5 best tips for data science beginners
For portuguese, go to this link.
I’d like to share with you the best five tips that I wish I knew for something like 10 years ago, and this I why:
When I was young, I was affictionated for robots and designed all my career to become a roboticist (still in my mind). The problem began when I realized that here in my country (Brazil), there are so few jobs available to someone that is focused on hardware development instead of software that I stumbled for many years until I realized that this was the hard way.
And then I started to work with IBM Watson Assistant (formerly Dialog), one thing led to another and long story short, I became a Data Scientist in mid 2018.
However, this also wasn’t easy for someone that was entirely focused on the hardware skills and not software skills, and perhaps you’re not a native developer or something. Who knows if you’re a statistician? Or biologist?
So, I’ll give you my five best tips (even though I think that it isn’t nearly enough to download the little knowledge that I’ve accomplished in the past few years):
1. LEARN BY YOURSELF
You’ll save a lot, and I mean A LOT of money if you learn to sit your butt on the chair and start some easier course and also maintain discipline in your learning trajectory.
I don’t really think that you should (unless there is no problem for you) spend money into expensive certifications, and I’ll provide a few of the best and easier courses that I’ve collected so far.
Course: Andrew Ng — Machine Learning
There’s a lot of people that are going to recommend this course for you, and it is fairly good, I agree. However, since it’s quite old you might be quite disappointed with audio quality back in the days. But still, Andrew will try to let you know the essentials of Machine Learning.
Book: Introduction to Statistical Learning
This is one of the most classical books, and I really recommend if you have enough patience or are inclined to learn reading. Not everyone likes though.
Youtube: Statquest by Josh Starmer
This is the BEST CHANNEL I think of all time to learn machine learning from scratch. While giving some of the easiest examples, Josh can explain the inner details from each ML model and you’ll surelly find most of the data science toolbox in those videos.
Course: Introduction to Deep Learning with Pytorch
This is another good course and here you’ll start into the deep learning world.
However, DO NOT go to Deep Learning unless you’re comfortable with Machine Learning. You’ll need to understand a lot of concepts inside the most classics models instead of going deep dive into the neural network world.
(In DL are the most amazing and astonishing techniques, those things that you’ll make your friends say WOW, you can go there and run something for fun hahaha, but I strongly advise you to leave this to the future)
2. UNDERSTAND THE PROBLEM
Another thing that might be often forgotten is about how important it is to understand what you’re dealing with.
You’ll try to implement the state of the art model, and try the newest techniques in the academy, but is it really the problem? What if you’re trying to predict the most inclined customers to do a purchase and perhaps, just a logistic regression would do? Or perhaps a random forest?
I’m not saying that you shouldn’t try out and go for the best solution you possibly can, but time to market is better than the perfect solution, improve on the fly.
And for that, I’d recommend you to take a look into two things:
The surprising truth about what it takes to build a Machine Learning Project
Machine Learning Canvas — Louis Dourard
3. BE LAZY (BUT DON’T)
In ML and DS, you’ll find yourself doing the same boring thing over and over again.
You might be a Data Scientist that is requested to do a NLP task, or dealing with PySpark DataFrames and building countless master tables, and you’ll end up connecting to the cluster again, and again, and again (just like the horribly slow murderer with the extremely inneficient weapon hahahah).
So don’t, be lazy, optimize your work and be efficient.
And how to do that? Your time is the most precious thing and I know that you’ve heard it a lot, but you’ll get my point in a minute.
If you’re privileged to work on a big company that posses a quite heavy cluster, take advantage of.
- Create a baseline model
- Design all of your hypothesis
- Do like Andrew Ng and focus on the “caviar approach”, put all your notebooks to run at the same time, discard the inneficient children (not the human ones though), embrace the ones that are most prone to be the best solutions. Use AutoML techniques if available (TPOT, AutoSKLearn, AutoKeras are good open source options), but still, KNOW WHAT YOU’RE DOING!
- Use all the VCPUs, RAM and GPU available for you (in an optimized manner).
- Create python packages that will deal with the most boring stuff.
- Do not forget about quality, it’s not because you’re a scientist that you shouldn’t be an engineer as well.
By now you’re somewhere around a intermediate level data scientist (at least in these tips haha): you know where to search for learning, how to think about ML projects and you know that you should try to optimize your way around in order to deliver fast results, what’s next?
4. SEARCH FOR THE BEST
One of the most amazing things in data science and machine learning is that you’re always next to the state of the art, and it seems so nearby you, that you feel like a part of it (kinda).
Why am I saying this? Think of the state of the art in engineering, unless you’re a billionaire, you’re light-years away from it. And when you think about coding, you might be in the best possibility a github of distance, or in the worst possibility, a paper away and all that you must do (not so easily though) is to implement that concept.
Enough talking, where to search? I have a few places that I like to search for:
Natural Language Processing Progress
This is maintained by Sebastian Rudder, and I think of him as one of the authorities in NLP tracking.
Here you’ll find lots of good ideas, concepts and models/architectures that are state of the art.
This is a curated list for automl concepts and tools
Maintained by Adrian Rosebrock, and one of the must-follow blogs to keep track of good things in Image Search, not to follow State of the Art models, but it will speedup your learning curve a lot.
Maintained by Valerio Velardo, if you’re trying to learn about sound, speech and music, this is another must-follow.
And now last, but not least, I also think that this is one of the best places to look for new things:
This is where you’ll receive most recent papers in your inbox. One of the things that I’ve learnt as a data scientist is that there are good papers and bad papers, just like books, and you’re the one responsible for chosing what you’re going to absorb. Do not believe in everything, but try to improve yourself and your own hypothesis.
Main goal here is to learn to think, right? Being a data scientist/ML researcher is more about “Scientistical thinking” than your current profession, and believe me, you’ll address this to more areas of your life.
5. LOOK BACK TO THE FOUNDATIONS
“Ok, now we’ve reached the final tip, and what? Your trying to tell me to go back to tip number one?”
Yes, that’s right. Be humble, stay fool, if you followed me all around until this last tip, you might be happy for yourself by understanding how most “AI” techniques work, and how is this dazzling world in the inside.
I’ve seen scientists and scientists, those who are just those happy kids eager to know more. And those who say: “What? NO, I’m a SENIOR, I don’t do these things anymore”.
It actually doesn’t matter if you’re a scientist for more than five years, ten years, think about LSTM (Long Short Term Memory), you’ll keep most relevant information for you nowadays, and forget the others, will put information into boxes inside your mind, some easy to retrieve information and some not so much.
Does this make you a bad scientist, for not remembering the nitty gritty details of a specific technique? How the kernel X or the activation function Y works in the last iteration of a specific task that you’re not doing in the last twelve months? No, but be humble to say “I don’t know/I don’t remember/I didn’t made anything or looked up for these tasks in the past N months”.
This is quite a big area, and you’re not supposed to know everything, or even to remember everything everytime. It’s always better to remember foundations and you’ll stay like in those younger years.
Remember: Data Science/Machine Learning research is a state of mind, a way of “scientistical” thinking. (Why not just statistical? Because I think there’s more than just statistics into this area and this will make a difference, but this is just my philosophical thoughts)