Skills for Data Scientists – Statistics isn’t Enough
Skills for data scientists are wide ranging and varied, since data scientists come from different backgrounds.
Every year, a good portion of my science friends end up realizing that they don’t want to be in academia and start looking for a new career. Data science is a path many lean into. And why not? Data science is the hot job of the year. Even Forbes listed it as the Best Job in America in 2018 because of great salaries, high job satisfaction, and employer demand.
Crucial skills for data scientists
You may be wondering what skills you need to get your first job. While statistics is the obvious core of being a data science, aspiring data scientists sometimes overlook coding and communication skills that are critical success in the job. I want to discuss why coding (and which languages), communication, and reproducibility are critical to being a top data scientist, and how to develop domain knowledge even if you don’t work in the field yet.
This is a huge and obvious piece of the puzzle, which is why scientists who are switching into tech naturally gravitate towards data science. A strong statistic background is key to standing out as a data science candidate.
Get familiar with probability, distributions, hypothesis testing, inference, randomization, generalizability, time series, and Bayes. How To Ace Data Science Interviews dives deep into topics in these categories and what key ideas you should know.
If you never took a course on statistics, or your statistics text is woefully outdated, below are some great resources to get started.
Though this book doesn’t go in depth on every statistical concept, it’s a great starting point. It also has code examples using R.
Think Stats. It’s free to download the PDF, and it uses Python for the examples.
Most data science online courses focus more on syntax and data visualization, instead of statistics. However, here are some statistics focused courses, Specializations (Statistics with R is particularly good with the math), and MicroMasters that are useful for solidifying mathematical concepts.
Coding is another big portion of data science. Every data scientist should be familiar with Python, though some still use R. However, it’s easier to communicate and prototype with the engineering team using Python, since most of them will be comfortable with it too.
Python is a powerful language. Many analysis, machine learning, and visualization packages are available, like numpy, pandas, scikit-learn, and matplotlib.
SQL is another good tool to have in your skillset. Somehow, you will need to get data out of a database. SQL is the query language to get data out of a relational database. SQLZoo is a fun, free resource for learning SQL. Of course, you can get into very complex queries, but basic SQL can be picked up in under a week.
Besides just the programming languages and packages, you want to focus on reproducible code. Using Github allows you to easily share and collaborate on projects. Jupyter Notebooks, an open source interactive computational notebook, allow data scientists to quickly explore data and test hypothesis. This will be critical once you start working with a team and need to communicate with other data scientists and engineers.
Codecademy also has courses on git, data analysis, pandas, SQL, and more (get 50% off on an annual plan with this link).
Generally, worry less about choosing the correct tools and python packages to learn, and learn how to learn. Data science is always evolving and software keeps changing. Part of the job will be to keep up to date with the latest technology. Do you enjoy Youtube videos or reading textbooks? How do you learn best?
Communication is an absolutely critical skill set and often overlooked. The stereotype is of the lone coder in the corner of a dark room, typing away on their laptop. However, you need to be able to communicate your results to engineers, product managers, and executive staff. You need to interpret the results for them and explain the reasoning behind your analysis. You will also need to adapt to the different educational backgrounds and personalities of the people who are receiving the results of your work. Being able to make your work clear and understandable to your target audience is key.
Beyond that, domain knowledge can be a huge plus for your application. Sometimes, it’s picked up on the job. That’s one of the best ways to acquire domain expertise. However, if you want to break into data science for healthcare, it’s unlikely you can get access to a doctor’s office and poke around the data. What you can do, however, is to look for peer-reviewed papers in that field. Look at the methods and read how the scientists analyzed their experimental data, and why they chose to do it that way.
As you read more in the domain, you’ll begin to notice trends in the methods and gain familiarity with how the data looks. From here, you’ll have a deeper understanding of a project that you can do with the publicly available data sets. Remember to share your results on Github and Kaggle, or post them on a blog. This will give you practice on how to communicate to different audiences and source feedback. Use this feedback to iterate on your project.
Statistics, coding, and communication are the key skills for data scientists to succeed. What books and courses have you used in your path to becoming a data scientist? Which languages, packages, and tools have you found to be the most valuable?