Data science #

My date: I am studying machine learning.

Me [trying to impress her]: I too am studying how to do regressions in California.

Data science is a difficult-to-define field, but I like the broad definition that Wikipedia gives (or at least gave in March 2021): “an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of domains”.

Data science in my experience has been comprised of the following four main areas. Someone who can do all of them well might be considered a “full stack” data scientist.

  • Obtaining and building datasets. This can be as simple as writing a query against an existing SQL database, to building infrastructure that logs the events you need in a product into a new database. Data engineering is a separate job description altogether in more recent years, but a lot of early data scientists (and many still today I imagine) spend a large percentage of their time writing code that builds them their datasets. There’s quite a few companies building interesting tools in this space.
  • Training and evaluating models. Again, the range of work here is very wide. This can be anything from comparing a couple of out-of-the-box models in an existing package like scikit-learn to writing custom image detection algorithms. But this definitely requires a solid knowledge of statistics and the fundamentals of machine learning to do well. An understanding of (discrete) optimisation is also important if you get a little deeper into modelling, as at their heart most machine learning algorithms are optimisation problems. Specialising in this area can lead to job titles like Machine Learning Engineer, or Algorithm Developer, and the like.
  • Interpreting and communicating results. Your models and analyses are useless if no one hears about the results and they are not used to make decisions. So being able to interpret your own results into actionable insights, and being able to communicate these insights clearly and directly is hugely important. The best data scientists in my opinion are the ones who can do this well without even realising it.
  • Building your models into a product. Being able to code up models and explain how you think a product should be built is one thing, but being able to actually code up a prototype / minimal viable product is very rewarding and also means you don’t need to rely on others to get your ideas out.
  • Interesting article about moving from statistics to data science. The main point is that statistics is about finding \(\hat{\beta}\), data science is about finding \(\hat{y}\).
  • A set of notes on causal inference I’d like to read one day.