The most common question I get from aspiring data scientists is, “Where do I start?” Most dive into a method like regression, see Greek symbols in the math, hear obscure sounding terms, then find themselves trying to back trace the math and programming they need to implement a method they partially understand. It is a frustrating process that I want to make a bit simpler.
You will need to understand and be proficient with Algebra especially graphing functions. You will need some basic calculus, specifically partial derivatives. While regression does not specifically use calculus, understanding the foundations of the methodology does. Those foundations contain recurring themes that you will see again in many other types of machine learning, so they are worth understanding.
You will need to know basic statistics. You are comfortable with z-scores, the correlation coefficient, coefficient of determination, sample mean, standard deviation, variance, etc. Terms like dependent variable or independent variable are well understood. You also want to understand the deeper concepts around significance and correlation. Diving into analysis of variance (ANOVA) is a great way to get there.
R is probably your simplest language to run a linear regression in, but it is not much harder in Python or Java. You need a basic understanding of how to use a development environment (IDE). You will need to understand building and calling functions/methods/classes. You will need an understanding of importing data into the model for training. There are several built in datasets that will work for toy models in R. Finally, you will need to be able to read the outputs of the model. These are never displayed clearly by default, so you will need to get accustomed to the output window as your viewer into the model’s performance.
Anytime you have 2 variables, dependent and independent, that you want to assess the relationship between. That is a common question during exploratory data analysis of all kinds. Deploying the model into production? That is less common with a simple technique like linear regression. The results of the exploration can be useful as part of a larger model development process but typically not by itself.
In Practice Knowledge:
In theory versus in practice is worth noting for any application of an algorithm. Everything above would give you a great theoretical foundation in linear regression but once you start applying it, you will find additional considerations.
You will rarely have a single, independent variable in a deployed model, so you will be adding different variables to increase the predictive value of the model. That introduces concepts leading to my recommendation around learning ANOVA. There are numerous KPIs around model performance that each have their own merit. Learning which ones to use and when is a process in and of itself.
Your coding skills will need to advance as well. You will be coding as part of a team so learning to comment and write legible, maintainable code is a must. You will need to learn how to debug your code when everything does not go as expected or you make a coding error. You will also need to learn how to deploy models into production.
Concepts around training and testing your model are also more advanced. Model validation goes beyond just the statistical scores to understanding how it will perform with real world, out of sample data. You will need to understand the impact of errors on downstream processes and/or users.
You will also need more advanced data skills. Data wrangling does not feature prominently in educational projects. Most real-world projects require you to source and clean up the data prior to model development, training, and testing. That means getting to know relational and NoSQL databases at a minimum.
Also get to know external data sources, like data brokers or APIs from companies like Facebook, Twitter, etc. Sometimes data comes in streams and other times it comes in dumps (occasionally you get both a dump to start and access to a stream for real time analysis).
You will also need to learn how to visualize your results. Even for projects that feed into other models rather than directly to a user, visualizing your results is an important step in communicating the output of your model. Do not get used to relying on the default output from the IDE or some shell. That will not be sufficient to explain your results to a mixed audience and will considered lazy by experts.
Keep in mind that the “in practice knowledge” is applicable to many projects you will do without regard to the algorithm you select. While you do not need it to get started, you will want to pay attention to each facet while you are learning. The “in practice knowledge” seems trivial when you are learning but its critical for success in the work world.