Machine Learning Basics. What to Learn, Before You Learn About Data Science.
Here are some specific sources to get started learning the field. I will keep these updates going forward.
December 21st, 2020
This is my answer to the question, “Where do I start so I can become a Data Scientist?” Educational content overlooks the need to “Learn Before You Learn.” Diving in without enough background is what derails most aspiring Data Scientist’s learning path. This is a detailed list of the fundamentals.
With these basics, all the Learn Machine Learning in Six Weeks bootcamps become useful. There is an unspoken assumption that students come into class with prerequisite knowledge. However, few explain what those prerequisites are.
I include resources to get started with each area. These are my favorites but not the only options. Substitute to meet your needs. You have a learning style and preferred content type. Use what works to build your body of knowledge.
Expect to come back to the more complex areas of study. Do not try to master everything on the first pass. It is easier to learn, apply, and come back to fill in the gaps you find.
You will be tempted to start building before you have learned all the pieces. Should you? Definitely. That is a good way to reinforce complex concepts. My only caution is to keep learning as you build. Do not let a working project make you overlook the gaps in your knowledge.
Data Scientists use data to create instructions for a computer to follow. Here are the major pieces:
Avoid Machine Learning focused courses for Math, Programming Languages, and Business Acumen. There are too many gaps and you will eventually have to relearn the foundations. It is quicker to start there then look at Machine Learning applications.
Geometry aka Analytical Geometry. Every machine learning concept is easier to understand with a solid foundation in geometry. Differential Geometry and Topology are foundational pieces of deep learning.
Calculus (part 2, multivariable) and Linear Algebra. Both are extremely broad from an application standpoint. You will spend a month or two learning Calculus, then another learning Linear Algebra. You will spend years learning how both are used.
What order should you learn them in? Geometry, Calculus, Linear Algebra, Differential Geometry, Topology. After Calculus, everything starts blending and borrowing from each other.
Statistics is usually introduced as part of Math for Machine Learning. Probability and distributions are your introduction.
After this point, implementations become the focus of learning. Principal Component Analysis, Regression, Support Vector Machines, Least Squares, Nearest Neighbors, Decision Trees; the name of the implementation category replaces a broad category of math.
The trap here is math, programming languages, tools, and frameworks get mixed into the same lesson. You will be learning implementations using statistics, calculus, and linear algebra. Learn the math by itself before you get to these lessons.
It is best to learn Principal Component Analysis, then apply it. Use Python to implement it in a simple project. Once you get past the foundational math cores and the foundational technical cores, combine the two. Bring tools and frameworks in to build your implementation. Do that for each of the categories I list above.
Deep learning uses implementations of the math I have already listed. The walk throughs look complicated, but they are all proofs, reductions, and derivations using what you have already learned. Deep learning is creative applications of math to reduce a complex problem to something simpler. Use the same learn then apply approach.
Python is the easiest place to start. However, there are fundamental software development concepts that can be overlooked by diving straight into Python. In essence, Python is too well made. It allows you to build without understanding everything that is necessary to build well.
Learn what a computer is, CPU, GPU, memory, etc. Operating systems and networking are also required. You need to understand what a programming language is. There are software engineering principals, design patterns, and best practices. This covers what you need to know before you create software for release, either open source or for a business.
However, you do not need to know all of that to start coding. A programming language like Python is syntax. The foundations from the last paragraph teach you how to build reliable products and software using programming languages.
That is an opportunity and a trap. I recommend you start writing code whenever you want to. I strongly recommend that you work to understand the likely flaws in what you build. In software development, compiling without errors, working, and reliable functionality are 3 different levels. You can have the first 2 without the third and it is easy to lose sight of that.
Building early is an opportunity. Writing code helps theory make sense. You will also learn what does not work and that is not taught in the classroom. When your code does not work, you are forced to debug. Finding and fixing is a core capability of software development. It is part of your learning path. Again, use the learn then apply approach.
Learn C++ and Java. Python is the best, first language to learn. C++ and Java are deeply integrated with Machine Learning for many, practical reasons. Knowing both is a requirement.
TensorFlow, PyTorch, Pandas, Scikit-Learn, NumPy, SciPy, Matplotlib, Scrapy. Those are your core Python packages supporting machine learning development. They are required.
Python has different distributions. Learn Anaconda and CPython. You write code in a development environment. Learn Jupyter and PyCharm. As you learn Java and C++ you will learn about other IDE’s (development environments). You save your code to a repository. Learn Git/GitHub.
Once completed, a model needs to be deployed and available for use. REST APIs and TensorFlow Extended are worth learning. Flask is commonly used and easy to learn.
Optimization is a consistent theme in machine learning. CUDA is a must know. This ties back to learning the hardware level of programming.
This is the foundation for understanding the Python ecosystem. With what you learned from the Programming Languages section; these topics are all Google searchable. You have the foundational knowledge required to independently discover and learn. The same applies to the next section, Platforms.
AWS, Docker, Kerberos, and Spark are core platforms you must learn.
Platforms are your introduction to architecture. Each company uses a technology stack (platforms, programming languages, distributed resources, networking, storage) and there is no single, uniform stack. There is no comprehensive, machine learning architecture specific curriculum.
Systems Engineering and Enterprise Architecture (I am looking for a good course to recommend) are core concepts you must learn. These are the other side of the coin to software concepts I outlined in the previous section. You build programs. They must run in an environment.
Data pipelines are a core component of all machine learning technology stacks. Pipelines are made up of data gathering, storage, transformation, and availability components. There is no single stack. Systems Architecture teaches the fundamentals behind building out a data pipeline.
Databases are a constant element of data pipelines. Learn MySQL and MongoDB.
The lack of consistency across businesses, sometimes even within a single business, means you can replace any of the platforms I recommend with a different one that serves the same purpose. Once you have learned Systems Engineering and Architecture, the foundations are there for you to make the right choice for your career.
Everything built using data depends on the scientific method and research methodology. Science is introduced as part of math. Proofs and derivations introduce the key concepts, reproducibility, peer review, and independent confirmation.
Experimental methods are introduced during statistics and applied mathematics. Formulas proposed to explain an area of physics need experimental validation. Statistics offers tools to gather and validate data during experiments then analyze the conclusions drawn from experiments.
That is the introduction to observational study and experimental design. Next you must learn how to build, review, complete, and present findings from research. You will do small scale research alone and you will need to learn how to work on a team for large projects.
You will publish some of your research either publicly in a journal or at conference and/or through a patent. You need to learn the review process.
There is a growing body of work on causal inference and machine learning. These (CausalML, dowhy) belong in Tools and Frameworks. I include them here because they fit further along in your learning journey. Read the papers associated with each library and follow through learning the do calculus. This is a good introduction to structured research in practice.
Growing your knowledge from there is not classroom or online course learning. As I said, you will get the foundational concepts from other classes. Continued learning happens when you become part of a research group. You need real world experience and mentorship.
You need to learn how businesses function. Business and product strategy, execution, project management, sales, marketing, customers, and intrateam communications. I am writing a series on soft skills.
Data visualization is a large part of machine learning communications. You must learn to simplify and present data to a variety of audiences. As part of science and research education, you learn the rigor needed to present data in a responsible, defensible way. Visualization techniques will allow you to take those presentations to a semi-technical and non-technical audience.
Here again, I do not recommend a specific course. With everything else provided, you have the knowledge to select the right technical course or tutorials.
I recommend learning about leadership in this stage. However, you may not want to ever leave the technical, individual contributor career path. Leadership is an elective.
This learning path is meant to cover someone starting from a High School Diploma. College and secondary education in general have evolved. I include people who take this learning path outside of traditional academic institutions.
I also include people transitioning into the field from any role. You can be a salesperson or construction worker and complete this learning path. My first internship was in Civil Engineering. I held a sales job early on in college. I also ran a small IT company doing web design, PC setup, and networking. I have worked with excellent Data Scientists with similar backgrounds in non-technical and non-development roles.
Completing this learning path will give you all the capabilities you need to do intermediate and advanced Machine Learning. Most people in our field do not have this body of knowledge and their work reflects it. Your capabilities will stand well above your peers. Often you will know more than the people interviewing you for roles in the field.
From here, Deep Learning concepts are easy to pick up. You will be able to quickly learn and customize any existing approach as well as those that will emerge over the course of your career.
How long will this learning path take? You will be ready to learn Platforms, Tools, and Frameworks after about 6 months. There will still be math and programming left to learn but you will have enough of the foundations to start. You will be ready for the Science and Research in about 18 to 24 months. I would start working on Business Acumen (Economics) after learning statistics.
All in all, this is 2 to 3 years’ worth of work. I think it can be done faster if you can learn full time. Plan to do significant projects as part of this learning path. Building reinforces concepts and introduces real world applications.
You will be ready for a software developer’s role after about 6 months, data analyst at about 9 to 18 months, and data scientist at 18 months. You can get hired before you finish but you must finish to build a career.