The Data Science Lifecycle
For companies moving forward into the next phase of data science maturity, a consistent lifecycle is needed. This is the blueprint for Enterprise Data Science.
Vin Vashishta | Originally Published: May 3rd, 2018
Book time with me for career coaching, influencer marketing, and AI strategy consults. Booking is Easy.
Dell/EMC, Microsoft, SAS, and many others have published their process for the Data Science Lifecycle. Kirk Borne tackled this topic in a 2014 talk on bringing maturity to the analytics profession. CRISP-DM has been repurposed for it as well. I have learned from these and many other sources. They are all worth exploring for a complete study of the topic.
My take on the DSLC is also shaped by my experiences both building data science products and working with companies to build data science capabilities. I have got a different and/or more detailed take on many of the key elements because I’ve needed to implement them as a data scientist and build them as a data strategy consultant. I come at this with two goals:
As an aspiring data scientist, there are benefits to understanding the DSLC. It allows you to focus your education on the areas that most interest you. A wholistic understanding of the process helps you understand not only your role in the DSLC but also how that role depends on and supports the other phases.
From a business perspective, it allows a team approach to data science and machine learning where each role is well defined. That has advantages when it comes to hiring, strategy planning, and project management. It is a managed process which allows for oversight with the goal of either revenue or cost savings clearly in mind.
I also see a parallel process to the DSLC which is the data monetization process. Building a business case, selecting data science projects that align with business goals, and measuring the ROI of each project are some of the steps that fall under this process. I pull a lot of the strategic elements into the data monetization process.
Data scientists need to be involved in these monetization steps but not without help. This is where external stakeholders and senior leaders get involved. The parallel process keeps data science capabilities from being a silo within the organization. They also inject business needs and goals into the process which helps monetize more initiatives.
Data monetization is the reason the DSLC needs to be a managed, transparent process. When you view monetization as a separate process, that becomes a realistic objective. With that in mind, let’s dive into the DSLC.
The entrance criteria for the research phase can come from several sources. A business case is the most common. Either a data science team with deep business acumen or a data science savvy strategy team builds the business case for a specific initiative. That business case, or another document describing the problem and opportunity, gives the researcher or research team enough information to explore three spaces.
The problem space exploration is where many projects are determined to be non-data science or non-machine learning problems. Not every problem that can be solved with data science, should be solved that way when there is a simpler approach. The result of a problem space exploration is often a determination that additional research, analytics, BI, or traditional software development is the most cost-effective solution.
For those projects that pass the problem space gate, the solution space is the next to explore. Again, many projects are determined to be non-data science or machine learning in this review. Often others have come up with solutions that do not include using either approach. A week’s work can save months of failed iterations.
As you can imagine, understanding how others have approached the solution saves time for projects that are green lit for the next phase. Narrowing down the potential solutions makes data gathering and model selection more directed.
The third review is the data space. What data does the business have that would assist in solving the business problem? What is the provenance of that data? The question of data provenance is critical because data goes stale, data is often questionable with respect to sourcing, and data is frequently inaccurate or incomplete.
What additional data needs to be gathered and what methods will be used to gather it? This is typically the gating question. Cost of data gathering and/or access to reliable data will often determine whether the project moves forward or is shelved.
The exit criteria for the research phase comes down to whether there’s sufficient understanding of the problem space, sufficient guidance on the solution space, and data to support potential solutions. Each step is a gate looking to weed out non-data science and non-machine learning problems while the phase itself is a gate for problems that cannot be solved by the business at this time.
The data science phase is the best understood. It is the first part of the process that is iterative which makes managing it challenging without a thorough research phase. The three explorations build a degree of certainty as to how best to proceed as well as what iterations are likely. That brings more stability to this phase’s iterations making oversight easier.
Each data science phase iteration consists of:
Each step is a gate looking to include/exclude a specific model as a viable solution to the problem. It is worth noting that a model can fail at each step and when a model passes all steps, additional iterations may still be desirable to find a model that better solves the problem.
External oversight is necessary to ensure that each iteration is valuable from a business perspective. From a data science perspective, additional exploration is often possible to find a more ideal model. However, from a business value perspective, this exercise will result in diminishing returns. The business and data science teams work together to determine a threshold where further iterations will not return adequate business value.
The exit criteria for this phase is a working prototype. The prototype should be robust enough to demonstrate the model’s expected predictions or analysis to a non-technical audience. It is not necessary for the prototype to be a complete solution, ready for production. That is the role of the next phase.
The final phase productizes the prototype. This phase can be abbreviated or altogether unnecessary depending on the project goals and how complete the prototype is. The entrance criteria are the prototype and monetization strategy which should come from the business case. Often the monetization strategy needs to be modified based on the results of the prior phases of work. That is a parallel process not covered here.
Data engineering follows a process like software development.
This phase follows normal development gates meant to evaluate each step with an eye towards product quality rather than eliminating projects. Managing each step also follows traditional software project and product management best practices. While the skills required to execute this phase are unique to the DSLC, the process and management are not.
Businesses without an understanding of the DSLC see value from their data science capabilities only intermittently. Projects often go off the rails and do not result in anything tangible or monetizable. Data science is like any other capability. Repeatable, documented processes are going to lead to successful business outcomes. With these types of processes, improvement is possible. Following a documented DSLC is a sign of data science maturity within the enterprise.
Aspiring data scientists who do not understand the DSLC have difficulty with what to learn to be successful in the field. Focus allows us to learn our part of the process rather than believing we must take on the whole thing ourselves. That is an important point to remember. The unicorn data scientist is expected to handle data science from inception to completion. After looking at the DSLC, how realistic do you think that is? Find an area of focus, then understand its place in the DSLC. What are the inputs, process, and outputs for your chosen role? A solid grasp of those three elements makes for a more effective data scientist.