Best Practices In Data Science
Applied data science and machine learning require best practices. Here are the basics.
Vin Vashishta | Originally Published: January 19th, 2015
Data science was growing up when I originally wrote this in 2015. I was working with large companies on their early prototypes and proofs of concept. Businesses of all sizes are now working on their second and third generation data science and advanced analytics projects. This is a good time to introduce a larger audience to best practices.
Integrating best practices has been an elusive goal for many data science organizations and it costs them in the long run. Everything from data breaches to unsuccessful projects to model failures can be traced back to a lack of best practices. Here are a few tried and true best practices that it is time to add to your data science department.
Data governance needs to cover the full lifecycle of collection, analysis, storage, and disposal of data. There are a growing number of compliance issues, privacy, ethical concerns, and costs that put data governance at the top of my data science best practices list. That was true in 2015 and more so today.
Compliance. In 2015, the US had over 20 separate laws that dealt with the handling of data. California is now setting the standard with CCPA but there are still differences across states. CCPA and new rulings based on GDPR mean some of the data a business collects are held to higher standards of protection than others.
Data collected needs to have a stated purpose. There are issues moving data from Europe to the US. Asia is a patchwork of regulations. Without a compliance strategy, a business is asking for trouble.
The costs associated with collection, storage and processing of large datasets is not trivial. It is an area that companies can look at for significant cost savings. Being smart about data logistics is all it takes to take advantage of those savings.
I think the hottest job segment (still agree with this 5 years later) within data science will be the data science product manager. Productizing data science initiatives is starting to yield big money for companies. Data itself is being compared to currency and is a potential revenue stream. In 2015, data from healthcare and finance was untapped value for businesses. Today companies across industries have monetized their data both for internal use and for sale.
The data science capabilities they are building internally must be connected with the business. That is why I believe product management needs to be brought into data science initiatives. Companies are missing out on revenue and all they need to capitalize is someone to say, “We can monetize that!”
The monetization strategies around data science are not traditional so it is not a simple matter of bringing a software product manager in. This role is a specialized hybrid who knows enough about data science to understand the projects, while also understanding the market well enough to identify and monetize opportunities. It is worth the effort to find a good data science product manager because the ROI is so high.
Companies are starting to hire data quality engineers and incorporate quality assurance into data science projects. The cost of defects in a data science system can be much worse than the cost of defects in traditional software. That is because the business is making critical decisions based on advanced analytics and products are more dependent on model-based features.
Software testing in many data science teams is currently handled by the data scientists responsible for writing the code and selecting the algorithms. Anytime the fox guards the henhouse there is going to be trouble. A dedicated quality engineer avoids that conflict as well as freeing the data scientist to work on developing full time.
This is another segment of data science that I think will take off this year (2015) and next (2016…and now in 2020). Machine learning algorithms are tested using a variety of methods. Those tests look at accuracy, optimization, and for pitfalls like over-fitting. Having engineers whose specialty is niched in model optimization and quality paired with those who can select, design, and build models from scratch will save time while building a better model.
Businesses group this in with data governance but it does not belong there. It is part of the system design and needs to be in the hands of security engineers. Data governance should have an oversight role (Does the security of the system meet with company requirements? What do we do when there is a breach?). However, the data science team needs an information security engineer. The system needs to be architected with security in mind. Look at Target or any of several large companies who have dealt with a large-scale data breach. It should be an imperative in any data science team.
Data science is not a “wild west” technology anymore. It has grown up and matured over the last three (now eight) years. As businesses of all sizes are building or ramping up data science teams it is a good time to think about how to build in the fundamental best practices of data science. Process and innovation are in a constant struggle so striking a balance between the overhead of process and the pace of progress is critical.
This is still a core primer into best practices for our field. Machine learning has gained traction in most companies. Best practices have not. There is an opportunity for cost savings and improved execution.