Best Practices In Data Science

Applied data science and machine learning require best practices. Here are the basics.

Vin Vashishta | Originally Published: January 19th, 2015

Book time with me for career coaching, influencer marketing, and AI strategy consults. Booking is Easy.

Data science was growing up when I originally wrote this in 2015. I was working with large companies on their early prototypes and proofs of concept. Businesses of all sizes are now working on their second and third generation data science and advanced analytics projects. This is a good time to introduce a larger audience to best practices.

Integrating best practices has been an elusive goal for many data science organizations and it costs them in the long run. Everything from data breaches to unsuccessful projects to model failures can be traced back to a lack of best practices. Here are a few tried and true best practices that it is time to add to your data science department.

MDM and Data Governance

Data governance needs to cover the full lifecycle of collection, analysis, storage, and disposal of data. There are a growing number of compliance issues, privacy, ethical concerns, and costs that put data governance at the top of my data science best practices list. That was true in 2015 and more so today.

Compliance. In 2015, the US had over 20 separate laws that dealt with the handling of data. California is now setting the standard with CCPA but there are still differences across states. CCPA and new rulings based on GDPR mean some of the data a business collects are held to higher standards of protection than others.

Data collected needs to have a stated purpose. There are issues moving data from Europe to the US. Asia is a patchwork of regulations. Without a compliance strategy, a business is asking for trouble.

Individual privacy and ethical use of data is another area which needs attention. I ask my clients, “What would happen if your data science program’s uses for data were made public? How would your customers and partners react?” That starts with cataloging all the data gathering and usage that is going on right now. There needs to be regular review and oversight. A comprehensive privacy policy needs to be crafted that aligns with the brand and culture of the business.

The costs associated with collection, storage and processing of large datasets is not trivial. It is an area that companies can look at for significant cost savings. Being smart about data logistics is all it takes to take advantage of those savings.

Product Management

I think the hottest job segment (still agree with this 5 years later) within data science will be the data science product manager. Productizing data science initiatives is starting to yield big money for companies. Data itself is being compared to currency and is a potential revenue stream. In 2015, data from healthcare and finance was untapped value for businesses. Today companies across industries have monetized their data both for internal use and for sale.

The data science capabilities they are building internally must be connected with the business. That is why I believe product management needs to be brought into data science initiatives. Companies are missing out on revenue and all they need to capitalize is someone to say, “We can monetize that!”

The monetization strategies around data science are not traditional so it is not a simple matter of bringing a software product manager in. This role is a specialized hybrid who knows enough about data science to understand the projects, while also understanding the market well enough to identify and monetize opportunities. It is worth the effort to find a good data science product manager because the ROI is so high.

Testing

Companies are starting to hire data quality engineers and incorporate quality assurance into data science projects. The cost of defects in a data science system can be much worse than the cost of defects in traditional software. That is because the business is making critical decisions based on advanced analytics and products are more dependent on model-based features.

Software testing in many data science teams is currently handled by the data scientists responsible for writing the code and selecting the algorithms. Anytime the fox guards the henhouse there is going to be trouble. A dedicated quality engineer avoids that conflict as well as freeing the data scientist to work on developing full time.

This is another segment of data science that I think will take off this year (2015) and next (2016…and now in 2020). Machine learning algorithms are tested using a variety of methods. Those tests look at accuracy, optimization, and for pitfalls like over-fitting. Having engineers whose specialty is niched in model optimization and quality paired with those who can select, design, and build models from scratch will save time while building a better model.

Security

Businesses group this in with data governance but it does not belong there. It is part of the system design and needs to be in the hands of security engineers. Data governance should have an oversight role (Does the security of the system meet with company requirements? What do we do when there is a breach?). However, the data science team needs an information security engineer. The system needs to be architected with security in mind. Look at Target or any of several large companies who have dealt with a large-scale data breach. It should be an imperative in any data science team.

 

Data science is not a “wild west” technology anymore. It has grown up and matured over the last three (now eight) years. As businesses of all sizes are building or ramping up data science teams it is a good time to think about how to build in the fundamental best practices of data science. Process and innovation are in a constant struggle so striking a balance between the overhead of process and the pace of progress is critical.

 

Five Years Later…

This is still a core primer into best practices for our field. Machine learning has gained traction in most companies. Best practices have not. There is an opportunity for cost savings and improved execution.