This article was initially published by our friends at Dataiku. To view the original article, click here.
Deriving tangible value from company data using data science and machine learning (ML) technologies has become integral to business success. What’s more, the realization has begun to take root among business leaders that the work involved justifies a concerted effort to apply the same level of scrutiny and control as found in the now well-established software development and productization practices. In other words, similar technical and organizational forces are at play — a mature approach to data science takes hard-earned lessons from the software world and applies them as consistently as possible to data.
Different skill sets and expertise — such as data engineering, ML, and MLOps — play a vital role when it comes to data. However, the approach to these competencies must be elevated so that they may be addressed more effectively to deliver maintainable, productized deployments of data initiatives.
In this article, we address a cross-section of the forces at play when working at scale and how they translate into lifting ‘from data to value’ efforts to a higher level of quality, maintainability, and repeatability, thus protecting data science investments.
Data at Scale
Given the fact that data volumes are ever-growing, scrutiny around data quality, lineage, and long-term storage is no longer a nice-to-have. Rather, it is a vital component in driving data science projects and product development success. No longer can a subject matter expert be expected to make judgments regarding these efforts without comprehensive support from the data science ecosystem.
Combining a well-established, long-term data storage solution with data governance solutions — whether they are custom, open source, or commercial product-based cloud data lakes — will pave the way to addressing this need.
Develop at Scale
Each development use case takes the shape of a large system with many responsibilities that interact and need coordination: ETL (extract, transform, load), exploratory data analysis, model training, model deployment, reporting, monitoring, and integration with ancillary infrastructure (such as those for authentication and authorization).
Many of these responsibilities and sub-systems will overlap, suggesting opportunities for reuse or co-development. It is wise to take lessons from traditional software development:
- Don’t Repeat Yourself (DRY) – Approach custom development with an eye on reusability. Note that this applies to areas such as custom feature and algorithm development.
- Single Responsibility Principle (SRP) – Properly segment custom code to clearly delineate functionality and ease composition.
- Keep It Simple Stupid (KISS) or You Aren’t Gonna Need It (YAGNI) – Don’t go overboard, but concentrate on the functionality that is currently needed.
- Open/Closed Principle (OCP) – Open for extension, closed for modification. Proper design of custom functionality allows for future feature growth.
Train at Scale
Single node or ‘local’ model training simply will not cut it, despite the fact that data scientists usually begin their exploration and experimentation in this mode. At-scale projects will quickly show the inadequacy of this approach and the need for scalable lab environments, which allow users to juggle multiple projects concurrently and enable the exploration of many potential candidate models or algorithms that will require hyperparameter tuning. Projects themselves are represented explicitly in such environments, avoiding the need for manual housekeeping and enabling sharing across the team.
Deploy at Scale
The demand for the predictions as served up by the deployed model will increase, especially at scale. Selecting a deployment approach that allows for dynamically expanding the number of ‘workers’ is critical to meet customers’ performance expectations. Note that automated means to scale up or down on the basis of measured load and throughput are now feasible.
Further, the deployment environment must monitor uptime and take appropriate failover actions to guarantee high availability. Model servicing should also take a cue from regular (micro) services regarding their ongoing evolution: blue-green deployments are the norm.
Monitor & Track at Scale
As a use case ages, feature drift is likely to occur and adversely impact model performance. Automated means for continuously assessing these aspects of incoming data is a team’s ‘canary in the mine.’ Monitoring, in conjunction with model score tracking, will signal the need to model (re-) training and redeployment. Again, automation is critical to allow this to happen in the presence of other data science team responsibilities.
Teams at Scale
Data science projects are multi-disciplinary efforts. As system complexities increase, team members need to fill specialized roles to drive projects to a favorable conclusion. Data science ecosystems that recognize and emphasize these roles are becoming increasingly more important. Maintaining strong relationships between such roles and authorization levels helps keep projects under control and within regulatory requirements.
Taking the Next Step
Operating at scale requires data teams to leverage automation and take advantage of commoditized core storage as well as compute and management services. The lessons learned over decades by the software industry are fully applicable when deriving value from data. As organizations and their implementation teams deal with sustained growth in the number of use cases they are required to implement and keep afloat, nothing less will do to guarantee success.