DevOps and Data Science: What Works, What Doesn’t, and How We Can Do Better

January 26, 2023

Is DevOps Detrimental to Data Science?

Arriving at the Scene

You are brought into an organization as a solutions architect and are introduced to two teams with two seemingly different goals assigned to the same project. One team is focused on accuracy and analytics. The other team is focused on handling the cold hard reality of running intelligent systems at scale. In between them, a wall of confusion and a whole jumbled mess of personal, professional, and (not always rational) opinions. Those who have read books like The Phoenix Project by Gene Kim, George Spafford, and Kevin BehrI may see the DevOps signal light up the night sky. DevOps enthusiasts will suit up in order to rid this organization of “wrongdoing”. These calls to action do not come without their own set of risks and consequences. What happens when instead of becoming DevOps cape crusaders, we become the villain we sought to defeat? What about when instead of creating the knights of the round table that fight for all, we create an oligarchy of “we know best”? What are the consequences of unintentionally building a table that divides instead of unites? Last but certainly not least, what are the lessons learned and frameworks that we can use to make sure our crusade against confusion and conflict leaves everyone better off than when we found them? If you are interested in handling these types of scenarios, learning what happens when you miss the mark, how we can work towards a DevOps like experience for data scientists, and if referring to this cultural shift as “DevOps for data scientists” does us more harm than good, you are in the right place. We at Arrikto salute those about Rok, and we salute you too for delving into the depths of DevOps and Data Science: What Works, What Doesn’t, and How We Can Do Better.

The Disruptive Duo

DevOps and data science. Two terms fueling a frenzy of tech advice and talent scouting. One being the old guard (DevOps) and the other driving the world into a new era of excitement and uncertainty (data science). You most likely have come in contact with some sort of ChatGPT3 post claiming your DevOps job is obsolete or a data scientist telling you how to tune your neural network for better performance. I cannot offer you advice on handling the new AI onslaught or tuning your next gen model, but I can offer you advice on how to bring your data scientists, platform teams, machine learning engineers, and data professionals together so you can lead the AI revolution with a unified force.

Hostage Negotiations

Any technologist, whether they just wrote their first kubectl command hoping for a KubeCon talk someday or are scoffing at the absurdity of Kubernetes while writing web assembly, knows that technology is all about tradeoffs. Where there are trade offs, there are breeding grounds for conflict. Where there is conflict, there is room for negotiations, and negotiations can be challenging to say the least.
Whether it’s an internal team trying to assert dominance by mandating arbitrary requirements or a junior platform engineer who simply can’t say no to their customers, solution architects have witnessed many varieties of friction between teams. It is such a major aspect of the role that I (as a solutions architect myself) have had to adjust my skill sets from being a Kubernetes focused platform engineering geek to an architect focused on negotiating the terms of release for projects that have been taken prisoner by cross team quarrels. I have become so fixated on trying to bring teams together with proper negotiation tactics that I have taken to enrolling in MasterClasses and reading books on negotiation strategies. One such book was written by an FBI hostage negotiator called Never Split the Difference: Negotiating As If Your Life Depended On It by Chriss Voss and Tahl Raz. Negotiations can be serious stuff!
People have different wants and needs created by external pressures. Understanding those requirements takes time and proper curiosity as to what is actually driving someone. The slow pace of actually listening to people and understanding what steers them can feel excruciating when applied to the high velocity tech world where we are taught to “fail fast” and have dreams of 5 minute development cycles where we push directly to prod without hesitation. Like software quality, negotiation quality takes time. Any machine learning model observability team will tell you that accuracy changes as context switches. Context changes apply to your negotiated agreements as well. We need to be able to iterate. Now am I saying the skills of an FBI hostage negotiator are required when talking to machine learning and operations teams? I don’t know if I’d go that far, but sometimes it does feel like many data science projects are taken hostage by certain teams with under explored demands. These conflicts can be masqueraded as religious tooling debates, constraint complaints, or handoffs heresy, but in the end it all comes down to one question. “Why”.

The Big Why

Realistically, “why” isn’t very helpful. “Why” is an accusation. “Why” translates to “defend your actions”. “What” and “how” are often more effective. “What are you trying to do?” “What drives that desired outcome?” “How have you approached this problem before?” “How can we work together to get you where you want to be?” We can even empower these questions further by asking “what’s in it for you”. A question starting with “what” helps us actually flesh out the problem. A question beginning with “how” helps us solve them. The cost of anything becomes a “what” focused conversation because many see cost as a problem and don’t focus on what they lose without investments. Investment doesn’t have to take the form of JUST money (in fact it rarely is). One of my favorite quotes around costs is from The Site Reliability Engineering Handbook chapter The Evolution of Automation at Google. The quote goes “If we are engineering processes and solutions that are not automatable, we continue having to staff humans to maintain the system. If we have to staff humans to do the work, we are feeding the machines with the blood, sweat, and tears of human beings. Think The Matrix with less special effects and more upset System Administrators.”

Costly Consequences

The cost of failing to automate is not just money represented by the salaries of the administrators, but the mental health cost that reduced job satisfaction forces our administrators to pay. How can you retain top talent if they are busy carrot feeding servers or repeatedly reading runbooks for your unreliable system? The DevOps movement is trying to improve collaboration and reduce conflict through technology by helping teams define a unified mission towards developing software, packaging it, and communicating requirements across teams reproducibly. DevOps also has another equally noble mission of providing value and peace of mind (stability) to the organization. All these strategies are to prevent the tolls levied on poorly aligned and frustrated teams. What about our data science teams? Are they immune to the consequences of poor collaboration?

DevOps for Data Scientists?

Enter MLOps. MLOps can easily be misinterpreted as “just DevOps for data scientists’”. What assumptions did we just make by mapping those two terms so closely together? When negotiating, we can sometimes get so caught up in how we see the world (I.E. how a developer moves an application to production) we forget to lean in and understand how the team we are negotiating with (the actual data scientists, machine learning engineers, and data professionals) view the world. We seek to be understood instead of to understand. As a solutions architect with an interest in platform engineering, Kubernetes, and Kubeflow, I have to catch myself when I start to solve a problem by hitting it with a Kubeflow shaped hammer. I have to constantly ask myself “can I actually help them solve their problem”. When we start listening to ourselves instead of our customers, we can create a major misunderstanding. This misunderstanding can lead to teams developing tools that don’t actually solve real problems. We could potentially spend days/weeks/months building out a solution to help reduce a data scientist’s “TOIL” only to demo it to them and send a signal screaming “we weren’t listening at all”. Both sides would be justified in being defensive. After all, one team might have just spent company time developing a solution based on their deep development expertise only to have it rejected by the team they built it for! Imagine if you spent months talking to a custom car company about your dream car and they dropped off your new car with a manual transmission! You were sure you were clear, they assured you they were listening, but you don’t know (and don’t want to know) how to drive a stick! Both parties were living a lie! This is exactly what happens when we build solutions without actively listening to our customers. We’ve seen first hand what goes wrong when you fail to handle these negotiations properly. In our next blog post, we will discuss what mistakes we made and lessons we learned when we approached DevOps and data science the wrong way.

About the author

At the time of writing, Chase is a solutions architect at Arrikto who's focus is helping people discover "whats in it for them" as they journey into the world of leveraging platforms to accelerate their development efforts. Chase's early career was oriented around manual tasks and testing. Since then, he has pursued the goal of reducing the amount of "job chores" professionals must partake in and giving them time to solve more interesting and valuable problems. Reducing TOIL aligns well with Chase's solutions architecture strategy of people's problems first and technology second to avoid "yet another tool" thrown into the already over encumbered tool box that technology professionals are being pressured to keep up with.

About Kubeflow

Kubeflow is an open source, cloud-native MLOps platform originally developed by Google that aims to provide all the tooling that both data scientists and machine learning engineers need to run workflows in production. Features include model development, training, serving, AutoML, monitoring and artifact management.

Kubeflow is the open source machine learning toolkit for Kubernetes.

About Arrikto

We are a Machine Learning platform powered by Kubeflow and built for Data Scientists. We make Kubeflow easy to adopt, deploy and use, having made significant contributions since the 0.4 release and continuing to contribute across multiple areas of the project and community. Our projects/products include:

Enterprise Kubeflow (EKF) is a complete MLOps platform that reduces costs, while accelerating the delivery of scalable models from laptop to production.
Kubeflow as a Service is the easiest way to get started with Kubeflow in minutes! It comes with a Free 7-day trial (no credit card required).
Rok is a data management solution for Kubeflow. Rok’s built-in Kubeflow integration simplifies operations and increases performance, while enabling data versioning, packaging, and secure sharing across teams and cloud boundaries.
Kale, an open source workflow tool for Kubeflow, which orchestrates all of Kubeflow’s components seamlessly.