While doing AI is sexy and cool, data infrastructure is typically not considered any of this. However, production-grade machine learning applications heavily rely on proper data infrastructure. Hence, in order to generate actual business value, solid data pipelines are required.
A truth about data science projects
Consider the following story: We start a data science project with a time-boxed proof of concept (PoC) having a skillful and enthusiastic project team that generates insights in fast iterations. After our time box ends, we achieved promising results and get the opportunity to present them in front of “the decision makers”. Luckily, we ace our presentation and it becomes more and more clear: “this idea is a winner!”. Everyone is feeling great.
However, under the hood, the promising results were achieved on a local machine with data dumps (CSV files) from multiple data sources carefully hand-crafted by an SQL expert with a special series of magical queries. Clearly, it is often very useful to work with data dumps for fast iterations, but clearly it is also a shortcut.
This shortcut becomes a problem when it is ignored. Even worse, when it is not properly communicated and understood by “the decision makers”. The trade-off for fast insights and results during the PoC was not considering data infrastructure and building proper data pipelines. Now, after everyone is hyped about the successful presentation and expects business value within “a few days”, we have to face the truth – we took a shortcut and before we can generate business value, we have to fix it.
A heavy burden for data science projects
Because it is possible to have short-term success as sketched in our fictional story, often times, the AI hype drives companies to start data science projects before evaluating and thinking about the current state of their data infrastructure. This can be useful in order to validate ideas. However, it becomes a problem if not all stakeholders are aware that the implications, i.e., building a production-grade machine learning application, are (depending on the data infrastructure), more probably than not, a challenging next step.
At worst, outdated and missing data infrastructure can lead to the bizarre situation where a data science project team is faced with the task to casually solve company-wide data infrastructure challenges. On the quest for generating business value after a successful PoC, the project team has to deal with organisational issues without much public awareness, and hence without much support – possibly resulting in frustration and failure to generate business value with the once so promising idea.
Metaphorically speaking, the team is driving a sports car (the shiny new machine learning model) on a forest path (the data infrastructure) and everyone around wonders why they do not run on full speed resulting in the impression that it is simply not possible to generate actual business value with data science projects.
The attitude towards data infrastructure
Historically, setting up a new data infrastructure is associated with great pain. This is because it is typically related to huge investments of time and money. Furthermore, there are various paradigms, e.g. data lakes (see here and here), data lakehousing and data meshes each with different sets of benefits and trade-offs to consider. However, improving the data infrastructure is expressly not solely a technical issue. In fact, what is even more important is changing the mindset about the significance of data in the company. This change is likely the single most crucial step for generating sustainable and growing business value with data science projects.
There are (at least) two reasons to rethink data infrastructure:
Faster cycle time for validation
Clearly, starting with a PoC as described above yields first insights as to whether or not an idea is promising. However, without proper data infrastructure, it is usually hard and costly to validate an idea end-to-end. This usually leads to lengthy discussions about if and when to start without actually starting to learn valuable lessons about the idea. Having proper data infrastructure allowing for fast end-to-end validation and experimentation enables one to dismiss unsuccessful ideas faster and makes it possible to single out the successful ones. The ability to validate ideas with fast end-to-end experiments is particularly crucial for building successful machine learning applications because there are many moving parts and the application typically affects various parts of the company.
True value is the ability to utilize machine learning algorithms
Typically, there are plenty of ideas on how to use data science to generate business value in a company. More often than not, there are also known machine learning algorithms already used by others to implement the same or similar ideas. Hence, there is legitimate reason to assume that this known algorithm supposedly would work at least well enough for a first version of the application. However, many times it is not the knowledge about the algorithm, but the availability of data and the ability to integrate the algorithm end-to-end into the company infrastructure in a painless way which decide if business value can be created.
Starting to (re-)think data is not only a technical issue; first it amounts to asking questions like
- What data is available?
- What quality does the data have?
- Who owns the data?
- Who controls access to the data?
- Who is responsible for the data and its quality?
- What is the current relationship between producers and consumers of data?
- How can we evolve and improve this crucial relationship?
Here, the main driver for answering these questions is to find ways to reduce the overall end-to-end cycle time for the development of machine learning applications by improving, for example, the availability, the processing speed and the quality of data. In most cases, we do not recommend to actually solve all data infrastructure problems upfront and then start data science projects. In fact, many issues are typically unknown at this point and it is very challenging to find suitable solutions without implementing specific applications.
Hence, it is perfectly fine to start rethinking data on a large scale and act small on a use case by use case basis. For one thing, this allows validating different technical approaches. For another (this is probably more important), it allows changing the mindset about the significance of data step by step. The focus here is on implementing use cases end-to-end learning, in particular, to build production-grade machine learning applications. At this stage, using cloud infrastructure offers a good way to move quickly without much ramp-up. As soon as several use cases create business value, unifying the data infrastructure might yield additional value.
While everyone is talking about AI, in the end it takes much more than data science and machine learning algorithms to build production-grade machine learning applications and generate business value. Moreover, sustainable business value with machine learning and AI requires more than clever algorithms – it requires rethinking data.
In fact, many state-of-the-art algorithms are publicly available, e.g. the Google search algorithm and even the patent is expired, the state-of-the-art model for natural language processing BERT is open source and the network architecture for state-of-the-art image classification ResNet is available and even implemented in various frameworks like Keras and Pytorch.
However, the major part of the business value is not the algorithm but the capability to employ the algorithm in a production-grade machine learning application, and hence the true value lies in the underlying data (often not open source) and the data infrastructure.
Here, a big bang attempt to change the data infrastructure or start by pouring all available data into a single place is more often than not an unnecessarily challenging approach. Instead, starting to rethink data – use case by use case – and building a suitable data infrastructure on the way appears to be much more promising.
If you are interested in some more ideas and our approach to data science projects, see the blog post (German) and the free on-demand webinar (German).