BigQuery to the rescue: How to prototype an ML system for a medium-sized wholesale company in the Google Cloud

2.2.2022 | 9 minutes of reading time

BigQuery can help with building an ML system for production with a short time to market.

Follow industry standards. Agile methods, the MLOps framework and focus on an MVP are helpful.
Model improvement is not everything. A good model evaluation as well as versioning and metadata tracking are worth the effort.
BigQuery also has downsides. The API and slot management can be a hindrance at first.

In my bachelor thesis I examined how medium-sized companies can effectively set up their own machine learning (ML) system using limited resources.
The reason for that was, that in 2020, despite the coronavirus pandemic, 71% of companies increased their investments in ML [1] . However, the resources of a medium-sized company remain finite. In addition, only 53% of ML proof of concepts (POC) were deployed in production and realised their promised value between 2018 and 2020, according to Gartner [2] .

In the context of my thesis, I cooperated with the Mosecker GmbH, a leading family-owned, medium-sized wholesale company for domestic appliances, their core business being customer satisfaction rather than IT. We carried out a project to develop a production-ready ML system prototype, to test a combination of agile methods and the MLOps framework as a development approach.

The approach proved to be helpful.
However, in the following, I will focus on the usage of the managed cloud service BigQuery ML we used as the basis for the ML system. I will first introduce the project and system in general and then go into detail about a few particularities.

The Project

The project’s goal was to create a recommendations system prototype for personalized product recommendations in the web store of Mosecker in less than ten weeks. The prototype should also be suitable for use in production. To bring the idea to the market as quickly as possible and validate it.

For the project, delivery data between 01.01.2019 and 15.09.2021 was made available. These included

a total of 4,823,891 transactions
5,897 different customers
910,002 individual items

System Setup

Since one of the primary conditions is to create the prototype with the least resources and effort possible, all required resources were sourced from the cloud right from the start to avoid ramp-up costs and delays due to hardware provisioning and configuration.
Another advantage of the cloud strategy is the on-demand use and billing of storage and computing resources. This makes it possible to easily scale up the prototype at a later time.

I mainly used the BigQuery service and its built-in ML capabilities from the Google Cloud Platform (GCP) for this project, as I had previously used it for a similar use case. At the moment of writing, the first processed terabyte of every month was free, which meant even fewer costs. For a detailed description of how to create a recommendation model with BigQuery ML, I recommend two blog posts from my codecentric colleagues, Timo Böhm and Niklas Haas. The first [3] is about the technical part; the second [4] is about the business perspective.

To complete the system with metadata tracking, I used the open-source tool MLflow. In concept, the ML system then consisted of three parts:

BigQuery in GCP to process the data, train and evaluate the model, and create the recommendations.
A Docker image as a pipeline to be executed (locally), which manages and monitors the individual steps of the process.
MLflow, collecting various metadata about the training run and the created model during execution, these are then comparably presented in a web UI.

Figure 1 shows the interaction of the components during execution.

Figure 1: Schematic of training pipeline, communication between pipeline and BigQuery

This sums up to a single GCP project and two images, one to run the pipeline, the other to host the MLflow Web UI.
The training data is provided as CSV files in a mounted volume.

I have to note here that it was not the project’s scope to provide continuous monitoring, though this is in general a crucial step when following industry-standard MLOps principles [5] .

Pipeline steps

The pipeline is separated into five steps:

1. Data Import: Upload CSV files to BigQuery
2. Data preparation: Data cleaning and creation of the training dataset
3. Model training: Creating the model and extraction of the recommendation
4. Model evaluation: Validating the model based on mathematical methods
5. Prediction serving: Exporting the top recommendations per customer in a CSV file

The MLOps white paper of Google [5] served as a template for the structure. It was then adapted to the project’s needs and technical requirements of the BigQuery ML service. This resulted in the pipeline steps described above.

Implementation peculiarities and beneficial factors

The use of BigQuery as an integral part of the system presented some challenges. But I also want to highlight some beneficial factors for completing the project.

Start simple with a baseline model

One of them is to start with a minimal baseline model. Even though the idea of technical cut-through is not new, I want to highlight it here. Since BigQuery ML uses SQL to train ML models, creating a baseline model with default values is easy and fast. This allowed me to focus more on creating the system in the first place, to make a technical cut-through including recommendation serving and simplifying the further improvement of the model.

Note:
Using an established ML use case made the development of the model easier, which is why I was able to focus on the technical cut-through. However, it might make more sense to focus on the model first in less common use cases, where you can not rely on a broad knowledge base for the model.

Evaluation of the model in BigQuery ML

One of the biggest challenges for me was to evaluate the created model. BigQuery ML does evaluate the model automatically during its creation. But Google does not provide more detailed information about the metric calculation or implementation, making it a black box. For this reason, I implemented the calculation of the mean average precision@k (mAP@k) [6] myself. You can find the SQL statement in this gist [7] .

During the project, I realised the mAP@k exhibited an unexpected behaviour in some edge cases.
As soon as the remaining recommendations of a query were all negative (0), these were no longer considered in the precision calculation, as seen in figure 2. For this reason, I designed the mean Precision (mP@k) and implemented it in the project. Here, every recommendation in a query is included in the precision, regardless of being positive or negative, as seen in figure 3.

Figure 2: Mean average precision@k, common case and edge case

Figure 3: Mean precision@k, common case and edge case

If this explanation was too short, I recommend this more detailed article [8] by Ren Jie Tan. For evaluating the model, I preferred to use the mP@k because I find it more understandable and thus closer to reality. The SQL statements for calculating the metrics can be found in this gist [7] .

Master the slots

As much as the cloud can save costs, it can also cause them, if you are not careful. Particular caution is required when booking computing resources in BigQuery, so-called slots [9] . These must be used for training a matrix factorisation model; other algorithms can also be used with on-demand resources.
The minimum slot amount (100) costs 4.80€/hour in the Netherlands. A model training for our data usually took only 20 to 30 minutes, so the costs were manageable. However, booking (managing) the slots is only possible via the web interface, SQL or the CLI of the cloud SDK. Although I have automated the step in the pipeline using SQL, I wanted to make sure that the monetary damage is limited in case of an error.

Therefore I used the cloud scheduler and a cloud function to regularly drop all slots at midnight. Fortunately, dropping the slots is also possible in the Python client library. The cloud function code and CLI commands for the scheduler can be found in this gist [10] . In retrospect, I noticed that one could have alternatively used a scheduled query. But this might complicate the integration into a monitoring system.

Tracking metadata with MLflow

A month into the project, I lost track of the model versions and their performance. Therefore, I included MLflow tracking [11] into the pipeline and used its web UI to compare the models.
Doing this, following things were useful:

logging the SQL queries with `log_text()` for debugging
saving the pipeline YAML configuration as a file artefact with `log_artifact()`
using dataset IDs as model version indicators and table prefixes for different runs

Figure 4: Example for comparing two runs in MLflow

I should mention that BigQuery ML is intended for the fast testing of model ideas by data analysts. Its use as an ML system is not intended, so features like metadata tracking or a proper client API are (currently) missing.
For ML systems, Google intends the use of Vertex AI, but this has a steep learning curve, especially for a beginner with a small project. AutoML systems, on the other hand, were very limiting in their configuration.
By the declarative work with SQL, BigQuery ML offered just the right amount of simplification and freedom to create the prototype according to my needs. Unfortunately, without a proper API, but the use of SQL queries had its own advantages. By working declaratively, there were hardly any bugs in the data transformation part.

Project Results

After eight weeks of development, the prototype has fulfilled all basic requirements:

Creation and serving of a new model with one command
Relevant recommendations
Tracking of metadata

In addition, the model did not recommend in majority previously bought items. Instead, also new items were recommended, which the stakeholders evaluated as reasonable.

Figure 5: Model quality by different metrics in the course of the project

Figure 5 shows the different model quality metrics in the course of the project, as well as their various results. For more information on the discrepancy between offline evaluation and user validation, I recommend this paper [12] by Dietmar Jannach.
The collection of the Ambassador User Validation, which was the key metric for the project, was facilitated by the completed ML system.

Conclusion

BigQuery ML is not intended to be an ML system; this becomes apparent when you want to use it in a script. But it reduces a lot of complexity to simple SQL statements without taking too much freedom in the model configuration. This makes creating a minimal viable product or prototype fast and easy.

The mentioned challenges in creating an ML system, like slot management or model evaluation and SQL calculations, are a hindrance. But for me, especially for smaller projects, they are made up for by ease of use and simultaneous freedom.

In the end, BigQuery did the job. It enabled me to build a machine learning system prototype with a continuous training pipeline, metadata tracking and prediction serving within eight weeks.

Acknowledgements

A huge thanks goes to Holger Sickmann and Fabrice Lepach at the Mosecker GmbH for making this project possible by providing the data and insight domain knowledge.
I also want to thank my supervisors, Michael Bücker at the FH Münster and Niklas Haas from codecentric, for keeping me on track and helping me with all my questions.

Was this post helpful?

Likes

Blog author

Felix

Developer

Do you still have questions? Just send me a message.

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Green Cloud: Daten und Emissionen sparen

Das Internet produziert jährlich 900 Millionen Tonnen CO₂ – das ist deutlich mehr als Deutschland insgesamt emittiert. Hauptverantwortlich ist der immer weiter steigende Stromverbrauch beim Transport und der Speicherung von Daten. Wenn ihr kurz darüber...

Cloud
Green IT
Softwarearchitektur
Data

11.3.2024 | 5 Minuten Lesezeit

Dennis

Charge your APIs Volume 23: REST vs. gRPC

APIs dienen als Verbindungsstück zwischen Daten und Verarbeitung und erlauben uns damit, Daten im richtigen Kontext als Informationen zu interpretieren. Passende fachliche Themen sind dabei präsenter denn je und erreichen bald auch den Endverbraucher...

Java
Softwareentwicklung
Spring
Softwarearchitektur
API
Data

11.2.2024 | 7 Minuten Lesezeit

Sebastian Tiemann

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Im Bereich des maschinellen Lernens wurde eine lange Zeit angenommen, dass die Eingabedaten von Modellen und Gewichten sicher sei und nicht extrahiert werden könnten. In den letzten Jahren veröffentlichte Forschung hat diese Annahme in Frage gestellt...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 8 Minuten Lesezeit

Ihsan Kisi

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

Mithilfe von Daten können Unternehmen fundiertere Entscheidungen treffen, ihre Arbeitsabläufe optimieren und mit der Kraft des maschinellen Lernens (ML) einen Vorteil in der wettbewerbsintensiven Geschäftswelt erlangen. Allerdings ist der Umgang mit ...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 7 Minuten Lesezeit

Ihsan Kisi

Bessere SQL-Datenpipelines mit dbt

SQL ist weiterhin aus der Datenanalyse nicht wegzudenken – es ist vergleichsweise einfach zu lernen und Anwender können es ohne zusätzliche Werkzeuge auf einer Datenbank ausführen. Entsprechend ist es bei vielen Datenanalysten und Engineers beliebt. ...

Data

22.2.2023 | 2 Minuten Lesezeit

Matthias Niehoff

Streaming Wikipedia mit Apache Kafka

Apache Kafka ist in aller Munde und entwickelt sich im Kontext von verteilten Systemen zum De-facto-Standard als Plattform für Event Streaming. Im Rahmen unserer OffProject Time (Weiterbildungszeit) haben wir uns die Plattform auch näher angeschaut und...

Kotlin
Data
Java
Messaging
Spring

15.8.2022 | 10 Minuten Lesezeit

Christoph Metzger

Felix Rieß

Einführung in die Welt der Tourenoptimierung – Echte Routen und realistischere...

In diesem Artikel möchte ich euch mit einem Python Jupyter Notebook zeigen, wie ihr Anwendungsfälle der Tourenoptimierung inklusive Nebenbedingungen lösen und visualisieren könnt. Außerdem zeige ich euch, wie ihr mit OpenStreetMaps die Route zwischen...

Data

21.6.2022 | 7 Minuten Lesezeit

Lukas Heidemann

Einführung in die Welt der Tourenoptimierung – Visualisierung und Lösungsverfahren...

In diesem Artikel möchte ich euch zeigen, wie ihr Probleme der Tourenoptimierung in einem Python Jupyter Notebook lösen und visualisieren könnt. Am Beispiel eines Fahrradkurierdienst zeige ich außerdem, wie das Grundproblem um gängige Nebenbedingungen...

Data

16.6.2022 | 9 Minuten Lesezeit

Lukas Heidemann

Einführung in die Welt der Tourenoptimierung (1/3)

In vielen Unternehmen fallen täglich verschiedene Transportprozesse an. Klassische Beispiele sind die Optimierung von Warenein- und ausgängen, die Einsatzplanung von Servicetechnikern oder die optimale Reihenfolge der Auslieferung bei Lieferdiensten....

Data

12.6.2022 | 8 Minuten Lesezeit

Lukas Heidemann

Machine-Learning-Modelle bewerten – Quality Gates etablieren

Die Qualität bzw. Nützlichkeit von Machine-Learning-Modellen lässt sich mit Hilfe von Testdaten und Metriken bewerten. Allerdings in welchem Umfang? Manuell, automatisiert, einmalig, regelmäßig? Manuell lassen sich die ersten Modelle als Ergebnis eines...

Data
Machine Learning
Softwareentwicklung
CI/CD

7.12.2021 | 7 Minuten Lesezeit

Berthold Schulte

Schnelles Training eines Recommendation-Modells durch BigQuery ML

Machine Learning (ML) kann nur durch Modelle in der Produktion Business Value erzeugen. Allerdings kann die Zeitspanne zwischen der Entwicklung der nächsten Iteration eines Modells und dessen Einsatz in einer Produktionsumgebung massiv sein. Dies gilt...

Accelerate
Cloud
Data
Google Cloud
Machine Learning

26.7.2021 | 11 Minuten Lesezeit

Niklas Haas

Timo Böhm

KI, Daten und Infrastruktur – ML-Systeme schnell Ende-zu-Ende verproben...

Heutzutage steht fast alles, was mit den Labels „künstliche Intelligenz (KI)“ oder „Machine Learning (ML)“ versehen ist, für Fortschritt. Seltsamerweise schließt diese Assoziation jedoch häufig die Themen Daten und Dateninfrastruktur nicht ausreichend...

Kultur
Data
Machine Learning

21.6.2021 | 12 Minuten Lesezeit

Marcel Mikl

Schnelles KI-Prototyping mit Google Cloud AutoML Vision

Bei klassischen Machine-Learning-(ML-)Projekten beschäftigen sich Data Scientists häufig längere Zeit (mehrere Monate) mit der Entwicklung eines ML-Modells. Dabei werden hohe Kosten verursacht und die Zeit, bis ein erstes Modell zur Verfügung steht, ...

Cloud
Computer Vision
Data
Künstliche Intelligenz
Google Cloud
Machine Learning

17.5.2021 | 5 Minuten Lesezeit

Nils Bauroth

Sven Rediske

The Good, the Bad and the Ugly: Daten effektiv visualisieren und kommunizieren

Dieser Artikel begleitet meinen Vortrag The Good, the Bad and the Ugly: Daten effektiv visualisieren und kommunizieren, den ich am 20.10.2020 auf der data2day gehalten habe.Datenvisualisierung ist ausschlaggebend für Verständnis und KommunikationDatenvisualisierung...

Data
Data Science

19.10.2020 | 11 Minuten Lesezeit

Shirin Elsinghorst

KI in der Praxis: Fehlerhafte Bauteile mit Rekognition auf AWS identifizieren

Noch vor kurzer Zeit mussten für den Einsatz von künstlicher Intelligenz (KI) unter großem Aufwand eigene KI-Modelle erstellt werden. Heute ist für viele Anwendungsfälle die Einstiegshürde in die Welt der KI durch Cloud-Computing-Dienste stark gesunken...

Cloud
Computer Vision
Data
Künstliche Intelligenz
Machine Learning
Python

29.7.2020 | 11 Minuten Lesezeit

Marcel Mikl

Nico Axtmann

KI in der Praxis: Fehlerhafte Bauteile mit AutoML in der Google Cloud ...

Noch vor kurzer Zeit war der Einsatz von künstlicher Intelligenz (KI) nur mit großem Aufwand und Konstruktion eigener neuronaler Netze möglich. Heute ist die Einstiegshürde in die Welt der KI durch Cloud-Computing-Dienste stark gesunken. So kann man ...

Cloud
Computer Vision
Data
Python
Machine Learning
Google Cloud
Künstliche Intelligenz

8.7.2020 | 11 Minuten Lesezeit

Nico Axtmann

Marcel Mikl

KI für KMU: (Teil-)Automatisierung der Qualitätskontrolle von Bauteilen

Noch vor kurzer Zeit war der Einsatz von künstlicher Intelligenz (KI) nur mit großem Aufwand und ausreichend Spezialwissen möglich. Hauptsächlich große Internet-Konzerne wie Google, Apple und Facebook hatten das Geld, die Daten und die Expertise, um ...

Data
Machine Learning
Künstliche Intelligenz

6.7.2020 | 7 Minuten Lesezeit

Marcel Mikl

Nico Axtmann

Machine Learning in der Praxis. Eine Mate mit … Matthias Niehoff #EineMateMit

Machine Learning und künstliche Intelligenz sind aktuell in aller Munde und versprechen vielfältige Einsatzmöglichkeiten im Unternehmen. Trotzdem tun sich viele Unternehmen aktuell noch schwer, das Potential der Technologie zu nutzen. „Der Fokus liegt...

Künstliche Intelligenz
Data
Community
Machine Learning

27.5.2020 | 1 Minuten Lesezeit

Matthias Niehoff

Process Mining mit bupaR

Process Mining schafft Transparenz darüber, was wirklich in Unternehmen geschieht. Im Prozessmanagement werden die Idealvorstellungen eines Prozesses meist langwierig definiert. In der Praxis ist die Qualität dieser Beschreibungen jedoch oft nicht eindeutig...

Open Source
Data
Process Management

5.5.2020 | 9 Minuten Lesezeit

Anna Lukas

Wie man Data-Science-Projekte nicht in die PoC-Sackgasse manövriert

Warum gelingt es Data-Science-Initiativen häufig nicht, einen echten Mehrwert zu schaffen? Wir haben einige Ursachen dafür ausgemacht. In diesem Blogpost stellen wir vier typische Fallen für Data-Science-Projekte vor und geben Tipps, wie Du sie umschiffen...

Machine Learning
Data
Künstliche Intelligenz
Softwareentwicklung

27.3.2020 | 11 Minuten Lesezeit

Marcel Mikl

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.

Contact

Send

BigQuery to the rescue: How to prototype an ML system for a medium-sized wholesale company in the Google Cloud

The Project

System Setup

Pipeline steps

Implementation peculiarities and beneficial factors

Start simple with a baseline model

Evaluation of the model in BigQuery ML

Master the slots

Tracking metadata with MLflow

Project Results

Conclusion

Acknowledgements

Was this post helpful?

Ja

Blog author

Get in contact

Get in contact

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

View Job

More articles in this subject area

Green Cloud: Daten und Emissionen sparen

Charge your APIs Volume 23: REST vs. gRPC

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

Bessere SQL-Datenpipelines mit dbt

Streaming Wikipedia mit Apache Kafka

Einführung in die Welt der Tourenoptimierung – Echte Routen und realistischere...

Einführung in die Welt der Tourenoptimierung – Visualisierung und Lösungsverfahren...

Einführung in die Welt der Tourenoptimierung (1/3)

Machine-Learning-Modelle bewerten – Quality Gates etablieren

Schnelles Training eines Recommendation-Modells durch BigQuery ML

KI, Daten und Infrastruktur – ML-Systeme schnell Ende-zu-Ende verproben...

Schnelles KI-Prototyping mit Google Cloud AutoML Vision

The Good, the Bad and the Ugly: Daten effektiv visualisieren und kommunizieren

KI in der Praxis: Fehlerhafte Bauteile mit Rekognition auf AWS identifizieren

KI in der Praxis: Fehlerhafte Bauteile mit AutoML in der Google Cloud ...

KI für KMU: (Teil-)Automatisierung der Qualitätskontrolle von Bauteilen

Machine Learning in der Praxis. Eine Mate mit … Matthias Niehoff #EineMateMit

Process Mining mit bupaR

Wie man Data-Science-Projekte nicht in die PoC-Sackgasse manövriert

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Unsere Leistungen

Hilf uns, noch besser zu werden.

Zu den Jobangeboten