Thoughts after completing the Coursera “Data Engineering, Big Data, and Machine Learning on GCP Specialization”

No Comments

Having worked with Google Cloud Platform’s Big Data Services for almost a year, I wanted to have a broader view on GCP’s capabilities. In this post, I will give you an overview of the services touched by the Coursera specialization. I have been a GCP fan already, and now I am even more convinced.
In the hands-on course (version: September 2019), which is quite up to date regarding the service maturity, but not the service names, I gained a deeper understanding of:

  • PubSub: Fully managed message queue for buffering messages
  • Dataflow: Serverless (i.e. autoscaling) Apache Beam service, can be used to process both batch or streaming data
  • Dataproc: Fully managed Hadoop cluster with Spark on YARN. Special focus was put on the separation of storage and compute, which also allows you to use cheap preemptible instances
  • BigQuery: Serverless data storage and analytics service, feels like an SQL database, with latency in seconds
  • Bigtable: A data sink for terabytes of data with millisecond latency, where you need to put some thoughts into schema design (especially access keys) and cluster configuration (it’s not serverless). I guess I will use it for IoT applications in the future!
  • ML Engine (new name: AI platform): Serverless training of TensorFlow models with seamless deployment to an API
  • Vision API: Query pre-trained models for already solved tasks, for example image classification, sentiment analysis
  • AutoML: Bring your own data to pre-trained models in the cloud. Then trigger transfer learning and finetuning with automated hyperparameter optimization. The GUI (which comes with AutoML Vision, I don’t know about other domains like language) makes the model easily accessible to your customer’s end users.
  • Datalab: Their notebook service which is built on the Jupyter ecosystem, but integrates nicely with Google’s infrastructure (CPU, GPU, TPU)
  • Data Studio: External dashboarding service which can be connected to BigQuery very easily

I especially enjoyed the hands-on labs where

  • we used AutoML to finetune a pretrained model to detect types of clouds in the sky (pun intended, I guess)
  • we set up a pipeline to monitor (simulated) traffic in San Diego to get the average lane speed and congestion information. I used PubSub to buffer messages, processed them with Dataflow (get average speed), redirected them to BigQuery and built a Dashboard with Data Studio, which got updated regularly.

Some more words about …

BigQuery

I really enjoy using BigQuery! Because

  • it’s serverless, you simply type SQL queries. I don’t care about node numbers and their uptime.
  • pricing is easy and transparent, it’s divided into storage and query costs (no artificial units involved, no dependency on cluster specs).
  • it handles nested data easily, and offers all the SQL functionalities you already know from your RDBMS.
  • you can create ML models with an SQL query. Of course, they are either the absolutely basic (LinReg and LogReg) or a custom TensorFlow model, but I guess the service will evolve and it might be possible in the future to make inference in the SQL query.

AutoML

In my opinion, Machine Learning (ML) is currently in the process of being commoditized. AutoML is an outcome of this process, and I also enjoy using it. You can call it “No-Ops” machine learning, you simply “recognize” the model parameters after hyperparameter tuning and assess the model quality.

If my next problem at a customer site has been solved already, like sentiment analysis or image classification, but needs a tweak towards one’s own data, then this is the way to go for me. Models also need to be retrained continuously. With the offered GUI, this can be done by (almost) non-technical personnel.
Besides that, AutoML models can also serve as a breakthrough to assess whether the business problem is suited for ML at all. AutoML then delivers the benchmark to beat from upcoming pipelines.

AutoML automates a lot of work for Data {Scientists, Engineers} (i.e. myself). IMHO, Data Scientists will still be in demand, though. But they will spend more time on insight generation, productivizing these, and implementing (semi-)automated decision making systems. Solutions for common use cases will be implemented more quickly, which will also increase the footprint and visibility of ML in the company more rapidly. With the increased awareness of ML, new use cases will arrive in the pipeline, and possibly some of them cannot be solved with a template or service (yet) such so that I can do hardcore modeling again.

A note on GCP AI services naming scheme

I personally consider Google to be the leader in non-robotic application of Machine Learning (Google AI, Colab with DeepMind, unlimited training data in Google Photos Storage, their reinforcement learning researchers are pleasantly anticipating the take-off of the Stadia gaming platform, I guess). But when browsing the website, I get really confused about the AI services they offer. Some snippets:

  • Cloud AI
  • ML engine (which is now being renamed to “AI Platform”)
  • AI Hub
  • Vision AI
  • Vision API
  • AutoML Vision

It is a little hard for me to differentiate between

  • Cloud services
  • Google Cloud consulting services
  • Which APIs I can query out of the box, what problem they are solving exactly

Apache Beam (Dataflow)

They are advertising Apache Beam heavily, understandingly, as it originated at Google.
I haven’t seen any other managed Beam service from other cloud providers. GCP’s Dataflow is a runner for Apache Beam.

It’s a unified framework for batch and stream processing with nice monitoring in Google Cloud Dataflow. The next time I need to send (streaming) data from A to B (for example, PubSub to BigQuery) and don’t need any JOIN or complex operations, I will definitely consider using it. Alternatives are, for example, Spark and Spark Streaming.

Datalab

Just a feeling, but I personally prefer to use “plain” Jupyter notebooks, as I am already used to it and know my shortcuts. Of course, I definitely want the ML machine out of the box, but I don’t see the need for a special Jupyter notebook look.

Summary

Most of our customers’ business models are not built around ML. But we use ML to add new features to their business model (recommendations), increase productivity (for example by doing anomaly detection with early warnings) or make their established processes more cost-efficient through automation.

Therefore, my main objective in my career at this moment is to productionize ML, i.e. to implement a production pipeline of insights that can be used for automated decision making. I am a fan of GCP, their products seem well thought out, and the high-quality Coursera specialization gave a nice overview after one year of GCP application. I especially welcome GCP’s growing AutoML capabilities. Using AutoML lets me focus on model quality and business problem solving, ultimatively enhancing my customer’s business models for a reasonable cost.

The next step on my path is now the Google Cloud Professional Data Engineer certification.

Niklas Haas

Niklas is a passionate Data Scientist who uses cloud technologies and Machine Learning to generate insights and discover valuable patterns in Big Data. His work is not finished with the PoC, instead he is keen to deploy the data product into production. As a graduate industrial engineer, he always incorporates the impact on business value in his decision making process.

Comment

Your email address will not be published. Required fields are marked *