By doing a cross join, we get a list of all possible customer / product combinations. You then calculate a recommender score in order to get the relevant recommendations. In this post, we use a dummy random score, as recommender scores are out of scope at the moment.
In order to understand how to do a cross join efficiently, we need to talk about partitions in Spark shortly. In general, you can always assume the following defaults: 1 partition = 1 (executor) core = 1 task
Of course, these can be adjusted. For example, in the current 2 x n1-standard-8
cluster setting, Spark allocated the resources as follows:
- 2 workers with 8 cores each yield
- 3 executors with 4 cores each, i.e. 12 cores in total (so 4 cores are not used for data crunching) yield
- 12 partitions that can be processed in parallel (1 partition per core)
You can think of executors as Java Virtual Machines (JVMs). Master nodes don’t do the number crunching, they are occupied with scheduling and organization.
If your dataset is in 1 partition only, only 1 core can read from it and the rest will be idle. No parallelization is happening, that’s bad for performance. To circumvent that, Spark sometomes performs an internal repartition(), which creates 200 partitions by default. Thus 200 partitions is a sensible default we can use for repartitioning our small dataset explicitly. Let’s check current partition count.