(Consistent) App Analytics with Snowplow

Written by Dimitar NedevJan 27, 2020 13:259 min read

To build the best app for grocery shopping, we determined that a robust analytics platform is critical for collecting the data needed for decision support. Let me give you an example: how does one find out which features in the app are useful to our customers and which are clutter? With the help of a consistent analytics platform, of course.

In the interest of keeping this article readable, we are skipping a few steps, like how to setup Snowplow; we will come back to that in another blog post. Instead, we will focus on how to implement tracking with the goal of a fast, easy and consistent analysis. Why is that important? Well, if our engineers cannot quickly analyze the data or the results are inconclusive, all the tracking data in the world will not help. In other words, we are going to be doing “smart data” and not “BIG data.”

Consistent event schemas

The first step in delivering useful event schemas is the naming. There are three simple steps:

Give your events and pages/screens logical names. Users should not have to wonder what’s what.
Apply modularity to names. For example, the main registration screen is just registration, a specific one for business customers is registration_b2b and one requiring postal-code data is registration_postcode. You can also combine them like this:registration_b2b_postcode.
Last but not least, document your decisions. You will eventually forget what’s what, why, and which version released it. Reliable documentation goes a long way in assuring consistency.

We also made sure our events are consistently named. Because events are actions, there is a linguistic analogy that will help describe events:. For example, “A user added product to the shopping cart” becomes add_product, instead of product_add.

Fewer time-series with standalone events (and contexts)

Time-series are useful, but hard to write and slow to run. For that reason, we designed our analytics events to be meaningful, even if standalone. The idea is that given enough metadata, each event will supply us with sufficient information for analysis. Here’s an example: an add_product event for a promotional product on a search results page will require a complex time-series to find all relevant data. These include the search query, the typed text, the suggestions given, the listed product, etc. This would force us to extract data from half-a-dozen events only to rebuild all the information needed for a single add_product analysis.

The alternative is to use Snowplow’s rich metadata contexts. We create a context for each specialty metadata domain we mean to track. In the above example, we need:

A search context attached when we need to track search query typed text and search suggestions.
A screen context contains the name of the page we are on, what was a previous page and a listing of products displayed.
A promotion context tracks which articles are on promotion and which promotion.

This way, using the relevant metadata, an analyst can quickly find out what search was the query and what other products were on the same result page, just by looking at the add_product event for a search page. As a result, a single event becomes multi-purpose — if one is only interested in the search results, one only needs the search context. If a comparative analysis is our goal, then we can use the products listing from the screen context. Similarly, if we want to analyze the impact of promotional products on a search results page for a specific search term, we can combine all three contexts.

Using this methodology, we can choose what data we need and obtain it from the relevant context. Each context remains small and focused on a single domain. Also, since the contexts are relatively generic, we can use them in more than one place. For example, promotion context can apply to any type of page, not only search, or even to a dedicated promotional products category.

In addition, the context allows you to filter data more easily. If you’re only looking for search-related events, just query for the data where the search context is set. That can be a further performance improvement over very large volumes of data.

Snowplow-ing with Snowflake

Now that we have so much data, with added rich metadata, we need to make sure we have the right tool for the job. As you can imagine, we will be collecting a significant volume of data as well. However, the usages will vary, and we also have to use a system that can scale with the increasing volume and query complexity.

Enter Snowflake, a cloud-based columnar database which (in the author’s opinion) is ideally suited for one’s Snowplow analytics needs. Snowflake allows us to:

Independently scale storage and compute resources. And storage comes at a very affordable price. As a result, we can no longer have to worry about our data will be stored and how it will be accessed — it all goes in Snowflake, and we can query it with SQL. Applying it to the above rich-metadata model, it means storing more contextual data is easier, and cheap.
Snowplow data stored in Snowflake uses a single database table with rich JSON data for each event/context. This model allows us to access all the relevant data without using JOINs, providing excellent performance and keeping the stored data in a format close to the tracked one. In case you are wondering how JSON data access can have an excellent performance, Snowflake “cheats” — the system seems to decompose JSON fields to columns and virtually stitches them together. This way, you get the flexibility of a semi-structured data format and the performance of a columnar database.
Snowflake compute resources can scale instantaneously. A single click (or SQL query) can double the available processing instances. This way, we can also eliminate the need for an external data processing cluster (e.g. Spark) event for our “BIG data.” Moreover, we can use consistent data processing language — SQL. All our data engineers and data scientists know SQL, and all Picnic analysts receive (mandatory) training using SQL. So as a bonus, we no longer need Spark developers “programming” our data analysis pipelines.

To make analysis easier we usually create views (as in VIEW in a SQL/relation database) on top of the Snowplow data, enriching it with useful information. For example, identifying the order the event was part of or getting more information about the product added to the basket. The views allow us to enrich the data without creating more copies of an already large data source. Such views can also be very versatile since they can be re-used in more that one type of analysis, as long as we need the same data enrichments.

Parting thoughts

To get reliable insights out of analytics data, you should focus on consistency, and not volume. Our team’s approach was to ensure consistent tracking and replicable results. The methodology is:

Have consistent event and entity names. I cannot stress enough that if you cannot make sense of your events, neither will an automated system.
Build standalone events using the Snowplow rich-metadata contexts, avoiding time-series as much as possible.
Use an analytics system that can handle your use cases and data volume, which in our case is Snowflake. Make sure you read up on the features your analytical database supports, as they can help you get results faster (or easier).
And a bonus one, avoid “BIG data” as much as possible. Focus on doing smart things with your systems and data, that can help you avoid scaling-up Spark clusters just to see how many customers saw your new feature yesterday.

(Consistent) App Analytics with Snowplow

Consistent event schemas

Fewer time-series with standalone events (and contexts)

Snowplow-ing with Snowflake

Parting thoughts

Want to join Dimitar Nedev in finding solutions to interesting problems?

Building the supermarket on wheels together

Building the supermarket on wheels together

Share this with...