If you’re interested enough in this topic, you’re probably savvy enough to know that collecting and analyzing events is your information superhighway to understanding how your users interact with your application. But let’s back up just a second and ask the question that this idea is predicated upon: Why do we collect events in the first place?
Past iterations of business intelligence professionals would have inferred quite a bit from demographic and psychographic information collected about their users at various stages in the buying process. It’s easy to say that people from the United States tend to purchase more than those in other countries and that younger people tend to have a higher proclivity to shopping online.
But demographics and psychographics won’t tell you everything. Studies show that behavioral data about users is strikingly better for predicting things like purchases, upgrades, and churn.
At Cooladata, we live by the dictum of our founder, Tomer Ben Moshe:
“You are defined by what you do, not who you are.”
This philosophy not only binds our company together, but it permeates through our product and guides every design decision we make. Cooladata is an entire system designed to enable swift analysis and understanding of user behavior.
In this universe of user behavior, events are the atoms. When a user takes some kind of measurable action on your website, in your application, or within one of your digital assets, that action is recorded as an event.
Online data coming from mobile web or IOT is based on events where each event has its own properties, for example:
Event data modeling is the process of using business logic to store event-level data to produce data that is “analyzable,” meaning simpler to query and understand user behavior.
An event data model would usually contain 3 main hierarchical levels:
User entity will contain specific attributes about the user, for example:
- Zip code
- Mobile number
- and other attributes
Session entity will contain specific attributes about the session, for example:
- Referral source, etc.
Event entity will contain specific attributes about each event type, for example:
- Login event : User ID, landing page, referral source
- Add Item event : Product, Price, Color, Size, etc.
- Checkout event : Total price, Discount, Shipping Option, etc.
Traditional BI data models such as star schema or snowflake contain fact tables and dimensions with a fixed structure.
This is efficient and easy to use for analyzing data, but it only works on data which is structured and the schema doesn’t change much.
An event-based data model is based on structured and semi-structured data, for example:
- URL: contains many attribute in one string (page, title, campaign, etc.)
- DUA: device user agent contains many attributes about session, such as device, browser, OS, location, etc.
- Event type: each event type has a different schema since it is based on totally different attributes making it almost impossible to use traditional BI data models
The event data structure also tends to change very frequently. This means you would have to add fields and tables on an almost daily basis. The data structures tend to change quickly due to the rapidly evolving nature of businesses today (versus 20 years ago).
With a new product feature, new tool, or new release you’ll find yourself needing an event-based data model that is flexible, scalable, and can easily incorporate new events. Rigid models such as snowflake or star schema simply aren’t good at adapting to the rapidly changing nature of 21st-century businesses.
There are several ways of building event-based data models. Below, we’ve listed 3 of the most popular, as well as our own version. Before we get started, let’s look at some of the most important considerations for an event-based data model.
Considerations for an event-based data model:
- Ease of use – How easy it is to analyze data using SQL
- Performance – Minimize joins and full scan
- Pricing – You would want to use less expensive Redshift clusters or reduce BigQuery costs
- Easy to populate – How easy it is to load data into the model? Full load, incremental load, and updating data
Now let’s take a closer look at some of your options.
Option 1 – A table for each event type + a common attribute table
- A fact table for each event type
- Each event-type table contains specific fields for the event attribute
- “All Events” fact table for all event types with fields for common attributes
- Good performance for specific behavioral queries.
For example: Funnel analysis will create a join between the event-type tables which is much more efficient than creating join on the all the events using subquery for each event type.
- Optimized for analyzing specific event type.
For example: Analyzing “install” event type on installs table will be much more efficient than doing so on the entire event table. (i.e. Analyzing “Signup event will access signup table only.)
- You can still analyze all events with common attributes using “All events” fact table.
- Difficult to manage, especially in dynamic environment. Whenever you add an event type, you need to add a table reflecting on ETL and metadata and monitoring.
- Each event type requires a new table. Even if this event type is very seldomly used, you will need to maintain a table for each event type. When you have hundreds of event types this becomes unmanageable.
- By having an “all_event” table and an “event_type” table, you are replicating data. This requires a lot more work and can easily lose sync resulting in data quality issues.
Option 2 – A common event table + JSON field for custom attributes
- A single event table
- The table contains the common attributes as fields
- Custom attributes are stored in a JSON field
- Easier to manage than a table for each event. New attributes are just added to the JSON field and there’s no need to manage many event_type tables.
- Easy to analyze common event attributes.
For example: Event time stamp, user_id, session_id, etc.
- You can store user and session data on the events table. (This is called smearing and helps you manage slowly changing dimensions.)
- Difficult to analyze the custom attributes. You need to extract the custom attribute from the JSON field for each query resulting in slow performance.
For example: In Google BigQuery, selecting a small amount of data from a larger JSON object is more costly due to the BigQuery pricing being based on scan size (bytes processed).
- Difficult to map metadata and use BI tools for the custom attributes, since data can be in a nested JSON string.
Option 3 – A nested table data model
- A single fact table for all events
- Common attributes in the same hierarchy
- Custom attributes in the nested hierarchy
- You can create hierarchies for user and session data
- Elegant and easy to understand the way data is stored.
- Flexible in attributes for each hierarchy.
- Good performance for certain queries such as count distinct.
- Certain queries such as count distinct will scan much less data.
- All data is contained within a single table, reducing the requirement for joins between tables.
- Nested data become difficult to query easily.
- Difficult to load data into the tables, especially incremental data.
- Difficult to map metadata and use BI tools over a nested data model. Columns containing nested data require reasonably complex defined schemas to account for possible self-describing JSON events that may occur in these columns.
- Almost Impossible to update data.
The Cooladata Option:
- A single event table
- There is an option to store specific event types into different tables.
For example: Popular event types, such as “page view,” can be stored in a separate table and the remaining event types can be stored in a single table
- The table contains the common & custom attributes as fields
- The table has user & session data for each event row (Smearing)
- Easy to understand the way data is stored.
- By storing popular event types in separate tables you get the optimal cost performance.
- All loading of the data is handled by Cooladata.
- You are able to analyze the data very easily.
- You have easy access to historical changes (slowly changing dimension).
- For optimal performance you need to use CQL (Cooladata SQL).
- Data management is via Cooladata.
Additional Cooladata features
We partition data by time and property, such as event type, geography, etc. to get maximum performance and optimize data access.
We store the data on a sparse columnar analytical database, getting both flexibility and performance.
We automatically create the semantic layer and metadata data from online data, as well as traditional data.
Optimized Data access layer (Parser)
Our intelligent parser translates simple SQL statements to be able to access the relevant partitions.
Behavioral time series extensions (CQL)
We extend SQL, which is set-based with more functions, enabling time series analysis over granular data.
Our unique technology provides these features:
- Combines transactional & online-based data to one coherent flow.
- Flexibility of a document-based schema in the data model.
- Comprehensive dimensional and time series analyses in one schema.
- Ability to use common visualization tools on top of data model.
- Ad hoc unification online and traditional data.
- Optimize data cost by smart partitioning.
- Full track of historical data changes (slowly changing dimensions).