The process of collecting data is derived by an SDK that we first install in the client application, which is then fired by a trigger, and finally the data is added to our database. The process of storing events data is quite simple: we first filter invalid events, and then store the remaining relevant information in raw data tables.
What is this raw data? It’s a collection of properties from a specific point in time for a relevant activity.
How can we enrich the data? How can we add more information to our metadata that can be applied immediately to this event?
The answer is “smearing.” We can use previously collected information on a certain user and integrate it with the raw data.
So, while a specific user is sending an event to our system, we can use a unique identifier for this user. For example, we can use a cookie to get the identifier of a registered user. In the system we can also aggregate past information from the user and add it to the raw data.
For example, let’s say there was a user who added his email address to a website few weeks ago. First, we store that information in our “user profile” database so that the next time we receive an event from this user without email property we can “fill the gap,” and add the “email” property to his raw data.
Since we cannot retroactively update the events data in our database, there is a single point of time that we can do it: before writing the event to our database.
While processing an event, during the validation phase we must have an additional step, the “Smearing” process.
What kind of data must we add in the smearing process in order to avoid complex “Joins” later in the system? We can add things like:
- All user properties that we collected up to this moment ,such as, Email, Last Name, First Name, Phone number, Credit History, A/B testing group, additional identities.
- Calculated properties about this user. For example, “user_creation_date” should contain the date we first received an event from this user in the system, “is_paying_flag” will receive the value TRUE/FALSE if we had at least one purchase event in the past.
- Metadata about the user history such as the serial session number of the current event, or the amount of devices that this user is using.
The main benefits of using smearing are:
- you can use simpler queries to generate reports and avoid having to use complex joins with the “users profile” table
- In each point in time there is relevant information about a particular user that was correct and accurate at the specific time we sent the event. A user can update his properties later, but we caught the “frozen” property value, which was relevant at the moment the event occurred.
- Reduce traffic- We can send “light” events with partial properties, and then rely on the system’s ability to complete the missing information .
|timestamp||session id||user id||event||is_paying||creation date||device|
Figure 1: User properties smearing process
What about a “short term” history smearing? We can collect and smear the common information for shorter units of time, such as a specific visit.
We can define a different scope of smearing as well, not only in the user lifetime scope, but also in the visit/session scope.
The requirement for this kind of smearing is derived from the properties that are relevant only for a specific visit.
For example, the IP address or device properties (DUA) may be different among many sessions of the same user, and we cannot inherit their values from a previous visit .
The solution is to have a logical breakdown of the events properties into few levels/scopes, such as:
- User Lifetime Scope
- Visit Scope
- Additional logical scopes as defined in each application
During the events upload process, we should identify the missing pieces of information and fill the gaps with the information that we already stored in the past.