However, this data is constantly evolving, and properties that are considered important today might give way to newer, more relevant ones.
This constant evolution is at odds with the venerable Relational Database Management System, which is tightly coupled to a schema.
NoSQL systems can provide flexibility when defining fields and types, but are constrained in their ability to calculate aggregations and provide business insights.
In order to achieve the best of both worlds, to provide state of the art analytic abilities, all while giving their customers the flexibility to evolve with the times, Cooladata has created “Seamless Events”.So what does this mean and how is this accomplished?When a customer sends an event with a new property or a “Seamless Event,” our metadata system identifies it. If it wasn’t given a scope (such as session or user scope), it tries to assign one and then maps it into one of the type-strong fields in our analytics database.Does this process scale?
In CoolaData, we track and process billions of events with peaks of tens of thousands events per second (EPS). At this scale, we must implement a workflow which runs in parallel and on distributed systems.Dynamically assigning metadata can give way to the following issues:1) race condition on seamless event – each one of the servers in the cluster that handles the events has a full copy of the metadata. This reduces latency but the moment events arrive to different machines they both can assign the same slot and create data corruption. .2) versioning issues – if we tell each process in the ETL cluster to update the client scheme, we will soon have several machines that handle client data, and on each of of them we could have have a different version of the client metadata. Therefore, it’s obvious that some kind of synchronization process needs to run in order to give some order to the metadata and line up all the processes in a single version. By doing so, however, several questions are raised.What is the frequency of this process? Because this process is very time consuming means a system has to try to keep a high EPsWho is in charge of this process and what should be done when there are schema migrations conflicts ?What should be done in the event that the process is not responding to the refresh meta request?3) no empty slots left – What should be done in case a seamless event arrives with some new property but here are no empty slots left? Should we reject all events that have new properties? Do we need to add columns to the data table and add empty slots to the metadata?
Here at CoolaData we solved metadata problems by not changing the ETL process on the metadata manipulation. Another process is changing and updating the schema, like a micro services approach. It receives an update request from the ETL process in the cluster and updates the scheme. In this way, we can avoid two problems:
Where two processes try to fill the same empty slot for two new but different properties
Two processes that try to fill different slots for the same property is avoided.
This is accomplished by an ETL process in the cluster which holds a view of the project metadata in its cache. This metadata is connected to an ETAG string as a key. When a new property is discovered by one of the ETL processes,it initiates an update request to the metadata central unit. This unit returns to the process that made the update and calls an ETAG from the metadata after it made the update action.
If a change was made to the metadata because of this call, a new ETAG is created. If an ETL process receives a call and sees that does not match the ETAG, it has and initiate a refresh metadata request. The ETAG of the metadata in the metadata central unit might have been changed by the current ETL process request, or by a request from another process in the cluster. If the ETAG of the metadata unit matches the current ETAG, it understands that no changes were made to the metadata. A reason for this could be that there were no empty slots for example or the particular update was not legal, and mark the event as invalid.
The use of an etag helps us to avoid a lot of refreshing the cache in the ETL processes which is a very time consuming operation.
These were the types of questions and issues we dealt with when we came up with the idea of a “seamless event.” And yet, there are some questions that have yet to be answered. The need to contain a valid and synchronized version of a client’s metadata has its price that which affects the time it takes to process each event. The need to keep the processing time as short as possible while keeping a synchronized schema raises a lot of technical challenges.