Anyone who has ever built an event-based (behavioral) analytical data warehouse that scales to billions of rows can tell you that it’s not a task for the faint of heart.
While it may seem like the perfect data solution for your organization, the truth is that the process can become extremely complicated and the final results achieved may not meet your expectations.
In fact, in a recent Forbes article, Bernard Marr wrote that, “half of all big data projects will fail to deliver against their expectations.”
This is a guide to overcoming your current challenge in the most efficient way possible, and to do so you’ll need to understand the potentially problematic scenario you may be facing if you follow the traditional path.
We’re about to paint a very detailed picture to help you better understand exactly what goes into planning and implementing a data warehousing project. What’s to follow is the 9 stages of the process outlining a hypothetical, but common series of events that many will encounter.
Embarking upon a journey without fully understanding the risks and complexity is a fool’s errand in any regard, especially with the task of building a data warehouse to codify a company’s digital information.
The allegorical tale to follow describes the plight of no particular analyst, but instead describes the organizational and emotional progression of the data team of a growing company.
This is also a guide to delivering the most effective end result for your company and improving the way each department views data. As such, we conclude our guide by outlining solutions to the problems faced within this scenario.
You need to fully understand what you’re getting your team into BEFORE your start with your project. This guide is our way of sharing our expert insights in hopes that we can help to streamline your process and make your job a whole lot easier.
Let’s get started with the first stage of this build.
Stage 1: Determining Your Problem
Some employees find themselves with piles of work being dumped on their desk. Ironically for the data analyst, the things they’ll need to do their jobs is very rarely dumped on their desk.
It’s your job to take the business questions from their colleagues and superiors, then find the relevant data and present the answers in an easily digestible format.
No matter the type of organization you’re a part of, the demands you experience are likely pretty similar.
Putting Relevant Information in Front of the Right People
Your executives want reports on their specific KPIs, even though you know that those KPIs don’t tell the whole story.
The marketing department comes their own list of demands.They want to know the basics: number of website visitors per month, campaign level metrics, conversion rates by channel.
Sales is probably using a CRM that has been improperly configured from the get-go, but they already have their “process” ironed out such that changing the configuration would be even more work than it’s worth.
They’ll probably tell you that they already have some kind of solution for analytics, but your inquiry into their process doesn’t give you much confidence in their half-baked implementation of some cloud-based SaaS tool.
They probably have some tight security around financials so getting past that will take some political maneuvering.
What That Means for Your Team
In a traditional organization, your data team is stuck doing manual data reporting processes even though they are all well-versed in SQL and have very strong technical backgrounds.
Some ad-hoc systems have been built for reporting, but they frequently contradict each other and haven’t yet earned the trust of management. They shrug and say that the only way to deal with more data is to hire more people to manage the never-ending sequence of recursively updating spreadsheets.
You have multiple sources of data. Your out-of-the-box SaaS tools deliver skewed data at a departmental level. The report you need to generate requires pulling data from 20 unique sources. Your execs are making decisions without the data they need.
It is what it is, but where do you go from here?
Stage 2: Finding Solutions for Your Problem
Surely there are vendors that have built a solution to this problem. If only you could find the one that perfectly suits the needs of your business and execute a successful implementation, your company would think that you’re a hero for restoring order in a land of data-chaos.
Searching for a Solution
Your search begins to fall flat when you realize the substantial gap between the application specific tools that your team is already using and ERPs (enterprise resource planning).
Even if you could convince your team that you need an ERP, you already know that you’d have to import all of your historical data, strip all of the teams off of their existing solutions and spend a year on implementation.
That’s only if you can afford the solution, which is proving difficult to determine because you can’t get clear answers about pricing when you try to get a quote.
Either way, you know that suggesting an the implementation of an ERP would be lengthy, costly and politically disastrous (and it’s probably overkill anyway).
Okay, back to the drawing board.
Is there some way you can connect all of these disparate data sources and bring them to a central location for analytical querying?
Now you think you’re getting somewhere.
Several tools latch on to the APIs of your department’s favorite SaaS tools and will shoot the data in whatever direction you tell them to. This seems like the answer, you just need to work with one of these managed integrators to start pulling all of this data out and piping it to… to where exactly?
In the copious amounts of marketing material you’ve undoubtedly read by now these integrators all talk about shipping your data into your data warehouse. You may not have experience designing and building a data warehouse,, but the idea of having a warehouse for all kinds of different data sounds very appealing.
Think about it: all of your company’s data from your team’s SaaS apps, your data from external databases, and live interaction data all seamlessly flowing into a data warehouse.
Unified under a single query engine, you and your team could spend hours creating endlessly complex reports to find answers questions you had never even had the tenacity to ask before!
Fully equipped with a behavioral layer that would enable simple time-series analysis and transferability of metadata for things like segments, funnels and cohorts, this would be the ultimate tool.
Stage 3: Building Your Data Warehousing Concept
If you’re going to take the time to build a data warehouse, it’s going to be the data warehouse to end all data warehouses.
It should be able to ingest every kind of data, structured, semi-structured (even those nasty nested JSON files) and store it in a convenient, immutable format primed for fast querying.
What to Incorporate
Anticipating the structure of this data would take a lot of work, especially when you consider the fast-changing nature of your business, so you decide that you’ll need some kind of self-learning schema in your ETL process.
Speaking of the marketing and product teams, you’re going to need some way to ingest all of that data that their users are generating. Remembering that web data is inherently messy, you decide that you’ll need an even more sophisticated ETL process than you had initially envisioned.
Marrying web data with mobile data is something that some of those SaaS apps that your product team uses does really well, so you need to make sure to include that in your design as well (somehow).
Of course, if you’re going to mary data between web and mobile, you’ll need to incorporate some kind of identity management solution to ensure that a user’s actions on the website and in the mobile app are recorded and associated with the same user ID to ensure that there’s no duplication across platforms.
Dealing with those darned freeloading anonymous users on your website that haven’t signed in yet is a big portion of what the marketing team does, so you decide that you’ll help them out by associating past history with a unique user ID that is retroactively associated when the user makes a conversion. A simple, but elegant touch that should help them calculate their ROI better could go a long way.
Great, now we have all of this data, so how should we store it?
Should we create one table per event type, or have one big table for all of the events, or something in between? For highly optimized queries you know you need a columnar data store, so figuring out how to structure this will be a little tricky.
Once you’ve decided how the whole ETL and internal data structure will look, you realize that all of this data is going to be useless without some way to get it in and out quite easily. Your data scientist is going to need an ODBC or JDBC connection to pipe data into their fancy predictive modelling script and an easy way to incorporate the results into the initial data store.
Your design specification is growing by the day.
Stage 4: Long Term Planning and Getting Buy In
It’s time to see just how far you and your team are going to take this thing. In order to get the cluster sizing right you need some way to estimate your data volume.
Taking a look at the basic analytics your marketing and product team have in place, you estimate that between your web, mobile and enrichment data you’re going to have about 100,000 users per month with about 5 sessions per user and about 60 events per session.
More Events Than You Planned For
Working the math out you realize that you’re talking about 30M events per month. Wow, that’s a little more than you thought you’d have.
Regardless, you know that if your company is able to hit their sales target by the end of the year, the number of events per month could easily double or triple, especially as the marketing and product team start tracking more and more events.
No fear, elastically scaling data stores are here!
Just mental note though, assuming things stay on course for your business, you’ll probably be at the billion-row mark in about three years, completely ignoring your company’s aggressive growth plans which could double that.
Regardless, building your own home-grown solution is going to be far superior than any off-the-shelf product you can find.
And what’s to say of the cost?
Sure cloud costs have come down dramatically in the last couple of years so your solution is now more economical than it would have been otherwise. Still though, the worrying pace of data volume growth is something to keep an eye on. You decide to find a data warehousing solution that bills for storage and compute separately and just leave it at that.
Now that you have this warehouse project you’re going to need to sell it to management.
Presenting to the C- Suite
They just want to see the ROI on the project to give it the green light, so you run a quick analysis that shows just how much bad data costs the organization. Even you are surprised by how much bad data costs the organization; quite an easy way to justify a six or seven figure expenditure on a data warehouse, especially when Gartner estimates that over 60% of these projects are completely abandoned.
Either way, after you put all of the high-level specs into a powerpoint that was circulated among management you were greeted (to great surprise) with overwhelming support for your project.
There’s no doubt that your superiors have been reading the most popular business publications and are eager to make their organization “data driven”.
This is it. You have the vision, the budget, and the resources necessary to accomplish what you had initially set out to do.
Stage 5: Keeping Up With Expectations
Alright, time to build this thing.
But who’s going to build it?
Your project needs resources for an entire development team, a DBA, an ETL specialist and an architect, but your budget only allows for about half of those people to come onboard.
Either way, you talk to HR and start writing job descriptions to hire the people that you need. You could contract out the work but you know that hiring internally for a long-term project is always better.
In the meantime, you start looking at actually building your data store. Should you choose a pricing model that charges for compute and storage separately, or together?
You start building some data models that represent a typical business use-case only to find that this is far harder to predict than you had originally anticipated. How do you forecast design parameters and cost without having a better idea of what the queries are?
The Knowledge Gap
A week or two later HR has several candidates who seem to fit the description that you wrote up in your job descriptions. After interviewing the first several candidates, you realize that building and managing a team is going to be a lot more work than you had originally anticipated.
The people who are qualified cost a lot more than you have budget for and the people you can afford don’t any experience with this sort of project. Making compromises and sacrifices along the way, you make your hiring decisions and put your team to work on building out the first level of infrastructure.
Managing a development team is a completely separate skillset that you haven’t quite had the chance to hone yet. They’re used to things like agile and scrum development methodologies but you didn’t think to include a scrum master in your original budget.
Source management is now an issue and members of your team are getting frustrated by the lack of communication between each other. Begrudgingly now, you team some data from the tools your company is using and starts loading it into the database.
Taking Longer Than Planned
Your flexible schema architecture is now showing its promise, but you know that queries are still going to be running suboptimally. As it turns out, building continuous, real-time data pipelines is a bit of a tricky business and data integrity seems to be a barrier to good insights.
After piping the data, your ETL process begins to consume all of your resources. How do you design a model that handles data discrepancies, null values, invalids and other kinds of problematic data? This turns out to be quite a time-consuming process that takes 3 months all by itself.
Don’t even mention that click-stream data from web and mobile. Your mobile attribution and navigation SDK’s are far from easy to glean insights from, especially since these are plagued by the same data integrity issues that your other data sources are having.
No Turning Back
Now that you actually have a team, resources, data and a whole mess of problems, you realize that there’s no turning back now. Cloud costs are beginning to rack up as you optimize your queries for efficiency, your team is a few months behind on implementation and you’re still miles from a finished product.
But you haven’t given up hope.
Determination is what’s keeping you moving forward with this project despite the significant hurdles you’ve had to overcome. Others in the organization are starting to wonder why you aren’t hitting deadlines.
Stage 6: Your Data Warehouse Problem
Pressing onward, you begin to experience a lagging organizational sense of doubt about your project. The costs you’ve incurred and the time you’ve spent can’t be made up now, but your results are still far from what was expected.
By attempting to tackle the problem of data fragmentation, the realization that you may have made the situation even worse is strikingly ironic.
Either way, data doesn’t lie, so as long as you can get this data warehouse to spit out accurate and actionable results in some form or fashion you’ll be in the clear. Let’s take a look at those data sources that we’re piping in and see what we can come up with.
Your Data is Lying to You
Looking at the web traffic data, you see that there are some obvious problems.
Users who access your assets both on mobile and web are being counted twice in your calculations. Even though your identity management solution has been implemented, your session data isn’t reflective of the user experience.
Server-side sessionization was supposed to solve this problem but it’s a lot more complicated than you had originally thought. Timestamps from devices that conflict or set in different timezones make your work quite a bit harder than you had anticipated.
Furthermore, the click stream data that’s being sent in needs to be retroactively attributed to a given user once they perform some kind of identifying event. Incorporating server-side sessionization with the retroactive inclusion of click-stream data requires a complex smearing algorithm to properly associate the unique user IDs with the actions they performed.
Oh, and don’t forget about all of the offline events that need to be added retroactively to a session to make it complete as well.
What you end up with is two or three records of each individual user with a fragmented journey. Your identify management system works in theory, but your users don’t have any incentive to make your life easier.
Forget about the fact that 20% of the internet uses Ad-block now and that this is going to skew your data either way.
Living Up to Your Promises
Marketing is now knocking on the door and wondering how they’re going to associate their ad spend with all of these individual user activities. They were promised visibility on ROI data and instead they get a bunch of fragmented touchpoints from what appear to be different users.
At least the product team has the luxury of having users signed in while you’re collecting tracking info, but that doesn’t count a whole lot for the team that’s managing a six figure ad spend with no clear relationship between cost and campaigns.
The cruel irony is that this is just a small piece of the puzzle and other tools do this quite well at a far lower cost.
You have multiple entities for single customers and multiple sessions for a single user on different platforms. Your data is lying to you.
Stage 7: Putting Out Fires
It’s been six months since you began this project and management is starting to voice their concerns. Your project is well behind the implementation timeline, you’re already significantly over budget, and you haven’t even tuned your database for the analytical queries that your users are going to be running.
Pressing forward, you assure the money will come back in the form of organizational efficiency. Time to optimize queries.
After you’ve set up dynamic partitioning on your data warehouse to allocate the bulk of your data in storage and only the recent stuff in memory, you start running and tuning, an exercise you quite enjoy.
It looks like your columnar data store isn’t actually as cost-effective as you thought. Average query time is far longer and far costlier than you had originally anticipated. What’s going on?
Optimizing for specific queries is easy, optimizing for general use is near impossible. But that’s the tradeoff with a flexible data warehouse, right?
Attempting to find some way to make this scale to billions of rows without a significant number of DBAs working around the clock, you implementing some clever hashing caching mechanisms to optimize your queries for runtime and cost.
This seems to work well, but you know that your elastic schema is (definitionally) going to change and this will require some more upkeep.
Which is cheaper, hiring a full-time DBA or playing cat and mouse with dynamic optimization?
Losing Confidence in Your Data Warehouse Plan
Cobbling together the resources you have, you go back to management to report your progress. They are not happy with the results. They were promised an analytical data warehouse three months ago and have not had anything solid delivered.
If the marketing and product teams can release features on time, why can’t you build your data warehouse within the defined time range?
You’ve lost the confidence of your key stakeholders. Your project is costly, and under threat. People in the company are starting to talk about pulling the plug and diverting the funds to another campaign.
Stage 8: Encountering Serious Problems
You need something to win them back. You need to deliver some key insights that the company is waiting for or your project is going into the gutter.
Hastily throwing together a user interface using some open-source visualization tools, you and your team start to provide some high-level metrics to the marketing and product team that you hope will help satiate their ever-increasing demands.
Your team is greeted initially with enthusiasm, but quickly find that the results are far from satisfactory.
The data coming from the data warehouse doesn’t match what their SaaS tools are saying. Now you have multiple data sources that are conflicting and multiple versions of the truth.
Not only are the figures off, they’re not even close.
You’re positive that you’ve done everything you can to merge the inherently messy collection of web data with the clean and tidy process of querying a database, but the numbers just don’t add up.
Two explanations exist.
Your data warehouse just doesn’t work or you’re attempting to tell them something that contradicts their pre-existing notions.
Identifying Unforeseen Problems
Determined to figure out what might be the problem, you take a quick look at data quality and find out that the trackers that you’ve implemented are returning timestamps from the future! Not only is your data warehouse lying to you, but your data sources are encountering problems as well.
In addition to this strange conundrum, you find out that your batch loading process isn’t set to run for another hour, so you’re still dealing with data from yesterday before it’s fully loaded into your data warehouse.
At the last minute, you realize there needs to be a real-time element to your project so as to detect discrepancies like this.
Spinning up a new cluster that will hold data for 7 days with no pre-processing should solve one part of the problem, but you now know that an even bigger issue has been created.
You’ve lost confidence in the quality of your data. Others don’t trust it either. Your dashboards have been implemented but people don’t look at them because they don’t trust them. They’d much rather have reliable, isolated metrics than contradictory ones that have been fully enriched.
Stage 9: Struggling to Provide Results
Without the support of your company, your team is beginning to lose focus on the merits of the project. Your accomplishments are beginning to look far less like what you had imagined and much more like a costly project that is far more expensive than originally intended and delivers far less than what was hoped.
All of your focus is on providing the behavioral insights that you had originally set out to at this point. If we could just show them the value of this tool to our business, we could turn this thing around.
Finally, you’ve reached the point where you can start interfacing with the behavioral semantic layer. The thing that takes advantage of all your complex ETL process, your identify unification system, your unique approach to sessionization, dynamic schemas and finally query optimization.
Let’s show some time series behavior.
Delivering Positive Results More Difficult Than Expected
The product team is the most likely candidate to appreciate time series behavior, so we’ll start with something valuable. Lets show them a reverse cohort analysis that helps them identify activities associated with churn.
Although this isn’t simple, it’s something that no one on the product team could imagine possible before your solution. SQL is what you and all of your analysts know, so you better be able to glean these insights from your gathered data store with it.
Starting to write your query, you realize that this is going to be one hell of a query. Joining data from multiple sessions, multiple users, over an unspecified period of time with interlocking events is quite a tricky thing to write in SQL.
Why is this so hard?
You remember learning somewhere that SQL is a set-based language, which makes it great for filtering, subsetting, performing simplifying calculations and joining tables on common keys. It’s apparent usefulness in this area seems to have compromised its ability to function as a behavioral language.
After several hours of building aggregation tables, selecting from and joining to on what seems like an endless number of conditions, you’ve figured out how to get the result you’re looking for.
But how in the world are you going to get a team of product managers, most of whom don’t have a great grasp of SQL, these insights on a consistent basis without making your team the bottleneck?
Your company’s preferred visualization tool definitely doesn’t have the capacity to automatically generate these kinds of queries. It too is built for analyzing data in set form, not for tracking behavior across a longitudinal time span.
Replicating Efforts for Each Department
Your team realizes that the only way to provide this information is to spin up some custom dashboards for every department.
Now you need to create the queries for, design the dashboard, embed them in something easily accessible (but still secure) and distribute these dashboards to the key decision-makers.
Your solution isn’t performing at nearly the capacity you had hoped. What was once a revolutionary idea is now a thorn in your side and something you need to maintain.
The Final Stage: Stuck With Subpar Results
You feel stuck. Instead of doing things simply and quickly, they are now complex and messy.
Teams Won’t Use Your Data
The dashboards you’ve created for departments in the company aren’t actually being used because no one trusts the data. Even if they had a question about it, they know that there’s no verification mechanism to ensure that what they’re seeing is actually correct.
The data is far from interactive as well. To support the nature of the complex queries associated with time-series, behavioral analysis, only members of your data team that are well-versed in SQL can provide support for ad-hoc questions. Once again, your team has become the bottleneck for meaningful insights, a painful irony considering the initial motivation for this project.
Management is disappointed at the outcome, but unwilling to forget that they themselves hung their hat on this project from the get go. Far from willing to admit defeat, the timidly declare the project a success but tout none of its values to the company. Even though the size of the data team has doubled in headcount, its cost has more than tripled and the insights they provide are only marginally better than they were before.
One Big Messy Data Team
You’re stuck in charge of the team that now has to deal with a homegrown mess. Forget about working with another vendor or 3rd party consultant at this point, you’re too far down to enlist their help now. Scrapping the solution would be unthinkable for management, if they were to consider the sunk cost associated with it.
In this position, you feel that you can no longer move up, down, sideways or even out. There is too much now that depends on keeping this thing running for you to make any kind of move. Remaining in this state is your only option.
3 Ways to Avoid These Problems
Slumping at your desk, you start to wonder why this all happened. Not only are you dealing with the consequences of your grand mistake on a day-to-day basis, but your professional career is now in jeopardy. You remember boldly, confidently looking into the chasm that lay before you, and, despite the risks, taking the plunge.
Have you fallen victim to your own hubris? Was there an issue with the process that you overlooked? What could have gone differently to avoid this?
After doing a thorough post-mortem, you realize that taking the following approach would have been far more advantageous:
1 – Start small and add incremental value:
Trying to digest all of the company’s data at once was a huge mistake, and probably an impossibility from the outset. By starting small and adding incremental value to a single department at a time, you could have identified problems early on without consuming a significant amount of resources along the way. Small wins and early corrections have compounding effects over time.
2 – Leverage existing vendors:
While building and maintaining your own solution affords you the maximum amount of customizability, you realize that you don’t need all of that customizability right out of the gate. There are several vendors who have solutions that would have gotten you 90% of the way to where you wanted to be at a fraction of the cost. Furthermore, cancelling a relationship with a bad vendor is a lot easier than firing an underperforming team.
3 – Treat it as a process, not a project:
A major misunderstanding is that building a data warehouse is not a project, but a process. It’s not something that can ever be marked as “done”. It’s something that is continuously built and refined as the company grows and matures. Communicating this to your stakeholders and having it reflected in your planning would have made a world of difference.
The Benefits of Data Warehousing Without The Pains
The scenario outlined above it not uncommon and explains why over half of all data warehouse projects don’t live up to expectations.
However, tools exist that provide the benefits of a data warehouse without the pains of building and maintaining one. In fact, let us tell you more about how that’s possible by outlining a scenario in which you’ve chosen to implement one of those tools instead.
The Behavioral Analytics Alternative
Showing up to work early has become a routine, not because you have to, but because you’re excited to keep working on that query that you’ve just about nailed. The coffee is hot and so are you, especially after leading the company through the implementation of your new behavioral analytics platform.
While it wasn’t a point-and-click installation, you know that the work you and your team did on the implementation was necessary for a properly functioning behavioral BI system.
Teams Are Operating More Effectively Than Ever
With all of your online and offline data points being aggregated automatically, teams have the ability to interact, visualize, understand, and interact with their data.This level of automation means that this is all possible without any additional resource investment on their part.
- Meetings now have substance, with a live, interactive dashboard prominently displayed overhead, seemingly to ward off personal bias or misinformation.
- Executives turn to their custom KPI dashboards and use them to share with their investors and external stakeholders.
- Marketers can now track their campaigns all the way from the first visit to the closed/won opportunity. They begin experimenting with different asset and promotion types to eliminate waste and maximize ROI.
- Sales is now able to determine the ideal number of touches that ensure the maximum likelihood of closing a deal. By integrating offline data with ad campaign data they’re able to understand what opportunities are most valuable.
- The product team now knows exactly which factors impact churn and are quick to test and implement fixes.With the ability to segment their users into behavioral cohorts, they quickly identify the users with the highest LTV and focus their efforts on retaining and monetizing that customer base.
- Accounting and Finance teams love how predictable the costs are associated with the implementation of the new platform. Event-based pricing means that costs are likely to grow only in proportion to actual business activity, meaning that the solution is profitable at any scale.
What About Your Team?
Speaking of your data team, how are they feeling about all of this?
Instead of frantically cobbling data from scattered sources, they now spend their days writing queries that are enhanced by your tool’s behavioral extension, CQL. One of your analysts now glows that with about 10 lines of CQL, he can build a time-series query that shows purchasing behavior across several different factors, a query that would have taken 5 pages of equivalent SQL.
Data is seamlessly streamed into the platform and the elastic-schema just handles it. Your marketing and product teams don’t even have to coordinate their event schemas with you.
All of the query optimization happens on the back end so there’s no need to worry about partitioning, hashing or managing indexes. Additional clusters are deployed and retired without lifting a finger. You watch as one of your CQL queries executes and scans over a billion rows and delivers insights in under ten seconds and wonder how your competition is going to manage.You have transformed your data from mere information into a strategic enabler for your business!
If you weren’t already aware, the tool described in the previous section is Cooladata. While the picture we painted above may seem a little too good to be true, it simply is not.
The truth is that Cooladata can do everything outlined above and far more. Not only that, it can be implemented quickly and requires a significantly lower investment of time, capital, and resources than building your own data warehouse.
Sure, a data warehouse may be the best option for your enterprise, and we’ll be the first ones to tell you so if that’s the case. We want to work with companies we can help.
It’s time to schedule a demo with one of our data scientists so we can show you exactly how Cooladata can help define the perfect data solution for your team.