Given its convenience and ease of use, Google BigQuery is a popular platform for large-scale data analytics. Yet the costs of frequent BigQuery use can quickly sneak up on your organization. And although Google provides features for controlling your costs, the tradeoff is strict user limits that can hold back the progress of your data analysis and business.
While it may seem like your only options with BigQuery are to either pay the high costs or set quotas, there are caching and partitioning strategies that can help you run more queries at a lower cost—all while getting better performance.
This white paper explains how you can get the equivalent of a “95% off coupon” for computer resources for any query that you run on BigQuery. It covers:
- Why BigQuery costs add up so quickly
- How Cooladata caching and partitioning strategies significantly reduce costs and query times
- How the cost savings add up
Let’s get to it.
PUTTING BIGQUERY COSTS UNDER THE MICROSCOPE
In many ways, the usage-based pricing for BigQuery is straightforward. You pay for the amount of data you store, query, and insert by streaming. While this seems fair, it’s important to remember that each and every query has a cost to it and the BigQuery engine scales elastically to accommodate computationally intensive queries. In other words, the more complex the query (or the more data the query has to scan) the faster the costs add up. And with unlimited resources at your disposal, the costs of BigQuery can be quite high. All it takes is poor planning or a query-entry error to accidentally blow through the entire monthly query budget far too early.
Given the more subtle realities of the BigQuery pricing structure, you can see why it would be a good idea to plan ahead for how much data you expect to store and the types and numbers of scans you expect to run. The problem is that you have better things to be doing than constantly worrying about costs and trying to anticipate company query needs. Fortunately, there are some best practices that can help you to control your costs.
Before we get into them, however, let’s take a closer look at the standard pricing that Google offers for BigQuery and talk about why understanding how the costs add up is so important.
Figure 1: Google BigQuery pricing with storage example.
To put the numbers from Figure 1 into perspective, let’s look at an example of a company looking to analyze user behavior within their web application. Often the best way to understand user behavior is by analyzing a sequence of actions performed by a set of users within a defined timeframe, typically referred to as a session.
Each of these actions triggers a small script within the website or application to send an event to a database that includes a user’s unique identifier, some information on their device, and some specific information about the action they performed. If you do this each time a user clicks a button or views a page, you’ll then have a whole pile of events in your database that describe how individual users navigate and interact with your application.
A typical event might have 40 or so properties, but can have far more depending on how much information you want to pass on for analysis. An event with 40 properties will be about 1.3 KB when streamed into BigQuery, just barely exceeding their 1 KB minimum billing threshold.
A company operating a web application with one million users who log in 10 times per month and are generating roughly 70 events per session would likely see event volume of 910 GB, costing about $50/month just in storage. See the equations below for more details:
Users * sessions_per_month * events_per_session = events
1,000,000 * 10 * 70 = 700,000,000 events
Events * size_per_event = volume
700,000,000 * 1.3 KB = 910,000,000 KB or 910 GB
Volume * streaming_cost_per_gb = monthly_cost
910 GB * $0.05/GB = $45.50
Now, for an organization with a million unique users, chances are that they have (or are working on building) a pretty robust BI team. Google charges based on the volume of data scanned for individual queries at a rate of $5/TB.
Let’s assume that this organization is looking to put dashboards in front of all 200 of its employees, who, by loading their dashboards, will run approximately 100 queries per day.
While considering the average scan size, you’ll obviously have to consider how complex these queries are. Let’s assume for the sake of illustration, the average query scans 5 GB of data. Under these conditions, the monthly cost of your queries will be about $10k.
Employees * queries_per_day * working_days = queries_per_month
200 * 100 * 20 = 400,000 queries per month
Queries_per_month * average_query_scan = monthly_scan_volume
400,000 * 5 GB = 2,000,000 GB or 2,000 TB
Monthly_scan_volume * query_scan_cost = monthly_query_cost
2,000 TB * $5/TB = $10,000 per month
As you can see, the bulk of the costs associated with BigQuery for this particular example are tied up in the analysis of the data, not the storage ($10,000 vs $50). Obviously, if you’re looking to trim your costs for BigQuery, you’d better spend your time on the analysis component.
Google’s answer to helping you control costs is to provide billing alert and custom quota options. But that effectively stops everyone’s queries until the next month rolls over. And given the potential tradeoffs, finding other ways to keep BigQuery costs in check can prove invaluable to your business.
RELYING ON A SMART OPTIMIZATION STRATEGY
The top priority for your data analysts is answering all of the questions your company has. So the less they need to worry about infrastructure costs the better. With this in mind, Cooladata developed smart ways to optimize the cost of BigQuery while simultaneously improving data availability and dashboard refresh efficiencies.
Moreover, while data optimization is critical for improving cost efficiency, there is a residual benefit of SPEED. Even though BigQuery distributes queries between many servers, data availability in BigQuery still has a direct impact on query execution time. For example, scanning 100 million rows rather than a billion is much faster and cheaper.
Sweating the details
Cooladata was one of the first companies to build features into its platform to take advantage of the BigQuery infrastructure. We tap into many BigQuery features, including table decorators, partitioning, and caching, as well as many of its analytical capabilities, window functions, and nested field capabilities. Generally speaking there are several ways to reduce BigQuery costs. For the remainder of this paper, however, we will focus on two of them:
- Partitioning (by time and property)
A CLOSER LOOK AT CACHING AND PARTITIONING
Before we delve into the intricacies of caching in BigQuery, let’s quickly revisit what caching is and why it matters to query costs.
Caching is a mechanism where the result of a query that references an unchanged dataset will be stored, alleviating the need to rescan the data. As long as the underlying data hasn’t changed, the result of the query should be the same.
In real-world terms, it’s kind of like when you ask a clerk at a store to help you find a specific item, such as sundried tomatoes. A new clerk may need to painstakingly scour the entire store in a time-consuming and resource-intensive search. Moving forward, however, as long as the layout of the store doesn’t change, the clerk can answer this question quickly and cheaply.
In BigQuery, the basic principle is the same. The results of queries are sent to an anonymous dataset, with a name generated from the name of the tables, their last update timestamp, and the query. When a user executes a query, BigQuery first checks to see if a cached result exists. If the query does return a cached result, it costs nothing because no additional processing is required. The problem is that query cache is only enabled in BigQuery in a limited number of scenarios.
- The query is deterministic—Queries are cached when they reference a specific data range from the past, such as 1/10/2017 and 5/10/2017, without a NOW () date parameter.
- The table does NOT have a streaming buffer—If there is a high volume of data being added to the table, it may have a streaming buffer. Think of the streaming buffer as the place where new data sits before it’s officially added to the table and ready for querying. A table that contains data from the previous month that is not being updated is an example of a table without a streaming buffer.
- The result of the query was not saved to a named table—When you write a query, you can save the result to another new table. That result will be saved as a table and will be readily available.
There are caveats to the above examples, however. For instance, query cache does not work in the following scenarios:
- The table is being updated—Updates to a table (for example, rows being inserted every 10 minutes) invalidate the cache. When you run a query, BigQuery will check to see if the underlying dataset has changed. If it has (or the table is currently in the process of being updated), BigQuery will not provide the cached result.
- A query is more than 24 hours old—The BigQuery cache only stores queries for 24 hours, so if you wrote a query a week ago that references data from last month you have to run the entire query again despite the fact that the result is the same. Re-running the query costs money and is slower.
How Cooladata handles cache
Using Cooladata in conjunction with BigQuery gives you access to more robust caching capabilities that help you save money and speed up some types of queries by:
- Caching all results for queries run on historical data
- Saving cache per query across all users
These strategies are helpful because time series historic data, such as the number of users that logged into your app three weeks ago, is not going to change. Moreover, any query about the previous day will return the same results, even if the current data on the same table is being updated.
With our date_range functions, we are aware of time. For example, we translate a date range (last 7 days) to a between clause (between 1/10/2017 and 7/10/2017). And since we know that historical data doesn’t change, it will be taken from cache.
What’s more, the Cooladata cache is saved per query across all users, not per user. That means employees from the same company can access historical data from other queries in the cache. In other words, if Bob runs a query that references historical data, Cooladata caches the result for the next time he wants to run the same query. Normally with BigQuery, if Cindy wanted to run the same query, she would have to spend the same amount of time and money to re-run the entire query. Since Cooladata caches queries across users, however, Cindy can rely on the same cache as Bob, saving her time and query costs.
Figure 2: [Caching costs nothing and returns immediate results]
How Cooladata handles partitioning
Partitioning by time
One approach we knew would help was by partitioning based on date and then querying data based on the partition date. All we needed to do was provide a way to partition tables to start saving money. Simple, right?
The catch is that BigQuery does not provide a simple way to partition a table based on an existing timestamp. It only allows you to create a partitioned table based on a date-shared table, so Cooladata needed a way to create a date-partitioned table. The easiest way would be to write a script that creates an empty date-partitioned table and then runs a query to get the data for each and every day, writing results back to the correct partition. Simple enough, however, the cost of this approach would be enormous, since it would require a full table scan for each day in a table.
To work around the BigQuery limitations for date partitioning, Cooladata uses the following approach.
- Creates a table with the relevant date range associated with the query being executed
- Supports changes to date-range intervals
- Relies on an intelligent parser that knows which tables to access according to date ranges
Partitioning by property
BigQuery does not support partitioning by property, which is another helpful way to speed up queries. A property is a detail that is passed with an event, such as IP location, gender, or browser type. When partitioning by property, there should be a single partition for each property that is passed through BigQuery.
Cooladata’s approach for partitioning by property in BigQuery is to create a table with the relevant date range and property. For example, the table fact_20151002_3 in the query below contains data from a single day, 2 of Oct 2015, and the data is filtered by event_name=’aplus-view’. Cooladata’s intelligent parser knows which tables to access according to date ranges and property
SELECT event_name , date(event_time_ts) , count(*) as cnt_not_opti from cooladata where date_range(last 30 days) and event_name=’aplus-view’ group by 1 ,2 LIMIT 10000″
SELECT session_event_letter AS event_name, DATE(session_event_time_ts) AS _expr_1, COUNT(*) AS cnt_not_opti FROM [project_114353_prod_dataset.fact_20150922_3], [project_114353_prod_dataset.fact_20151002_3], [project_114353_prod_dataset.fact_20151012_3]
WHERE session_event_time_ts BETWEEN TIMESTAMP(‘2015-09-24 00:00:00’) AND TIMESTAMP (‘2015-10-19 23:59:59.999999’) AND session_event_letter = ‘G’ GROUP EACH BY 1, 2 LIMIT 10000
In this example, we’re finding the number of events that happen over a series of days (to ostensibly produce a line graph of activity over time). The first query leverages Cooladata’s contextual query parameters, including event names and date ranges. The second query is written in ANSI SQL and manually specifies the dates, the partitions, and way that the events should be referenced from the tables.
The highlighted part in the first query corresponds with the highlighted part in the second. Essentially, referring to something by event_name automatically references three different partitions in BigQuery.
As you can see by the output in Figure 3 below, referencing only the table partitioned by time yields far poorer performance than querying tables that are partitioned by time and event query.
Figure 3: The performance of a query partitioned by time and event is far superior to one partitioned only by time.
PUTTING IT ALL INTO PERSPECTIVE
If you’re wondering what the numbers in the above example mean in practical terms, it’s a great question. You could expect that for a dashboard trying to populate 12 reports on a dataset of 25GB, which is the case for the example, it would take more than two minutes to load the whole thing. Cooladata, on the other hand, breaks the query into its components and instructs BigQuery to parallelize the operation and return all of the results in seconds.
Furthermore, if you’re dealing with terabytes instead of gigabytes of data, most dashboards will crumble under the intensity of these queries. You can forget interactivity with big data, the dashboards would hardly be usable.
With Cooladata you can expect much more.
- Faster: Got the result in 28% of the time (and that’s comparing to partitioned-by-time data).
- Cheaper: We scanned 5.65% of the data.
- More Reliable: We’re so confident in our ability to optimize your queries that we don’t even charge for compute, just streaming events.
- Better: Cooladata significantly lowers your BigQuery costs.
Given its ability to offload demands on compute resources through more targeted scans of cached data, with the Cooladata platform it’s like you have a 95% off coupon for any query that is run on BigQuery. And if your company runs hundreds or thousands of queries per month, these efficiency gains can mean big savings. It all adds up to having even more time and money for answering important questions day in and day out.
Cooladata provides a secure, fully-managed analytics and data warehouse solution optimized for behavioral and time-series analysis.
Cooladata is an end-to-end solution that lets companies capture, unify, analyze, visualize and share their data to empower every team to make smarter decisions, faster.
Data and BI teams use Cooladata to perform ad-hoc analysis of their data, and answer complex questions in seconds without writing long SQL queries. We’re so confident in our ability to automatically optimize our queries that we offer almost unlimited queries on your data for a fixed price. Moreover, for data teams considering building their own data warehouse for behavioral analysis, Cooladata gives you a solution as powerful and flexible, but at a fraction of the cost.
Cooladata is backed by 83 North / Greylock Israel, Carmel Ventures and Salesforce Ventures.