Predictive Statistical Models with Python using CoolaData

Being an open source scripting language, Python is now one of the leading high level programming languages. It supports and functions for almost any statistical / model building operation you may want to do. Moreover, since the introduction of Pandas (a data analysis package) , Python has become very useful in operations on structured data. Even after choosing the right tool for your business (out of many) there still lies the issue running prediction and data models on the data. There are libraries such as http://www.scipy.org/, pandas, and others that are very common by data scientists.

CoolaData is an event based analytics platform which data scientists can now connect to using Python in order to build statistical and advanced data mining models. The connection is done by addressing CoolaData’s servers through it’s simple Query API. You can then run CoolaSQL, a behavioral extension to the SQL language, that enables you to utilize cohorts, paths and funnels to retrieve sample data for training your models.
By running these behavioral queries on large sets of data and getting results quickly, you’re shortening the cycle of analysis. This means that all you need to take care of are the data models, then you receive behavioral segments directly from huge data sets as events that your users performed.

“As a data scientist, it’s important to have access to raw data. Accessing the CoolaData API with Python allows me to pre-aggregate data using CQL, and then perform whatever processing I want myself using powerful analysis tools like Pandas for Python. The CoolaData API automatically returns queries as .CSV files, which makes storing and processing data easier.” Clinton Boys, Data Scientist.

The advantage of using CoolaData is that they store your data, offer ETL, and transfer event based data in real time. After you’ve finished tweaking you can then save the query results on your statistical tool. The real power here is the combination of a full stack analytics platform which can transfer the data through Python for statistical use.

 

Let’s look at real life example:

After exploring the data in CoolaData, we can see a correlation between the time a user spends on our application, during their first session, and the number of lifetime visits for the same user. Now, we will connect to your event store in CoolaData, get the data needed, and build a model to test our hypothesis that the first time visit duration is indeed a predictor for the users future visits.

1. Connect to CoolaData:

from requests import Session
def SQLtoURL(file):
          ## This function converts a .SQL file into a URL which can be passed to
          ## the CoolaData REST API. We need to flatten the URL and remove unnecessary characters.
     with open(file, 'r') as raw:
               data = raw.read().replace('\n', ' ').replace('\t',' ').replace('   ',' ').replace('  ',' ')
     return data
def QueryCoola(query, file = None):
          ## This function sends the raw SQL query via the Cooladata API via a HTTP POST request.
          session = Session()
          response = session.post(
               data = {'tq': query,
               'tqx': "out:csv",
               },
          url =      'https://app.cooladata.com/api/v2/projects/xxxxxx/cql/',
          headers = {'Authorization': 'Token xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'},)
          if file:
               with open(file+'.csv', 'w') as f:
               f.write(str(response.text.encode('utf-8')))
               return 'Done.'
          else:
               return response.text.encode('utf-8')
query = SQLtoURL('get_counts.sql')
print QueryCoola(query)

 

2. Our Statistical Model:

import pandas as pd
import numpy as np
import statsmodels.formula.api as sm
# Reading multiple CSV files
data = []
for i in range(1,3):
    frame = pd.read_csv('users_data/users_first_ses_dur_visit_count'+str(i)+'.csv')
    data.append(frame)
# Merging the frames into a single one
user_data = pd.concat(data)
user_data.columns=  ['uid', 'duration', 'returned']

# Learning a little bit about our data
print user_data.describe()

# let's see if they re really correlated:
print user_data.corr()

# This is where the magic happens, where we assume that the churn of users can be predicted
# from the duration of the user's first session.
result = sm.ols(formula="returned ~ duration ", data=user_data).fit()
# Let's See what we've got
print 'R2: '  + result.rsquared
# Do more stuff with the data , curnch & mung

 

The best thing to do from here is to send the data back to CoolaData, where we can further explore the data, slice and dice it, and act upon the insights we gathered.

Share this post

Leave a Reply

Your email address will not be published. Required fields are marked *