GitHub, like any other online website or app, constantly tracks its users and records all the actions of its members – Open Source developers from all over the world working on millions of projects like writing code and documentation, fixing and submitting bugs, watching or commenting. Following the open source approach, the GitHub Archive project makes their data sets available for anyone who wants to analyze the GitHub experience.
Around 20 event types tracked, JSON payload,or meta-data rich. As an open Behavioral Analytics solution, we took this as a personal invitation to apply our behavioral analytics approach to get some behavioral insights on the GitHub community.
Analysis Beyond Trends and KPIs
GitHub routinely publishes lists of trending repositories and developers, as well as some other KPIs. The GitHub archive containing big data from 20 events types, tracked with respect to their 12 million users and 31 million repositories – is challenging to analyze beyond the simple measurements.
GitHub events such as create, deployment status, push, download, fork, apply, page_build, pull_request, and watch, could be analogous to user_registration, add_to_cart, comment, view_product_page, add_favorites and buy events on an eCommerce site. Whatever the action or event, when tracked across platforms and stored properly, it is fertile ground for behavioral analysis.
Serial contribution to repositories analyzed as retention & churn
Retention analysis is a classic behavioral report because it looks at cohorts of users who perform a series of actions over time. We examined user retention or stickiness of the hottest repositories. We looked at GitHub users who initially performed a push event (contributing to a repository) and the pattern of their push behavior over the following five days.
We used the Cohort analysis visualization to follow contributions by the same users (day by day) to the same repository. We see that interesting repositories get repeated push events by the same users. Without drilling down further (Take a closer look at the Cohort report below) we see that an average period of 4 days represents a “work unit” for a repository. However, after four days the retention drops, (maybe it’s due to the weekend…). In an effort to explain this, we may assume that the feedback to the initial push is responsible for stimulating the continuing user activity.
This behavior could be compared to a compelling article published on a content website; interest dies down naturally when the slew of comments to the article has run its course. As for the GitHub case, repeated pushes lead to the code becoming updated and the need for further pushes subsides.
Though the CoolaSQL intuitive query language behind this is quite simple, it leads to answering a complex question. With advanced behavioral analytics, you can select behavioral functions and JOIN it with others to create the reports.
We used the same Cohort analysis visualization for Churn – the opposite of retention and a main concern for every online business. To analyze GitHub churn we took all users who did anything (any action) and followed them to check how soon they become inactive – did nothing. The analysis is of one week in March, and shows churn of about 60%, an indication that 60% of the users are occasional users and not continuously active on GitHub.
We could also do it backwards with Reverse Cohort, where instead of following users from their first action to when they became inactive, we could work backwards and start with the final event and look back in time to see what actions led up to it.
Analyzing Churn forwards or backwards, if these were the results of a churn analysis for an eCommerce site, they would be worrisome!
It takes 8 seconds to get watched (if you’re really popular…)!
The newest repositories are watched by other users almost immediately. This shows that popular repositories gain loyal followers who watch new uploads. T he quickest watch happens within 8 seconds!
This is a typical behavioral analysis – looking at a series of actions (in this case push, and later watch), with the dimension of time added.
Wouldn’t a commerce or content site be eager to track the speed of its users’ actions in response to a new offer, product, article or blog?
17% of GitHub users are just watching
The paths users take is another focus of behavioral analytics. Here, we used our advanced path analysis, represented in a Sunburst visualization, to check on the most popular paths for the GitHub users.
Online apps and websites obviously want to know what visitors do on their site, which path is the most popular and which is the least. Here we can clearly see that 17% of all GitHub sessions during this period, begin with a watch action right after login, and that’s where the path ends for most of them – with a single watch event, of a single repository.
Often the less traveled paths deserve a closer look, as they can reveal anomalies in users’ behavior. While these paths are taken by fewer users, they are still easily traceable in this path analysis visualization.
If we click this first step of the narrow path, we see that 0.33% of all GitHub sessions starts with a Fork action (copy repository), followed by a Push Event and ends with the Pull request. What does this path reveal? for that we need to drill down and query further each step in the funnel. naturally it’s all possible when you have your big data open for querying.
Behavioral Analytics applied
GitHub or any other site collecting events representing user behavior can and should apply behavioral analytics, in order to place user events into context for producing actionable insights, enabling that site to better serve their users’ needs and interests, which ultimately serves as the springboard for increased revenue.
These behavioral reports and other are open and available. Sign up to see more.