With the fusion of technology into the field of science and research, human minds have come up with the greatest innovations in order to make our lives faster and easier.
One such field where technology has showed its power is data analysis. It’s an area in which intense research, scans, transformations and modifications are done with raw information, in order to crunch huge numbers down into uncomplicated, simplified figures with which to draw conclusions and make decisions.
In order to crunch huge numbers, however, certain high level programming languages must be used in the field of data analysis.
Two of the most popular programming languages in the field of data analysis are R and Python. The clashes between R and Python in the domains of data modeling and data integration have turned out to be very captivating for data scientists to figure out the best one.
R, is an open-source language, and was previously an application of S programming language. Initially, it was mainly used by the scholars and researchers, but took a leap into the world of business in recent years
Many users prefer R programming language over other languages not only because it’s interactive, but also because it’s an excellent data structure handler. It has amazing abilities to produce visual structures for data comparison, and it beats other programming languages when it comes to missing values, a factor that frustrates programmers to great extents. Every real value has missing value and the R language is an expert at figuring it out.
Python on the other hand, is a programming language drafted not only for data analysis, but also functions as a web development language as well. It’s syntax makes it easy for the user to understand, which produces more simplified results. It became popular among programmers who want to explore the domains of data analysis more deeply or employ statistical techniques, which made it a popular choice among those who work in the field of data science.
What would you prefer?
So let me walk you through certain features of both the languages, which might throw a limelight on the one you would like to opt for.
Even though initially, R has portrayed itself as complicated, it gets easier once you cross the hurdle of getting acquainted with the basics. Learning advanced stuff in R is simple experienced programmers can easily get accustomed to it. On the other hand, Python is a language that requires time to learn, but it syntax is simple and easy. So if you’re new to the world of programming languages, Python is the clear choice for beginners.
R is ornamented with easy techniques for using complex formulas. The statistical tests and models are easy to use and also are widely available.
Python shows its flexibility in domains where nothing has ever been done before. It is also widely used by developers to script websites and other applications.
R, provides the means to represent statistical models in very few lines. Using R, similar functionalities can be implemented in several ways. There are R stylesheets but it hasn’t gained that much a popularity among developers
Python, however, owing to its simple syntax, makes debugging and coding easier for programmers. The indentation of the code holds huge importance, as it affects the code. Any functionality in Python can be implemented in only one way and therefore has to be written in the same way, every time, making it quite a contrary to R.
Data Handling Efficiency
The huge number of packages in R, its readily usable tests and the boon of being able to use formulas makes it extremely useful for data analysis. The basic works related to data analysis can be done without having to install separate packages. Packages such as data.table and dplyr are, however, required for big datasets.
Python, on the contrary, is more like a novice in the world of data analysis. The infancy of its packages was a huge issue for a long time, but it has improved considerably in the recent past. Packages such as NumPy and pandas must used to make Python usable for data analysis.
When exploration is needed, R fits in much more easily. With just a few lines of code, statistical models can be developed.
Python, on the other hand is a good tool to execute algorithms for production related purposes.
IDE AND Support
The only IDE used for R is RStudio.
There are various popular packages in R that are widely used. Some of them are dplyr, plyr and data.table for data manipulation, stringr to manage strings, zoo to work with the time series ( regular and irregular ), ggvis, lattice and ggplot2 to visualize data and caret for machine learning.
Stack overflow, R documentation and R-help (mailing list) are the various websites providing support for users using R.
Python on the other hand is popular among various IDEs. The most popular and widely used IDEs are Spyder and IPython Notebook and Rodeo.
Some of its popular package libraries are pandas for smooth manipulation of data, ScyPy & NumPy for scientific computation, sckikit-learn for using of machine learning methods, matplotlib for graphics related work and statsmodels for data exploration, estimation of statistical models, and for performing various statistical and unit tests.
Websites such as StackOverflow and Mailing lists provide support for people using Python.
CRAN which stands for Comprehensive R Archive Network is a huge storehouse of R packages, where users can contribute easily. They mostly consist of collections of R data, function and compile code, and all this can be installed in R in just a single line of code.
The Python Package index, namely the PyPi is the Python Package Index, is the repository of Python software, along with its libraries. Users can contribute to the PyPi, but in reality, it’s a bit complicated in nature.
R is equipped with the most advanced graphical capabilities. The various packages available such as ggplot2, lattice, rCharts, googleVis, ggvis provide you with advanced graphical capabilities.
The graphical capabilities of Python lie somewhere in between, with options to either use native libraries such as matplotlibor the derived libraries, which allows the calling of R, functions. The RPy2 library is used to run R codes from within Python, by providing a low level interface from Python to R.
Let us consider the example of a Random Forest to show you the difference in the implementation of R and Python.
So before we begin, let us understand what random forest is!
Random Forest is a machine learning process which shows its versatility in the tasks of performing both regression and classification. It is also capable of performing other steps of data exploration such as dimensional reduction, treat missing values and outlier values.
Multiple trees are grown here, and each tree classifies a new object (votes for that class) on the basis of its attributes. In the end, the forest selects the classification with the most votes. For regression, the average of the outputs by different trees is taken into consideration.
Let us now look into the code for Random Forest implementations in R and Python.
Let us have a look into its algorithm for better understanding of the code.
Here, each tree is planted and grown under the following set of guidelines
Let’s now assume that in the training sets, the number of cases is N. The sample although taken at random, is done only with replacement. Thus the sample becomes the training set for the growing tree.
- Now if there are M input variables, a number less than M (m < M) is specified in such a way that each node randomly selected fall in the range till M. The best split out of these is used for splitting the node, and thus while the forest is grown, m remains constant.
- Each tree is growing to the maximum with no pruning at all.
- Finally, the new data is predicted by making an aggregation of the predictions of the ntree trees (i.e. average for regression and the majority votes for classification).
The popularity of R and Python between 2013 and February 2015 by the TIOBE Index has been shown in the graph below this text.
Additionally, the Redmond ranking, shows the relative performance of R and Python on Github and Stack Overflow for the years (September 2012 and January 2013, 2014, 2015)
When it comes to opportunities, scope and salary, even though Python is leading in the case of opportunities and scope, R has beaten Python by wider margins in salary. However, with Python you can work in innumerable fields of computer science whereas R is a language developed for researchers and skilled programmers. It’s sophisticated and complex nature has resulted in jobs with huge amount of money making.
According to the 2014 Dice Tech Salary Survey, the average salary for high paying skills and experience for an R programmer is $115,531, whereas the average salary for the high paying skills and experience of a programmer working on Python is $94,139.
Now let us discuss certain features common to both R and Python
Both R and Python are open-source, and readily available in the market for free download, unlike other statistical software in the market such as SAS and SPSS that are rather commercial tools.
Advanced Tools- Many of the newly developed statistics appear first in open source packages of R and to a lesser extent, in Python as well.
Online Communities and Portals- On the contrary to the paid customer support for the commercial software, R and Python have various online communities that provide support to the respective users.
The Bottom Line
No matter what you choose to learn first, Python or R, each language has its own set of pros and cons for different situations and tasks.With CoolaData you can now connect to using Python or R in order to build statistical and advanced data mining models. The connection is done by addressing CoolaData’s servers through the simple Query API. You can then run CoolaSQL, a behavioral extension to the SQL language, in the form of a cohort, path or funnel to retrieve sample data for training your models.