I am a big fan of Excel. It has been one of the biggest productivity boosting technologies invented in recent times. But it has its limitations.
Learning to code, like learning any language, is both challenging and rewarding. It takes time, effort and perseverance.
Learning R is no different. The initial learning phase can be daunting and there are many moments when, deadlines looming, you abandon ship and return to the safety of Excel. But once you build up confidence, you’ll find the benefits are huge. Tasks can be done much faster. New types of analysis are possible. Repetitive tasks are automatable and workflows can be shared with colleagues.
Here are 6 (of many) reasons to take the leap from Excel to R. (PS. it’s also free!!)
1. Cleaning data
Transforming messy raw data into usable clean processed data is generally the first data analysis task of a new project. There are many subtasks - merging different datasets together, checking for missing or erroneous data, reshaping tables, reformatting and transforming variables.
Most of these are just about possible in Excel. Merging datasets generally involves getting very friendly with the VLOOKUP() function. Searching for missing values involves a lot of filtering and sorting. Reshaping data is difficult, usually involving a lot of playing about with pivot tables.
In R these tasks are all straightforward. Packages from tidyverse contain clean well written functions. Each of these tasks can be neatly stacked into a logical workflow.
2. Automating tasks
If you received the same messy raw data periodically, you might want to automate the data cleaning process. Coding in general allows many repetitive tasks to be automated. At best this can be achieved with VBA in the Excel environment, but it’s not ideal.
At DS Analytics we have a whole host of R scripts we recycle, that clean, model and visualise data. Once written they can be executed with a single run command.
3. Sharing workflows
Most analysts at one time or other will have been on the receiving end of a bad project handover. A myriad of miscellaneous .xls and .csv files. The former filled with hideous long formulae and conditional formatting. Generally it is not possible to trace back the analysis stages.
This agony can be relieved by replacing these messy workflows with a series of well commented and documented R scripts. Workflows can be packaged up into custom-build R functions, which themselves can be wrapped up into your own internal R package.
You can just about build models in Excel, but it isn’t pretty. Without doubt this is not the right tool for that job. Whatever kind of modelling you’re doing, whether correlation analysis, analysis of variance, building regression modelling or cluster analysis, you’re better off in the land of R. There are a clutch of other programs you could turn to - Stata, SPSS, SAS or EViews. But they are all fast becoming obsolete. The only other contender worth considering when it comes to data modelling is python (which is in fact complementary, rather than competitive to R).
In R you can build every conceivable type of model, from simple linear regression to most intricate algorithms.
5. Handling larger datasets
When you’ve got datasets with over ~50k rows, Excel starts to splutter to a halt. I can’t count the number of times I’ve accidentally opened datasets too large for Excel to handle and had to wait while it freezes to death.
On most personal laptops R is comfortable with millions of rows of data. When data gets very large (exceeds your laptop’s RAM), there are an increasing number of options available. There are packages for efficient memory usage and parallelisation. And its now relatively painless for a non-computer scientist to fire up high memory virtual machines with R pre-installed, through cloud services like Google Cloud Platform and Amazon Web Services.
Very, very large data will be stored in big data warehouse solutions, such as Hadoop or Spark. These are often designed to work with R, with easy to use APIs to either pull subsets of the data from the warehouses, or to push R code to them.
6. Creating Better data visualisations
Excel plots have got a lot better in the latest versions. And sometimes they’re right for the job at hand. If you want to produce dynamic, highly customised plots, perhaps within a web application available to users within your organisation, then you’ll need another solution.
One route would be to buy an expensive licence for a dashboarding program e.g. Tableau or PowerBI. While these are extremely powerful tools, they come with drawbacks. Often visualisation is a subtask nested within a series of others.
For example your workflow might be:
Read in raw data >
Clean data >
Transform variables >
Visualise with simple plots >
Run statistical tests >
More visualisation >
Build models >
More visualisation >
Write report with analysis results and visualisations or build data app so others can use the information
Drag-and-drop tools like Tableau and PowerBI can only do elements of this workflow, so you’ll have to keep jumping in and out of different programs. ALL of the above can be done in R, including writing up tidy reports (Rmarkdown) or building data apps (Shiny).
A great many organisations have switched adopted open source software like R and python. Not everyone in your organisation, or even within your analytics team need coding and R skills. But having some in-house R skills can greatly improve analytical efficiency.
Where next? Read a previous blog on 8 Skills Every Data Analytics Team Should Have
DS Analytics are a data science consulting company. We provide clients with support, mentoring and training. We help companies embed data science capabilities within their organisations. We provide R support and training, as well as training in other tools and techniques in data science.
Get in touch to find out more!