You've probably heard the buzz words: "Big Data" - this and Artificial Intelligence have become well known topics in the scientific / technology industries, promising to change our lives - mostly for the better. At first exposure, this sounds like something out of science fiction, bringing up movies like 'I, Robot', 'Terminator', and '2001: A Space Odessy'. Although we're well on our way to realizing the technology behind some of these concepts, at a basic level, this stuff is relatively easily understood.
I'm going to start with the basics of data wrangling using 'R' and 'R-Studio', then slowly progress to some of the more advanced topics, such as machine learning, and application of these concepts to real-world problems.
Before we begin, you'll need to download and install the free software we're going to use!
- R - Download and Install R (base) for your OS.
- R-Studio (Desktop) - Downlaod and install R-Studio (after installation of base R is complete).
Once this is done, launch R-Studio and you'll be up and running.
Welcome to the R 'ecosystem'
At some point, I'll add more info about the history of R, but for now, suffice to say it's a high-level statistical programming language, that is free and open source. It is widely used for statistical and data analysis in academia, as well as industry.
There is a large number of users, so searching Stack Overflow to help solve similar problems will often yield good results. Furthermore, there is a large number of packages which have been contributed by the R community, meaning that there's a good chance that someone has developed a package or function to help you with your analysis.
Useful Packages to Get Started
Here are a few packages that I couldn't work without:
ggplot2: A very powerful & flexible plotting package.
data.table: A powerful extension of the data.frame. This one is definitely worth learning for significant performance gains and easy data manipulation.
ProjectTemplate: A package for creating a nice hierarchy for organizing your projects.
I use the first two extensively; however, from there it largely depends on the analysis I'm trying to perform.
To install packages, you can type the following at the R console:
or, for a list of packages:
That last one would look awfully unfamiliar for those of you used to Python, C, or most other programming languages. There are a series of
apply functions in R which are used to loop over values in a vector and perform a specified function.
RDataMining.com: has some great reference documents, including exapmles, and a quick reference card. R-Bloggers.com: A site that pulls R news and blog posts by >500 R users and bloggers across the internet. Some other things to consider when using R for analysis is version control and tracking. Try checking out Git and Github or Atlassian’s BitBucket (which also supports Git).
Keep in mind the help files within R often include examples, demos, vignettes, and package manuals. All of these things can be very helpful when learning to use R.
I'm going to leave it there for now, but I'll be adding to this post in the near future.