1 The R Ecosystem

R is an open-source, community driven platform available for a wide variety of computer operating systems and is becoming a de facto standard in modern biological analyses. It is not a point-and-click interface, rather it is a rich analytical ecosystem that will make your research and scientific life more pleasurable.

1.1 Getting R Configured

The grammar of the R language was derived from another system called S-Plus. S-Plus was a proprietary analysis platform developed by AT&T and licenses were sold for its use, mostly in industry and education. Given the closed nature of the S-Plus platform, R was developed with the goal of creating an interpreter that could read grammar similar to S-Plus but be available to the larger research community. The use of R has increased dramatically due to its open nature and the ability of people to share code solutions with relatively little barriers.

The main repository for R is located at the CRAN Repository, which is where you can download the latest version. It is in your best interests to make sure you update the underlying R system, changes are made continually (perhaps despite the outward appearance of the website).

The current version of this book uses version 3.2. To get the correct version, open the page and there should be a link at the top for your particular computing platform. Download the appropriate version and install it following the instructions appropriate for your computer.

1.1.1 Packages

The base R system comes with enough functionality to get you going. By default, there is only a limited amount of functionality in R, which is a great thing. You only load and use the packages that you intend to use. There are just too many packages to have them all loaded into memory at all times and there is such a broad range of packages, it is not likely you’d need more than a small fraction of the packages during the course of all your analyses. Once you have a package, you can tell R that you intend to use it by either

library(package_name)

or

require(package_name)

They are approximately equivalent, differing in only that the second one actually returns a TRUE or FALSE indicating the presence of that library on you machine. If you do not want to be mocked by other users of R, I would recommend the library version—there are situations where require will not do what you think you want it to do (even though it is a verb and should probably be the correct one to use grammatically).

There are, at present, a few different places you can get packages. The packages can either be downloaded from these online repositories and installed into R or you can install them from within R itself. Again, I’ll prefer the latter as it is a bit more straightforward.

1.1.2 CRAN

The main repository for packages is hosted by the r-project page itself. There are packages with solutions to analyses ranging from Approximate Bayesian Computation to Zhang & Pilon’s approaches to characterizing climatic trends. The list of these packages is large and ever growing. It can be found on CRAN under the packages menu. To install a package from this repository, you use the function

install.packages("thePackageName") 

You can see that R went to the CRAN mirror site (I use the rstudio one), downloaded the package, look for particular dependencies that that package may have and download them as well for you. It should install these packages for you and give you an affirmative message indicating it had done so.

At times, there are some packages that are not available in binary form for all computer systems (the rgdal package is one that comes to mind for which we will provide a work around later) and the packages need to be compiled. This means that you need to have some additional tools and/or libraries on your machine to take the code, compile it, and link it together to make the package. In these cases, the internet and the documentation that the developer provide are key to your sanity.

For packages on CRAN, you can also keep them updated to the latest version using the command:

update.packages(ask=FALSE)

I recommend using the ask=FALSE option on this one as it can be very tedious to respond to a “Y/N” question for each library you may need to update (there could be hundreds).

1.1.3 GitHub

There are an increasing number of projects that are being developed either in the open source community or shared among a set of developers. These projects are often hosted on http://www.github.com where you can get access to the latest code updates. The packages that I develop are hosted on Github at (http://github.com/dyerlab) and only pushed to CRAN when I make major changes.

To install packages from Github you need to install the devtools library from CRAN first

install.packages("devtools")
library(devtools)

Then you need to use the devtools function install_github() to go grab it. To do so you need two separate pieces of information, the name of the developer who is creating the repository and the name of the repository it is contained within. For the gstudio package, the develop is dyerlab and the repository name is gstudio. If you are comfortable with git, you can also check out certain branches of development if you like with optional arguments to the install_github() function. Here is how you would install both the gstudio and popgraph libraries I use in this book.

install_github("dyerlab/gstudio")
install_github("dyerlab/popgraph")

If you are starting to work with R and intend to compile packages, there are some tools that you will need to have on your machine. For Windows machines, there is an rtools download EXE that you can get. On OSX, you need to download the developers tools from Apple (available on the Application Store for free). In either case, the developer (if they care about their code being used by others) should provide sufficient documentation. If you are working on Windows, I do not know how that system (and never have used it for any prolonged period of time) so you’ll have to look to the internet on where to find compilers and other developer tools.

In a way similar to that for the CRAN packages, you can update existing github packages using:

library(devtools)
update_packages()

It will ask you if you really want to do this (which has three options: ‘Yup’, ‘No’, and ‘I forget’) and then proceed.

1.1.4 Bioconductor

The last common location to find packages for R is on the Bioconductor site, a collection of software for bioinformatic analyses. To install from the bioconductor site you need to download their own installer as:

source("http://bioconductor.org/biocLite.R")
biocLite()

And then to install packages you use

biocLite(c("GenomicFeatures", "AnnotationDbi"))

There are no libraries used in this text from bioconductor as this text but if you shoudl do those kinds of analyses, it will be helpful to see what they have available.

1.1.5 Troublesome Packages

Some packages provide a unique set of problems for getting them onto your computer. There are a variety of reasons for this and Google is your friend (though it would be easier if R was named something more unique than the 18th letter of the alphabet…).

1.1.5.1 RGDAL & RGEOS

Every time I upgrade in any significant way, two R libraries seem to raise their ugly heads and scream like a spoiled child—rgdal and rgeos. Why do these two have to be SOOOO much of a pain? Why can’t we have a auto build of a binary with all the options in it for OSX? Who knows? I always feel like I get the fuzzy end of the lollipop with these two. Here is my latest approach for getting them going.

First you have to make sure you have the latest GDAL libraries. I used to get mine from Kyngchaos, just download the framework, install it, and then do some kind of long R CMD INSTALL dance, which seems to no longer work for me. I also tried installing from Ripley’s repository and found that:

  1. It was a version older than the one I already had on my machine, and
  2. You can’t install from that repository, there is a malformed header and the install.packages() function just barfs.

Time to try something new. I typically stay away from the various installer frameworks out there on OSX to keep everything in Frameworks. But this time, I used MacPorts. You can find the latest version here. Here is how I got it to help me out.

  • Download XCode from Apple, it is both free and has a lot of tools that make your life as a programmer easier. It is a big package and you’ll have to install the command line developer tools as well. You will be prompted on how to do this.
  • Downloaded the version of macports for your OS, I’m currently on 10.11 and installed it with little problems. It takes a bit of time at the end of the installation because it is downloading a lot of information. Be patient.
  • In the terminal, I updated it sudo ports -v selfupdate and again be patient, it is going to grab a lot of stuff from the internet.
  • I then used it to install gdal as a unix library (rather than as a framework so it won’t be located in /Library/Frameworks) by sudo ports install gdal. There were a lot of dependencies for this one so it took a while.
  • I then had R install rgdal as install.packages( rgdal, type="source")

Worked like a charm.

1.2 Libraries Used

This work requires several libraries that you may need to get from either CRAN or GitHub. The following if you run the following code, you should be up-to-date on the necessary packages used throughout this text.

Table 1.1: R packages used in the examples shown in this book.
Name Title
DescTools Tools for Descriptive Statistics
devtools Tools to Make Developing R Packages Easier
GGally Extension to ‘ggplot2’
ggplot2 Create Elegant Data Visualisations Using the Grammar of Graphics
gstudio Tools Related to the Spatial Analysis of Genetic Marker Data
MASS Support Functions and Datasets for Venables and Ripley’s MASS

1.3 The RStudio Environment

By itself, R can be used in the terminal or as the basic interface the installer provides. While both of these methods are sufficient for working in R, they are less than optimal (IMHO). I believe one of the most important tools you can use is a good IDE as it helps you organize and work with your data in ways that focus more on the outcome and less on the implementation. I’ve spent a lot of time in both Emacs and Vim and while they may be tools for the geek elite, RStudio has made me much more productive in terms of output per unit time.

The RStudio IDE can be can be downloaded directly from (http://rstudio.org) and comes in varieties for desktops and servers. The packages are easy to install and contain all the instructions you need. If you run your own server, you can install it and do your analyses via a web interface that looks identical to the desktop client (in fact it is more similar than you perhaps know).

The interface has four panes, two of which (the source code editor and the terminal output) you will work with the most often. The other panes have information on the history of your project, plot output, packages, help, command history, a list of all variables, and a file browser (in the image above, I have minimized the pane with the “Environment” and “History” tabs on it to make more room for the code editor).

You can type in R commands in the terminal window and receive responses directly, just as normal. However, there are several alternative ways you can extract use R including:

  1. Code can be entered into a script file (ending in *.R) and run them directly by submitting that file to the R interpreter. This kind of file only has code (and comments) in it.
  2. A Markdown (*.Rmd) or LaTeX (*.Rwd) file that mixes code and non-code documenation together. This text is acutally written in RMarkdown and mixes together the word-processing parts and the analytical code in a single file. This is a very important and useful approach. If you are a serious user of R, you can use LaTeX or RMarkdown to make documents, slides, presentations, posters, etc. It is a versatile tool and one worth looking into, especially as it pertains to reproducible research (something we all need to strive for).