5 Programming
One of the strengths of R as an analysis platform is that it is a language rather than a program. With programs, such as SPSS & JMP, you are limited by the functionality that the designers thought would be necessary to meet the broadest audience. In R, you can rely upon simple functions or you can create entire analysis and simulation programs de novo. To do this, we need to dig into flow control and decision making processes, both of which you need for doing more in-depth programming.
5.1 Function Writing
Here we look at how to create an R function. Writing small functions like this is a huge benefit to you as an analyst and this is a great place to start. A function in R is defined as:
function_name <- function( arguments ) { Stuff you want the function to do }
You define a function name for whatever you like (given you use the names) and assign it the stuff to the right. In R, the function named function()
is a special one, it tells R that you are about to create a little routine and you want that set of code to be available to you for later use under the name of whatever you named it. This allows a tremendous amount of flexibility as you develop your own set of routines and analyses for your work. The part that actually does stuff is after the function call. It may be that the function that you create does need some data (those are the arguments) or you may not need any input in to the function (in which you pass no arguments). It all depends upon what you are creating.
The key to understanding functions is that they are encapsulations of code—a shortcut for a sequence of instructions if you will not have to type over and over again. The less typing you do, the lower the probability that you will have errors (and all code has errors).
Here is an example of some code that I’m going to develop into a function. This function will allow me to determine if one genotype could possibly be the offspring of the other genotype.
library(gstudio)
loc1 <- locus( c(128,130) )
loc2 <- locus( c(128,128) )
cat( loc1, loc2 )
## 128:130 128:128
We start out with two loci, a 128:130
heterozygote and a 128:128
homozygote. These may represent repeat motifs at a microsatellite locus or some other co-dominant genotype. First, I’ll break the locus into a vector of genotypes.
off.alleles <- alleles( loc1 )
off.alleles
## [1] "128" "130"
mom.alleles <- alleles( loc2 )
mom.alleles
## [1] "128" "128"
To be a valid potential offspring there should be at least one of the alleles in the parent that matches the allele in the offspring. The intersect()
function returns the set of values common to both vectors.
shared <- intersect( off.alleles, mom.alleles )
shared
## [1] "128"
If it has at least one of the alleles present (it could have both if parent and offspring are both the same heterozygote) then you cannot exclude this individual as a potential offspring. If there are no alleles in common, then the value returned is an empty vector.
loc3 <- locus( c(132,132))
dad.alleles <- alleles( loc3 )
intersect( mom.alleles, dad.alleles )
## character(0)
This logic can be shoved into a function. You have to wrap it into a set of curly brackets. I use the length of the result from the intersect() to return from the function. Potential values for
potential_offspring <- function( parent, offspring ) {
off <- alleles( offspring )
par <- alleles( loc2 )
shared <- intersect( off, par )
return( length( shared ) > 0 )
}
Now, you can call this function anytime you need, just passing it two genotypes. If they can be offspring it returns TRUE, as in the comparison between 128:130 and 128:128 genotypes.
potential_offspring(loc1, loc2)
## [1] TRUE
And it returns FALSE for the comparison between 128:128 and 132:132.
potential_offspring(loc2, loc3)
## [1] FALSE
5.2 Variable Scope
There is a lot more information on writing functions and we will get into that as we progress through the text. However, it is important that I bring this up now. The value assigned to a variable is defined by its scope. Consider the following code
x <- 10
and the function defined as
do_it <- function( x ) {
x <- x + 10
return( x )
}
When I call the function, the variable x that is the argument of the function is not the same variable that is in the environment that I assigned a value of 10. The x in the function argument is what we call “local to that function” in that within the curly brackets that follow (and any number of curly brackets nested within those, the value of x is given whatever was passed to the function.
5.3 Decision Making
We interact with our data in many ways and introspection of the values we have in the variables we are working with are of prime importance. Decision making in your code is where you evaluate your data and make a choice of outcomes based upon some criteria. Here is some example data that we can use as we explore the basics of if()
, if(){} else{}
, and if(){} elif(){} else{}
coding patterns.
5.3.1 The if Pattern
The most basic version of decision making is asking a single question and if the answer is TRUE then do something. The if(){}
function does this and has the form
if( CRITERIA ) {
DO_SOMETHING
}
You pass a logical statement (or something that can be coerced into a logical type) to the function as the CRITERIA and if it evaluates to TRUE
, then the contents of the DO_SOMETHING
are executed. If the value of CRITERIA
is not TRUE
the DO_SOMETHING
is skipped entirely—it is not even seen by the interpreter.
Here we can test this out using the loci defined above along with the is_heterozygote() function. This function takes one or more locus objects and returns TRUE/FALSE if they are or are not a heterozygote.
is_heterozygote( c(loc1, loc2) )
## [1] TRUE FALSE
If we shove that function into the if()
parameters we can use its evaluation of the heterozygous state of the locus to do something interesting, say tell us it is a heterozygote—it is admittedly a contrived example, but hey you try to make easy examples, it is not easy.
if( is_heterozygote(loc1) ){
print("It's a het!")
}
## [1] "It's a het!"
If the is_heterozygote()
function returns a value of FALSE, then the contents of the if()
function (the stuff within the curly brackets is skipped entirely.
if( is_heterozygote(loc2) ){
print("It's a het!")
}
Notice, there was no indication of any of that code inside the curly brackets. The if-else Pattern If there are more than on thing you want to potentially do when making a decision, you can add an else clause after the if pattern. Here if is_heterozygote() returns FALSE, the contents of the else{} clause will be executed. Here is the heterozygote example
if( is_heterozygote(loc1) ) {
cat(loc1, "is a heterozygote")
} else {
cat(loc1, "is a homozygote")
}
## 128:130 is a heterozygote
and the homozygote one
if( is_heterozygote(loc2) ) {
cat(loc2, "is a heterozygote")
} else {
cat(loc2, "is a homozygote")
}
## 128:128 is a homozygote
There is a slightly shorter version of this that is available for the lazy programmer and lets be honest, all programmers are lazy and the more you can accomplish with fewer strokes on the keyboard the better (this is how we got emacs and vim). I generally don’t teach the shortcuts up front, but this one is short and readily apparent so it may be more helpful than confusing. The ifelse()
function has three parts, the condition, the result if TRUE
, and the result if FALSE
.
ans <- ifelse( is_heterozygote( c(loc1, loc2)) , "heterozygote", "Not")
ans
## [1] "heterozygote" "Not"
So iterating through the x vector, the condition x>0
is evaluated and if TRUE the sqrt() of the value is returned, else the NA is given. It is compact and easy to use so you may run into it often.
5.4 The if-else
Pattern
It is possible to test many conditions in a single sequence by stringing together else-if conditions. The point that is important here is that the first condition that evaluates to TRUE will be executed and all remaining ones will be skipped, even if they also are logically TRUE. This means that it is important to figure out the proper order of asking your conditions. Here is an example function that determines if none, one, or both of the genotypes passed to it are heterozygotes. By default, I step through every one of the potential options of available on this comparison.
1. The first is a heterozygote and the second one isn’t 2. The first one isn’t and the second one is 3. Both are heterozygotes 4. The last state (both are not)
Here is the function.
which_is_het <- function( A, B) {
if( is_heterozygote(A) & !is_heterozygote(B) ) {
print("First is heterozygote")
} else if( !is_heterozygote(A) & is_heterozygote(B) ){
print("Second is heterozygote")
} else if( is_heterozygote(A) & is_heterozygote(B) ){
print("Both are heterozygotes")
} else {
print( "Neither are heterozygotes")
}
}
It is possible that the order of these CRITERIA could be changed, the important thing to remember is that the sequence of if - else if - else if etc. will terminate the very first time one of the CRITERIA is evaluated to be TRUE
.
5.5 Flow Control
Flow control is the process of iterating across objects and perhaps doing operations on those objects. The R language has several mechanisms that you can use to control the flow of a script or bit of code.
5.5.1 The for()
Loop
x <- c(3,8,5,4,6)
x
## [1] 3 8 5 4 6
You can iterate through this vector using a for() loop. This is a simple function that has the form:
for( SOME_SEQUENCE ){
DO_SOMETHING
}
Where the SOME_SEQUENCE
component is a sequence of values either specified OR calculated and the DO_SOMETHING
is the thing you want to do with each of the values in the sequence. Usually, there is a variable defined in the SOME_SEQUENCE
component and the value of that variable is used. Here are a few examples. The first goes through the existing vector directly and assigns (in sequential order) the entries of ‘x’ to the variable val. We can then do whatever we want with the value in val (though if we change it, nothing happens to the original x vector).
for( val in x ){
print(val)
}
## [1] 3
## [1] 8
## [1] 5
## [1] 4
## [1] 6
We can also specify a sequence directly and then use it as an index. Here I use an index variable named i to take on the integer seqeunce equal in length to the length of the original x
variable. Then I can iterate through the original vector and use that index variable to grab the value I want.
for( i in 1:length(x)){
print( x[i] )
}
## [1] 3
## [1] 8
## [1] 5
## [1] 4
## [1] 6
Both give us the same output, namely a way to go through the variable x
. However, there may be a need to use the latter approach in your calculations. For example, perhaps I want to do some other operation on the values. In this very contrived example that follows, I want to perform operations on the values in x
depending on if they are even or odd. For the odd ones, I add the corresponding value in y
and if not I subtract it. Sure, this is totally contrived and I cannot think of a reason why I would be doing this, but if I need to know what index (row, column or whatever) an entry is during the iteration process, then I need to use this approach over the for( val in x)
approach.
y <- 1:5
for( i in 1:length(x)){
if( x[i] %% 2)
print( x[i] + y[i])
else
print( x[i] - y[i] )
}
## [1] 4
## [1] 6
## [1] 8
## [1] 0
## [1] 1
5.5.2 Short Circuiting the Loop
It is possible to short circuit the looping process using the keywords next and break, though in my programming style, I consider their use in my source files as evidence of inelegant code. That said, you may need them on occasion.
The next keyword basically stops all commands after that during the current iteration of the loop. It does not terminate the loop itself, it just stops the commands that follow it this time through. Here is an example that uses the modulus operator, %%
(e.g., the remainder after division), to print out only those numbers that are divisible by three.
for( i in 1:20 ){
if( i %% 3 )
next
cat("The value of i =",i,"\n")
}
## The value of i = 3
## The value of i = 6
## The value of i = 9
## The value of i = 12
## The value of i = 15
## The value of i = 18
The use of break to exit the loop entirely is perhaps more commonly encountered. When this keyword is encountered, the loop terminates immediately, as if it reached the send of the sequence.
for( i in 1:10){
if( i > 2 )
break
cat("The value of i=",i,"\n")
}
## The value of i= 1
## The value of i= 2