STA/OPR 9750 Week 3 In-Class Activity: R
, These are your First Steps
Welcome!
Vector Semantics in R
R
works with 3 core principal structures:
- Vectors: an ordered (one-dimensional) collection of the same type
- Numeric (real number / double), Integer, Character, Logical (Boolean)
- Scalars (single values) are just vectors of length 1
- Lists: an ordered collection of arbitrary objects (other vectors, other lists, etc)
- Data Frames: ordered tabular database (think SQL tables)
- Next week, we’ll discuss these in detail
This week, our primary focus is on vectors.
Review of R
In this section, we will review some of the basic ‘built-in’ features of R
. In the next section (Packages
) we will discuss how to add to the “base-bones” functionality. When working with R
, there are two interacting ‘subsystems’ in play:
- The
R
language and interpreter: this is the part ofR
that is similar topython
orC
/C++
. You will writeR
code in theR
language and theR
interpreter will run that code. The fact that all of these elements are calledR
is a bit confusing, but once you get the hang of things, the distinctions will melt away. R
packages: When working inR
, you do not have to start from scratch every time. Other programmers makesets
of code available to you in the form ofpackages
. For our purposes, a package can contain two things:- Pre-written functions to help you achieve some goal
- Data sets Most of the time, the primary purpose of a package is sharing functions and code: there are easier ways to share data with the world.
When you first downloaded R
, you downloaded the interpreter and a set of base packages written by the “R Core Development Team”.
Run the following code to see what your R
environment looks like:
Compare the output of running this here-in the browser-with what you get by running sessionInfo()
on your machine.
There is lots of useful information here, including
- the version of
R
being used - the operating system
- the numerical linear algebra libraries (BLAS, LAPACK) used
- system language and time zone information
- loaded packages
When asking for help, always include the output of the sessionInfo()
command so that your helper can quickly know how your system is configured.
Packages
R
code is distributed as packages, many of which come included with R
by default. These are the base
packages, and they are noted in your sessionInfo()
. But we can do many more things with R
using contributed (non-base) packages!
The most common platform for distributing R
packages is CRAN
, the Comprehensive R Archive Network, available at https://cran.r-project.org/. You have likely already visited this site to download R
. The available.packages()
function in R
lists all packages currently on CRAN
. We can see that there are many:
NROW(available.packages())
[1] 21510
If you want to use a contributed package, you need to do two things:
- Download it from CRAN and install it onto your computer (one time)
- Load it from your hard drive into
R
(every time you restartR
)
The first step - download and install - can be completed using the install.packages()
function. For example, to install the palmerpenguins
package, I would run:
install.packages("palmerpenguins")
This will automatically download and install this package for me. R
is helpful and also tries to automatically install all packages that a given package relies upon. Because of this, it is often sufficient to install the “last step” and trust R
to handle the dependencies automatically. In this course, most of the packages we use can be automatically installed by installing the tidyverse
package.
install.packages("tidyverse")
Note that there really isn’t much in the tidyverse
package we want, but it’s a useful proxy for a much larger set of packages.
Once a package is installed, we need to load it into R
with the library
function:
library(palmerpenguins)
After doing this, we have access to the contents of the palmerpenguins
package until we restart R
.
Note that the install.packages
function wants you to quote its argument, but library
does not. This is a weird historical quirk of R
that you will trip up on many times before this course ends.
install.packages("tidyverse")
library(tidyverse)
For example, palmerpenguins
package provides a data set of penguin measurements. If we try to get the data set without loading palmerpenguins
, we get an error message:
After we install and load palermerpenguins
, we are good to go:
So much tuxedo goodness!
In general, if you get a error message of the form Error: object 'X' not found
, you should:
- Make sure you spelled
X
properly - If
X
comes from a package, make sure youlibrary()
that package.
There’s no harm to library()
-ing a package multiple times; if you install.packages()
a package that you have already loaded, you may need to restart R
.
As mentioned last week, I strongly recommend never saving your workspace in R
or RStudio
. One of the things “saved” in a workspace is the list of loaded packages, so it becomes essentially impossible to reinstall a package properly.
Variables and Assignment
Whenever you type a “word” of R
code, it must be one of three things:
A reserved word: this is a small set of
keywords
thatR
keeps for its own use. These have special rules for their use that we’ll learn as we go along. The main ones are:if
,else
,for
,in
,while
,function
,repeat
,break
, andnext
.If you use one of these words and get a weird error message, it’s likely because you aren’t respecting the special rules for these words.
For the nitty gritty, see the
Reserved
help page but feel free to skip this for now. TheControl
help page gives additional details.(When you run a help page in this tutorial, it looks a bit funny. Try running
?Reserved
directly inRStudio
for better formatting.)A “literal”. This is a word that represents “just the thing” without any additional indirection. The most common types of literals are:
- Numeric: e.g.,
3
,42.0
, or1e-3
- String: e.g.,
'a'
,"beach"
, or'cream soda'
There are a few rules for literals, but the most important is that strings begin and end with the same character, either a single quote or a double quote. WhenR
sees a single quote, it will read everything until the next single quote as one string, even if there’s a double quote inside.
Try some literals:
- Numeric: e.g.,
What does the literal 0xF
represent? (You don’t need to worry about why. This is a fancier literal than we will use in this class.) - A “variable name”. This is the most common sort of “word” in code. It is used to something without actually having to know what it is.
We can create variables using the “assignment” operator: <-
When you read this outloud, read <-
as “gets” so x <- 3
becomes “x gets 3.”
When we use the assignment operator on a variable, it overwrites the value of a variable silently and without warning
We also put expressions on the right hand side of an assignment:
Also note the trick we’ve used here a few times: a “plain” line of code without an assignment generally prints its value.
Vector Data Types
A vector is a ordered array of the same sort of thing (e.g., all numbers or all strings). We can create vectors using the c
command (short for concatenate).
To see the type (sort) of a vector, you can use the str
command.
str(x)
tells us about the structure of x
. Here, we see that x
is a numeric vector of length 3.
R
will try to do the right thing when doing arithmetic on vectors.
When you give R
vectors of different lenghts, it will “recycle” the shorter one to the length of the longer one.
This can be a double-edge sword when the two vectors don’t fit together so nicely:
Here we see also that R
printed a warning message. A warning message is R
’s way of saying “something is funny, but I can still do this” while it (successfully) implements your command. It’s here to help you, but sometimes can be safely ignored if you’re sure about what you’re doing.
An error is a “I can’t do this” message. When R
encounters an error
it stops and does not fully execute the command
Here we get an error because there is no meaningful way to multiply a string by a number, unlike earlier where the recycling rule told R
what to do, even if it was probably a bad idea.
Functions
Functions
In many of these exercises, we have used commands that have the form NAME()
with zero or more comma-separated elements in the parentheses.
This represents a function call. Specifically, the command func(x, y)
calls the function named func
with two arguments x
and y
.
Functions are the verbs of the programming world. They are how anything gets done. So far, we’ve only used some basic functions:
c
: the concatenate functionprint
: the print functionstr
: the structure functionlist
: the list making function
But there are tons of other useful ones!
Try these out: - length
on a vector - colnames
on a data frame (like PlantGrowth
) - toupper
on a string (vector) - as.character
on a numeric value
Arguments: Positional and Keyword
The inputs to a function are called the arguments. They come in two forms: - Positional - Keyword
So far we have only seen positional arguments. The function interprets them in an order that depends on they were given:
Here paste
combines two values into a string. We get different output strings depending on the order of the input.
Other arguments can be passed as keyword arguments. Keyword arguments come with names that tell functions how to interpret them. For example, the paste
function has an optional keyword argument sep
that controls how the strings are combined.
Keyword arguments typically have defaults so you don’t need to always provide them. For the paste
function, the sep
defaults to " "
.
Creating Your Own Functions
When you want to create your own function, you use a variant of the assignment structure
<- function(x, y) {
my_addition + y
x }
Let’s break this into pieces:
- On the left hand side of the assignment operator
<-
, we see the function name. This works exactly the same as vector assignment. - Immediately to the right of the assignment operator, we see the keyword
function
. This tellsR
that we are defining a function. - After the word function, we see the “argument list”, i.e., the list of inputs to the function (comma separated). Here, we are not providing default values for any function.
- Finally, between the curly braces, we get the body of the function. This is actually the code defining a function’s behavior. You can do basically anything here! Define variables, do arithmetic, load packages, call other functions - it’s all valid. (In fact, you can even define a function within a function, but that’s sort of advanced.)
- The last line of the body (here the only line) defines the return value of the function, i.e., its output.
This function will add two numbers together. Now that we’ve defined it, we can use it just like a built-in function:
my_addition(2, 4)
[1] 6
Tip: You can see the code used to define any function by simply printing it: think of the code as being the “value” and the function name as the variable name. (This isn’t actually just a metaphor - it’s literally true!)
Default Arguments
Sometimes, we want functions to have default but changeable behvaior. This is default arguments come in. If the user provides a value, the function uses it, but otherwise the default is used.
For example,
<- function(x, by=2){
make_bigger + by
x
}
make_bigger(3)
[1] 5
make_bigger(3, by = 3)
[1] 6
Here by
defaults to 2
, but the user is required to supply x
because it has no default.
make_bigger(by=3)
Error in make_bigger(by = 3): argument "x" is missing, with no default
There are lots of details in the mechanics - they even can be ‘dynamic’ using some tricks - but in general, they should “just work.”
Control Flow
So far, all of the code we have written executes linearly, one line at a time. To write complex programs, however, we sometimes need code to execute in other ways: e.g., going line by line through a complex data set running the same code (a “loop”) or doing different things depending on the value of a variable (a “conditional”). This brings us to the topic of control flow, or how a program gets executed.
Conditionals
Perhaps the most common control flow operator is the “conditional” - the if
operator. In R
, the if
operator looks like this:
if(TEST){
# Some code goes here
# This gets run if `TEST` is true
else {
} # Some code goes here
# This gets run if `TEST` if false
}
For example
<- 3
x if(x > 0){
print("x is positive!")
else {
} print("x is negative")
}
[1] "x is positive!"
The element inside the if
(the test statement) should ideally be Boolean (TRUE/FALSE-ish) but R
will make a reasonable guess if it isn’t.
Note that you can omit the else
part and the second set of braces that go with it, but the first set of braces, immediately after if()
, should always be there.
Change the value of x
and see what happens. Next, modify this by adding an else
statement to handle the case of odd numbers.
Note that we’re using the %%
operator here. If you haven’t seen it before, recall you can get help by typing
?
%%
in R
. In this case, %%
is a modulo
operator; that is, it is the “remainder” from division. (Do you see how it works here?)
We’ll practice using conditional operators below.
Looping
In other ## Programming Exercises
Write functions to perform each of the following tasks.
- Write a function that takes in a vector of numbers, calculates the length and maximum value of the vector, and prints that information to the screen in a formatted way.
> func_1(c(1, 2, 3, 5, 7))
The largest value in that list of 5 numbers is 7.
> func_1(c(1, 2, 5, 5))
The largest value in that list of 4 numbers is 5.
To make your output as attractive as possible, you might want to use the cat
command instead of the print
command.
Write a program that tests whether its (integer) outputs are leap years. Recall the leap year rules:
- A year is a leap year if it is divisible by 4
- But it is not a leap year if it is divisible by 100
- Unless it is also divisible by 400
> leap_year(2023)
FALSE
> leap_year(2024)
TRUE
> leap_year(2100)
FALSE
> leap_year(2000)
TRUE
Remember our discussion of the %\%
operator from class.
Write a function to greet your classmates with varying levels of enthusiasm. It should have three optional arguments:
name
. The name of the person to greet. Default"friend"
times
. The number of times to repeat the greeting.emphasis
. A Boolean (TRUE/FALSE) value indicating whether the greeting should end with an exclamaition point. (DefaultFALSE
)
> greetings()
Hello, friend
> greetings(name="Michael")
Hello, Michael
> greetings(times=2)
Hello, friend
Hello, friend
> greetings(emphasis=TRUE)
Hello, friend!
> greetings("Michael", 5, TRUE)
Hello, Michael!
Hello, Michael!
Hello, Michael!
Hello, Michael!
Hello, Michael!
- The Riemann Zeta Function is a famous function in analytic number theory1 defined as \[\zeta(k) = 1 + \left(\frac{1}{2}\right)^k + \left(\frac{1}{3}\right)^k + \dots = \sum_{i=1}^{\infty} i^{-k} \] We cannot implement an infinite series in
R
, but we can get very close by taking a large number of terms in the series (e.g, the first 500,000). Implement the zeta function and show that \(\zeta(2) = \frac{\pi^2}{6}\)
> zeta(2)
[1] 1.644932
> zeta(3)
[1] 1.202057
> zeta(4)
[1] 1.082323
> all.equal(zeta(2), pi^2/6, tol=1e-4)
[1] TRUE
Hero of Alexandria developed a method for computing square roots numerically. He showed that by performing the following update repeatedly, \(x\) will converge to \(\sqrt{n}\): \[x \leftarrow \frac{1}{2}\left(x + \frac{n}{x}\right)\] You can start with any positive \(x\), but \(n/2\) is a good choice.
Implement this method to compute square roots. Use an optional keyword argument (default value 100) to control how many iterations are performed:
> hero_sqrt(100)
[1] 1.644932
> hero_sqrt(3)^2
[1] 3
> hero_sqrt(3, iter=2)
[1] 1.732143
> hero_sqrt(3000)
[1] 54.77226
Footnotes
ANL is basically the application of calculus techniques to prove properties of prime numbers: it’s a surprisingly powerful approach.↩︎
Comments
When you include a
#
symbol,R
will ignore everything after it. This is called a comment and you can use it to leave notes to yourself about what you are doing and why.