STA 9750 Week 3 In-Class Activity: R, These are your First Steps

Slides

Welcome!

In this week’s in-class activities, we are going to dive a bit deeper into R’s basic data structure - the vector

Vectors in R

R works with 3 core principal structures:

  • Vectors: an ordered (one-dimensional) collection of the same type
    • Numeric (real number / double), Integer, Character, Logical (Boolean)
    • Scalars (single values) are just vectors of length 1
  • Lists: an ordered collection of arbitrary objects (other vectors, other lists, etc)
  • Data Frames: ordered tabular database (think SQL tables)
    • Next week, we’ll discuss these in detail

This week, our primary focus is on vectors.

Review of R

In this section, we will review some of the basic ‘built-in’ features of R.
In the next section (Packages) we will discuss how to add to the “base-bones” functionality. When working with R, there are two interacting ‘subsystems’ in play:

  • The R language and interpreter: this is the part of R that is similar to python or C/C++. You will write R code in the R language and the R interpreter will run that code. The fact that all of these elements are called R is a bit confusing, but once you get the hang of things, the distinctions will melt away.
  • R packages: When working in R, you do not have to start from scratch every time. Other programmers make sets of code available to you in the form of packages. For our purposes, a package can contain two things:
    • Pre-written functions to help you achieve some goal
    • Data sets Most of the time, the primary purpose of a package is sharing functions and code: there are easier ways to share data with the world.

When you first downloaded R, you downloaded the interpreter and a set of base packages written by the “R Core Development Team”.

Run the following code to see what your R environment looks like:

Compare the output of running this here-in the browser-with what you get by running sessionInfo() on your machine.

There is lots of useful information here, including

  • the version of R being used
  • the operating system
  • the numerical linear algebra libraries (BLAS, LAPACK) used
  • system language and time zone information
  • loaded packages

When asking for help, always include the output of the sessionInfo() command so that your helper can quickly know how your system is configured.

Packages

R code is distributed as packages, many of which come included with R by default. These are the base packages, and they are noted in your sessionInfo(). But we can do many more things with R using contributed (non-base) packages!

The most common platform for distributing R packages is CRAN, the Comprehensive R Archive Network, available at https://cran.r-project.org/. You have likely already visited this site to download R. The available.packages() function in R lists all packages currently on CRAN. We can see that there are many:

The list of available packages is long

available.packages()[,"Package"]
 [1] "AalenJohansen" "aamatch"       "AATtools"      "ABACUS"       
 [5] "abasequence"   "abbreviate"    "abc"           "abc.data"     
 [9] "ABC.RAP"       "ABCanalysis"   "ABCDscores"    "abclass"      
[13] "ABCoptim"      "abcrf"         "abcrlda"       "abctools"     
[17] "abd"           "abdiv"         "abe"           "aberrance"    
 [ reached 'max' / getOption("max.print") -- omitted 22750 entries ]

As of the time of writing, CRAN contains approximately 22770 different freely available packages, while continuing to grow daily. A full list of CRAN packages can be found here. In addition to all of these, there are even more packages hosted on sites like GitHub or on repositories that are specific to large organizations.1

If you want to use a contributed package, you need to do two things:

  1. Download it from CRAN and install it onto your computer (one time)
  2. Load it from your hard drive into R (every time you restart R)

The first step - download and install - can be completed using the install.packages() function. For example, to install the nycflights13 package, I would run:

install.packages("nycflights13")

This will automatically download and install this package for me. R is helpful and also tries to automatically install all packages that a given package relies upon. Because of this, it is often sufficient to install the “last step” and trust R to handle the dependencies automatically. In this course, most of the packages we use can be automatically installed by installing the tidyverse package.

install.packages("tidyverse")

Note that there really isn’t much in the tidyverse package we want, but it’s a useful proxy for a much larger set of packages.

There is no package called XXX

If you encounter an error with the message ‘there is no package called XXX’, this almost always means you need to install package XXX. You can do so by running install.packages("XXX") in the Console.

Once a package is installed, we need to load it into R with the library function:

After doing this, we have access to the contents of the nycflights13 package until we restart R.

Note that the install.packages function requires you to quote its argument, but library does not. This is a weird historical quirk of R that you will trip up on many times before this course ends. To be safe, you can just quote the argument to library() as well

For example, nycflights13 package provides a data set covering different US and Canadian airports. If we try to access that data set without loading nycflights13, we get an error message:

After we install and load nycflights13, we are good to go:

In general, if you get a error message of the form Error: object 'X' not found, you should:

  1. Make sure you spelled X properly
  2. If X comes from a package, make sure you library() that package.

As you first get started with the various packages we’ll use in this course, you might wonder how you should know what functions and data are in a given package. The easiest way to do so is to find the web documentation for that package. For the nycflights13 package used above, a little web searching will lead you to https://nycflights13.tidyverse.org/. On this site, we see a high-level description of what the nycflights13 package includes. If you click “Reference” at the top of the page, you will be taken to a page listing everything made available to you by that package. All of the packages we’ll use in this course have similarly designed web documentation. See the Resources Page for direct links to some of the most important packages.

As a general rule, however, I don’t recommend trying to learn a package by learning everything in it. It’s much more productive to look up the functions you need ‘just-in-time’ for you to use them. This way, you will repeatedly look up the most useful functions and commit them to memory the most rapidly.

There’s no harm to library()-ing a package multiple times; if you install.packages() a package that you have already loaded, you may need to restart R.

Caution

As mentioned last week, I strongly recommend never saving your workspace in R or RStudio. One of the things “saved” in a workspace is the list of loaded packages, so it becomes essentially impossible to re-install a package properly.

Variables and Assignment

Whenever you type a “word” of R code, it must be one of three things:

  • A reserved word: this is a small set of keywords that R keeps for its own use. These have special rules for their use that we’ll learn as we go along. The main ones are: if, else, for, in, while, function, repeat, break, and next.

    If you use one of these words and get a weird error message, it’s likely because you aren’t respecting the special rules for these words.

    For the nitty gritty, see the Reserved help page but feel free to skip this for now. The Control help page gives additional details.

    (When you run a help page in this tutorial, it looks a bit funny. Try running ?Reserved directly in RStudio for better formatting.)

  • A “literal”. This is a word that represents “just the thing” without any additional indirection. The most common types of literals are:

    • Numeric: e.g., 3, 42.0, or 1e-3
    • String: e.g., 'a', "beach", or 'cream soda' There are a few rules for literals, but the most important is that strings begin and end with the same character, either a single quote or a double quote. When R sees a single quote, it will read everything until the next single quote as one string, even if there’s a double quote inside.

    Try some literals:

What does the literal 0xF represent? (You don’t need to worry about why. This is a fancier literal than we will use in this class.) - A “variable name”. This is the most common sort of “word” in code. It is used to something without actually having to know what it is.

We can create variables using the “assignment” operator: <-

When you read this outloud, read <- as “gets” so x <- 3 becomes “x gets 3.”

When we use the assignment operator on a variable, it overwrites the value of a variable silently and without warning

We also put expressions on the right hand side of an assignment:

Also note the trick we’ve used here a few times: a “plain” line of code without an assignment generally prints its value.

Comments

When you include a # symbol, R will ignore everything after it. This is called a comment and you can use it to leave notes to yourself about what you are doing and why.

Vector Data Types

A vector is a ordered array of the same sort of thing (e.g., all numbers or all strings). We can create vectors using the c command (short for concatenate).

Change the above example to c(1, "a", 3) and examine the output. What happened? Why?

To see the type (sort) of a vector, you can use the str command.

str(x) tells us about the structure of x. Here, we see that x is a numeric vector of length 3.

R will try to do the right thing when doing arithmetic on vectors.

When you give R vectors of different lenghts, it will “recycle” the shorter one to the length of the longer one.

This can be a double-edge sword when the two vectors don’t fit together so nicely:

How was the last element of x+y computed?

Here we see also that R printed a warning message. A warning message is R’s way of saying “something is funny, but I can still do this” while it (successfully) implements your command. It’s here to help you, but sometimes can be safely ignored if you’re sure about what you’re doing.

An error is a “I can’t do this” message. When R encounters an error it stops and does not fully execute the command

Here we get an error because there is no meaningful way to multiply a string by a number, unlike earlier where the recycling rule told R what to do, even if it was probably a bad idea.

Functions

Functions

In many of these exercises, we have used commands that have the form NAME() with zero or more comma-separated elements in the parentheses.

This represents a function call. Specifically, the command func(x, y) calls the function named func with two arguments x and y.

Functions are the verbs of the programming world. They are how anything gets done. So far, we’ve only used some basic functions:

  • c: the concatenate function
  • print: the print function
  • str: the structure function
  • list: the list making function

But there are tons of other useful ones!

Try these out: - length on a vector - colnames on a data frame (like PlantGrowth) - toupper on a string (vector) - as.character on a numeric value

Arguments: Positional and Keyword

The inputs to a function are called the arguments. They come in two forms: - Positional - Keyword

So far we have only seen positional arguments. The function interprets them in an order that depends on they were given:

Here paste combines two values into a string. We get different output strings depending on the order of the input.

Other arguments can be passed as keyword arguments. Keyword arguments come with names that tell functions how to interpret them. For example, the paste function has an optional keyword argument sep that controls how the strings are combined.

Keyword arguments typically have defaults so you don’t need to always provide them. For the paste function, the sep defaults to " ".

Creating Your Own Functions

When you want to create your own function, you use a variant of the assignment structure

Let’s break this into pieces:

  • On the left hand side of the assignment operator <-, we see the function name. This works exactly the same as vector assignment.
  • Immediately to the right of the assignment operator, we see the keyword function. This tells R that we are defining a function.
  • After the word function, we see the “argument list”, i.e., the list of inputs to the function (comma separated). Here, we are not providing default values for any function.
  • Finally, between the curly braces, we get the body of the function. This is actually the code defining a function’s behavior. You can do basically anything here! Define variables, do arithmetic, load packages, call other functions - it’s all valid. (In fact, you can even define a function within a function, but that’s sort of advanced.)
  • The last line of the body (here the only line) defines the return value of the function, i.e., its output.

This function will add two numbers together. Now that we’ve defined it, we can use it just like a built-in function:

Tip: You can see the code used to define any function by simply printing it: think of the code as being the “value” and the function name as the variable name. (This isn’t actually just a metaphor - it’s literally true!)

Default Arguments

Sometimes, we want functions to have default but changeable behvaior. This is default arguments come in. If the user provides a value, the function uses it, but otherwise the default is used.

For example,

Here by defaults to 2, but the user is required to supply x because it has no default.

There are lots of details in the mechanics - they even can be ‘dynamic’ using some tricks - but in general, they should “just work.”

Control Flow

So far, all of the code we have written executes linearly, one line at a time. To write complex programs, however, we sometimes need code to execute in other ways: e.g., going line by line through a complex data set running the same code (a “loop”) or doing different things depending on the value of a variable (a “conditional”). This brings us to the topic of control flow, or how a program gets executed.

Conditionals

Perhaps the most common control flow operator is the “conditional” - the if operator. In R, the if operator looks like this:

if(TEST){
    # Some code goes here
    # This gets run if `TEST` is true
} else {
    # Some code goes here
    # This gets run if `TEST` if false
}

For example

The element inside the if (the test statement) should ideally be Boolean (TRUE/FALSE-ish) but R will make a reasonable guess if it isn’t.

Note that you can omit the else part and the second set of braces that go with it, but the first set of braces, immediately after if(), should always be there.

Change the value of x and see what happens. Next, modify this by adding an else statement to handle the case of odd numbers.

Note that we’re using the %% operator here. If you haven’t seen it before, recall you can get help by typing

?%%

in R. In this case, %% is a modulo operator; that is, it is the “remainder” from division. (Do you see how it works here?)

We’ll practice using conditional operators below.

Looping

In other contexts, it is useful to perform the same operation several times over. When possible, it is best to use a vectorized approach, but sometimes that’s not feasible. In particular, if you have a situation where you need a value from the previous time, vectorization isn’t going to work.

In this case, you need a loop. There are several types of loops, but the most common is the for loop, which looks like this:

for(VARIABLE_NAME in VECTOR){
    BODY
}

For example, we can use the toupper function and the month.name vector to do this:

Here, the loop repeats 12 times - once for each element of month.name. As we go through the loop, the value of the variable month changes: it is set to each element of the vector month.name separately. On a single iteration month is a scalar (length-one vector), not a length-12 vector. The “body” of the loop is the code between the curly braces. It is repeated as is, with the only changes coming from the fact that the value of variables are changed.

A common pattern is to want to repeat a block of code without needing to iterate over a vector. In this case, just define an unused iteration variable and iterate over a sequence. E.g., to compute the first 10 powers of 2:

Note that this example can be computed more effectively using vectorization: do you see how? Note that if you re-assign a variable other than the loop variable, you can use it again on the next time through the loop: so here, we get the old value of x on the right-hand side of x <- 2 * x and we put the new value back into x for later use.

If you try to assign to the loop variable (i), your values are overwritten at the start of the next loop.

So be careful what you use the loop variable for.

Because it has vectorization, R uses loops much less than other languages, but they still sometimes appear and it’s worth being aware of them.

Programming Exercises

Write functions to perform each of the following tasks.

  1. Write a function that takes in a vector of numbers, calculates the length and maximum value of the vector, and prints that information to the screen in a formatted way.

    > func_1(c(1, 2, 3, 5, 7))
    The largest value in that vector of 5 numbers is 7.
    
    > func_1(c(1, 2, 5, 5))
    The largest value in that vector of 4 numbers is 5.

    To make your output as attractive as possible, you might want to use the cat command instead of the print command.

    which behaves as expected:

  2. Write a program that tests whether its (integer) outputs are leap years. Recall the leap year rules:

    • A year is a leap year if it is divisible by 4
    • But it is not a leap year if it is divisible by 100
    • Unless it is also divisible by 400

    Your function should behave like this:

    > leap_year(2025)
    [1] FALSE
    > leap_year(2024)
    [1] TRUE
    > leap_year(2100)
    [1] FALSE
    > leap_year(2000)
    TRUE

    Remember our discussion of the %% operator from above.

    Alternatively, we could use a helper function:

    which behaves as we expect

  3. Write a function to greet your classmates with varying levels of enthusiasm. It should have three optional arguments:

    1. name. The name of the person to greet. Default "friend"
    2. times. The number of times to repeat the greeting. Default 1
    3. emphasis. A Boolean (TRUE/FALSE) value indicating whether the greeting should end with an exclamaition point. (Default FALSE)
    > greetings()
    Hello, friend
    > greetings(name="Michael")
    Hello, Michael
    > greetings(times=2)
    Hello, friend
    Hello, friend
    > greetings(emphasis=TRUE)
    Hello, friend!
    > greetings("Michael", 5, TRUE)
    Hello, Michael!
    Hello, Michael!
    Hello, Michael!
    Hello, Michael!
    Hello, Michael!

    which gives

  4. The Riemann Zeta Function is a famous function in analytic number theory2 defined as

    \[\zeta(k) = 1 + \left(\frac{1}{2}\right)^k + \left(\frac{1}{3}\right)^k + \dots = \sum_{i=1}^{\infty} i^{-k} \] We cannot implement an infinite series in R, but we can get very close by taking a large number of terms in the series (e.g, the first 500,000). Implement the zeta function and show that \(\zeta(2) = \frac{\pi^2}{6}\)

    > zeta(2)
    [1] 1.644932
    > zeta(3)
    [1] 1.202057
    > zeta(4)
    [1] 1.082323
    > all.equal(zeta(2), pi^2/6, tol=1e-4)
    [1] TRUE

    which gives

  5. Hero of Alexandria developed a method for computing square roots numerically. He showed that by performing the following update repeatedly, \(x\) will converge to \(\sqrt{n}\):

    \[x \leftarrow \frac{1}{2}\left(x + \frac{n}{x}\right)\] You can start with any positive \(x\), but \(n/2\) is a good choice.

    Implement this method to compute square roots. Use an optional keyword argument iter with a default value of 100 to control how many iterations are performed:

    > hero_sqrt(100)
    [1] 1.644932
    > hero_sqrt(3)^2
    [1] 3
    > hero_sqrt(3, iter=2)
    [1] 1.732143
    > hero_sqrt(3000)
    [1] 54.77226

    which gives

Executing Code in R

Now that you’re executing R code, it’s worth taking a moment and walking through some of the details of how R and RStudio go about executing code. There are quite a few details here – and not all of them will make complete sense at first – so bookmark this session and come back to it as needed. If something is still confusing, contact the instructor and I’ll provide some more details.

In particular, save this section and refer back to it when you start Mini-Project #01, as that will be the first time you need to write a large amount of R code in a qmd document over several sessions.

All R code gets executed in a “session” - you can think of this as a ‘place’ or ‘environment’ where computations are performed, variables get stored, packages loaded, etc. Sessions are isolated, so whatever happens in session A has no effect on session B or vice versa. In particular, if you create a variable in Session A, but then try to use it in Session B, you will get an “object not found” error since Session B can’t see what you did in Session A. There are really only two ways to do things in R that translate across sessions - and that’s because they involve saving files to your computer that can be accessed by the other session:

  1. Installing packages. Just like you don’t have to install.packages a package more than once over time, you also don’t need to install it in multiple sessions. All R packages live in a globally accessible directory, so all R sessions can see all installed packages by default.
  2. Saving data files. If you save a data file, like a csv in one R session, you can read that csv into a different R session. There’s really nothing special about the fact both the ‘reader’ and ‘writer’ are R sessions here. In general, files saved by one program can be accessed by another program on the same computer. That fact that our reader and writer are both R sessions just gives us a better chance of avoiding file-type compatability issues.

So - except for those two “global” actions - whenever you run code, it’s important to distinguish which session is being used to run your code.

Anything that gets executed in the Console in RStudio runs in the same session, unless you specifically restart RStudio or otherwise create a new session. (This normally only happens when things go wrong and RStudio has to abort your current session to avoid even worse problems.) Code can be executed in the Console session in several different ways:

  • You can type something directly into the Console. This is the simplest model, but unless you are a perfect typist, this doesn’t really scale to larger or more complex commands.

  • You can type something in a .R file, i.e., an “R Script” and then run it in the console by either i) highlighting and Running relevant lines; or by ii) or Sourceing the entire .R file. Either way, when you do this, you will see that the code is copied into the Console automatically and executed there.3

  • You can execute one-or-more code blocks from a Quarto (.qmd) document in the Console session. This happens whenever you press the little green “Run Current Chunk” arrow buttons to the top-right of a code chunk4 or if you use the Run button near the top of the Window to run one or more chunks

So, wherever you happen to type the code - it gets run in the same session if it is run in any of these three ways.

So why is it worth talking about sessions at all if it all code to the same place to be executed? It turns out there’s one exception - but it’s a very important one.

When you hit the Render button for a .qmd document (or hit the Save button if you have Render-on-Save enabled), all of the code chunks in that document – and only the code chunks in that document – get run in a brand-new session. In particular, you get a new session every time you hit Render, so hitting Render twice will execute your code twice in two different (and not connected) sessions.

Because you’re starting from a clean state at every render, it is important that all of the code you need exists in a code chunk in the .qmd file. If you only define a variable in the Console session and not in a code chunk in your .qmd file, but try to reference that variable within the .qmd Rendering process, you will get a “object not found” error. This is because the Render-session can’t ‘see’ what you’ve done anywhere else:
the Render-session does not have access to whatever you’ve done in the Console session. In particular, if you run the same code chunk once in the Console session using a Run button and later in a Render-induced different session, the two executions will be totally isolated, even though it is the same code.

So why is this second session worth it? Why does quarto insist on doing things in a new session instead of giving us maximum flexibility to write and run our code wherever is most convenient? The use of a ‘clean-room’ session guarantees reproducibility and self-containedness. If the .qmd Rendering uses all of the code in the .qmd document and only the code in the .qmd document, then the .qmd document must be a perfect record of those calculations.

If you do something directly in the Console, but never save the code, then there’s no guarantee that the code you have in your document actually matches the output that gets printed. Even worse, the code that only exists in the Console may be lost if it is not recorded elsewhere in a .qmd or similar document.

When writing code in a Quarto document, I tend to work in parallel tracks. As I’m figuring out what I want to do, I try a bunch of different things in the Console session, either writing them directly or, more commonly, in a .R script that gets executed in the same session. Then, once I’ve settled on code that does what I want – or at least seems to do what I want – I copy it into a new chunk in my .qmd document. As I work through the project, my .qmd document contains only my “good code” and the scraps of my earlier attemps are lost in the aether.

I also regularly re-Render the .qmd file to ensure that it has everything it needs to execute properly. If I forgot to put some necessary code into a .qmd code block, I will get an error when I move on to the next code block that depends on the missing code block. I try to Render about as often as possible: this makes debugging easier. In particular, if I re-Render any time I add or edit a code chunk and I observe an error, I know that it is coming from the chunk I just worked on. Finding the error may still take a bit of effort, but it’s far simpler than finding it in the entire document without any context.

In particular, I don’t write code directly in my .qmd document and I don’t use the Run button or the little Run arrows to run .qmd code chunks in the Console session. I find it easier to think of Console-land and Render-land as two distinct “places” and avoid mixing them. R won’t get them crossed, but I do everytime I break this separation, so I do my best to avoid going down that path. It takes a bit more copy and paste work, but it’s worth it as it makes the mental bookkeeping easier.

There is one rough edge to this workflow. When you Render a .qmd file - and all code is executed in a new session - but an error is encountered it can sometimes be a bit hard to figure out the root cause of the error. In particular, quarto will print out the text of the error in the Background Jobs tab, but there’s no way to ‘pause’ mid-Render and take a look around. Quarto tries to help a little by giving line numbers identifying the chunk where the error occurred but that’s only a ‘symptom’ the actual cause may be much earlier. E.g., if I define a variable in Chunk 1 but misspell its name, and then I wait to use that variable until Chunk 50 where I spell the name correctly, I will get a object not found error in Chunk 50, not Chunk 1.

As noted above, frequent re-Rendering certainly makes it easier - though not 100% painless - to find the source of an error. Because Rendering takes place in a separate session, specifically in a Background Job, I can set a long document to render while I continue to write new code and only need to interrupt my flow if an error is encountered in the background Render process.

Wrapping up, it doesn’t matter where you write the code: it matters where it gets executed, i.e., in which session. Sessions are interchangeable with tabs in RStudio: the primary session you’ll use for writing and trying out code is associated with the Console tab. I use this session to keep iterating on code until I’m happy with it.

The other session is associated with the Render button of a .qmd file and with the Background Jobs tab. Quarto uses this to run all the code in a .qmdfile in a “clean room” as part of rendering a document. I only put my ‘finished’ code in a .qmd file code block. I Render regularly to make sure that those code blocks are working as expected.

Finally, and hopefully this goes without saying, all the interactive blocks on this page are a totally different session than anything you do through RStudio. But all blocks on this page share the same session, which lasts until you refresh or exit the page. This is why we could define a function in one chunk of the Exercises and test it in the next chunk - they are both executed in the same session and have the same access to our functions (once defined).

Footnotes

  1. The BioConductor project produces over 2000 high-quality non-CRAN packages for bioinformatics alone.↩︎

  2. ANL is basically the application of calculus techniques to prove properties of prime numbers: it’s a surprisingly powerful approach.↩︎

  3. Sometimes the settings on Source button will be set to run your code, printing only the output and suppressing the input. I find this behavior a bit confusing - I want to see the code that was executed! - so I tend to disable it. If editing a .R file, click the tiny-little down arrow button next to the Source button and select Source with Echo to ensure the code is printed before execution.↩︎

  4. If you are of a certain age, they look like an old-school VCR right arrow / play button.↩︎