Practices to Avoid in R Programming

This page describes some practices that are common in R programming but should be avoided because they make programs slow, hard to maintain, or unsafe.

Using `T` or `F`

Never use T or F. Always use TRUE or FALSE instead.

R uses some special “reserved words” which can never be overwritten. For example, you cannot name a variable if because if is a foundational element of how R programs are structure. To see the full list of reserved words, run ?reserved.

Two of these are TRUE and FALSE. These are used to represent logical values and control flow, like this:

if (TRUE && FALSE){
    print("both TRUE!")
} else {
    print("at least one FALSE")
}

## [1] "at least one FALSE"

Because TRUE and FALSE are reserved keywords, it’s impossible to change their values.

TRUE <- "hello"

## Error in `TRUE <- "hello"`:
## ! invalid (do_set) left-hand side to assignment

R also comes with two objects T and F. If you type one of these in the terminal, you’ll see that T is an alias for TRUE and F is an alias for FALSE. It’s common for R programmers to believe that T and TRUE are interchangeable, but they are not!

Unlike TRUE and FALSE, T and F can be changed!

T <- "hello"
F <- "goodbye"

if (T && F){
    print("both TRUE!")
} else {
    print("at least one FALSE")
}

## Error in `T && F`:
## ! invalid 'x' type in 'x && y'

Generating vectors with `1:length(x)`

Never use 1:length(x), useseq_len() or seq_along() instead.

It’s inevitable that you’ll encounter a situation where you want to loop over every element of a vector and do something with it.

revenue <- c(10, 20,  15, 30)
cost <- c(8, 18, 11, 26)
for (i in 1:length(cost)){
    print(paste0("i: ", i))
    print(paste0("  * revenue: ", revenue[i]))
    print(paste0("  * cost: ", cost[i]))
    print(paste0("  * profit: ", revenue[i] - cost[i]))
}

## [1] "i: 1"
## [1] "  * revenue: 10"
## [1] "  * cost: 8"
## [1] "  * profit: 2"
## [1] "i: 2"
## [1] "  * revenue: 20"
## [1] "  * cost: 18"
## [1] "  * profit: 2"
## [1] "i: 3"
## [1] "  * revenue: 15"
## [1] "  * cost: 11"
## [1] "  * profit: 4"
## [1] "i: 4"
## [1] "  * revenue: 30"
## [1] "  * cost: 26"
## [1] "  * profit: 4"

The code above will work fine in most cases, but it will behave in a surprising way if cost is empty

revenue <- numeric()
cost <- numeric()
for (i in 1:length(cost)){
    print(paste0("i: ", i))
    print(paste0("  * revenue: ", revenue[i]))
    print(paste0("  * cost: ", cost[i]))
    print(paste0("  * profit: ", revenue[i] - cost[i]))
}

## [1] "i: 1"
## [1] "  * revenue: NA"
## [1] "  * cost: NA"
## [1] "  * profit: NA"
## [1] "i: 0"
## [1] "  * revenue: "
## [1] "  * cost: "
## [1] "  * profit: "

This is because 1:0 generates a 2-element vector equivalent to c(1, 0). In this case, what we really want is to not run the cost at all because the input is empty! R provides two functions that are safer for this task:

seq_along(): equivalent to 1:length(x), but returns a length-0 input for length-0 output
seq_len(length.out): generates an integer vector with length length.out

revenue <- numeric()
cost <- numeric()
for (i in seq_along(cost)){
    print(paste0("i: ", i))
    print(paste0("  * revenue: ", revenue[i]))
    print(paste0("  * cost: ", cost[i]))
    print(paste0("  * profit: ", revenue[i] - cost[i]))
}

Using `require()` to load libraries

Never use require() in scripts, use library() instead.

In R, “packages” are bundles of R code which you can load into your programs and re-use. Except for a few absolutely essential default packages (getOption("defaultPackages")), the code from these packages had to be explicitly loaded to be used by your code. For example, the code below throws an error because I have not loaded {data.table}.

set.seed(708)
data.table(
    x = rnorm(10)
    , y = rnorm(10)
)

## Error in `data.table()`:
## ! could not find function "data.table"

When I load the package, this command now succeeds.

library(data.table)

## Warning: package 'data.table' was built under R version 4.4.3

set.seed(708)
data.table(
    x = rnorm(10)
    , y = rnorm(10)
)

##              x           y
##          <num>       <num>
##  1: -0.3635938 -0.93689125
##  2:  0.8000444  1.15348849
##  3:  0.2545262 -0.49272222
##  4:  1.1050339  0.08861516
##  5:  0.2239490  0.04470975
##  6:  0.3043927 -0.07872333
##  7:  0.5863823  1.18891894
##  8: -1.4412849 -2.17314240
##  9:  1.0383001  1.53658612
## 10:  0.2348604  0.13554731

The most popular commands to load packages are library() and require(). You may find examples on the internet which say or at least imply that these can be used interchangeably. Those examples are not correct.

require() will throw a warning if you use it on a package that has not been installed, but it will not throw an error. That means that code that comes after require() will run, which increases the time until you find out that you are missing a required package for some program!

require("nonsense-package")

## Loading required package: nonsense-package

## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'nonsense-package'

print("this code ran")

## [1] "this code ran"

library("nonsense-package")

## Error in `library()`:
## ! there is no package called 'nonsense-package'

Subsetting with positional references

Never use positional subsetting for data frames. Always use names or logical conditions that references names.

In R, it’s possible to subset objects by numbers. You can say things like “get the 5th column” or “get rows 15-88 from this matrix”. This practice, called positional subsetting, is often a bad idea.

Look at the example above Can you tell what it does?

data(mtcars)
mtcars[27:31, ]

##                 mpg cyl  disp  hp drat    wt qsec vs am gear carb
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
## Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8

What about this one?

data(mtcars)
mtcars[mtcars$gear == 5, ]

##                 mpg cyl  disp  hp drat    wt qsec vs am gear carb
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
## Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8

Both statements produce the same result: a dataframe with all the cars in mtcars that have 5-gear transmissions. The second statement, though, will still produce the right answer even if mtcars is sorted or randomly shuffled.

Using absolute file paths in shared code

For any code you want other people to run, do not use absolute files paths. Use relative paths instead.

See “File Paths” in the programming supplement for an overview of file paths in R.

Consider the following code.

DATA_DIR <- "/home/fnourzad/Downloads/data-files"
measures <- c("cpi-2018", "cpi-2019", "cpi-2020")
data_frames <- lapply(
    X = measures
    , FUN = function(filename){
        full_filepath <- file.path(DATA_DIR, filename)
        return(data.table::fread(full_filepath))
    }
)
cpiDT <- data.table::rbindlist(data_frames)

This code will only work on a machine that has those files stored in a directory /home/fnourzad/Downloads/data-files. If you intend to share this code with people, it would be better to use relative file paths.

DATA_DIR <- file.path(getwd(), "data-files")
measures <- c("cpi-2018", "cpi-2019", "cpi-2020")
data_frames <- lapply(
    X = measures
    , FUN = function(filename){
        full_filepath <- file.path(DATA_DIR, filename)
        return(data.table::fread(full_filepath))
    }
)
cpiDT <- data.table::rbindlist(data_frames)

Installing packages in shared code

In any R code you want other people to run, do not install packages. Use library() instead so people will get loud errors telling them to install things.

When you write R code for other people to run, don’t install packages R only allows you to have one version of a package installed at a time, and it’s bad practice to update other peoples’ environments for them.

bad

install.packages(c("data.table", "corrplot"))
...

good

library(data.table)
library(corrplot)

Using magic constants

Always store important literal values in informatively-named variables and re-use those variables everywhere.

Consider the following snippet. Given distance traveled in kilometers and time spent in minutes, it tries to calculate the average trip speed (in miles per hour).

distance_travelled <- distance_travelled * 0.621371
hours_spent <- travel_time / 60
mph <- distance_travelled / hours_spent

This code might not make sense if you didn’t have the two sentences of documentation above it. If someone else handed you this code, you might wonder “where did 0.621371 come from? why is travel time being divided by 60”.

The following is a better way to write this:

MILES_PER_KILOMETER <- 0.621371
MINUTES_PER_HOUR <- 60

distance_travelled <- distance_travelled * MILES_PER_KILOMETERS
hours_spent <- travel_time / 60
mph <- distance_travelled / hours_spent