This page describes some practices that are common in R programming but should be avoided because they make programs slow, hard to maintain, or unsafe.
T or FNever use
TorF. Always useTRUEorFALSEinstead.
R uses some special “reserved words” which can never be overwritten.
For example, you cannot name a variable if because
if is a foundational element of how R programs are
structure. To see the full list of reserved words, run
?reserved.
Two of these are TRUE and FALSE. These are
used to represent logical values and control flow, like this:
if (TRUE && FALSE){
print("both TRUE!")
} else {
print("at least one FALSE")
}
## [1] "at least one FALSE"
Because TRUE and FALSE are reserved
keywords, it’s impossible to change their values.
TRUE <- "hello"
## Error in `TRUE <- "hello"`:
## ! invalid (do_set) left-hand side to assignment
R also comes with two objects T and F. If
you type one of these in the terminal, you’ll see that T is
an alias for TRUE and F is an alias for
FALSE. It’s common for R programmers to believe that
T and TRUE are interchangeable, but they are
not!
Unlike TRUE and FALSE, T and
F can be changed!
T <- "hello"
F <- "goodbye"
if (T && F){
print("both TRUE!")
} else {
print("at least one FALSE")
}
## Error in `T && F`:
## ! invalid 'x' type in 'x && y'
1:length(x)Never use
1:length(x), useseq_len()orseq_along()instead.
It’s inevitable that you’ll encounter a situation where you want to loop over every element of a vector and do something with it.
revenue <- c(10, 20, 15, 30)
cost <- c(8, 18, 11, 26)
for (i in 1:length(cost)){
print(paste0("i: ", i))
print(paste0(" * revenue: ", revenue[i]))
print(paste0(" * cost: ", cost[i]))
print(paste0(" * profit: ", revenue[i] - cost[i]))
}
## [1] "i: 1"
## [1] " * revenue: 10"
## [1] " * cost: 8"
## [1] " * profit: 2"
## [1] "i: 2"
## [1] " * revenue: 20"
## [1] " * cost: 18"
## [1] " * profit: 2"
## [1] "i: 3"
## [1] " * revenue: 15"
## [1] " * cost: 11"
## [1] " * profit: 4"
## [1] "i: 4"
## [1] " * revenue: 30"
## [1] " * cost: 26"
## [1] " * profit: 4"
The code above will work fine in most cases, but it will behave in a
surprising way if cost is empty
revenue <- numeric()
cost <- numeric()
for (i in 1:length(cost)){
print(paste0("i: ", i))
print(paste0(" * revenue: ", revenue[i]))
print(paste0(" * cost: ", cost[i]))
print(paste0(" * profit: ", revenue[i] - cost[i]))
}
## [1] "i: 1"
## [1] " * revenue: NA"
## [1] " * cost: NA"
## [1] " * profit: NA"
## [1] "i: 0"
## [1] " * revenue: "
## [1] " * cost: "
## [1] " * profit: "
This is because 1:0 generates a 2-element vector
equivalent to c(1, 0). In this case, what we really want is
to not run the cost at all because the input is empty! R provides two
functions that are safer for this task:
seq_along(): equivalent to 1:length(x),
but returns a length-0 input for length-0 outputseq_len(length.out): generates an integer vector with
length length.outrevenue <- numeric()
cost <- numeric()
for (i in seq_along(cost)){
print(paste0("i: ", i))
print(paste0(" * revenue: ", revenue[i]))
print(paste0(" * cost: ", cost[i]))
print(paste0(" * profit: ", revenue[i] - cost[i]))
}
require() to load librariesNever use
require()in scripts, uselibrary()instead.
In R, “packages” are bundles of R code which you can load into your
programs and re-use. Except for a few absolutely essential default
packages (getOption("defaultPackages")), the code from
these packages had to be explicitly loaded to be used by your code. For
example, the code below throws an error because I have not loaded
{data.table}.
set.seed(708)
data.table(
x = rnorm(10)
, y = rnorm(10)
)
## Error in `data.table()`:
## ! could not find function "data.table"
When I load the package, this command now succeeds.
library(data.table)
## Warning: package 'data.table' was built under R version 4.4.3
set.seed(708)
data.table(
x = rnorm(10)
, y = rnorm(10)
)
## x y
## <num> <num>
## 1: -0.3635938 -0.93689125
## 2: 0.8000444 1.15348849
## 3: 0.2545262 -0.49272222
## 4: 1.1050339 0.08861516
## 5: 0.2239490 0.04470975
## 6: 0.3043927 -0.07872333
## 7: 0.5863823 1.18891894
## 8: -1.4412849 -2.17314240
## 9: 1.0383001 1.53658612
## 10: 0.2348604 0.13554731
The most popular commands to load packages are library()
and require(). You may find examples on the internet which
say or at least imply that these can be used interchangeably. Those
examples are not correct.
require() will throw a warning if you use it on a
package that has not been installed, but it will not throw an error.
That means that code that comes after require() will run,
which increases the time until you find out that you are missing a
required package for some program!
require("nonsense-package")
## Loading required package: nonsense-package
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'nonsense-package'
print("this code ran")
## [1] "this code ran"
library("nonsense-package")
## Error in `library()`:
## ! there is no package called 'nonsense-package'
Never use positional subsetting for data frames. Always use names or logical conditions that references names.
In R, it’s possible to subset objects by numbers. You can say things like “get the 5th column” or “get rows 15-88 from this matrix”. This practice, called positional subsetting, is often a bad idea.
Look at the example above Can you tell what it does?
data(mtcars)
mtcars[27:31, ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
What about this one?
data(mtcars)
mtcars[mtcars$gear == 5, ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
Both statements produce the same result: a dataframe with all the
cars in mtcars that have 5-gear transmissions. The second
statement, though, will still produce the right answer even if
mtcars is sorted or randomly shuffled.
Always store important literal values in informatively-named variables and re-use those variables everywhere.
Consider the following snippet. Given distance traveled in kilometers and time spent in minutes, it tries to calculate the average trip speed (in miles per hour).
distance_travelled <- distance_travelled * 0.621371
hours_spent <- travel_time / 60
mph <- distance_travelled / hours_spent
This code might not make sense if you didn’t have the two sentences of documentation above it. If someone else handed you this code, you might wonder “where did 0.621371 come from? why is travel time being divided by 60”.
The following is a better way to write this:
MILES_PER_KILOMETER <- 0.621371
MINUTES_PER_HOUR <- 60
distance_travelled <- distance_travelled * MILES_PER_KILOMETERS
hours_spent <- travel_time / 60
mph <- distance_travelled / hours_spent