df = data.frame(first=1:4, second=seq(pi, pi + 3), third=LETTERS[1:4])) (
first second third
1 1 3.141593 A
2 2 4.141593 B
3 3 5.141593 C
4 4 6.141593 D
data.frame
and Tidyverse tibble
objects.
November 29, 2023
Data frames are one of the most important data structures for representing tabular data. Base R includes the tried and tested data.frame
type, which is technically a list of equal-length vectors (where each vector corresponds to a column in the data frame). The tibble package (part of the Tidyverse) offers a slightly tweaked data frame type called tibble
. In practical data analysis pipelines, we frequently create subsets of the data frame, for example by selecting one or more columns and/or rows. In most situations, the data.frame
and tibble
types are interchangeable. However, there are subtle differences in the context of subsetting, which I will highlight in this post.
Subsetting data frames is slightly more challenging than subsetting vectors (see this post), mainly because there is a multitude of available (and partly redundant) options. We’ll start with a small data frame df
consisting of four rows and three columns:
first second third
1 1 3.141593 A
2 2 4.141593 B
3 3 5.141593 C
4 4 6.141593 D
This is a classic data.frame
, so let’s also create a tibble
with identical contents:
In the following examples, we will explore different options to select the second column (named "second"
).
$
operatorWe’ll start with the $
operator, which extracts a single column by name as follows:
[1] 3.141593 4.141593 5.141593 6.141593
[1] 3.141593 4.141593 5.141593 6.141593
We can enclose the desired column name in quotes, but the first variant without quotes is more common. In either case, R returns the single column as a basic vector. This is also true when working with a tibble
:
[1] 3.141593 4.141593 5.141593 6.141593
[1] 3.141593 4.141593 5.141593 6.141593
The $
notation is convenient for interactive exploration, because we don’t have to type a lot of extra characters (except for the $
sign). In addition, RStudio offers auto-completion of matching column names in its console.
Subsetting a data.frame
with $
performs partial matching. This means that R will return the first column that partially matches the given name, for example:
R will happily return df$second
in this example. You can learn more about $
by typing ?`$`
in the interactive console.
The $
operator applied to a tibble
does not perform partial matching. Instead, the following example will result in NULL
and raise a warning:
It is easy to shoot yourself in the foot with partial matching. Therefore, I advise against using the $
notation when working with data.frame
objects.
[[]]
operatorAnother way to select a single column uses double square brackets notation [[]]
. We can specify either the position or the name of the desired column:
[1] 3.141593 4.141593 5.141593 6.141593
[1] 3.141593 4.141593 5.141593 6.141593
[1] 3.141593 4.141593 5.141593 6.141593
[1] 3.141593 4.141593 5.141593 6.141593
Both data.frame
and tibble
objects return the desired column as a vector.
[]
operatorInterestingly, we can also select a single column with single square bracket notation []
, which we’ve already seen with atomic vectors:
second
1 3.141593
2 4.141593
3 5.141593
4 6.141593
second
1 3.141593
2 4.141593
3 5.141593
4 6.141593
# A tibble: 4 × 1
second
<dbl>
1 3.14
2 4.14
3 5.14
4 6.14
# A tibble: 4 × 1
second
<dbl>
1 3.14
2 4.14
3 5.14
4 6.14
The important difference here is that the resulting subset is a data frame (depending on the original type either a data.frame
or a tibble
) and not a vector, even though we select only a single column.
If we want to select multiple columns, we have to use single square bracket notation []
. We can specify both a row selection and a column selection, separated by a comma, within the square brackets. However, we can omit either or both indices to select entire columns or rows.
Let’s start with selecting a single column. For example, we can grab the second column by omitting the row selection:
[1] 3.141593 4.141593 5.141593 6.141593
[1] 3.141593 4.141593 5.141593 6.141593
# A tibble: 4 × 1
second
<dbl>
1 3.14
2 4.14
3 5.14
4 6.14
# A tibble: 4 × 1
second
<dbl>
1 3.14
2 4.14
3 5.14
4 6.14
A data.frame
will return the column as a vector, whereas a tibble
will return a tibble
(with a single column):
When selecting a single column, we can set the returned value to be a vector or a single-column data frame with the drop
argument (drop=TRUE
means vector, whereas drop=FALSE
means data frame):
second
1 3.141593
2 4.141593
3 5.141593
4 6.141593
second
1 3.141593
2 4.141593
3 5.141593
4 6.141593
[1] 3.141593 4.141593 5.141593 6.141593
[1] 3.141593 4.141593 5.141593 6.141593
I’ve rarely seen this in practice, so I don’t recommend using it unless there is no other option.
In contrast to $
and [[]]
, single square bracket notation []
allows us to select multiple columns:
second third
1 3.141593 A
2 4.141593 B
3 5.141593 C
4 6.141593 D
second third
1 3.141593 A
2 4.141593 B
3 5.141593 C
4 6.141593 D
second third
1 3.141593 A
2 4.141593 B
3 5.141593 C
4 6.141593 D
second third
1 3.141593 A
2 4.141593 B
3 5.141593 C
4 6.141593 D
# A tibble: 4 × 2
second third
<dbl> <chr>
1 3.14 A
2 4.14 B
3 5.14 C
4 6.14 D
# A tibble: 4 × 2
second third
<dbl> <chr>
1 3.14 A
2 4.14 B
3 5.14 C
4 6.14 D
# A tibble: 4 × 2
second third
<dbl> <chr>
1 3.14 A
2 4.14 B
3 5.14 C
4 6.14 D
# A tibble: 4 × 2
second third
<dbl> <chr>
1 3.14 A
2 4.14 B
3 5.14 C
4 6.14 D
The returned subset will always be a data frame.
A tibble
is more consistent than a data.frame
when using []
-style subsetting, because the result will always be a tibble
. In contrast, we get a vector when selecting a single column and a data.frame
when selecting multiple columns with data.frame
objects.
Selecting one or more rows is also known as filtering. We use the row index (the value before the comma) within single square brackets []
to create the desired subset. The result will always be a data frame.
For example, we can select the second row as follows (don’t forget the trailing comma):
first second third
2 2 4.141593 B
# A tibble: 1 × 3
first second third
<int> <dbl> <chr>
1 2 4.14 B
Similarly, we can also select multiple rows:
first second third
2 2 4.141593 B
3 3 5.141593 C
# A tibble: 2 × 3
first second third
<int> <dbl> <chr>
1 2 4.14 B
2 3 5.14 C
Logical subsetting is especially useful for filtering rows. The following example creates a subset by selecting rows where the values in the second column are greater than 5: