(df = data.frame(first=1:4, second=seq(pi, pi + 3), third=LETTERS[1:4])) first second third
1 1 3.141593 A
2 2 4.141593 B
3 3 5.141593 C
4 4 6.141593 D
data.frame and Tidyverse tibble objects.
November 29, 2023
Data frames are one of the most important data structures for representing tabular data. Base R includes the tried and tested data.frame type, which is technically a list of equal-length vectors (where each vector corresponds to a column in the data frame). The tibble package (part of the Tidyverse) offers a slightly tweaked data frame type called tibble. In practical data analysis pipelines, we frequently create subsets of the data frame, for example by selecting one or more columns and/or rows. In most situations, the data.frame and tibble types are interchangeable. However, there are subtle differences in the context of subsetting, which I will highlight in this post.
Subsetting data frames is slightly more challenging than subsetting vectors (see this post), mainly because there is a multitude of available (and partly redundant) options. We’ll start with a small data frame df consisting of four rows and three columns:
first second third
1 1 3.141593 A
2 2 4.141593 B
3 3 5.141593 C
4 4 6.141593 D
This is a classic data.frame, so let’s also create a tibble with identical contents:
In the following examples, we will explore different options to select the second column (named "second").
$ operatorWe’ll start with the $ operator, which extracts a single column by name as follows:
[1] 3.141593 4.141593 5.141593 6.141593
[1] 3.141593 4.141593 5.141593 6.141593
We can enclose the desired column name in quotes, but the first variant without quotes is more common. In either case, R returns the single column as a basic vector. This is also true when working with a tibble:
[1] 3.141593 4.141593 5.141593 6.141593
[1] 3.141593 4.141593 5.141593 6.141593
The $ notation is convenient for interactive exploration, because we don’t have to type a lot of extra characters (except for the $ sign). In addition, RStudio offers auto-completion of matching column names in its console.
Subsetting a data.frame with $ performs partial matching. This means that R will return the first column that partially matches the given name, for example:
R will happily return df$second in this example. You can learn more about $ by typing ?`$` in the interactive console.
The $ operator applied to a tibble does not perform partial matching. Instead, the following example will result in NULL and raise a warning:
It is easy to shoot yourself in the foot with partial matching. Therefore, I advise against using the $ notation when working with data.frame objects.
[[]] operatorAnother way to select a single column uses double square brackets notation [[]]. We can specify either the position or the name of the desired column:
[1] 3.141593 4.141593 5.141593 6.141593
[1] 3.141593 4.141593 5.141593 6.141593
[1] 3.141593 4.141593 5.141593 6.141593
[1] 3.141593 4.141593 5.141593 6.141593
Both data.frame and tibble objects return the desired column as a vector.
[] operatorInterestingly, we can also select a single column with single square bracket notation [], which we’ve already seen with atomic vectors:
second
1 3.141593
2 4.141593
3 5.141593
4 6.141593
second
1 3.141593
2 4.141593
3 5.141593
4 6.141593
# A tibble: 4 × 1
second
<dbl>
1 3.14
2 4.14
3 5.14
4 6.14
# A tibble: 4 × 1
second
<dbl>
1 3.14
2 4.14
3 5.14
4 6.14
The important difference here is that the resulting subset is a data frame (depending on the original type either a data.frame or a tibble) and not a vector, even though we select only a single column.
If we want to select multiple columns, we have to use single square bracket notation []. We can specify both a row selection and a column selection, separated by a comma, within the square brackets. However, we can omit either or both indices to select entire columns or rows.
Let’s start with selecting a single column. For example, we can grab the second column by omitting the row selection:
[1] 3.141593 4.141593 5.141593 6.141593
[1] 3.141593 4.141593 5.141593 6.141593
# A tibble: 4 × 1
second
<dbl>
1 3.14
2 4.14
3 5.14
4 6.14
# A tibble: 4 × 1
second
<dbl>
1 3.14
2 4.14
3 5.14
4 6.14
A data.frame will return the column as a vector, whereas a tibble will return a tibble (with a single column):
When selecting a single column, we can set the returned value to be a vector or a single-column data frame with the drop argument (drop=TRUE means vector, whereas drop=FALSE means data frame):
second
1 3.141593
2 4.141593
3 5.141593
4 6.141593
second
1 3.141593
2 4.141593
3 5.141593
4 6.141593
[1] 3.141593 4.141593 5.141593 6.141593
[1] 3.141593 4.141593 5.141593 6.141593
I’ve rarely seen this in practice, so I don’t recommend using it unless there is no other option.
In contrast to $ and [[]], single square bracket notation [] allows us to select multiple columns:
second third
1 3.141593 A
2 4.141593 B
3 5.141593 C
4 6.141593 D
second third
1 3.141593 A
2 4.141593 B
3 5.141593 C
4 6.141593 D
second third
1 3.141593 A
2 4.141593 B
3 5.141593 C
4 6.141593 D
second third
1 3.141593 A
2 4.141593 B
3 5.141593 C
4 6.141593 D
# A tibble: 4 × 2
second third
<dbl> <chr>
1 3.14 A
2 4.14 B
3 5.14 C
4 6.14 D
# A tibble: 4 × 2
second third
<dbl> <chr>
1 3.14 A
2 4.14 B
3 5.14 C
4 6.14 D
# A tibble: 4 × 2
second third
<dbl> <chr>
1 3.14 A
2 4.14 B
3 5.14 C
4 6.14 D
# A tibble: 4 × 2
second third
<dbl> <chr>
1 3.14 A
2 4.14 B
3 5.14 C
4 6.14 D
The returned subset will always be a data frame.
A tibble is more consistent than a data.frame when using []-style subsetting, because the result will always be a tibble. In contrast, we get a vector when selecting a single column and a data.frame when selecting multiple columns with data.frame objects.
Selecting one or more rows is also known as filtering. We use the row index (the value before the comma) within single square brackets [] to create the desired subset. The result will always be a data frame.
For example, we can select the second row as follows (don’t forget the trailing comma):
first second third
2 2 4.141593 B
# A tibble: 1 × 3
first second third
<int> <dbl> <chr>
1 2 4.14 B
Similarly, we can also select multiple rows:
first second third
2 2 4.141593 B
3 3 5.141593 C
# A tibble: 2 × 3
first second third
<int> <dbl> <chr>
1 2 4.14 B
2 3 5.14 C
Logical subsetting is especially useful for filtering rows. The following example creates a subset by selecting rows where the values in the second column are greater than 5: