Subsetting data frames in R

Basics

Selecting rows and columns from a data frame are basic data manipulation operations. In this post, I show several options for creating subsets of data frames in R, and I also point out important differences between classic data.frame and Tidyverse tibble objects.

Published

November 29, 2023

Introduction

Data frames are one of the most important data structures for representing tabular data. Base R includes the tried and tested data.frame type, which is technically a list of equal-length vectors (where each vector corresponds to a column in the data frame). The tibble package (part of the Tidyverse) offers a slightly tweaked data frame type called tibble. In practical data analysis pipelines, we frequently create subsets of the data frame, for example by selecting one or more columns and/or rows. In most situations, the data.frame and tibble types are interchangeable. However, there are subtle differences in the context of subsetting, which I will highlight in this post.

Subsetting data frames is slightly more challenging than subsetting vectors (see this post), mainly because there is a multitude of available (and partly redundant) options. We’ll start with a small data frame df consisting of four rows and three columns:

(df = data.frame(first=1:4, second=seq(pi, pi + 3), third=LETTERS[1:4]))

  first   second third
1     1 3.141593     A
2     2 4.141593     B
3     3 5.141593     C
4     4 6.141593     D

This is a classic data.frame, so let’s also create a tibble with identical contents:

(tf = tibble::as_tibble(df))

# A tibble: 4 × 3
  first second third
  <int>  <dbl> <chr>
1     1   3.14 A    
2     2   4.14 B    
3     3   5.14 C    
4     4   6.14 D

Selecting a single column

In the following examples, we will explore different options to select the second column (named "second").

The `$` operator

We’ll start with the $ operator, which extracts a single column by name as follows:

df$second  # vector

[1] 3.141593 4.141593 5.141593 6.141593

df$"second"  # vector

[1] 3.141593 4.141593 5.141593 6.141593

We can enclose the desired column name in quotes, but the first variant without quotes is more common. In either case, R returns the single column as a basic vector. This is also true when working with a tibble:

tf$second  # vector

[1] 3.141593 4.141593 5.141593 6.141593

tf$"second"  # vector

[1] 3.141593 4.141593 5.141593 6.141593

The $ notation is convenient for interactive exploration, because we don’t have to type a lot of extra characters (except for the $ sign). In addition, RStudio offers auto-completion of matching column names in its console.

Important

Subsetting a data.frame with $ performs partial matching. This means that R will return the first column that partially matches the given name, for example:

df$s  # extracts column "second"

[1] 3.141593 4.141593 5.141593 6.141593

R will happily return df$second in this example. You can learn more about $ by typing ?`$` in the interactive console.

The $ operator applied to a tibble does not perform partial matching. Instead, the following example will result in NULL and raise a warning:

tf$s  # returns NULL (no partial matching)

Warning: Unknown or uninitialised column: `s`.

NULL

It is easy to shoot yourself in the foot with partial matching. Therefore, I advise against using the $ notation when working with data.frame objects.

The `[[]]` operator

Another way to select a single column uses double square brackets notation [[]]. We can specify either the position or the name of the desired column:

df[[2]]  # vector

[1] 3.141593 4.141593 5.141593 6.141593

df[["second"]]  # vector

[1] 3.141593 4.141593 5.141593 6.141593

tf[[2]]  # vector

[1] 3.141593 4.141593 5.141593 6.141593

tf[["second"]]  # vector

[1] 3.141593 4.141593 5.141593 6.141593

Both data.frame and tibble objects return the desired column as a vector.

Note

If you really want, you can enable partial matching for data.frame types as follows (but you probably don’t want to do this):

df[["s", exact=FALSE]]  # vector (partial matching)

[1] 3.141593 4.141593 5.141593 6.141593

The `[]` operator

Interestingly, we can also select a single column with single square bracket notation [], which we’ve already seen with atomic vectors:

df[2]  # data.frame

df["second"]  # data.frame

tf[2]  # tibble

# A tibble: 4 × 1
  second
   <dbl>
1   3.14
2   4.14
3   5.14
4   6.14

tf["second"]  # tibble

# A tibble: 4 × 1
  second
   <dbl>
1   3.14
2   4.14
3   5.14
4   6.14

The important difference here is that the resulting subset is a data frame (depending on the original type either a data.frame or a tibble) and not a vector, even though we select only a single column.

Selecting multiple columns

If we want to select multiple columns, we have to use single square bracket notation []. We can specify both a row selection and a column selection, separated by a comma, within the square brackets. However, we can omit either or both indices to select entire columns or rows.

Let’s start with selecting a single column. For example, we can grab the second column by omitting the row selection:

df[, 2]  # vector

[1] 3.141593 4.141593 5.141593 6.141593

df[, "second"]  # vector

[1] 3.141593 4.141593 5.141593 6.141593

tf[, 2]  # tibble

# A tibble: 4 × 1
  second
   <dbl>
1   3.14
2   4.14
3   5.14
4   6.14

tf[, "second"]  # tibble

# A tibble: 4 × 1
  second
   <dbl>
1   3.14
2   4.14
3   5.14
4   6.14

Important

A data.frame will return the column as a vector, whereas a tibble will return a tibble (with a single column):

Note

When selecting a single column, we can set the returned value to be a vector or a single-column data frame with the drop argument (drop=TRUE means vector, whereas drop=FALSE means data frame):

df[, 2, drop=FALSE]  # data.frame

df[, "second", drop=FALSE]  # data.frame

tf[, 2, drop=TRUE]  # vector

[1] 3.141593 4.141593 5.141593 6.141593

tf[, "second", drop=TRUE]  # vector

[1] 3.141593 4.141593 5.141593 6.141593

I’ve rarely seen this in practice, so I don’t recommend using it unless there is no other option.

In contrast to $ and [[]], single square bracket notation [] allows us to select multiple columns:

df[, c("second", "third")]

    second third
1 3.141593     A
2 4.141593     B
3 5.141593     C
4 6.141593     D

df[, c(2, 3)]

    second third
1 3.141593     A
2 4.141593     B
3 5.141593     C
4 6.141593     D

df[c("second", "third")]

    second third
1 3.141593     A
2 4.141593     B
3 5.141593     C
4 6.141593     D

df[c(2, 3)]

    second third
1 3.141593     A
2 4.141593     B
3 5.141593     C
4 6.141593     D

tf[, c("second", "third")]

# A tibble: 4 × 2
  second third
   <dbl> <chr>
1   3.14 A    
2   4.14 B    
3   5.14 C    
4   6.14 D

tf[, c(2, 3)]

# A tibble: 4 × 2
  second third
   <dbl> <chr>
1   3.14 A    
2   4.14 B    
3   5.14 C    
4   6.14 D

tf[c("second", "third")]

# A tibble: 4 × 2
  second third
   <dbl> <chr>
1   3.14 A    
2   4.14 B    
3   5.14 C    
4   6.14 D

tf[c(2, 3)]

# A tibble: 4 × 2
  second third
   <dbl> <chr>
1   3.14 A    
2   4.14 B    
3   5.14 C    
4   6.14 D

The returned subset will always be a data frame.

Tip

A tibble is more consistent than a data.frame when using []-style subsetting, because the result will always be a tibble. In contrast, we get a vector when selecting a single column and a data.frame when selecting multiple columns with data.frame objects.

Selecting rows

Selecting one or more rows is also known as filtering. We use the row index (the value before the comma) within single square brackets [] to create the desired subset. The result will always be a data frame.

For example, we can select the second row as follows (don’t forget the trailing comma):

df[2, ]  # data.frame

  first   second third
2     2 4.141593     B

tf[2, ]  # tibble

# A tibble: 1 × 3
  first second third
  <int>  <dbl> <chr>
1     2   4.14 B

Similarly, we can also select multiple rows:

df[c(2, 3), ]  # data.frame

  first   second third
2     2 4.141593     B
3     3 5.141593     C

tf[c(2, 3), ]  # tibble

# A tibble: 2 × 3
  first second third
  <int>  <dbl> <chr>
1     2   4.14 B    
2     3   5.14 C

Logical subsetting is especially useful for filtering rows. The following example creates a subset by selecting rows where the values in the second column are greater than 5:

df[df[, 2] > 5, ]

  first   second third
3     3 5.141593     C
4     4 6.141593     D

tf[tf[, 2] > 5, ]

# A tibble: 2 × 3
  first second third
  <int>  <dbl> <chr>
1     3   5.14 C    
2     4   6.14 D

Introduction

Selecting a single column

The $ operator

The [[]] operator

The [] operator

Selecting multiple columns

Selecting rows

The `$` operator

The `[[]]` operator

The `[]` operator