Visruth Srimath Kandali

The Pipe

| 2683 words | 13 min

The pipe is an essential part of my R toolkit, allowing for simple and cogent method chaining, sensibly decomposing long chains of commands. I was motivated to begin this piece as an ode to the pipe, and as a brief introduction and endorsement of embracing piping to the fullest extent. I have seen often people half heatedly using the pipe, and I wish to urge them to adopt it wholly–I will elaborate further on my stance later, but first a short history of the pipe.

A Very Brief History of Piping

Shells

I didn’t do too much digging (just some basic Wikipedia sleuthing), but I found that the pipe seems to have been created in the early ’70s as part of Version 3 Unix.1 This was featured in its shell, the Thompson Shell2, with a syntax that has stayed remarkably consistent in *NIX shells, using | to pipe the output of one command into the input of the next.

1# list all files (`ls`), then filter them and print to stdout the ones which end with 'pdf' (`grep`)
2ls -lh | grep pdf$

Related to this piping idea in UNIX is changing the output of commands, i.e. writing something to a file by redirecting stdout to a file. This redirects the output of a command.

1# save the results of the prior commands to a new file at `./my_pdfs.txt`
2ls -lh | grep pdf$ > my_pdfs.txt

Calling > piping is a bit of abuse of terminology as piping usually refers to sending the output of one function into another whereas > sends the output of a function to a file.

This syntax has stayed the same for the past 50 years, and can be seen in modern shells like fish3, nushell4, and xonsh5, all of which support the standard | and > syntax as described above.

Functional Languages

Despite it being a mainstay in shells since the ’70s, I wasn’t able to find usages of the pipe in too many programming languages. Pipes are most often seen in functional programming languages where the natural endemic predisposition towards function application pairs neatly with an infix operator for function composition. Such an operator doesn’t seem to be a part of Standard ML. I think the earliest implementation of a piping operator is F#’s pipe (|>), which seems to have been part of the language spec from the get to in the early noughties.6 The syntax is simple and formative, as most languages after F# inherited the usage of |> to signify piping. OCaml7 and Haskell8 got their pipe operators (|> and &) in the mid 2010’s. Julia also seemed to have the pipe from its inception around this time9. These languages are all functional or “semi-functional” (viz. R and Julia), and though I didn’t look too hard, I couldn’t instances of piping outside of this context. That isn’t too surprising–there isn’t much need for this kind of explicit piping in most software I’d imagine, and other languages use similar ideas with different syntax (e.g. Java’s Streams.)

Usage

Idiomatic Piping

Though I’ll be focusing on R, I’d hazard that some of these broad ideas are portable–the context of this section is mostly statistical (or “data science”, i.e. computational statistics, though I will probably write more on that later), and so the examples I use may be restricted to this domain and contrived but the ideas still hold and could be generalized.

I’ve noticed a good bit of the following usage of pipes, which I find worrying and missing the whole point of piping.

 1library(dplyr)
 2
 3filtered_mtcars <- mtcars |> filter(mpg >= 15)
 4grouped_mtcars <- filtered_mtcars |> group_by(cyl)
 5grouped_mtcars |> summarize(mpg_mean = mean(mpg), mpg_sd = sd(mpg))
 6#> # A tibble: 3 × 3
 7#>     cyl mpg_mean mpg_sd
 8#>   <dbl>    <dbl>  <dbl>
 9#> 1     4     26.7   4.51
10#> 2     6     19.7   1.45
11#> 3     8     16.5   1.58

I’ve seen this semi-frequently, and I think this shows a bit of a misunderstanding of how to use the pipe. This paradigm doesn’t fully embrace piping by storing these intermediary throwaway objects like filtered_mtcars and grouped_mtcars, which aren’t used again in the analysis. The whole point of the pipe is to avoid the creation of such objects; an idiomatic approach to performing the same operations would be something like the following.

 1library(dplyr)
 2
 3mtcars |>
 4    filter(mpg >= 15) |>
 5    group_by(cyl) |>
 6    summarize(mpg_mean = mean(mpg), mpg_sd = sd(mpg))
 7#> # A tibble: 3 × 3
 8#>     cyl mpg_mean mpg_sd
 9#>   <dbl>    <dbl>  <dbl>
10#> 1     4     26.7   4.51
11#> 2     6     19.7   1.45
12#> 3     8     16.5   1.58

This example is, again, contrived and rather simple, but it highlights the differences in paradigms pretty well I think. The whole purpose of the pipe is to remove all nuisance intermediate objects that you would need to create if you didn’t have the pipe. Though this isn’t the best example due to its simplicity, saving the intermediate objects will cost a performance penalty as we will see.

 1library(dplyr)
 2
 3bnch <- bench::mark(
 4    Improper = {
 5        filtered_mtcars <- mtcars |> filter(mpg >= 15)
 6        grouped_mtcars <- filtered_mtcars |> group_by(cyl)
 7        grouped_mtcars |> summarize(mpg_mean = mean(mpg), mpg_sd = sd(mpg))
 8    },
 9    Proper = {
10        mtcars |>
11            filter(mpg >= 15) |>
12            group_by(cyl) |>
13            summarize(mpg_mean = mean(mpg), mpg_sd = sd(mpg))
14    },
15    iterations = 10000
16)
17summary(bnch)
18#> # A tibble: 2 × 6
19#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
20#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
21#> 1 Improper     3.29ms   4.85ms      193.    2.76MB     7.82
22#> 2 Proper       2.47ms   4.83ms      199.    9.38KB     8.08
23summary(bnch, relative = TRUE)
24#> # A tibble: 2 × 6
25#>   expression   min median `itr/sec` mem_alloc `gc/sec`
26#>   <bch:expr> <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
27#> 1 Improper    1.33   1.00      1         301.     1   
28#> 2 Proper      1      1         1.03        1      1.03

Though the times are similar, what is more important is the memory allocations–saving objects costs around 300 times more memory than purely piping. This makes complete sense, of course, but it is still somewhat startling to see how much those two assignments cost; additionally, the problem would obviously only be exacerbated with larger dataframes.

I think the above behaviour is a special case of a more general pattern I’ve been noticing wherein people partake in unnecessary assignments–saving objects which they will never reuse. Assignment shoudl be avoided whenever possible, as object assignment is memory allocations and reducing allocations is, as a general rule of thumb, a good thing. I don’t know enough computing to give a nuanced argument, but memory is a limited resource, and so habits should be built so as to systematically and automatically avoid unnecessarily using memory.

More Piping

As I’ve embraced piping, I’ve run across some issues at times. A classic issue is how to deal with functions which don’t take data as their first argument–a seemingly unsurmountable issue for the pipe. Treat the next snippet as the ground truth we wish to emulate by using a pipe. Though it doesn’t feature any more function chaining, it is reasonable to think that we could want to do some operations on the dataframe before calling myfunc. We could just save those results and pass it manually to myfunc in a manner like below, or we could be a touch more clever as shown later.

 1myfunc <- function(col, value, data) data |> dplyr::filter({{ col }} > value)
 2myfunc(cyl, 6, mtcars)
 3#>                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
 4#> Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
 5#> Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
 6#> Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
 7#> Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
 8#> Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
 9#> Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
10#> Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
11#> Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
12#> Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
13#> AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
14#> Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
15#> Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
16#> Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
17#> Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

We can’t pipe into this function since it doesn’t take in data as its first argument. There are a few ways we could rectify this issue.

The first is the one I am most familiar with, and was my default option till recently due to its flexibility and simplicity; you can just wrap the function in a lambda which takes data as its first (and probably only) argument.

 1mtcars |> (\(data) myfunc(cyl, 6, data))()
 2#>                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
 3#> Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
 4#> Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
 5#> Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
 6#> Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
 7#> Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
 8#> Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
 9#> Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
10#> Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
11#> Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
12#> AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
13#> Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
14#> Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
15#> Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
16#> Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

I don’t think this approach is fundamentally flawed, but there are some slightly better ways of doing this. If possible, specifying arguments will automatically pass the data argument to the function correctly. The problem here of course is that this may not scale very well.

 1mtcars |> myfunc(col = cyl, value = 6)
 2#>                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
 3#> Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
 4#> Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
 5#> Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
 6#> Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
 7#> Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
 8#> Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
 9#> Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
10#> Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
11#> Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
12#> AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
13#> Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
14#> Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
15#> Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
16#> Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

I think the most elegant approach utilizes the built-in functionality of the base pipe. I wasn’t aware of this, but it neatly handles this issue and will be my go-to option. Using _ sends the data to that argument in the function call, thus allowing you to neatly solve the problem as demonstrated in the following snippet.

 1mtcars |> myfunc(cyl, 6, data = _)
 2#>                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
 3#> Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
 4#> Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
 5#> Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
 6#> Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
 7#> Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
 8#> Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
 9#> Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
10#> Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
11#> Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
12#> AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
13#> Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
14#> Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
15#> Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
16#> Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

One caveat is that the argument has to be named. There are some more subtle differences between the base pipe (|>) and the magrittr pipe (%>%), but I’ll point you to Hadley’s post instead of rehashing the matter.

I’ve already posted about another simple pattern I’ve been using a little, that of printing and saving an object. See my post for more. That is, however, less about the pipe and more about print() (viz. functions called for side-effects invisibly returning their input, unmodified.) Another thing I’ve been using a lot more is splitting pipes and View()ing the intermediate output, i.e. if I have a pipe that looks like:

1library(dplyr)
2
3mtcars |>
4    mutate(squared_mpg = mpg^2, sqrt_cyl = sqrt(cyl)) |>
5    filter(wt < 3.2) |>
6    group_by(gear) |>
7    select(starts_with("squared")) |>
8    summarize(mean_mpg = mean(squared_mpg))

and I want to make some quick changes to try something, or need to diagnose some issues, I’ve been running the lines I need to, including the pipe–something like this.

1library(dplyr)
2
3mtcars |>
4    mutate(squared_mpg = mpg^2, sqrt_cyl = sqrt(cyl)) |>
5    filter(wt < 3.2) |>
6    group_by(gear) |>
7    # select(starts_with("squared")) |>
8    # summarize(mean_mpg = mean(squared_mpg))

This won’t immediately execute though, as the pipe is expecting another function call. “Running” the above code in your IDE (e.g. Positron, RStudio) and bringing your cursor to the console section (using a keybind, of course) allows you to just type View() to see the resulting dataframe as it is in that part of the pipeline. I’ve found this to be very convenient, and is quite fast if you’re decent at selecting lines and know a few common shortcuts.

In a similar vein, if I’m doing EDA or such, I’ve taken to piping even simple single function calls, i.e. doing something like this:

11:10 |> mean()

as I’ll probably need to do more things with 1:10, like getting the median or something. If you apply mean() like this, it is easier to use other functions; changing that code to 1:10 |> median() is simpler as the function to be changed is at the end of the line which is easier to edit when copying over from console history (up-arrow).

In Conclusion

Anyway, the pipe is cool, and one should use it generously without storing intermediary results as piping allows for complex sets of instructions to be split up into digestible chunks. There are some cool tricks one can do with the pipe and it is certainly an essential part of modern idiomatic R.

Appendix

All snippets created on 2025-04-10 with reprex v2.1.1

Reply to this post by email ↪