02. Toolkit: R and the Shell

Data Science for Economists

2026-03-01

Session Roadmap

  • Shell essentials
  • Make (brief)
  • R basics
  • Quarto

Reproducibility (Recap)

Covered in the previous session — remember the 8 building blocks from Gentzkow & Shapiro:

Automation, Version Control, Directories, Keys, Abstraction, Documentation, Management, Code Style

Everything we do today serves these principles.

Shell / Bash

Shell

  • Terminology: shell, terminal, tty, command prompt, etc.
    • Same same: command line interface (CLI)
  • Many shell variants: focus on Bash (“Bourne again shell”)
  • Included by default on Linux and macOS
  • Windows users need to install a Bash-compatible shell

Why the Shell?

  • Powerful: executing commands and fixing problems
    • some things you just can’t do in an IDE or GUI
  • Reproducibility: scripting is reproducible, clicking is not
  • Remote: interacting with servers and supercomputers
  • Automation: workflow and analysis pipelines, e.g. with Makefile

Navigation, Files and Directories

Basics

username@hostname:~$
  • username denotes a specific user
  • hostname denotes name of the computer
  • :~ denotes the directory path (where ~ signifies the user’s home directory)
  • $ denotes the start of the command prompt (# for root)

Keyboard Shortcuts

  • Tab completion
  • Up/Down keys to scroll through previous commands
  • Ctrl + Right (and Ctrl + Left) to skip whole words at a time
  • Ctrl + a moves the cursor to the beginning of the line
  • Ctrl + e moves the cursor to the end of the line
  • Ctrl + k deletes everything to the right of the cursor
  • Ctrl + u deletes everything to the left of the cursor
  • Ctrl + Shift + c to copy and Ctrl + Shift + v to paste

Syntax

command option(s) argument(s)

astronaut $ ls -lh
total 4.0K
drwxr-xr-x 3 astronaut astronaut  96 Apr 26 19:03 01-getting-started
drwxr-xr-x 2 astronaut astronaut  64 Apr 26 19:03 02-toolkit
-rw-r--r-- 1 astronaut astronaut 135 Apr 19 15:43 README.md
  • Options start with a dash, usually one letter
  • Multiple options can be chained under a single dash, sometimes two
$ ls -lah 01-getting-started/
$ ls --group-directories-first --human-readable 01-getting-started/

Create Files and Directories

  • touch and mkdir
$ mkdir testing
$ touch testing/test1.txt testing/test2.txt testing/test3.txt
$ ls testing
test1.txt  test2.txt  test3.txt

Removing Files and Directories

  • rm
$ rm testing/test1.txt
$ ls testing
test2.txt  test3.txt
$ rm testing
rm: cannot remove 'testing': Is a directory
$ rm -rf testing
$ ls testing
ls: cannot access 'testing': No such file or directory
  • “recursive” (-r or -R) and “force” (-f) options

Copying

  • cp object path/copyname (keeps old name if not provided with new one)
$ touch example.txt
$ mkdir testing
$ cp example.txt testing
$ ls testing
example.txt

Moving and Renaming

  • mv object path/newobjectname
$ mv example.txt testing/example2.txt
$ ls testing
example2.txt  example.txt
$ mv testing/example2.txt testing/example_new.txt
$ ls testing
example_new.txt  example.txt

Working with Text and Pipes

Working with Text Files

  • Print whole file with cat (“concatenate”)
$ cat -n input/sonnets.txt
  • Print only first or last lines with head and tail
$ head -n 3 input/sonnets.txt   ## First 3 rows
$ tail -n 1 input/sonnets.txt   ## Last row

Working with Text Files: grep

  • Search within files: grep (“Global regular expression print”)
$ wc input/sonnets.txt
 2633 17698 95662 input/sonnets.txt

$ grep -n "Shall I compare thee" input/sonnets.txt

Redirect

  • Send output from the shell to a file using redirect operator >
$ echo "At first, I was afraid, I was petrified" > survive.txt
$ find survive.txt
survive.txt
  • To append to a file, use >> (> overwrites)
$ echo "'Kept thinking I could never live without you by my side" >> survive.txt
$ cat survive.txt
At first, I was afraid, I was petrified
'Kept thinking I could never live without you by my side

Pipes

  • Send (“pipe”) output to another command with |
    • chain together a sequence of simple operations
$ cat -n input/sonnets.txt | head -n100 | tail -n10

Loops and Scripting

Loops

  • Repeat operation over a set: Loops
for i in LIST
do
  OPERATION $i
done
  • Example: numbering text files
$ n=1
$ for f in input/*.txt
> do
>   echo "=== File $n: $f ==="
>   head -n 2 "$f"
>   n=$((n + 1))
> done

Scripting

  • .sh files with code can be executed
#!/bin/sh
echo -e "\nHello World!\n"
  • #!/bin/sh is a shebang, indicating which program to run the command with
$ bash code/00-shell-exercise.sh
Hello World!

Running Other Languages from the Shell

  • Not limited to running shell scripts in the shell
  • Example: Rscript
$ Rscript -e 'cat("Hello World, from R!")'
Hello World, from R!

Make

Make: Automate Your Pipeline

  • make automates the sequence from raw data → results → paper
  • Define targets, prerequisites, and recipes in a Makefile
  • Only re-runs steps whose inputs have changed
# Makefile
paper.pdf: paper.tex figures/plot.png
    pdflatex paper.tex

figures/plot.png: output/results.csv code/plot.R
    Rscript code/plot.R

output/results.csv: input/data.csv code/analysis.R
    Rscript code/analysis.R
  • Run make and it figures out what needs rebuilding
  • Change the data? Only the downstream steps re-run

R Basics

R Basics

  • A great calculator
  • Logic, negation, evaluation (==), matching (%in%)
    • careful: floating-point numbers
    • better: all.equal()
  • Assignment with = or <-
  • Questions? help(plot) or ?plot
  • Commenting with #

Objects

  • vectors
  • matrices
  • data frames (and derivatives like data.table and tibble)
  • lists
  • functions
  • etc.

Conversion Between Objects

# Create a small data frame called "d"
d = data.frame(x = 1:2, y = 3:4)
d
#>   x y
#> 1 1 3
#> 2 2 4

# Convert it to (i.e. create) a matrix called "m"
m = as.matrix(d)
m
#>      x y
#> [1,] 1 3
#> [2,] 2 4

Class, Type and Structure

# Evaluate its class
class(d)
#> [1] "data.frame"

# Evaluate its type
typeof(d)
#> [1] "list"

# Show its structure
str(d)
#> 'data.frame':   2 obs. of  2 variables:
#>  $ x: int  1 2
#>  $ y: int  3 4

Global Environment

# View d
View(d)
d
#>   x y
#> 1 1 3
#> 2 2 4

# Use d to run command
lm(y ~ x)
#> Error in eval(predvars, data, env) : object 'y' not found

lm(y ~ x, data = d)
#> Call:
#> lm(formula = y ~ x, data = d)
#>
#> Coefficients:
#> (Intercept)            x
#>           2            1

Reserved Words

  • Fundamental commands, operators and relations cannot be reassigned
if
else
while
function
for
TRUE
FALSE
NULL
Inf
NaN
NA

Semi-reserved Words

my_vector = c(1, 2, 5)
my_vector
#> [1] 1 2 5

c = 4
c(1, 2, 5)
#> [1] 1 2 5
c
#> [1] 4

pi
#> [1] 3.141593

pi = 2
pi
#> [1] 2

Indexing: []

a = 1:10
a[4]
#> [1] 4
a[c(4, 6)]
#> [1] 4 6

m[1, 1]
#> x
#> 1

my_list = list(a = "hello", b = c(1, 2, 3),
               c = data.frame(x = 1:5, y = 6:10))
my_list[[1]]
#> [1] "hello"
my_list[[2]][3]
#> [1] 3

Indexing: $

my_list
#> $a
#> [1] "hello"
#>
#> $b
#> [1] 1 2 3
#>
#> $c
#>   x  y
#> 1 1  6
#> 2 2  7
#> 3 3  8
#> 4 4  9
#> 5 5 10

Indexing: $ (continued)

my_list$a
#> [1] "hello"

my_list$b[3]
#> [1] 3

my_list$c$x
#> [1] 1 2 3 4 5

Indexing: $ and the Global Environment

# Remember the earlier problem?
lm(d$y ~ d$x)
#> Call:
#> lm(formula = d$y ~ d$x)
#>
#> Coefficients:
#> (Intercept)          d$x
#>           2            1

Functions

  • A lot of functionality in “base R”
    • in-built functions, like lm()
  • User-built functions are easy to implement
example_function = function(a, b) {
  output = a + b
  return(output)
}
example_function(1, 2)
#> [1] 3

Installing Packages

# pacman: install-if-missing + load
if (!require("pacman")) install.packages("pacman"); library(pacman)
p_load(data.table)
p_load(ggplot2)
p_load_current_gh("ropensci/rnaturalearthhires")  # from GitHub
  • pacman — single-line install + load; good for reproducible teaching setups

Libraries

  • Community-built (set of) functions: libraries or packages
library(data.table)
#> data.table 1.16.4 using 4 threads (see ?getDTthreads).
#> Latest news: r-datatable.com

The Pipe: |>

  • R 4.1+ introduced the native pipe operator |>
  • Passes the left-hand side as the first argument of the right-hand side
# Without pipe
head(subset(mtcars, cyl == 4), 3)

# With native pipe |>
mtcars |> subset(cyl == 4) |> head(3)
  • We use |> throughout this course
  • You’ll see %>% (magrittr pipe) in older code — same idea, but requires library(magrittr) or library(tidyverse)

Data Frames vs Tibbles vs data.tables

You’ll encounter all three — they’re all rectangular data, with different trade-offs:

data.frame tibble data.table
Package base R tidyverse data.table
Print all rows fits screen fits screen
Speed slow slow very fast
Syntax df[row, col] dplyr verbs dt[i, j, by]
Best for small data tidy pipelines large data
  • tibble is a data.frame with nicer defaults
  • data.table modifies in place — crucial for memory on large data
  • We’ll use all three; Module 03 goes deeper

Quarto: Literate Programming

  • Quarto is the next generation of R Markdown
    • supports R, Python, Julia, and Observable JS
    • renders to HTML, PDF, Word, slides (reveal.js), websites, books, …
  • A single .qmd file combines prose, code, and output
  • Replaces R Markdown (.Rmd) for new projects
  • Learn more: https://quarto.org

Quarto: Minimal Example

---
title: "My Analysis"
format: html
---

## Data

```{r}
library(tidyverse)
mtcars |> head()
```

## Plot

```{r}
mtcars |> ggplot(aes(wt, mpg)) + geom_point()
```
# Render from the terminal
quarto render analysis.qmd           # → analysis.html
quarto render analysis.qmd --to pdf  # → analysis.pdf

Wrap Up

Wrap Up

  • Shell essentials, Make, R basics, Quarto
  • Next session: Working with large structured data

Further reading