Web scraping 🕸

# Web scraping 🕸
### S. Mason Garrison

---

layout: true
 
<div class="my-footer">

<a href="https://DataScience4Psych.github.io/DataScience4Psych/" target="_blank">Data Science for Psychologists</a>

</div>

---

# Scraping the web

---

## Scraping the web: what? why?

- Increasing amounts of data are available on the web
--

- These data are provided in an unstructured format: you can always copy&paste, 
  - but it's time-consuming and prone to errors

--
- Web scraping is the process of extracting this information automatically and transform it into a structured dataset

--
- Two different scenarios:
    - Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy).
    - Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files.
---

# Web Scraping with rvest

---

## Hypertext Markup Language

- Most of the data on the web is still largely available as HTML 
- It is structured (hierarchical / tree based), but it''s often not available in a form useful for analysis (flat / tidy).

```html
<html>
 <head>
 <title>This is a title</title>
 </head>
 <body>
 Hello world!
 </body>
</html>
```

---

## rvest

.pull-left[
- The **rvest** package makes basic processing and manipulation of HTML data straight forward
- It's designed to work with pipelines built with `%>%`
]
.pull-right[
<img src="img/rvest.png" width="230" style="display: block; margin: auto 0 auto auto;" />
]

---

## Core rvest functions

- `read_html`   - Read HTML data from a url or character string
- `html_node `  - Select a specified node from HTML document
- `html_nodes`  - Select specified nodes from HTML document
- `html_table`  - Parse an HTML table into a data frame
- `html_text`   - Extract tag pairs' content
- `html_name`   - Extract tags' names
- `html_attrs`  - Extract all of each tag's attributes
- `html_attr`   - Extract tags' attribute value by name

---
class: middle

# Wrapping Up...

---

# Using the SelectorGadget

---

## SelectorGadget

.pull-left-narrow[
- Open source tool that eases CSS selector generation and discovery
- Easiest to use with the [Chrome Extension](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb) 
- Find out more on the [SelectorGadget vignette](https://cran.r-project.org/web/packages/rvest/vignettes/selectorgadget.html)
- Note: Our class is about 50-50 (Apple-Microsoft) 
 - I have screen shots from a Mac in the slides,
 - and screen captures from my PC
]
.pull-right-wide[
<img src="img/selector_gadget/selector-gadget.png" width="75%" style="display: block; margin: auto;" />
]

---

# Using the SelectorGadget (MAC)

---
## Using the SelectorGadget (MAC)

---

---

---

---

---

---
## Using the SelectorGadget (MAC)
Through this process of selection and rejection, 
SelectorGadget helps you find the appropriate CSS selector for your needs

<img src="img/selector_gadget/selector-gadget.gif" width="65%" style="display: block; margin: auto;" />
---

# Using the SelectorGadget (PC)

---

## Using the SelectorGadget (PC)
.small.pull-left-narrow[
- Click on the logo next to the search bar
- A box will open in the bottom right of the website
- Click on a page element (it will turn green), 
  - SelectorGadget will generate a minimal CSS selector for that element, 
  - and will highlight (yellow) everything that is matched by the selector
- Click on a highlighted element to remove it from the selector (red), 
  - or click on an unhighlighted element to add it to the selector
- This process of selection and rejection, 
  - helps you find the appropriate CSS selector for your needs

]
---
class: middle

# Wrapping Up...

---

# Top 250 movies on IMDB

---

## Top 250 movies on IMDB

Take a look at the source code, look for the tag `table` tag:
 
http://www.imdb.com/chart/top

---

## First check if you're allowed!

```r
library(robotstxt)
paths_allowed("http://www.imdb.com")
```

```
## [1] TRUE
```

vs.

```r
paths_allowed("http://www.facebook.com")
```

```
## [1] FALSE
```

---

.your-turn[
- [class git repo](https://github.com/DataScience4Psych) > `AE 08 - IMDB + Webscraping`.
- Open `01-imdb-250movies.R`.
- Follow along, and fill in the blanks as we go based on upcoming slides.
]

---

## Plan

---

## Plan

1. Read the whole page

2. Scrape movie titles and save as `titles`

3. Scrape years movies were made in and save as `years`

4. Scrape IMDB ratings and save as `ratings`

5. Create a data frame called `imdb_top_250` with variables `title`, `year`, and `rating`

---

# Step 1. Read the whole page

---

## Read the whole page

```r
page <- read_html("https://www.imdb.com/chart/top/")
page
```

```
## {html_document}
## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html ...
## [2] <body id="styleguide-v2" class="fixed">\n <img ...
```

---

## A webpage in R

- Result is a list with 2 elements

```r
typeof(page)
```

```
## [1] "list"
```

- that we need to convert to something more familiar, like a data frame....

```r
class(page)
```

```
## [1] "xml_document" "xml_node"
```

---

# Step 2. Scrape movie titles and save as `titles`

---

## Scrape movie titles

---

## Scrape the nodes

```r
page %>%
  html_nodes(".titleColumn a")
```

```
## {xml_nodeset (250)}
## [1] <a href="/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_ ...
## [2] <a href="/title/tt0068646/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_ ...
## [3] <a href="/title/tt0468569/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_ ...
## [4] <a href="/title/tt0071562/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_ ...
## [5] <a href="/title/tt0050083/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_ ...
## [6] <a href="/title/tt0108052/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_ ...
## [7] <a href="/title/tt0167260/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_ ...
## [8] <a href="/title/tt0110912/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_ ...
## [9] <a href="/title/tt0120737/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_ ...
## [10] <a href="/title/tt0060196/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_ ...
## [11] <a href="/title/tt0109830/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_ ...
## [12] <a href="/title/tt0137523/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_ ...
## [13] <a href="/title/tt1375666/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_ ...
## [14] <a href="/title/tt0167261/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_ ...
## [15] <a href="/title/tt0080684/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_ ...
## [16] <a href="/title/tt0133093/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_ ...
...
```
]
.pull-right[
<img src="img/titles.png" width="100%" style="display: block; margin: auto;" />
]

---

## Extract the text from the nodes

```r
page %>%
  html_nodes(".titleColumn a") %>%
  html_text()
```

```
## [1] "The Shawshank Redemption" 
## [2] "The Godfather" 
## [3] "The Dark Knight" 
## [4] "The Godfather: Part II" 
## [5] "12 Angry Men" 
## [6] "Schindler's List" 
## [7] "The Lord of the Rings: The Return of the King" 
## [8] "Pulp Fiction" 
## [9] "The Lord of the Rings: The Fellowship of the Ring" 
## [10] "The Good, the Bad and the Ugly" 
## [11] "Forrest Gump" 
## [12] "Fight Club" 
## [13] "Inception" 
## [14] "The Lord of the Rings: The Two Towers" 
## [15] "Star Wars: Episode V - The Empire Strikes Back" 
## [16] "The Matrix" 
...
```
]
.pull-right[
<img src="img/titles.png" width="100%" style="display: block; margin: auto;" />
]

---

## Save as `titles`

```r
titles <- page %>%
 html_nodes(".titleColumn a") %>%
 html_text()

titles
```

```
## [1] "The Shawshank Redemption" 
## [2] "The Godfather" 
## [3] "The Dark Knight" 
## [4] "The Godfather: Part II" 
## [5] "12 Angry Men" 
## [6] "Schindler's List" 
## [7] "The Lord of the Rings: The Return of the King" 
## [8] "Pulp Fiction" 
## [9] "The Lord of the Rings: The Fellowship of the Ring" 
## [10] "The Good, the Bad and the Ugly" 
## [11] "Forrest Gump" 
## [12] "Fight Club" 
## [13] "Inception" 
## [14] "The Lord of the Rings: The Two Towers" 
...
```
]
.pull-right[
<img src="img/titles.png" width="100%" style="display: block; margin: auto;" />
]

---

# Step 3. Scrape year movies were made and save as `years`

---

## Scrape years movies were made in

---

## Scrape the nodes

```r
page %>%
  html_nodes(".secondaryInfo")
```

```
## {xml_nodeset (250)}
## [1] (1994)
## [2] (1972)
## [3] (2008)
## [4] (1974)
## [5] (1957)
## [6] (1993)
## [7] (2003)
## [8] (1994)
## [9] (2001)
## [10] (1966)
## [11] (1994)
## [12] (1999)
## [13] (2010)
## [14] (2002)
## [15] (1980)
## [16] (1999)
...
```
]
.pull-right[
<img src="img/years.png" width="100%" style="display: block; margin: auto;" />
]

---

## Extract the text from the nodes

```r
page %>%
  html_nodes(".secondaryInfo") %>%
  html_text()
```

```
## [1] "(1994)" "(1972)" "(2008)" "(1974)" "(1957)" "(1993)"
## [7] "(2003)" "(1994)" "(2001)" "(1966)" "(1994)" "(1999)"
## [13] "(2010)" "(2002)" "(1980)" "(1999)" "(1990)" "(1975)"
## [19] "(1995)" "(1954)" "(1946)" "(1991)" "(1998)" "(2002)"
## [25] "(1997)" "(1999)" "(1977)" "(2014)" "(1991)" "(1985)"
## [31] "(2001)" "(1960)" "(1994)" "(2002)" "(2019)" "(1994)"
## [37] "(2000)" "(1998)" "(1995)" "(2006)" "(2021)" "(2006)"
## [43] "(1942)" "(2014)" "(2011)" "(1936)" "(1968)" "(1962)"
## [49] "(1988)" "(1979)" "(1954)" "(1931)" "(2000)" "(1979)"
## [55] "(1988)" "(1981)" "(2012)" "(2008)" "(2006)" "(1950)"
## [61] "(1980)" "(1957)" "(1940)" "(2018)" "(1957)" "(1986)"
## [67] "(1999)" "(2012)" "(1964)" "(2019)" "(2018)" "(2003)"
## [73] "(1995)" "(1995)" "(1984)" "(2017)" "(2009)" "(1981)"
## [79] "(2019)" "(1997)" "(2022)" "(1984)" "(1997)" "(2010)"
## [85] "(2000)" "(2009)" "(1952)" "(2016)" "(1983)" "(1992)"
## [91] "(2004)" "(1968)" "(1941)" "(1963)" "(1962)" "(1931)"
...
```
]
.pull-right[
<img src="img/years.png" width="100%" style="display: block; margin: auto;" />
]

---

## Clean up the text

We need to go from `"(1994)"` to `1994`:

- Remove `(` and `)`: string manipulation
- Convert to numeric: `as.numeric()`

---

## stringr

.pull-left-wide[
- **stringr** provides a cohesive set of functions designed to make working with strings as easy as possible
- Functions in stringr start with `str_*()`, e.g.
  - `str_remove()` to remove a pattern from a string

```r
str_remove(string = "jello", pattern = "el")
```

```
## [1] "jlo"
```
  - `str_replace()` to replace a pattern with another
  .midi[

```r
str_replace(string = "jello", pattern = "j", replacement = "h")
```

```
## [1] "hello"
```
] 
]
.pull-right-narrow[
<img src="img/stringr.png" width="100%" style="display: block; margin: auto auto auto 0;" />
]

---

## Clean up the text

```r
page %>%
  html_nodes(".secondaryInfo") %>%
  html_text() %>%
  str_remove("\\(") # remove (
```

```
##   [1] "1994)" "1972)" "2008)" "1974)" "1957)" "1993)" "2003)"
##   [8] "1994)" "2001)" "1966)" "1994)" "1999)" "2010)" "2002)"
##  [15] "1980)" "1999)" "1990)" "1975)" "1995)" "1954)" "1946)"
##  [22] "1991)" "1998)" "2002)" "1997)" "1999)" "1977)" "2014)"
##  [29] "1991)" "1985)" "2001)" "1960)" "1994)" "2002)" "2019)"
##  [36] "1994)" "2000)" "1998)" "1995)" "2006)" "2021)" "2006)"
##  [43] "1942)" "2014)" "2011)" "1936)" "1968)" "1962)" "1988)"
##  [50] "1979)" "1954)" "1931)" "2000)" "1979)" "1988)" "1981)"
##  [57] "2012)" "2008)" "2006)" "1950)" "1980)" "1957)" "1940)"
##  [64] "2018)" "1957)" "1986)" "1999)" "2012)" "1964)" "2019)"
##  [71] "2018)" "2003)" "1995)" "1995)" "1984)" "2017)" "2009)"
##  [78] "1981)" "2019)" "1997)" "2022)" "1984)" "1997)" "2010)"
##  [85] "2000)" "2009)" "1952)" "2016)" "1983)" "1992)" "2004)"
##  [92] "1968)" "1941)" "1963)" "1962)" "1931)" "2018)" "1959)"
##  [99] "2012)" "1958)" "2001)" "1971)" "1987)" "1983)" "1944)"
...
```

---

## Clean up the text

```r
page %>%
  html_nodes(".secondaryInfo") %>%
  html_text() %>%
  str_remove("\$") %>% # remove (
  str_remove("\$") # remove )
```

```
##   [1] "1994" "1972" "2008" "1974" "1957" "1993" "2003" "1994"
##   [9] "2001" "1966" "1994" "1999" "2010" "2002" "1980" "1999"
##  [17] "1990" "1975" "1995" "1954" "1946" "1991" "1998" "2002"
##  [25] "1997" "1999" "1977" "2014" "1991" "1985" "2001" "1960"
##  [33] "1994" "2002" "2019" "1994" "2000" "1998" "1995" "2006"
##  [41] "2021" "2006" "1942" "2014" "2011" "1936" "1968" "1962"
##  [49] "1988" "1979" "1954" "1931" "2000" "1979" "1988" "1981"
##  [57] "2012" "2008" "2006" "1950" "1980" "1957" "1940" "2018"
##  [65] "1957" "1986" "1999" "2012" "1964" "2019" "2018" "2003"
##  [73] "1995" "1995" "1984" "2017" "2009" "1981" "2019" "1997"
##  [81] "2022" "1984" "1997" "2010" "2000" "2009" "1952" "2016"
##  [89] "1983" "1992" "2004" "1968" "1941" "1963" "1962" "1931"
##  [97] "2018" "1959" "2012" "1958" "2001" "1971" "1987" "1983"
...
```

---

## Convert to numeric

```r
page %>%
  html_nodes(".secondaryInfo") %>%
  html_text() %>%
  str_remove("\$") %>% # remove (
  str_remove("\$") %>% # remove )
  as.numeric()
```

```
##   [1] 1994 1972 2008 1974 1957 1993 2003 1994 2001 1966 1994 1999
##  [13] 2010 2002 1980 1999 1990 1975 1995 1954 1946 1991 1998 2002
##  [25] 1997 1999 1977 2014 1991 1985 2001 1960 1994 2002 2019 1994
##  [37] 2000 1998 1995 2006 2021 2006 1942 2014 2011 1936 1968 1962
##  [49] 1988 1979 1954 1931 2000 1979 1988 1981 2012 2008 2006 1950
##  [61] 1980 1957 1940 2018 1957 1986 1999 2012 1964 2019 2018 2003
##  [73] 1995 1995 1984 2017 2009 1981 2019 1997 2022 1984 1997 2010
##  [85] 2000 2009 1952 2016 1983 1992 2004 1968 1941 1963 1962 1931
##  [97] 2018 1959 2012 1958 2001 1971 1987 1983 1944 1985 1960 1976
## [109] 1962 1973 1997 2009 2020 1995 1952 2000 1988 1989 2011 1927
## [121] 1948 2010 2019 2007 2005 1965 2016 2004 1921 1959 2020 1950
...
```

---

## Save as `years`

```r
years <- page %>%
 html_nodes(".secondaryInfo") %>%
 html_text() %>%
 str_remove("\$") %>% # remove (
 str_remove("\$") %>% # remove )
 as.numeric()

years
```

---

# Step 4. Scrape IMDB ratings and save as `ratings`

---

## Scrape IMDB ratings

---

## Scrape the nodes

```r
page %>%
  html_nodes("strong")
```

```
## {xml_nodeset (250)}
## [1] 9.2</ ...
## [2] 9.2</ ...
## [3] 9.0</ ...
## [4] 9.0</ ...
## [5] 9.0</st ...
## [6] 8.9</ ...
## [7] 8.9</ ...
## [8] 8.9</ ...
## [9] 8.8</ ...
## [10] 8.8</st ...
## [11] 8.8</ ...
## [12] 8.8</ ...
## [13] 8.7</ ...
## [14] 8.7</ ...
## [15] 8.7</ ...
## [16] 8.7</ ...
...
```
]
.pull-right[
<img src="img/ratings.png" width="100%" style="display: block; margin: auto;" />
]

---

## Extract the text from the nodes

```r
page %>%
  html_nodes("strong") %>%
  html_text()
```

```
## [1] "9.2" "9.2" "9.0" "9.0" "9.0" "8.9" "8.9" "8.9" "8.8" "8.8"
## [11] "8.8" "8.8" "8.7" "8.7" "8.7" "8.7" "8.7" "8.6" "8.6" "8.6"
## [21] "8.6" "8.6" "8.6" "8.6" "8.6" "8.6" "8.6" "8.6" "8.5" "8.5"
## [31] "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5"
## [41] "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.4"
## [51] "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4"
## [61] "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.3" "8.3" "8.3" "8.3"
## [71] "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3"
## [81] "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3"
## [91] "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3"
## [101] "8.3" "8.3" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2"
## [111] "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2"
## [121] "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2"
## [131] "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2"
## [141] "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.1" "8.1"
## [151] "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1"
...
```
]
.pull-right[
<img src="img/ratings.png" width="100%" style="display: block; margin: auto;" />
]

---

## Convert to numeric

```r
page %>%
  html_nodes("strong") %>%
  html_text() %>%
  as.numeric()
```

```
## [1] 9.2 9.2 9.0 9.0 9.0 8.9 8.9 8.9 8.8 8.8 8.8 8.8 8.7 8.7 8.7
## [16] 8.7 8.7 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.5 8.5
## [31] 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5
## [46] 8.5 8.5 8.5 8.5 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4
## [61] 8.4 8.4 8.4 8.4 8.4 8.4 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3
## [76] 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3
## [91] 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.2 8.2 8.2
## [106] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2
## [121] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2
## [136] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.1 8.1
## [151] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1
## [166] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1
## [181] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1
## [196] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1
## [211] 8.1 8.1 8.1 8.1 8.1 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0
...
```
]
.pull-right[
<img src="img/ratings.png" width="100%" style="display: block; margin: auto;" />
]

---

## Save as `ratings`

```r
ratings <- page %>%
 html_nodes("strong") %>%
 html_text() %>%
 as.numeric()

ratings
```

```
## [1] 9.2 9.2 9.0 9.0 9.0 8.9 8.9 8.9 8.8 8.8 8.8 8.8 8.7 8.7 8.7
## [16] 8.7 8.7 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.5 8.5
## [31] 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5
## [46] 8.5 8.5 8.5 8.5 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4
## [61] 8.4 8.4 8.4 8.4 8.4 8.4 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3
## [76] 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3
## [91] 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.2 8.2 8.2
## [106] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2
## [121] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2
## [136] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.1 8.1
## [151] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1
## [166] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1
...
```
]
.pull-right[
<img src="img/ratings.png" width="100%" style="display: block; margin: auto;" />
]

---

# Step 5. Create a data frame called `imdb_top_250`

---

## Create a data frame: `imdb_top_250`

```r
imdb_top_250 <- tibble(
 title = titles, 
 year = years, 
 rating = ratings
 )

imdb_top_250
```

```
## # A tibble: 250 x 3
## title year rating
## <chr> <dbl> <dbl>
## 1 The Shawshank Redemption 1994 9.2
## 2 The Godfather 1972 9.2
## 3 The Dark Knight 2008 9 
## 4 The Godfather: Part II 1974 9 
## 5 12 Angry Men 1957 9 
## 6 Schindler's List 1993 8.9
## 7 The Lord of the Rings: The Return of the King 2003 8.9
## 8 Pulp Fiction 1994 8.9
## 9 The Lord of the Rings: The Fellowship of the Ring 2001 8.8
## 10 The Good, the Bad and the Ugly 1966 8.8
## # ... with 240 more rows
```

---

<div id="htmlwidget-658508aa12f815ad271c" style="width:100%;height:400px;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-658508aa12f815ad271c">{"x":{"filter":"none","vertical":false,"data":[["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44","45","46","47","48","49","50","51","52","53","54","55","56","57","58","59","60","61","62","63","64","65","66","67","68","69","70","71","72","73","74","75","76","77","78","79","80","81","82","83","84","85","86","87","88","89","90","91","92","93","94","95","96","97","98","99","100","101","102","103","104","105","106","107","108","109","110","111","112","113","114","115","116","117","118","119","120","121","122","123","124","125","126","127","128","129","130","131","132","133","134","135","136","137","138","139","140","141","142","143","144","145","146","147","148","149","150","151","152","153","154","155","156","157","158","159","160","161","162","163","164","165","166","167","168","169","170","171","172","173","174","175","176","177","178","179","180","181","182","183","184","185","186","187","188","189","190","191","192","193","194","195","196","197","198","199","200","201","202","203","204","205","206","207","208","209","210","211","212","213","214","215","216","217","218","219","220","221","222","223","224","225","226","227","228","229","230","231","232","233","234","235","236","237","238","239","240","241","242","243","244","245","246","247","248","249","250"],["The Shawshank Redemption","The Godfather","The Dark Knight","The Godfather: Part II","12 Angry Men","Schindler's List","The Lord of the Rings: The Return of the King","Pulp Fiction","The Lord of the Rings: The Fellowship of the Ring","The Good, the Bad and the Ugly","Forrest Gump","Fight Club","Inception","The Lord of the Rings: The Two Towers","Star Wars: Episode V - The Empire Strikes Back","The Matrix","Goodfellas","One Flew Over the Cuckoo's Nest","Se7en","Seven Samurai","It's a Wonderful Life","The Silence of the Lambs","Saving Private Ryan","City of God","Life Is Beautiful","The Green Mile","Star Wars","Interstellar","Terminator 2: Judgment Day","Back to the Future","Spirited Away","Psycho","Léon: The Professional","The Pianist","Parasite","The Lion King","Gladiator","American History X","The Usual Suspects","The Departed","Spider-Man: No Way Home","The Prestige","Casablanca","Whiplash","The Intouchables","Modern Times","Once Upon a Time in the West","Hara-Kiri","Grave of the Fireflies","Alien","Rear Window","City Lights","Memento","Apocalypse Now","Cinema Paradiso","Indiana Jones and the Raiders of the Lost Ark","Django Unchained","WALL·E","The Lives of Others","Sunset Blvd.","The Shining","Paths of Glory","The Great Dictator","Avengers: Infinity War","Witness for the Prosecution","Aliens","American Beauty","The Dark Knight Rises","Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb","Joker","Spider-Man: Into the Spider-Verse","Old Boy","Braveheart","Toy Story","Amadeus","Coco","Inglourious Basterds","The Boat","Avengers: Endgame","Princess Mononoke","The Batman","Once Upon a Time in America","Good Will Hunting","Toy Story 3","Requiem for a Dream","3 Idiots","Singin' in the Rain","Your Name.","Star Wars: Episode VI - Return of the Jedi","Reservoir Dogs","Eternal Sunshine of the Spotless Mind","2001: A Space Odyssey","Citizen Kane","High and Low","Lawrence of Arabia","M","Capernaum","North by Northwest","The Hunt","Vertigo","Amélie","A Clockwork Orange","Full Metal Jacket","Scarface","Double Indemnity","Come and See","The Apartment","Taxi Driver","To Kill a Mockingbird","The Sting","L.A. Confidential","Up","Hamilton","Heat","Ikiru","Snatch","Die Hard","Indiana Jones and the Last Crusade","A Separation","Metropolis","Bicycle Thieves","Incendies","1917","Like Stars on Earth","Batman Begins","For a Few Dollars More","Dangal","Downfall","The Kid","Some Like It Hot","The Father","All About Eve","Green Book","The Wolf of Wall Street","Unforgiven","Casino","Pan's Labyrinth","Judgment at Nuremberg","Ran","A Beautiful Mind","The Sixth Sense","Monty Python and the Holy Grail","There Will Be Blood","The Truman Show","Yojimbo","The Treasure of the Sierra Madre","Shutter Island","The Great Escape","Rashomon","Jurassic Park","Kill Bill: Vol. 1","Finding Nemo","No Country for Old Men","Raging Bull","The Elephant Man","V for Vendetta","Gone with the Wind","Chinatown","Inside Out","Lock, Stock and Two Smoking Barrels","The Thing","Dial M for Murder","The Secret in Their Eyes","Howl's Moving Castle","The Bridge on the River Kwai","Trainspotting","Three Billboards Outside Ebbing, Missouri","Warrior","Gran Torino","Fargo","My Neighbor Totoro","Prisoners","Million Dollar Baby","Blade Runner","The Gold Rush","Catch Me If You Can","On the Waterfront","Children of Heaven","Harry Potter and the Deathly Hallows: Part 2","The Third Man","Gone Girl","Ben-Hur","12 Years a Slave","The General","The Deer Hunter","Wild Strawberries","Before Sunrise","In the Name of the Father","Pather Panchali","Mr. Smith Goes to Washington","The Grand Budapest Hotel","Room","Sherlock Jr.","Hacksaw Ridge","How to Train Your Dragon","Memories of Murder","The Wages of Fear","The Seventh Seal","Barry Lyndon","The Big Lebowski","Klaus","Mad Max: Fury Road","Wild Tales","Monsters, Inc.","Mary and Max","Jaws","The Passion of Joan of Arc","Hotel Rwanda","Rocky","Dead Poets Society","Tokyo Story","Platoon","Ford v Ferrari","The Terminator","Stand by Me","Rush","Into the Wild","The Wizard of Oz","Logan","Spotlight","Network","Groundhog Day","The Exorcist","Ratatouille","Hachi: A Dog's Tale","The Incredibles","Dersu Uzala","The Best Years of Our Lives","Before Sunset","Rebecca","Dune","The Grapes of Wrath","My Father and My Son","Cool Hand Luke","To Be or Not to Be","Amores perros","The Battle of Algiers","Pirates of the Caribbean: The Curse of the Black Pearl","The Sound of Music","Life of Brian","The 400 Blows","Persona","It Happened One Night","La Haine","Aladdin","Beauty and the Beast","Jai Bhim","Gandhi","The Help","The Handmaiden"],[1994,1972,2008,1974,1957,1993,2003,1994,2001,1966,1994,1999,2010,2002,1980,1999,1990,1975,1995,1954,1946,1991,1998,2002,1997,1999,1977,2014,1991,1985,2001,1960,1994,2002,2019,1994,2000,1998,1995,2006,2021,2006,1942,2014,2011,1936,1968,1962,1988,1979,1954,1931,2000,1979,1988,1981,2012,2008,2006,1950,1980,1957,1940,2018,1957,1986,1999,2012,1964,2019,2018,2003,1995,1995,1984,2017,2009,1981,2019,1997,2022,1984,1997,2010,2000,2009,1952,2016,1983,1992,2004,1968,1941,1963,1962,1931,2018,1959,2012,1958,2001,1971,1987,1983,1944,1985,1960,1976,1962,1973,1997,2009,2020,1995,1952,2000,1988,1989,2011,1927,1948,2010,2019,2007,2005,1965,2016,2004,1921,1959,2020,1950,2018,2013,1992,1995,2006,1961,1985,2001,1999,1975,2007,1998,1961,1948,2010,1963,1950,1993,2003,2003,2007,1980,1980,2005,1939,1974,2015,1998,1982,1954,2009,2004,1957,1996,2017,2011,2008,1996,1988,2013,2004,1982,1925,2002,1954,1997,2011,1949,2014,1959,2013,1926,1978,1957,1995,1993,1955,1939,2014,2015,1924,2016,2010,2003,1953,1957,1975,1998,2019,2015,2014,2001,2009,1975,1928,2004,1976,1989,1953,1986,2019,1984,1986,2013,2007,1939,2017,2015,1976,1993,1973,2007,2009,2004,1975,1946,2004,1940,2021,1940,2005,1967,1942,2000,1966,2003,1965,1979,1959,1966,1934,1995,1992,1991,2021,1982,2011,2016],[9.2,9.2,9,9,9,8.9,8.9,8.9,8.8,8.8,8.8,8.8,8.7,8.7,8.7,8.7,8.7,8.6,8.6,8.6,8.6,8.6,8.6,8.6,8.6,8.6,8.6,8.6,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8]],"container":"<table class=\"display\">\n <thead>\n <tr>\n <th> <\/th>\n <th>title<\/th>\n <th>year<\/th>\n <th>rating<\/th>\n <\/tr>\n <\/thead>\n<\/table>","options":{"dom":"p","pageLength":8,"columnDefs":[{"className":"dt-right","targets":[2,3]},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false,"lengthMenu":[8,10,25,50,100]}},"evals":[],"jsHooks":[]}</script>

---

## Clean up / enhance

May or may not be a lot of work depending on how messy the data are

- See if you like what you got:

```r
glimpse(imdb_top_250)
```

```
## Rows: 250
## Columns: 3
## $ title <chr> "The Shawshank Redemption", "The Godfather", "Th~
## $ year <dbl> 1994, 1972, 2008, 1974, 1957, 1993, 2003, 1994, ~
## $ rating <dbl> 9.2, 9.2, 9.0, 9.0, 9.0, 8.9, 8.9, 8.9, 8.8, 8.8~
```

- Add a variable for rank

```r
imdb_top_250 <- imdb_top_250 %>%
 mutate(rank = 1:nrow(imdb_top_250)) %>%
 relocate(rank)
```

---

```
## # A tibble: 250 x 4
## rank title year rating
## <int> <chr> <dbl> <dbl>
## 1 1 The Shawshank Redemption 1994 9.2
## 2 2 The Godfather 1972 9.2
## 3 3 The Dark Knight 2008 9 
## 4 4 The Godfather: Part II 1974 9 
## 5 5 12 Angry Men 1957 9 
## 6 6 Schindler's List 1993 8.9
## 7 7 The Lord of the Rings: The Return of the K~ 2003 8.9
## 8 8 Pulp Fiction 1994 8.9
## 9 9 The Lord of the Rings: The Fellowship of t~ 2001 8.8
## 10 10 The Good, the Bad and the Ugly 1966 8.8
## 11 11 Forrest Gump 1994 8.8
## 12 12 Fight Club 1999 8.8
## 13 13 Inception 2010 8.7
## 14 14 The Lord of the Rings: The Two Towers 2002 8.7
## 15 15 Star Wars: Episode V - The Empire Strikes ~ 1980 8.7
## 16 16 The Matrix 1999 8.7
## 17 17 Goodfellas 1990 8.7
## 18 18 One Flew Over the Cuckoo's Nest 1975 8.6
## 19 19 Se7en 1995 8.6
## 20 20 Seven Samurai 1954 8.6
## # ... with 230 more rows
```

---

# What next?

---

```r
imdb_top_250 %>% 
  count(year, sort = TRUE)
```

```
## # A tibble: 86 x 2
## year n
## <dbl> <int>
## 1 1995 8
## 2 2004 7
## 3 1957 6
## 4 2003 6
## 5 2009 6
## 6 2019 6
## 7 1975 5
## 8 1994 5
## 9 1997 5
## 10 1998 5
## # ... with 76 more rows
```

---

```r
imdb_top_250 %>% 
  filter(year == 1995) %>%
  print(n = 8)
```

```
## # A tibble: 8 x 4
## rank title year rating
## <int> <chr> <dbl> <dbl>
## 1 19 Se7en 1995 8.6
## 2 39 The Usual Suspects 1995 8.5
## 3 73 Braveheart 1995 8.3
## 4 74 Toy Story 1995 8.3
## 5 114 Heat 1995 8.2
## 6 136 Casino 1995 8.2
## 7 187 Before Sunrise 1995 8.1
## 8 244 La Haine 1995 8
```

---

.question[
Visualize the average yearly rating for movies that made it on the top 250 list over time.
]

.pull-left[
<img src="d16_webscraping_files/figure-html/unnamed-chunk-54-1.png" width="100%" style="display: block; margin: auto;" />
]
.medi.pull-right[

```r
imdb_top_250 %>% 
  group_by(year) %>%
  summarize(avg_score = mean(rating)) %>%
  ggplot(aes(y = avg_score, x = year)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Year", y = "Average score")
```
]

---

.your-turn[
- [class git repo](https://github.com/DataScience4Psych) > `AE 08 - IMDB + Webscraping`.
- Open `02-imdb-tvshows.R`.
- Scrape the names, scores, and years of most popular TV shows on IMDB:
[www.imdb.com/chart/tvmeter](http://www.imdb.com/chart/tvmeter).
- Create a data frame called `tvshows` with four variables: `rank`, `name`, `score`, `year`.
- Examine each of the **first three** TV shows to also obtain genre, runtime, how many episodes so far, first five plot keywords.
- Add this information to the `tvshows` data frame you created earlier.
]

---

# Wrapping Up...

---

# Ethics

---

## "Can you?" vs "Should you?"

.footnote[.small[
Source: Brian Resnick, [Researchers just released profile data on 70,000 OkCupid users without permission](https://www.vox.com/2016/5/12/11666116/70000-okcupid-users-data-release), Vox.
]]

---

## "Can you?" vs "Should you?"

---

# Challenges

---

## Unreliable formatting at the source

---

## Data broken into many pages

---

# Workflow

---

## Screen scraping vs. APIs

Two different scenarios for web scraping:

- Screen scraping: 
  - extract data from source code of website, with html parser (easy) or regular expression matching (less easy)
- Web APIs (application programming interface): 
  - website offers a set of structured http requests that return JSON or XML files

---

## A new R workflow

- When working in an R Markdown document, 
  - your analysis is re-run each time you knit
- If web scraping in an R Markdown document, 
  - you'd be re-scraping the data each time you knit, 
  - which is undesirable (and not *nice*)!
- An alternative workflow: 
  - Use an R script to save your code 
  - Saving interim data scraped using the code in the script as CSV or RDS files
  - Use the saved data in your analysis in your R Markdown document

---

# Sources

- Mine Çetinkaya-Rundel's Data Science in a Box ([link](https://datasciencebox.org/))

---

# Wrapping Up...