class: center, middle, inverse, title-slide # Web scraping
🕸 ### S. Mason Garrison --- layout: true <div class="my-footer"> <span> <a href="https://DataScience4Psych.github.io/DataScience4Psych/" target="_blank">Data Science for Psychologists</a> </span> </div> --- class: middle # Scraping the web --- ## Scraping the web: what? why? - Increasing amounts of data are available on the web -- - These data are provided in an unstructured format: you can always copy&paste, - but it's time-consuming and prone to errors -- - Web scraping is the process of extracting this information automatically and transform it into a structured dataset -- - Two different scenarios: - Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy). - Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files. --- class: middle # Web Scraping with rvest --- ## Hypertext Markup Language - Most of the data on the web is still largely available as HTML - It is structured (hierarchical / tree based), but it''s often not available in a form useful for analysis (flat / tidy). ```html <html> <head> <title>This is a title</title> </head> <body> <p align="center">Hello world!</p> </body> </html> ``` --- ## rvest .pull-left[ - The **rvest** package makes basic processing and manipulation of HTML data straight forward - It's designed to work with pipelines built with `%>%` ] .pull-right[ <img src="img/rvest.png" width="230" style="display: block; margin: auto 0 auto auto;" /> ] --- ## Core rvest functions - `read_html` - Read HTML data from a url or character string - `html_node ` - Select a specified node from HTML document - `html_nodes` - Select specified nodes from HTML document - `html_table` - Parse an HTML table into a data frame - `html_text` - Extract tag pairs' content - `html_name` - Extract tags' names - `html_attrs` - Extract all of each tag's attributes - `html_attr` - Extract tags' attribute value by name --- class: middle # Wrapping Up... --- class: middle # Using the SelectorGadget --- ## SelectorGadget .pull-left-narrow[ - Open source tool that eases CSS selector generation and discovery - Easiest to use with the [Chrome Extension](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb) - Find out more on the [SelectorGadget vignette](https://cran.r-project.org/web/packages/rvest/vignettes/selectorgadget.html) - Note: Our class is about 50-50 (Apple-Microsoft) - I have screen shots from a Mac in the slides, - and screen captures from my PC ] .pull-right-wide[ <img src="img/selector_gadget/selector-gadget.png" width="75%" style="display: block; margin: auto;" /> ] --- class: middle # Using the SelectorGadget (MAC) --- ## Using the SelectorGadget (MAC) <img src="img/selector_gadget/selector-gadget.gif" width="80%" style="display: block; margin: auto;" /> --- <img src="img/selector_gadget/selector-gadget-1.png" width="95%" style="display: block; margin: auto;" /> --- <img src="img/selector_gadget/selector-gadget-2.png" width="95%" style="display: block; margin: auto;" /> --- <img src="img/selector_gadget/selector-gadget-3.png" width="95%" style="display: block; margin: auto;" /> --- <img src="img/selector_gadget/selector-gadget-4.png" width="95%" style="display: block; margin: auto;" /> --- <img src="img/selector_gadget/selector-gadget-5.png" width="95%" style="display: block; margin: auto;" /> --- ## Using the SelectorGadget (MAC) Through this process of selection and rejection, SelectorGadget helps you find the appropriate CSS selector for your needs <img src="img/selector_gadget/selector-gadget.gif" width="65%" style="display: block; margin: auto;" /> --- class: middle # Using the SelectorGadget (PC) --- ## Using the SelectorGadget (PC) .small.pull-left-narrow[ - Click on the logo next to the search bar - A box will open in the bottom right of the website - Click on a page element (it will turn green), - SelectorGadget will generate a minimal CSS selector for that element, - and will highlight (yellow) everything that is matched by the selector - Click on a highlighted element to remove it from the selector (red), - or click on an unhighlighted element to add it to the selector - This process of selection and rejection, - helps you find the appropriate CSS selector for your needs ] --- class: middle # Wrapping Up... --- class: middle # Top 250 movies on IMDB --- ## Top 250 movies on IMDB Take a look at the source code, look for the tag `table` tag: <br> http://www.imdb.com/chart/top <img src="img/imdb-top-250.png" width="65%" style="display: block; margin: auto;" /> --- ## First check if you're allowed! ```r library(robotstxt) paths_allowed("http://www.imdb.com") ``` ``` ## [1] TRUE ``` vs. ```r paths_allowed("http://www.facebook.com") ``` ``` ## [1] FALSE ``` --- .your-turn[ - [class git repo](https://github.com/DataScience4Psych) > `AE 08 - IMDB + Webscraping`. - Open `01-imdb-250movies.R`. - Follow along, and fill in the blanks as we go based on upcoming slides. ] --- ## Plan <img src="img/plan.png" width="90%" style="display: block; margin: auto;" /> --- ## Plan 1. Read the whole page 2. Scrape movie titles and save as `titles` 3. Scrape years movies were made in and save as `years` 4. Scrape IMDB ratings and save as `ratings` 5. Create a data frame called `imdb_top_250` with variables `title`, `year`, and `rating` --- class: middle # Step 1. Read the whole page --- ## Read the whole page ```r page <- read_html("https://www.imdb.com/chart/top/") page ``` ``` ## {html_document} ## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml"> ## [1] <head>\n<meta http-equiv="Content-Type" content="text/html ... ## [2] <body id="styleguide-v2" class="fixed">\n <img ... ``` --- ## A webpage in R - Result is a list with 2 elements ```r typeof(page) ``` ``` ## [1] "list" ``` -- - that we need to convert to something more familiar, like a data frame.... ```r class(page) ``` ``` ## [1] "xml_document" "xml_node" ``` --- class: middle # Step 2. Scrape movie titles and save as `titles` --- ## Scrape movie titles <img src="img/titles.png" width="70%" style="display: block; margin: auto;" /> --- ## Scrape the nodes .pull-left[ ```r page %>% html_nodes(".titleColumn a") ``` ``` ## {xml_nodeset (250)} ## [1] <a href="/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [2] <a href="/title/tt0068646/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [3] <a href="/title/tt0468569/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [4] <a href="/title/tt0071562/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [5] <a href="/title/tt0050083/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [6] <a href="/title/tt0108052/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [7] <a href="/title/tt0167260/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [8] <a href="/title/tt0110912/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [9] <a href="/title/tt0120737/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [10] <a href="/title/tt0060196/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [11] <a href="/title/tt0109830/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [12] <a href="/title/tt0137523/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [13] <a href="/title/tt1375666/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [14] <a href="/title/tt0167261/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [15] <a href="/title/tt0080684/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ## [16] <a href="/title/tt0133093/?pf_rd_m=A2FGELUUNOQJNL&pf_ ... ... ``` ] .pull-right[ <img src="img/titles.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Extract the text from the nodes .pull-left[ ```r page %>% html_nodes(".titleColumn a") %>% html_text() ``` ``` ## [1] "The Shawshank Redemption" ## [2] "The Godfather" ## [3] "The Dark Knight" ## [4] "The Godfather: Part II" ## [5] "12 Angry Men" ## [6] "Schindler's List" ## [7] "The Lord of the Rings: The Return of the King" ## [8] "Pulp Fiction" ## [9] "The Lord of the Rings: The Fellowship of the Ring" ## [10] "The Good, the Bad and the Ugly" ## [11] "Forrest Gump" ## [12] "Fight Club" ## [13] "Inception" ## [14] "The Lord of the Rings: The Two Towers" ## [15] "Star Wars: Episode V - The Empire Strikes Back" ## [16] "The Matrix" ... ``` ] .pull-right[ <img src="img/titles.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Save as `titles` .pull-left[ ```r titles <- page %>% html_nodes(".titleColumn a") %>% html_text() titles ``` ``` ## [1] "The Shawshank Redemption" ## [2] "The Godfather" ## [3] "The Dark Knight" ## [4] "The Godfather: Part II" ## [5] "12 Angry Men" ## [6] "Schindler's List" ## [7] "The Lord of the Rings: The Return of the King" ## [8] "Pulp Fiction" ## [9] "The Lord of the Rings: The Fellowship of the Ring" ## [10] "The Good, the Bad and the Ugly" ## [11] "Forrest Gump" ## [12] "Fight Club" ## [13] "Inception" ## [14] "The Lord of the Rings: The Two Towers" ... ``` ] .pull-right[ <img src="img/titles.png" width="100%" style="display: block; margin: auto;" /> ] --- class: middle # Step 3. Scrape year movies were made and save as `years` --- ## Scrape years movies were made in <img src="img/years.png" width="70%" style="display: block; margin: auto;" /> --- ## Scrape the nodes .pull-left[ ```r page %>% html_nodes(".secondaryInfo") ``` ``` ## {xml_nodeset (250)} ## [1] <span class="secondaryInfo">(1994)</span> ## [2] <span class="secondaryInfo">(1972)</span> ## [3] <span class="secondaryInfo">(2008)</span> ## [4] <span class="secondaryInfo">(1974)</span> ## [5] <span class="secondaryInfo">(1957)</span> ## [6] <span class="secondaryInfo">(1993)</span> ## [7] <span class="secondaryInfo">(2003)</span> ## [8] <span class="secondaryInfo">(1994)</span> ## [9] <span class="secondaryInfo">(2001)</span> ## [10] <span class="secondaryInfo">(1966)</span> ## [11] <span class="secondaryInfo">(1994)</span> ## [12] <span class="secondaryInfo">(1999)</span> ## [13] <span class="secondaryInfo">(2010)</span> ## [14] <span class="secondaryInfo">(2002)</span> ## [15] <span class="secondaryInfo">(1980)</span> ## [16] <span class="secondaryInfo">(1999)</span> ... ``` ] .pull-right[ <img src="img/years.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Extract the text from the nodes .pull-left[ ```r page %>% html_nodes(".secondaryInfo") %>% html_text() ``` ``` ## [1] "(1994)" "(1972)" "(2008)" "(1974)" "(1957)" "(1993)" ## [7] "(2003)" "(1994)" "(2001)" "(1966)" "(1994)" "(1999)" ## [13] "(2010)" "(2002)" "(1980)" "(1999)" "(1990)" "(1975)" ## [19] "(1995)" "(1954)" "(1946)" "(1991)" "(1998)" "(2002)" ## [25] "(1997)" "(1999)" "(1977)" "(2014)" "(1991)" "(1985)" ## [31] "(2001)" "(1960)" "(1994)" "(2002)" "(2019)" "(1994)" ## [37] "(2000)" "(1998)" "(1995)" "(2006)" "(2021)" "(2006)" ## [43] "(1942)" "(2014)" "(2011)" "(1936)" "(1968)" "(1962)" ## [49] "(1988)" "(1979)" "(1954)" "(1931)" "(2000)" "(1979)" ## [55] "(1988)" "(1981)" "(2012)" "(2008)" "(2006)" "(1950)" ## [61] "(1980)" "(1957)" "(1940)" "(2018)" "(1957)" "(1986)" ## [67] "(1999)" "(2012)" "(1964)" "(2019)" "(2018)" "(2003)" ## [73] "(1995)" "(1995)" "(1984)" "(2017)" "(2009)" "(1981)" ## [79] "(2019)" "(1997)" "(2022)" "(1984)" "(1997)" "(2010)" ## [85] "(2000)" "(2009)" "(1952)" "(2016)" "(1983)" "(1992)" ## [91] "(2004)" "(1968)" "(1941)" "(1963)" "(1962)" "(1931)" ... ``` ] .pull-right[ <img src="img/years.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Clean up the text We need to go from `"(1994)"` to `1994`: - Remove `(` and `)`: string manipulation - Convert to numeric: `as.numeric()` --- ## stringr .pull-left-wide[ - **stringr** provides a cohesive set of functions designed to make working with strings as easy as possible - Functions in stringr start with `str_*()`, e.g. - `str_remove()` to remove a pattern from a string ```r str_remove(string = "jello", pattern = "el") ``` ``` ## [1] "jlo" ``` - `str_replace()` to replace a pattern with another .midi[ ```r str_replace(string = "jello", pattern = "j", replacement = "h") ``` ``` ## [1] "hello" ``` ] ] .pull-right-narrow[ <img src="img/stringr.png" width="100%" style="display: block; margin: auto auto auto 0;" /> ] --- ## Clean up the text ```r page %>% html_nodes(".secondaryInfo") %>% html_text() %>% str_remove("\\(") # remove ( ``` ``` ## [1] "1994)" "1972)" "2008)" "1974)" "1957)" "1993)" "2003)" ## [8] "1994)" "2001)" "1966)" "1994)" "1999)" "2010)" "2002)" ## [15] "1980)" "1999)" "1990)" "1975)" "1995)" "1954)" "1946)" ## [22] "1991)" "1998)" "2002)" "1997)" "1999)" "1977)" "2014)" ## [29] "1991)" "1985)" "2001)" "1960)" "1994)" "2002)" "2019)" ## [36] "1994)" "2000)" "1998)" "1995)" "2006)" "2021)" "2006)" ## [43] "1942)" "2014)" "2011)" "1936)" "1968)" "1962)" "1988)" ## [50] "1979)" "1954)" "1931)" "2000)" "1979)" "1988)" "1981)" ## [57] "2012)" "2008)" "2006)" "1950)" "1980)" "1957)" "1940)" ## [64] "2018)" "1957)" "1986)" "1999)" "2012)" "1964)" "2019)" ## [71] "2018)" "2003)" "1995)" "1995)" "1984)" "2017)" "2009)" ## [78] "1981)" "2019)" "1997)" "2022)" "1984)" "1997)" "2010)" ## [85] "2000)" "2009)" "1952)" "2016)" "1983)" "1992)" "2004)" ## [92] "1968)" "1941)" "1963)" "1962)" "1931)" "2018)" "1959)" ## [99] "2012)" "1958)" "2001)" "1971)" "1987)" "1983)" "1944)" ... ``` --- ## Clean up the text ```r page %>% html_nodes(".secondaryInfo") %>% html_text() %>% str_remove("\\(") %>% # remove ( str_remove("\\)") # remove ) ``` ``` ## [1] "1994" "1972" "2008" "1974" "1957" "1993" "2003" "1994" ## [9] "2001" "1966" "1994" "1999" "2010" "2002" "1980" "1999" ## [17] "1990" "1975" "1995" "1954" "1946" "1991" "1998" "2002" ## [25] "1997" "1999" "1977" "2014" "1991" "1985" "2001" "1960" ## [33] "1994" "2002" "2019" "1994" "2000" "1998" "1995" "2006" ## [41] "2021" "2006" "1942" "2014" "2011" "1936" "1968" "1962" ## [49] "1988" "1979" "1954" "1931" "2000" "1979" "1988" "1981" ## [57] "2012" "2008" "2006" "1950" "1980" "1957" "1940" "2018" ## [65] "1957" "1986" "1999" "2012" "1964" "2019" "2018" "2003" ## [73] "1995" "1995" "1984" "2017" "2009" "1981" "2019" "1997" ## [81] "2022" "1984" "1997" "2010" "2000" "2009" "1952" "2016" ## [89] "1983" "1992" "2004" "1968" "1941" "1963" "1962" "1931" ## [97] "2018" "1959" "2012" "1958" "2001" "1971" "1987" "1983" ... ``` --- ## Convert to numeric ```r page %>% html_nodes(".secondaryInfo") %>% html_text() %>% str_remove("\\(") %>% # remove ( str_remove("\\)") %>% # remove ) as.numeric() ``` ``` ## [1] 1994 1972 2008 1974 1957 1993 2003 1994 2001 1966 1994 1999 ## [13] 2010 2002 1980 1999 1990 1975 1995 1954 1946 1991 1998 2002 ## [25] 1997 1999 1977 2014 1991 1985 2001 1960 1994 2002 2019 1994 ## [37] 2000 1998 1995 2006 2021 2006 1942 2014 2011 1936 1968 1962 ## [49] 1988 1979 1954 1931 2000 1979 1988 1981 2012 2008 2006 1950 ## [61] 1980 1957 1940 2018 1957 1986 1999 2012 1964 2019 2018 2003 ## [73] 1995 1995 1984 2017 2009 1981 2019 1997 2022 1984 1997 2010 ## [85] 2000 2009 1952 2016 1983 1992 2004 1968 1941 1963 1962 1931 ## [97] 2018 1959 2012 1958 2001 1971 1987 1983 1944 1985 1960 1976 ## [109] 1962 1973 1997 2009 2020 1995 1952 2000 1988 1989 2011 1927 ## [121] 1948 2010 2019 2007 2005 1965 2016 2004 1921 1959 2020 1950 ... ``` --- ## Save as `years` .pull-left[ ```r years <- page %>% html_nodes(".secondaryInfo") %>% html_text() %>% str_remove("\\(") %>% # remove ( str_remove("\\)") %>% # remove ) as.numeric() years ``` ``` ## [1] 1994 1972 2008 1974 1957 1993 2003 1994 2001 1966 1994 1999 ## [13] 2010 2002 1980 1999 1990 1975 1995 1954 1946 1991 1998 2002 ## [25] 1997 1999 1977 2014 1991 1985 2001 1960 1994 2002 2019 1994 ## [37] 2000 1998 1995 2006 2021 2006 1942 2014 2011 1936 1968 1962 ## [49] 1988 1979 1954 1931 2000 1979 1988 1981 2012 2008 2006 1950 ## [61] 1980 1957 1940 2018 1957 1986 1999 2012 1964 2019 2018 2003 ## [73] 1995 1995 1984 2017 2009 1981 2019 1997 2022 1984 1997 2010 ## [85] 2000 2009 1952 2016 1983 1992 2004 1968 1941 1963 1962 1931 ## [97] 2018 1959 2012 1958 2001 1971 1987 1983 1944 1985 1960 1976 ## [109] 1962 1973 1997 2009 2020 1995 1952 2000 1988 1989 2011 1927 ## [121] 1948 2010 2019 2007 2005 1965 2016 2004 1921 1959 2020 1950 ... ``` ] .pull-right[ <img src="img/years.png" width="100%" style="display: block; margin: auto;" /> ] --- class: middle # Step 4. Scrape IMDB ratings and save as `ratings` --- ## Scrape IMDB ratings <img src="img/ratings.png" width="70%" style="display: block; margin: auto;" /> --- ## Scrape the nodes .pull-left[ ```r page %>% html_nodes("strong") ``` ``` ## {xml_nodeset (250)} ## [1] <strong title="9.2 based on 2,556,144 user ratings">9.2</ ... ## [2] <strong title="9.2 based on 1,759,566 user ratings">9.2</ ... ## [3] <strong title="9.0 based on 2,513,199 user ratings">9.0</ ... ## [4] <strong title="9.0 based on 1,218,952 user ratings">9.0</ ... ## [5] <strong title="9.0 based on 755,400 user ratings">9.0</st ... ## [6] <strong title="8.9 based on 1,303,994 user ratings">8.9</ ... ## [7] <strong title="8.9 based on 1,761,125 user ratings">8.9</ ... ## [8] <strong title="8.9 based on 1,964,714 user ratings">8.9</ ... ## [9] <strong title="8.8 based on 1,783,066 user ratings">8.8</ ... ## [10] <strong title="8.8 based on 736,637 user ratings">8.8</st ... ## [11] <strong title="8.8 based on 1,972,731 user ratings">8.8</ ... ## [12] <strong title="8.8 based on 2,012,472 user ratings">8.8</ ... ## [13] <strong title="8.7 based on 2,244,850 user ratings">8.7</ ... ## [14] <strong title="8.7 based on 1,591,134 user ratings">8.7</ ... ## [15] <strong title="8.7 based on 1,238,547 user ratings">8.7</ ... ## [16] <strong title="8.7 based on 1,842,667 user ratings">8.7</ ... ... ``` ] .pull-right[ <img src="img/ratings.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Extract the text from the nodes .pull-left[ ```r page %>% html_nodes("strong") %>% html_text() ``` ``` ## [1] "9.2" "9.2" "9.0" "9.0" "9.0" "8.9" "8.9" "8.9" "8.8" "8.8" ## [11] "8.8" "8.8" "8.7" "8.7" "8.7" "8.7" "8.7" "8.6" "8.6" "8.6" ## [21] "8.6" "8.6" "8.6" "8.6" "8.6" "8.6" "8.6" "8.6" "8.5" "8.5" ## [31] "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" ## [41] "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.4" ## [51] "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" ## [61] "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.3" "8.3" "8.3" "8.3" ## [71] "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" ## [81] "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" ## [91] "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" ## [101] "8.3" "8.3" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" ## [111] "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" ## [121] "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" ## [131] "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" ## [141] "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.1" "8.1" ## [151] "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" ... ``` ] .pull-right[ <img src="img/ratings.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Convert to numeric .pull-left[ ```r page %>% html_nodes("strong") %>% html_text() %>% as.numeric() ``` ``` ## [1] 9.2 9.2 9.0 9.0 9.0 8.9 8.9 8.9 8.8 8.8 8.8 8.8 8.7 8.7 8.7 ## [16] 8.7 8.7 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.5 8.5 ## [31] 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 ## [46] 8.5 8.5 8.5 8.5 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 ## [61] 8.4 8.4 8.4 8.4 8.4 8.4 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 ## [76] 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 ## [91] 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.2 8.2 8.2 ## [106] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 ## [121] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 ## [136] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.1 8.1 ## [151] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 ## [166] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 ## [181] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 ## [196] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 ## [211] 8.1 8.1 8.1 8.1 8.1 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 ... ``` ] .pull-right[ <img src="img/ratings.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Save as `ratings` .pull-left[ ```r ratings <- page %>% html_nodes("strong") %>% html_text() %>% as.numeric() ratings ``` ``` ## [1] 9.2 9.2 9.0 9.0 9.0 8.9 8.9 8.9 8.8 8.8 8.8 8.8 8.7 8.7 8.7 ## [16] 8.7 8.7 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.5 8.5 ## [31] 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 ## [46] 8.5 8.5 8.5 8.5 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 ## [61] 8.4 8.4 8.4 8.4 8.4 8.4 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 ## [76] 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 ## [91] 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.2 8.2 8.2 ## [106] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 ## [121] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 ## [136] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.1 8.1 ## [151] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 ## [166] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 ... ``` ] .pull-right[ <img src="img/ratings.png" width="100%" style="display: block; margin: auto;" /> ] --- class: middle # Step 5. Create a data frame called `imdb_top_250` --- ## Create a data frame: `imdb_top_250` ```r imdb_top_250 <- tibble( title = titles, year = years, rating = ratings ) imdb_top_250 ``` ``` ## # A tibble: 250 x 3 ## title year rating ## <chr> <dbl> <dbl> ## 1 The Shawshank Redemption 1994 9.2 ## 2 The Godfather 1972 9.2 ## 3 The Dark Knight 2008 9 ## 4 The Godfather: Part II 1974 9 ## 5 12 Angry Men 1957 9 ## 6 Schindler's List 1993 8.9 ## 7 The Lord of the Rings: The Return of the King 2003 8.9 ## 8 Pulp Fiction 1994 8.9 ## 9 The Lord of the Rings: The Fellowship of the Ring 2001 8.8 ## 10 The Good, the Bad and the Ugly 1966 8.8 ## # ... with 240 more rows ``` ---
--- ## Clean up / enhance May or may not be a lot of work depending on how messy the data are - See if you like what you got: ```r glimpse(imdb_top_250) ``` ``` ## Rows: 250 ## Columns: 3 ## $ title <chr> "The Shawshank Redemption", "The Godfather", "Th~ ## $ year <dbl> 1994, 1972, 2008, 1974, 1957, 1993, 2003, 1994, ~ ## $ rating <dbl> 9.2, 9.2, 9.0, 9.0, 9.0, 8.9, 8.9, 8.9, 8.8, 8.8~ ``` - Add a variable for rank ```r imdb_top_250 <- imdb_top_250 %>% mutate(rank = 1:nrow(imdb_top_250)) %>% relocate(rank) ``` --- ``` ## # A tibble: 250 x 4 ## rank title year rating ## <int> <chr> <dbl> <dbl> ## 1 1 The Shawshank Redemption 1994 9.2 ## 2 2 The Godfather 1972 9.2 ## 3 3 The Dark Knight 2008 9 ## 4 4 The Godfather: Part II 1974 9 ## 5 5 12 Angry Men 1957 9 ## 6 6 Schindler's List 1993 8.9 ## 7 7 The Lord of the Rings: The Return of the K~ 2003 8.9 ## 8 8 Pulp Fiction 1994 8.9 ## 9 9 The Lord of the Rings: The Fellowship of t~ 2001 8.8 ## 10 10 The Good, the Bad and the Ugly 1966 8.8 ## 11 11 Forrest Gump 1994 8.8 ## 12 12 Fight Club 1999 8.8 ## 13 13 Inception 2010 8.7 ## 14 14 The Lord of the Rings: The Two Towers 2002 8.7 ## 15 15 Star Wars: Episode V - The Empire Strikes ~ 1980 8.7 ## 16 16 The Matrix 1999 8.7 ## 17 17 Goodfellas 1990 8.7 ## 18 18 One Flew Over the Cuckoo's Nest 1975 8.6 ## 19 19 Se7en 1995 8.6 ## 20 20 Seven Samurai 1954 8.6 ## # ... with 230 more rows ``` --- class: middle # What next? --- .question[ Which years have the most movies on the list? ] -- ```r imdb_top_250 %>% count(year, sort = TRUE) ``` ``` ## # A tibble: 86 x 2 ## year n ## <dbl> <int> ## 1 1995 8 ## 2 2004 7 ## 3 1957 6 ## 4 2003 6 ## 5 2009 6 ## 6 2019 6 ## 7 1975 5 ## 8 1994 5 ## 9 1997 5 ## 10 1998 5 ## # ... with 76 more rows ``` --- .question[ Which 1995 movies made the list? ] -- ```r imdb_top_250 %>% filter(year == 1995) %>% print(n = 8) ``` ``` ## # A tibble: 8 x 4 ## rank title year rating ## <int> <chr> <dbl> <dbl> ## 1 19 Se7en 1995 8.6 ## 2 39 The Usual Suspects 1995 8.5 ## 3 73 Braveheart 1995 8.3 ## 4 74 Toy Story 1995 8.3 ## 5 114 Heat 1995 8.2 ## 6 136 Casino 1995 8.2 ## 7 187 Before Sunrise 1995 8.1 ## 8 244 La Haine 1995 8 ``` --- .question[ Visualize the average yearly rating for movies that made it on the top 250 list over time. ] -- .pull-left[ <img src="d16_webscraping_files/figure-html/unnamed-chunk-54-1.png" width="100%" style="display: block; margin: auto;" /> ] .medi.pull-right[ ```r imdb_top_250 %>% group_by(year) %>% summarize(avg_score = mean(rating)) %>% ggplot(aes(y = avg_score, x = year)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + labs(x = "Year", y = "Average score") ``` ] --- .your-turn[ - [class git repo](https://github.com/DataScience4Psych) > `AE 08 - IMDB + Webscraping`. - Open `02-imdb-tvshows.R`. - Scrape the names, scores, and years of most popular TV shows on IMDB: [www.imdb.com/chart/tvmeter](http://www.imdb.com/chart/tvmeter). - Create a data frame called `tvshows` with four variables: `rank`, `name`, `score`, `year`. - Examine each of the **first three** TV shows to also obtain genre, runtime, how many episodes so far, first five plot keywords. - Add this information to the `tvshows` data frame you created earlier. ] --- class: middle # Wrapping Up... --- class: middle # Ethics --- ## "Can you?" vs "Should you?" <img src="img/ok-cupid-1.png" width="60%" style="display: block; margin: auto;" /> .footnote[.small[ Source: Brian Resnick, [Researchers just released profile data on 70,000 OkCupid users without permission](https://www.vox.com/2016/5/12/11666116/70000-okcupid-users-data-release), Vox. ]] --- ## "Can you?" vs "Should you?" <img src="img/ok-cupid-2.png" width="70%" style="display: block; margin: auto;" /> --- class: middle # Challenges --- ## Unreliable formatting at the source <img src="img/unreliable-formatting.png" width="70%" style="display: block; margin: auto;" /> --- ## Data broken into many pages <img src="img/many-pages.png" width="70%" style="display: block; margin: auto;" /> --- class: middle # Workflow --- ## Screen scraping vs. APIs Two different scenarios for web scraping: - Screen scraping: - extract data from source code of website, with html parser (easy) or regular expression matching (less easy) - Web APIs (application programming interface): - website offers a set of structured http requests that return JSON or XML files --- ## A new R workflow - When working in an R Markdown document, - your analysis is re-run each time you knit - If web scraping in an R Markdown document, - you'd be re-scraping the data each time you knit, - which is undesirable (and not *nice*)! - An alternative workflow: - Use an R script to save your code - Saving interim data scraped using the code in the script as CSV or RDS files - Use the saved data in your analysis in your R Markdown document --- # Sources - Mine Çetinkaya-Rundel's Data Science in a Box ([link](https://datasciencebox.org/)) --- class: middle # Wrapping Up...