class: center, middle, inverse, title-slide .title[ # Deeper Dive into ggplot
🌊 ] .author[ ### S. Mason Garrison ] --- layout: true <div class="my-footer"> <span> <a href="https://DataScience4Psych.github.io/DataScience4Psych/" target="_blank">Data Science for Psychologists</a> </span> </div> --- class: middle # Deep Diving into Grammar of Graphics --- ## What does "grammar of graphics" mean? * The analogy with English grammar, * or any language's grammar, * is that it allows you to put together component parts -- + Better than "grammar of graphics" might be the * "orthogonal components of graphics," * but that doesn't have the same alliterative appeal. -- + The power of the grammar of graphics is that it is modular: * different aspects of the plot can be specified independently of each other. --- ## An example + As an example, the coordinate system is specified separately * from the geometric object used to represent the points. + We have three representations of the same data, where * the only difference between them * is the coordinate system used to represent them. .medi[ ``` r bar_plot = ggplot(diamonds) + aes(x ="", fill = clarity)+ geom_bar(width = 1, position = "stack") bar_plot ``` ``` r bar_plot + coord_polar() ``` ``` r bar_plot + coord_polar(theta = "y") ``` ] --- ``` r bar_plot ``` <img src="d13b_moreggplot_files/figure-html/unnamed-chunk-3-1.png" width="60%" style="display: block; margin: auto;" /> --- ``` r bar_plot + coord_polar() ``` <img src="d13b_moreggplot_files/figure-html/unnamed-chunk-4-1.png" width="60%" style="display: block; margin: auto;" /> --- ``` r bar_plot + coord_polar(theta = "y") ``` <img src="d13b_moreggplot_files/figure-html/unnamed-chunk-5-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Again, the same dataset Again, the same dataset, three different coordinate systems, very different representations: ``` r dodged_bar_plot = ggplot(diamonds) + geom_bar(aes(x = "", fill = clarity), width = 1, position = "dodge") dodged_bar_plot ``` ``` r dodged_bar_plot + coord_polar() ``` ``` r dodged_bar_plot + coord_polar(theta = "y") ``` --- ``` r dodged_bar_plot ``` <img src="d13b_moreggplot_files/figure-html/unnamed-chunk-7-1.png" width="60%" style="display: block; margin: auto;" /> --- ``` r dodged_bar_plot + coord_polar() ``` <img src="d13b_moreggplot_files/figure-html/unnamed-chunk-8-1.png" width="60%" style="display: block; margin: auto;" /> --- ``` r dodged_bar_plot + coord_polar(theta = "y") ``` <img src="d13b_moreggplot_files/figure-html/unnamed-chunk-9-1.png" width="60%" style="display: block; margin: auto;" /> --- class: middle # Wrapping Up... --- class: middle # What are the components of a plot? --- ## Components of a plot + A default dataset and set of mappings from variables to aesthetics. + One or more _layers_, each of which contains -- + one geometric (`geom_*`) object, + one statistical transformation (`stat_*`), + one position adjustment (`position_*`), + and one dataset and set of aesthetic mappings. -- + One scale for each aesthetic mapping. + A coordinate system (`coord_*`). + A facet specification (`facet_*`). -- .tip[ Layers are the most important and involved part of the plot. ] --- ## What is a layer? - data and aesthetic mapping - statistical transformation (stat) - geometric object (geom) - position adjustment -- ``` r p = ggplot(diamonds, aes(x = color, y = clarity)) + geom_point() p$layers ``` ``` ## [[1]] ## geom_point: na.rm = FALSE ## stat_identity: na.rm = FALSE ## position_identity ``` --- ## Data and aesthetic mapping + Data: self evident. For ggplot the data needs to be formatted as a tibble or a data.frame. + Aesthetic mapping: Less evident -- + Describes how variables in the dataset are mapping + to "aesthetic" attributes of the plot. + "Aesthetic" here means perceivable: + something you can observe on the plot. --- ## Aesthetic/perceivable attributes + Examples include * position along the `\(x\)`-axis, * color, * shape, * position along the `\(y\)`-axis, * opacity, * linetype --- ## Data and aesthetic mapping go together * They are not independent of each other: * the aesthetic mapping takes variables in your data and + maps them to aesthetics/perceivable parts of the plot and + is therefore specific to a dataset. --- class: middle # Wrapping Up... --- class: middle # Stats, Geoms, and Positions --- # Statistical transformation A statistical transformation is some transformation of the data. ![](img/ggplot_stat_table.png) -- <br> .footnote[ These stat transformations and position adjustments can overlap. Often there is more than one way to create the same plot. ] --- ## Geometric objects Geometric objects (`geom_*`) control the type of plot you create. Examples are: - Points, text * (zero-dimensional geometric objects) - Line, path * (one-dimensional geometric objects) - Polygon, interval * (two-dimensional geometric objects) - More complicated: boxplot --- ## Relationship between stats and geoms - Every statistic has a default geometric object, - and every geometric object has a default statistic. - Stats and geoms are not completely orthogonal: - not every combination is valid - (although many are). --- ## For example: - `stat_bin` and `geom_bar` - is valid and - standard for a histogram. - `stat_bin` and `geom_point` or `stat_bin` + `geom_line` - are valid - but non-standard combinations. - They give a plot that is resembles a histogram - (interpretable that same way, too) - `stat_identity` and `geom_boxplot` - is invalid, because - boxplot needs certain computed quantities from the data. --- ## Position adjustment Used to avoid "collisions" in the plot objects: - In bar plots where one of the aesthetics is height, - the bars would often be plotted over each other. - In this case we use the "dodge" or "stack" position adjustments. - If we have issues with overplotting (multiple points in exactly the same place), - we can use the "jitter" position adjustment - to randomly move the points a small amount. --- .medi[.pull-right-wide[ ``` r (p = ggplot(diamonds, aes(x = color, y = price)) + geom_boxplot()) ``` <img src="d13b_moreggplot_files/figure-html/unnamed-chunk-11-1.png" width="60%" style="display: block; margin: auto;" /> ] .pull-left-narrow[ ``` r p$layers ``` ``` ## [[1]] ## geom_boxplot: outliers = TRUE, outlier.colour = NULL, outlier.fill = NULL, outlier.shape = 19, outlier.size = 1.5, outlier.stroke = 0.5, outlier.alpha = NULL, notch = FALSE, notchwidth = 0.5, staplewidth = 0, varwidth = FALSE, na.rm = FALSE, orientation = NA ## stat_boxplot: na.rm = FALSE, orientation = NA ## position_dodge2 ``` ] ] --- ``` r ggplot(diamonds,aes(x = color, y = price, color = clarity)) + geom_boxplot(position = "identity") ``` <img src="d13b_moreggplot_files/figure-html/unnamed-chunk-13-1.png" width="60%" style="display: block; margin: auto;" /> --- ``` r ggplot(diamonds,aes(x = color, y = price, color = clarity)) + geom_boxplot(position = "dodge") ``` <img src="d13b_moreggplot_files/figure-html/unnamed-chunk-14-1.png" width="60%" style="display: block; margin: auto;" /> .footnote[`position = "dodge"` is the default for boxplots, so you don't need to specify it.] --- ``` r ggplot(diamonds,aes(x = color, y = price, color = clarity)) + geom_boxplot(position = position_dodge(width = 1)) ``` <img src="d13b_moreggplot_files/figure-html/unnamed-chunk-15-1.png" width="60%" style="display: block; margin: auto;" /> .footnote[`position = "dodge"` is the default for boxplots, so you don't need to specify it.] --- class: middle # Wrapping Up... --- class: middle # Scales and Coordinates --- ## Scales + So far, we've defined aesthetic mappings that specify, + which perceivable aspects of the plot correspond + to which variables. -- + However, perceivable aspects of the plot + can be mapped to variables many other ways. --- ## For example + If we have a categorical variable that takes values "A" and "B" to the color aesthetic, + that means that color is going to represent + whether variable took value "A" or "B". + But we could do that in a practically infinite number of ways -- + e.g. - A maps to "red", B maps to "black" - A maps to "green", B maps to "blue" - A maps to "purple", B maps to "gold" -- - ... You probably get the picture --- ## Modernized ggplot - *Now*, ggplot has good default mappings from values into aesthetic space, - but you will sometimes want to set them by hand. -- - To do so, you use the `scale_*` functions. -- - The *old* version of ggplot had poor mappings from continuous values to colors, - and the [viridis](https://cran.r-project.org/web/packages/viridis/index.html) color scheme was much better. - The most recent version of ggplot uses viridis by default - for both continuous values and ordered factors. --- ## Coordinate system Another aspect of the plot that can be specified independently of everything else is the coordinate system. + `coord_cartesian` is the default, - and is almost always what you want. -- + `coord_flip` is sometimes useful: - for example, boxplots require the explanatory variable to be mapped to x, - so if you want a horizontal boxplot you need to use `coord_flip`. -- + `coord_polar` will often make your plots look cooler * and more difficult to read. * Not usually recommended. --- ## Faceting Allows you to make small multiples of plots. + Other grammars/plotting systems implement faceting with a fancy coordinate system, + but it turns out that it's easier to use if you think about it separately. -- + Each facet plots a subset of the data, - and it takes as input - what variable(s) to use to make the subsets and - how to lay out the subsets. --- ## Facet options - `facet_wrap`: - Lays out the plots for each subset sequentially. -- - `facet_grid`: - Subsets the data according to two separate variables. -- + The facet position along the `\(x\)`-axis represents levels of one variable, - and the facet position along the `\(y\)`-axis represents levels of the other variable. --- class: middle # Wrapping Up... --- class: middle # How this all works --- ## The long way - One way to specify a ggplot is to - specify all of the components we've seen. - If you understand all the parts, - this way is probably the least confusing method to specify a ggplot. - The problem is that it's verbose. --- ## Verbose - Suppose we want to make a plot with points and a smoother - from the diamonds dataset. - We can specify data, mapping, geom, stat, and positions for each layer, - along with scales and the coordinate system as follows: -- .midi[ ``` r ggplot() + layer( data = diamonds, mapping = aes(x = carat, y = price), geom = "point", stat = "identity", position = "identity") + layer( data = diamonds, mapping = aes(x = carat, y = price), geom = "smooth", position = "identity", stat = "smooth", params = list(method = "lm")) + scale_x_log10() + scale_y_log10() + coord_cartesian() ``` ] --- <img src="d13b_moreggplot_files/figure-html/unnamed-chunk-16-1.png" width="85%" style="display: block; margin: auto;" /> --- ## Defaults make the code shorter The more standard way of writing the same plot would be: .medi[ ``` r p = ggplot(data = diamonds, aes(x = carat, y = price)) + geom_point() + stat_smooth(method = "lm") + scale_x_log10() + scale_y_log10() ``` ] -- This code is still fairly long, but we don't have to specify... - position: Default for both `geom_point` and `stat_smooth` is `position = "identity"`. - stat, for `geom_point`: The default stat for `geom_point` is `stat = "identity"`. - geom, for `stat_smooth`: The default geom for `stat_smooth` is `geom_smooth`. - coordinate system: `coord_cartesian` is always the default. --- You can check what stat, geom, and position is used for each of the layers: .medi[ ``` r names(p) ``` ``` ## [1] "data" "layers" "scales" "guides" ## [5] "mapping" "theme" "coordinates" "facet" ## [9] "plot_env" "layout" "labels" ``` .medi[ ``` r p$layers ``` ``` ## [[1]] ## geom_point: na.rm = FALSE ## stat_identity: na.rm = FALSE ## position_identity ## ## [[2]] ## geom_smooth: se = TRUE, na.rm = FALSE, orientation = NA ## stat_smooth: method = lm, formula = NULL, se = TRUE, n = 80, fullrange = FALSE, level = 0.95, na.rm = FALSE, orientation = NA, method.args = list(), span = 0.75, xseq = NULL ## position_identity ``` ] --- class: middle # Wrapping Up... --- class: middle # Minard on Napoleon's invasion of Russia --- ## Minard on Napoleon's invasion of Russia One of [the most famous statistical graphics, created by Charles Minard](https://en.wikipedia.org/wiki/Charles_Joseph_Minard) depicts Napoleon's 1812 invasion of and retreat from Russia. .center[ <img src="img/Minard.png" width="80%" style="display: block; margin: auto;" /> ] --- ## Minard on Napoleon's invasion of Russia One of [the most famous statistical graphics, created by Charles Minard](https://en.wikipedia.org/wiki/Charles_Joseph_Minard) depicts Napoleon's 1812 invasion of and retreat from Russia. .center[ <img src="img/Minard_Update.png" width="81%" style="display: block; margin: auto;" /> ] --- .medi[ ``` r minard = read_csv("data/minard.csv");minard_cities = read_csv("data/minard-cities.csv") (plot_troops = ggplot(minard) + geom_path(aes(x = long, y = lat, color = direction, size = surviv, group = division))) ``` <img src="d13b_moreggplot_files/figure-html/minardpt1-1.png" width="55%" style="display: block; margin: auto;" /> ] --- Let's add another layer for the cities: .medi[ ``` r (plot_both = plot_troops + geom_text(aes(x = long, y = lat, label = city), data = minard_cities, size = 3)) ``` <img src="d13b_moreggplot_files/figure-html/minardpt2-1.png" width="55%" style="display: block; margin: auto;" /> ] .footnote[Notice: we have a new dataset for this layer.] --- ## Next Steps Some things that are still not great about this plot: .pull-left[ - We would like the colors - for advance and retreat to be gray and red. - We want the line widths to be - proportional to the number of survivors. - We would like the line ends to be - round instead of square. ] .pull-right[ <img src="d13b_moreggplot_files/figure-html/unnamed-chunk-22-1.png" width="99%" style="display: block; margin: auto;" /> ] --- ## An improved version of the plot, with better scales: .small[.pull-left-narrow[ ``` r ggplot(minard) + geom_path(aes(x = long, y = lat, color = direction, size = surviv^2, group = division), lineend = "round") + geom_text(aes(x = long, y = lat, label = city), data = minard_cities, size = 3) + scale_size(range = c(.18, 15), breaks = (1:3 * 10^5)^2, labels = scales::comma(1:3 * 10^5), "Survivors") + scale_color_manual(values = c("grey50", "red"), breaks = c("A", "R"), labels = c("Advance", "Retreat"), "") + xlab(NULL) + ylab(NULL) + theme(axis.text = element_blank(), axis.ticks = element_blank(), panel.grid = element_blank()) ``` ]] .pull-right-wide[ <img src="d13b_moreggplot_files/figure-html/unnamed-chunk-23-1.png" width="99%" style="display: block; margin: auto;" /> ] --- .your-turn[ ] - Try creating a histogram on the diamonds dataset, for example with... ``` r p = ggplot() + geom_histogram(aes(x = carat), data = diamonds) ``` - Re-write this using the `layer` function. .tip[ - If you don't know what the default values for some of the aspects of the plot, examine `p$layers` - Remember that a histogram is a plot with `stat_bin` and `geom_bar`. ] --- .your-turn[ ] .pull-left-narrow[ <img src="d13b_moreggplot_files/figure-html/unnamed-chunk-25-1.png" width="99%" style="display: block; margin: auto;" /> ] - Modify your histogram code so that it uses a different geom, for example `geom_line` or `geom_point`. - This change should be simple once you have the `layer` specification of a histogram. - In your histogram, - add an aesthetic mapping from one of the factor variables (maybe color or clarity) to the color aesthetic. .question[What is the default position adjustment for a histogram? ] .tip[- Try changing the position adjustment to something different - (try `position_dodge`).] --- ## Difficult Stretch Goal - Recreate the Minard map more precisely. - Add text for the number of troops surviving on each segment, - and add the time and temperature data to the bottom of the plot. --- ## ggplot and EDA Remember our passage from Tukey: > *Exploratory data analysis is detective work*... As all detective stories remind us, many of the circumstances surrounding a crime are accidental or misleading. Equally, many of the indications to be discerned in bodies of data are accidental or misleading. To accept all appearances as conclusive would be destructively foolish, either in crime detection or in data analysis. *To fail to collect all appearances because some -- or even most -- are only accidents would, however, be gross misfeasance deserving (and often receiving) appropriate punishment.* --- # Concluding Thoughts - The flexibility in the grammar of graphics allows us to collect many more "appearances" in the data than we would be able to, - if we just have access to a handful of named plots. -- - Many of the plots that we can make with ggplot are not useful, - but the point is to try visualizing the data in many different ways. - ggplot opens up a very large space of statistical graphics to us for not very much effort. --- # Sources - Mine Çetinkaya-Rundel's Data Science in a Box ([link](https://datasciencebox.org/)) - Julia Fukuyama's EDA ([link](https://jfukuyama.github.io/)) --- class: middle # Wrapping Up...