Guidance
This document provides some basic assignment instructions and information. It is not the syllabus for students taking this course. Please see the data science syllabus on my syllabus website. And yes, the bookdown theme looks familiar…
0.6 Materials
0.6.2 Required Texts
The text is intended to supplement the videos, lecture notes, tutorials, activities, and labs. You probably need to consume all of them in order to be successful in this class. All materials for this course are open source, including the multimedia course notes.
- Garrison’s Data Science for Psychologists (https://datascience4psych.github.io/DataScience4Psych/)
- Wickham and Grolemund’s R for Data Science text
0.6.3 Software
0.6.3.1 R and RStudio
R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX, Windows, and MacOS platforms.
RStudio is a free integrated development environment (IDE), a powerful user interface for R.
0.6.3.2 Git and Github
Git is a version control system. Its original purpose was to help groups of developers work collaboratively on big software projects. Git manages the evolution of a set of files – called a repository – in a structured way. Think of it like the “Track Changes” features from Microsoft Word.
GitHub is a free IDE hosting service for Git. As a Wake Forest student, you should be able to access the GitHub Student Developer Pack for free. It includes a free PRO upgrade for your GitHub account. You can learn more about how we’ll use GitHub in this class here. If you want to jump ahead, here are some git IDEs that you can use to interact with GitHub.
0.7 Portfolio Instructions
0.7.1 EDA as Practice
Data science (and Exploratory Data Analysis) is like basketball. We can watch either being done, and appreciate the art and skill involved in high-level performance. In the hands of Lebron James or Michael Jordan, a basketball is a highly-tuned artistic instrument; in the hands of John Tukey, a graph sings the praises of data in melodies both harmonious and discordant, reflecting model, data, and mood. Part of this course will be devoted to Watching and Studying the master at his work.
But basketball is played by thousands of bodies with less than NBA training and ability. Some novice basketball players are just learning their craft, and others will evolve into future LJs and MJs; others have lower aspirations, yet still enjoy participating. So, also, should DS be played? A second part of this course will involve learning to do EDA by Doing It.
That’s what your portfolio is: a record of your practice. It’s not a collection of finished products or a checklist of requirements—it’s a demonstration of your engagement with data over time. Through each piece, you’re showing how your thinking has evolved, how you’ve tried and revised and explored, and how you’re developing your own way of working with data. Some pieces may be refined, others experimental or unfinished. What matters is that they reflect your process—what you’re noticing, questioning, building, and becoming.
If you want a sense of what that can look like, you can explore Eric Stone’s portfolio https://estone65.github.io/portfolios/, which includes three reflective pieces from an earlier version of this class. You’ll notice that none of them follow a rigid template. Each one is personal, exploratory, and tied to his own questions and data. That’s what yours should be, too: a series of artifacts that reflect your thinking, your style, and your process.
Each of you will be expected to create several portfolio pieces. These can emerge from in-class activities, labs, tutorials, your thesis or research, or anything else that allows you to demonstrate what you’re working on. You’ll report on your progress during lab, and you’ll compile everything into your final portfolio site by the end of the semester.
You are welcomed (in fact, strongly encouraged) to use data that is relevant to your broader work: thesis or dissertation data, side projects, collaborations, or even data collected from your own life. You’re also welcome to use data I’ve provided, or to scrape, simulate, or remix your own.
0.7.1.1 Documenting Your Projects
Your portfolio should not only showcase your final products but also illuminate your thought process, challenges, and insights. Here’s how you should document each project in your portfolio:
Goal – Clearly state the question, challenge, or idea you were exploring. Explain why you chose this particular piece.
Product – Describe what you created. This could be a plot, an app, a sketch, a dashboard, a transformation, a function, or something else.
Data – Detail the data you used. Provide a brief description of the source, format, and any cleaning or wrangling steps you took.
Interpretation – Share what you noticed, learned, or discovered. Reflect on what you would try next or what surprised you.
At the end of the course, you will turn in a Portfolio, which consists of two components:
A report describing all your projects. The total number depends on the scope and difficulty of each project (as specified in your contract). There may be projects that you don’t finish. That’s fine; EDA projects are hardly ever completely finished—write them up anyway. Projects should be numbered consecutively in the order you began them, and each should follow the structure outlined above. Reports must be reproducible and of high quality in terms of writing, grammar, and presentation.
A prototypical example of the product of each project. This might include a graph, computer code, visualization, or another artifact. You may wish to embed these in your report or append them separately, depending on format.
Portfolios will not be returned; if you wish to have a copy, make one before you turn it in. Portfolios are due during finals. Project reports will not be accepted late. Please, no exceptions!!!
0.7.1.2 Portfolio Template
If this sounds like a lot to organize—don’t worry. I’ve created a portfolio template that’s designed to make the whole process easier. And, a former student Benjamin Egan wrote up this entire subsection to help you get started. The template is a GitHub repository that you can clone to your own GitHub account.
0.7.1.2.1 Why use the template?
The benefit of using the template is that you can create a nice and neatly packaged version of all your portfolio pieces. Utilizing the build function of .Rproj
files, R will knit all portfolio pieces into one big interactive website. Don’t worry if you haven’t thought of all of your portfolio pieces, those sections will remain blank until your magnificent pieces are created.
You can also take control of your website through customization and personalization. Try checking out the secret folder in the template for extra tips and tricks to make the website your own!
0.7.1.3 Possible Portfolio Pieces
The following ideas are just a jumping-off point for your creativity—think of them as a playground rather than a checklist. You are not required to choose anything from this list; feel free to come up with your own unique projects that resonate with you. Your portfolio should highlight your unique journey through data exploration—feel free to choose a few ideas that excite you and delve into them with gusto!
Draw plots by hand of some data that are of interest to you, and transform the variables in several different ways. Interpret your results.
Choose some data from EDA (Exploratory Data Analysis) or VDat (Visualizing Data); table or plot them in a way that Tukey/Cleveland didn’t.
Find some population data of interest to you (e.g., North Carolina, Forsyth County, your cat herd, etc.) and do several hand plots like those in Chapter 5 of EDA. Interpret results.
Find some data in the World Almanac and plot and/or table them.
Use some two-way data, and repeatedly extract the medians like Tukey does in Ch. 10 & 11.
Find some time series data, and smooth them in several different ways (see EDA, ch. 7). Data with seasonal patterns are especially interesting (see VDat, pp. 152-172).
Write an R, SAS-Graph, SPSS, BASIC, FORTRAN, C, JAVA, or other program to portray influence-enhanced scatter plots. Produce scatter plots of several relationships.
Write a BASIC, FORTRAN, C, SAS, SPSS, JAVA, or other program to portray scatter plots on a computer. Give the user the option to plot X and/or Y as either raw data, logs, squares, cubes, reciprocals, roots, etc.
Write an R, SAS-Graph, SPSS, BASIC, FORTRAN, C, JAVA or other program to produce some exotic version of stem-and-leaf diagrams.
Write a an R, SAS-Graph, SPSS, BASIC, FORTRAN, C, JAVA or other program to plot in three-dimensions with time as one of the dimensions (i.e., a kinostatistical plot).
Use R or SAS-Graph or some other dedicated graphical package to plot some interesting data (preferably in color, possibly in 3D, maybe even in higher than 3D).
Write an R/SAS routine to do median smoothing by three, and use it on some data.
Write an R program, or SAS MACRO or SAS PROC or SAS program to produce some EDA output (but don’t duplicate what PROC UNIVARIATE already does).
Find an R program in the R library that does interesting EDA; apply it to some interesting data.
Produce a correlation matrix between many variables, and develop a scatter plot matrix from it.
Read the literature on graphical data analysis and develop some new graphical techniques. Program your techniques. Apply them to real data.
Invent a new EDA graphical application, and apply it to real data.
Additional brainstorms and half-formed ideas to spark your creativity:
- Data Cleaning Project Using Lab Data
- Web Scraping Project
- Tidy Tuesday Project
- Data Innovation
- Recreate A Classic Visualization
- Interactive Project (Rshiny)
- Infographic
- Master’s Thesis / First Year Project
- Misleading Graph
- Impossible To Read
- Colorblind Friendly
- Visualization That Only Uses X Colors
- Animated/Video
- Tutorial
- Webscraper Data
- Digital Humanities Project
- Reproduce Findings From An Article In Your Content Area
- Machine Learn!
- Live Dashboard
- Maps/ Geospatial Things
- Lie To Me Graphic
0.7.2 Additional Ground Rules
In this class, I actively encourage you to “double-dip.” My aim is for you to incorporate your research into portfolio pieces and use these pieces in other places. However, I Mason have some ground rules. These rules have been implemented for various reasons, but they primarily preserve my sanity by establishing some boundaries. As much as I adore this class, my students (especially you, dear reader!), and everything it stands for – it blurs many lines because I often serve numerous roles while teaching this class. In particular, I may also be your mentor, committee member, collaborator, letter of recommendation writer, colleague, confidant, statistics consultant, friend, or cat caretaker. I do my best to navigate these roles. However, for my sanity, here are my basic rules.
For anything that is related to a graduate school milestone (thesis, first-year project, major area paper) or something else that you’re working with your advisor on… you need to actively discuss it with your advisor and get their approval in advance.
They may not – and that’s okay. If they’re unsure, I’m very happy to schedule a quick 30-minute chat with all three of us. My role on anything milestone-related is to assist you with the implementation or give you general feedback. I cannot give you advice related to modeling or anything that would merit a discussion about authorship on the final work. I may disagree with something they recommend that you do… such as use a specific software that isn’t R. (I do have many strong opinions about SPSS and AMOS, as well as modeling choices. However, there is nearly always room for debate in these issues. I try to remind folks that my opinion is just that – an opinion.)
For all things related to your research at Wake Forest, please defer to your advisor. Everyone in the department are amazing at what they do. Seriously, each is an expert in their research area – just as I am an expert in R (and genetically-informed designs and strange data…). In all likelihood, they know more about the modeling in their specific area. I am happy to share my expert opinions and I have the privilege to facilitate such conversations through the nature of this class. HOWEVER, please defer to your advisor for all design decisions. I will do my best to tell you when something is outside the scope of the class or if we’re approaching a gray area. I will not always be successful at navigating this issue. However, I will do my best to do so because it is worth the extra challenge.