49 Data usually finds me

This document is based on my SAM talk on “I don’t go looking for Data … Data usually finds me,” from 2020.Here is a link to the slides of that talk.

49.1 I don’t go looking for Data … Data usually finds me

The most interesting aspects of my work (or at least to me) are the aspects related to finding data. However, this part is also the least documented. In my case, it primarily lives in footnotes, personal statements, and appendices. In the world of data science, finding interesting datasets is often more serendipitous than strategic. Let’s explore the various ways data can find us or be found. Fundamentally, data can be discovered through two major approaches: exploratory and confirmatory. The exploratory approach is descriptive and is data-driven. Data scientists often apply this approach. In contrast, the confirmatory approach is question driven, and aimed at testing specific hypotheses. Research scientists often apply this approach, but not always. Both approaches have their merits and challenges, and the methods of data acquisition can vary widely. Let’s delve into the adventure of data retrieval and discover the best places to look for datasets.

49.2 Two Major Approaches to Data Discovery

Do you start with the question or with the data?

49.2.1 The Exploratory Approach

This data-driven method is often favored by data scientists. It’s descriptive in nature, allowing the data itself to guide the questions and analyses. Most of my early experiences fall into this category, and they’re often the most interesting. In this approach, you start with the data and let it tell you what questions to ask. This can be a bit like a treasure hunt, where you’re not sure what you’ll find until you start digging. The data can be found in various ways, including referrals, reading, rumor, and random chance. We’ll explore these methods in more detail later.

But, how does data find you? Data can find you in numerous ways including referrals, reading, rumor, and random chance. A referral example would be when a speaker tells you about an intergenerational data set partially run by the BLS. You may also just stumble across it in your readings, such as when a historian using aspects of a marriage study. A rumor may inspire you, such as observing that a control group is mentioned in the original write-up of the Terman study (1921ish). Serendipity might lead you to fly to SPSP, talk to the person sitting next to you about a study you were in…

49.2.2 Confirmatory Approach to Archival Data

In contrast, this question-driven approach is aimed at testing specific hypotheses. It’s commonly used by research scientists, though not exclusively. Here, you start with specific questions that guide your data search. These could be related to theories, measures, subjects, models, replication, or external motivations.

Theory-based questions include questions like “Do smart girls delay sex? Measurement based questions can ask things like”Is Coding Speed from the ASVAB a decent proxy for conscientiousness?“. Questions about subjects include”Where can I find Twins Raised Apart?” Modeling questions can include things like How do I illustrate my dual mediated survival model? Replication: Can I replicate my finding in another sample? Externally-inspired questions can include things like Can I address reviewer two’s concern about reliability of difference scores?

These questions narrow your search… Otherwise the scope of data is overwhelming. The wonderful Kathy Shields helped me add a section to the WFU library website to get you started guides.zsr.wfu.edu/psychology. This is a great place to start your search for data.

Regardless of which approach you take, you’ll need to acquire data. Let’s look at the various ways this can happen.

49.3 The Data Acquisition Spectrum

Data acquisition methods generally fall into five categories:

  • Direct Download: The simplest method, where data is available in ready-to-use formats like CSV or Excel files.
  • Wrapped API Access: Using specialized tools or packages to access data repositories.
  • Raw API Interaction: Crafting custom queries for more specific data needs.
  • Web Scraping: Extracting data embedded in web pages.
  • Physical Retrieval: Sometimes, data isn’t digital and requires old-fashioned legwork.

Now, let’s explore how these methods play out in real-world scenarios.

49.3.1 How Data Finds You

In the exploratory approach, data can appear in your life through various means:

  • Referrals: A conference speaker might mention an intriguing dataset, leading you to directly download it from a repository.
  • Reading: While reviewing literature, you might stumble upon a study with online supplementary data, requiring web scraping to access.
  • Rumors: Colleagues might discuss an old study with valuable data, prompting you to track down physical records.
  • Random chance: A conversation at a conference could lead you to a researcher with access to unique datasets.

For the confirmatory approach, your specific questions guide your search through these acquisition methods. For example: - Investigating “Do smart girls delay sexual activity?” might lead you to directly download relevant datasets from the BLS. - Exploring ASVAB’s Coding Speed as a proxy for conscientiousness could involve using a wrapped API to access specific datasets. - Finding data on twins raised apart might require raw API interactions with multiple sources.

49.3.2 The Adventure of Data Retrieval

Sometimes, acquiring data involves multiple methods and unexpected challenges. Here’s a real-world example that spans the acquisition spectrum:

  • Apply for access to a “digitized” dataset on Dataverse (Direct Download attempt) for an econometrics class
  • Discover the dataset isn’t actually digitized, requiring physical retrieval and approval from the researchers.
  • Learn that the researchers who created it are hard to find.
    • One researcher (E. Lowell Kelly) died in 1986 and the other (James Conley) is nowhere to be found.
  • Track down a researcher who changed their name in the 1990s by
    • Teaming up with a 2nd-year assistant professor (Josh Jackson) to find Conley
    • Tracking down James Connolly, who legally changed his name in 1992 (or so) from James Conley
  • Convince the researcher to share the data.
  • Determine the data’s location and retrieve boxes from various archives and libraries:
    • Henry A. Murray Research Archive (Part of Harvard’s Dataverse)
    • Jim Connolly’s office
    • Bentley Library (Part of Michigan’s ICPSR) Retrieve boxes of data from various locations, including research archives and libraries.

49.3.3 Where to Look

Great places to start your data search (or let it find you) include: - ICPSR (http://icpsr.org): ~15,000 datasets - Harvard Dataverse (http://dataverse.harvard.edu/): ~95,000 datasets - Data.gov (https://catalog.data.gov/dataset): ~250,000 datasets - Other resources: OSF(http://osf.io/), Figshare(http://figshare.com/), Dryad (http://datadryad.org) and field-specific repositories - Nature has a great list of repositories: https://www.nature.com/sdata/policies/repositories - ZSR Guide

Remember, the more hurdles between you and the data, the less likely you are to be scooped! Whether you’re letting data find you or actively seeking it out, keep an open mind. The journey to finding the right dataset can be as illuminating as the analysis itself.