4 Gathering and Modeling Data

Up to this point, we have mostly focused on describing our data. What if we want to say more about the processes unfolding? In this week, we look at data gathering and modelling. We will first examine some data collection methods such as web scraping and APIs. We will then examine linear and non-linear models, their benefits, and some potential drawbacks.

Monday Readings:

Wednesday Readings:

4.1 New Ways of Gathering Data

Gathering data will always be important, but the tools that we use to gather data change. Here, we will look at some new ways of gathering data. We will look at new ways of surveying people and new types of data.

First, smartphones and digital surveys offer new possibilities for asking people questions. For example, Sugie used a probability sample to provide 133 incarcerated men cell phones upon release from prison. She was then able to administer daily surveys to these men and ask real-time questions about well-being and employment searches. From these data, she was able to describe how formerly incarcerated people look for work at a scale and in details that had not been possible before. These descriptions then could inform questions and policies around post-prison transitions in our society.

Source: Sugie, 2016.

Figure 4.1: Source: Sugie, 2016.

In the plot above (from Sugie, 2016), we observe response rates for traditional interviews and cell phone interviews over a 12 week period. Notice the decline in responses for each, but how cell phone interviews elicit higher responses throughout the study period than traditional methods.

New tools also offer new types of data to gather. Researchers have long studied gentrification, and debated the extent to which it occurs, who it displaces, and so on. However, they have lacked empirical tools for measuring the processes that we typically associate with gentrification, like changes in urban built environments. To address this, Hwang measured neighborhood changes using Google Street View (GSV). By looking at GSV images over different time points, she was able to directly measure gentrification over time. The study yields insights on how and where gentrification unfolds: she finds that Black and Hispanic populations make gentrification less likely, a relationship that levels off when neighborhood Black proportions reach 40% or higher.

Source: Hwang, 2014

Figure 4.2: Source: Hwang, 2014

In the above plot (from Hwang, 2014), estimated gentrification levels in Chicago are plotted from an older study (Hammel and Wyly, 1996, who used field surveys and census data) and for 2007-09 using the GSV data. Notice the similarities and differences. What types of changes do you think the GSV data would pick up on (or omit)?

4.2 Application Programming Interfaces (APIs)

Application Programming Interfaces, or APIs, are a novel and common way to gather data in the age of big data. We’ve actually already used an API in this class: gtrends!

Some APIs allow us to gather data that has already been collected via traditional survey or census methods. For instance, the tidycensus API allows us to easily pull data from the Decennial Census, American Community Survey, and more into R for analysis. This is not a new source of data - researchers have long studied the census to understand social issues. However, the API allows us to streamline data gathering, analysis, and even writing our manuscript (if we use markdown).

One critical first step in working with tidycensus is obtaining a Census Data API Key. You can get one here. Once you have a key, you can library tidycensus and tidyverse (this is just a collection of tidy packages, some of which we’ve worked with already like tidyr and dplyr).

library(tidycensus)
library(tidyverse)

census_api_key("YOUR API KEY GOES HERE")

Now we can gather data! For example, we could use the decennial census (every 10 years) to get information on median age by state:

# gather age data
age20 <- get_decennial(geography = "state", 
                       variables = "P13_001N", 
                       year = 2020,
                       sumfile = "dhc")

head(age20)
## # A tibble: 6 × 4
##   GEOID NAME                 variable value
##   <chr> <chr>                <chr>    <dbl>
## 1 09    Connecticut          P13_001N  41.1
## 2 10    Delaware             P13_001N  41.1
## 3 11    District of Columbia P13_001N  33.9
## 4 12    Florida              P13_001N  43  
## 5 13    Georgia              P13_001N  37.5
## 6 15    Hawaii               P13_001N  40.8

To see what other variables exist in the get_decennial or get_acs functions, you can download the codebook for the 2020 demographic and housing characteristic file with the following code:

# 2020 dhc codebook
v20 <- load_variables(2020, "dhc")

To see what other codebooks are available, try ?load_variables!

We can now explore the housing data that we downloaded. For instance, we could use ggplot() to show the following:

age20 %>%
  ggplot(aes(x = value, y = reorder(NAME, value))) + 
  geom_point()

Notice how NAME was reordered according to its value within the aes(). We can first use the following code to examine 5-year ACS variables (i.e. things that are sampled within a 5 year window for ACS).

# 2020 dhc codebook
v23acs <- load_variables(2023, "acs5")

There are a lot of variables to sort through! If we wanted to gather “Median Household Income in the Past 12 Months (in 2023 Inflation-Adjusted Dollars)” we could use “B19013_001.” Below, this is done for the state of Vermont:

vt <- get_acs(geography = "county", 
              variables = c(medincome = "B19013_001"), 
              state = "VT", 
              year = 2023)

vt
## # A tibble: 14 × 5
##    GEOID NAME                       variable  estimate   moe
##    <chr> <chr>                      <chr>        <dbl> <dbl>
##  1 50001 Addison County, Vermont    medincome    88478  3554
##  2 50003 Bennington County, Vermont medincome    71494  4402
##  3 50005 Caledonia County, Vermont  medincome    66075  2723
##  4 50007 Chittenden County, Vermont medincome    94310  2768
##  5 50009 Essex County, Vermont      medincome    58985  3546
##  6 50011 Franklin County, Vermont   medincome    79078  5479
##  7 50013 Grand Isle County, Vermont medincome    90625  9258
##  8 50015 Lamoille County, Vermont   medincome    69897  6220
##  9 50017 Orange County, Vermont     medincome    77328  3918
## 10 50019 Orleans County, Vermont    medincome    66426  3655
## 11 50021 Rutland County, Vermont    medincome    64778  2933
## 12 50023 Washington County, Vermont medincome    79853  2906
## 13 50025 Windham County, Vermont    medincome    68021  4236
## 14 50027 Windsor County, Vermont    medincome    75247  3153

And of course, we can plot this!

vt %>%
  mutate(NAME = gsub(" County, Vermont", "", NAME)) %>%
  ggplot(aes(x = estimate, y = reorder(NAME, estimate))) +
  geom_errorbarh(aes(xmin = estimate - moe, xmax = estimate + moe)) +
  geom_point(color = "red", size = 3) +
  labs(title = "Household income by county in Vermont",
       subtitle = "2017-2021 American Community Survey",
       y = "",
       x = "ACS estimate (bars represent margin of error)")

4.3 Web Scraping

Last class, we looked at demographic data on California’s ethnic/racial compositions from Wikipedia. However, we didn’t discuss how this data was actually gathered.

One method for gathering such data is web scraping. In the “wild west” days of the internet, sites were fairly disconnected and scraping was an acceptable way to gather information from sites. In the modern day, many sites disallow web scraping, and offer APIs as alternatives. Therefore, it is important to check that you are not violating terms of agreement in scraping websites. You can check a site’s regulations by typing the website and adding “robots.txt” to the end of the url. For example, if we go to https://en.wikipedia.org/robots.txt we see that some crawlers and bots have been banned for going to fast or for other actions, but they do not list any total restrictions on scraping.

To try scraping a Wikipedia page, we will first download and library the rvest package.

library(rvest)

Great! Now we can try reading an HTML file (the language behind a website, which tells your browser the meaning and structure of the site) into R. It will work best for us to read in a Wikipedia page with a table. Let’s try a different Wikipedia page, this time on US incarceration over time.

# read in html
us_inc <- read_html("https://en.wikipedia.org/wiki/Incarceration_in_the_United_States")

# take a look
us_inc
## {html_document}
## <html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-menu-disabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available skin-theme-clientpref-thumb-standard" lang="en" dir="ltr">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="skin--responsive skin-vector skin-vector-search-vue mediawik ...

Hmmmm, it’s not really in a readable format. But that’s ok! In order to extract our table of interest, we just need to gather a little more information:

  1. Go to the page of interest and find the table that we want to extract
  2. Right click the table and click inspect
  3. Find the <table… element of interest in the Element pane on the right-hand side. When you hold your mouse over this text, it should highlight the entire table on the left-hand side
  4. Right-click the table element, Copy -> Copy Xpath

Great, now that we have the Xpath copied we can use rvest’s html_nodes() function to extract just the html of interest.

# extract nodes of interest
us_inc_table <- html_nodes(us_inc,
                            xpath = '//*[@id="mw-content-text"]/div[2]/div[5]/table')

It is still in html format, but there is another function, html_table(), which will convert this to a dataframe. Let’s try it!

library(dplyr)
library(magrittr)

# convert object to table
us_inc_table %<>%
  html_table()

#take another look
us_inc_table %>%
  head()
## [[1]]
## # A tibble: 20 × 3
##     Year Count      Rate
##    <int> <chr>     <int>
##  1  1940 264,834     201
##  2  1950 264,620     176
##  3  1960 346,015     193
##  4  1970 328,020     161
##  5  1980 503,586     220
##  6  1985 744,208     311
##  7  1990 1,148,702   457
##  8  1995 1,585,586   592
##  9  2000 1,937,482   683
## 10  2002 2,033,022   703
## 11  2004 2,135,335   725
## 12  2006 2,258,792   752
## 13  2008 2,307,504   755
## 14  2010 2,270,142   731
## 15  2012 2,228,424   707
## 16  2014 2,217,947   693
## 17  2016 2,157,800   666
## 18  2018 2,102,400   642
## 19  2020 1,675,400   505
## 20  2021 1,767,200   531

This looks better. But notice the [[1]]? This means that our table is still technically in list format, and that this is the first element of a list. We will change that below with the as.data.frame() function.

Also, you might be wondering, what is a tibble? According to CRAN, “Tibbles are a modern take on data frames. They keep the features that have stood the test of time, and drop the features that used to be convenient but are now frustrating.” For our purposes, we can treat them just like dataframes.

# convert list to dataframe
us_inc_table %<>%
  as.data.frame()

# take another look
us_inc_table %>% 
  head()
##   Year   Count Rate
## 1 1940 264,834  201
## 2 1950 264,620  176
## 3 1960 346,015  193
## 4 1970 328,020  161
## 5 1980 503,586  220
## 6 1985 744,208  311

Ok perfect, now we have a true dataframe. In order to look at our data, we might want to clean it up using functions like filter() or slice(). If we had additional data, we could add these with bind_rows()!

The first thing we want to do when we gather new data is look at it. We can do this either by visually inspecting the data (using the View() function) or by plotting it (using some of the methods from section 3). We can start by plotting the data with ggplot() and geom_point().

library(ggplot2)

# plot incarceration data
ggplot(us_inc_table, aes(x = Year, y = Count))+
  geom_point()

Hmmm, something doesn’t seem quite right … Any guesses? If we inspect our data further, we notice that each variable has a different class().

# examine class of variables
class(us_inc_table$Year)
## [1] "integer"
class(us_inc_table$Count)
## [1] "character"
class(us_inc_table$Rate)
## [1] "integer"

In the problem set, you will use mutate() and parse_number() to solve this problem! For now, let’s look at a different variable: Rate. We can look at the relationship between year and incarceration rate.

library(ggplot2)

# plot incarceration data
ggplot(us_inc_table, aes(x = Year, y = Rate))+
  geom_point()

This looks a bit better (at least, in the data sense). Now that we have a plot, what can we say about it?

4.4 Models

From what we have seen so far, we can answer descriptive questions about mass incarceration. But we may also want to explain or predict changes. In other words, an initial question as we look at our data might be “what direction is the trend?” Are incarceration rates going up, down, or remaining stable? We don’t need to do much analysis here to see that the trend is positive (i.e. rates are going up over time). But if we quantify this, we can answer questions like “if the trend keeps going at this rate, where will we be in 10 years?”

As social scientists, we tend to make three types of claims about our data. Here are examples of each:

  1. Describe: In the U.S., between 1970 and 2010, the incarceration rate increased from 161 to 731.
  2. Explain: As time passes, modern societies become increasingly rely on carceral solutions to social problems, and incarceration rates increase.
  3. Predict: In 2030, the U.S. incarceration rate will reach 1,000 (persons per 100,000).

For the latter two items, we are going to need a model of how our data works. Let’s start with a simple example: in our incarceration data, we have just one data point per year. We can start by just running a line through our points. We can do so with geom_line():

library(ggplot2)

ggplot(us_inc_table, aes(x = Year, y = Rate))+
  geom_point()+
  geom_line()

But is this useful? Well, yes and no. It is useful in describing the data. But it doesn’t really simplify what is going on in any way. We might want a model that gives a single value for the slope (i.e. the amount that our points are increasing or decreasing over time).

We can quantify the trend using linear regression. If we just had two data points, we would run a line through them. However, we have a lot more than two data points. So what do we do?

If you recall from prior math classes, we can express lines with a formula like y = mx + b. m is the slope and b is the intercept. In the below example, you’ll see an example of this. We can formulate a line that runs through our data:

Most data points will not fall directly on this line. The distance between each data point and the line is called its residual. When choosing values for m and b, we choose values that minimize the sum of the squared residuals. The line that does this is called the line of “best fit.” The R Squared value tells us how well our line actually fits our data (i.e. the proportion of variation that is explained by our model).

Returning to our incarceration data, we can add a linear model to our plot with the following code:

# add linear model
ggplot(us_inc_table, aes(x = Year, y = Rate))+
  geom_point()+
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'

Notice we used method = "lm" for “linear model.” If we remove this, the default option is a non-linear model, in this case, a “loess” model or local polynomial model. You might recall polynomials like quadratic and cubic models, which use squared terms or cubed terms to estimate curves. For more information on this method, you can run ?loess. What do you notice about the fit of this model, compared to the linear model?

library(ggplot2)

ggplot(us_inc_table, aes(x = Year, y = Rate))+
  geom_point()+
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Our linear and non-linear models will give us different predictions for the future. For example, we can look at how the linear and polynomial models estimate incarceration trends will change over the next 15 years:

# linear model
ggplot(us_inc_table, aes(x = Year, y = Rate))+
  geom_point()+
  geom_smooth(method = "lm", se = FALSE, fullrange = TRUE)+
  lims(x = c(1940, 2040))

How well do you think this estimates future trends?

4.5 Building a Model

To understand our data, we often want to build a model. As with many things in R, there are multiple ways to build a model. We will build models with tidymodels, which offer a succinct way to split data, fit models, and make predictions. We will try these with the linear example from above.

4.5.1 Splitting the Data

Historically, social scientists have focused more on explaining trends than predicting them. For this reason, they have mostly used all of their data to build models.

Data scientists, on the other hand, take a different approach to building models and predicting the future. The data scientist begins by dividing their data into training and test data. It is important that this division is random. We can do this with the tidymodels R package (you may have to install GPfits or another package as well, R will tell you if this is the case).

library(tidymodels)

# split data
splits <- initial_split(us_inc_table)

Now we can create “training” and “test” datasets.

df_train <- training(splits)

df_test <- testing(splits)

For our purposes, we will fit our model with these training and test sets. But depending on your goals, you may not need to split your data if you are not interested in prediction!

4.5.2 Fitting a Model

In our linear model above, we represent incarceration trends with something like the following:

\(Y_i = \beta_0 + \beta_1X_i\)

Where \(Y_i\) is our outcome variable (for example, the incarceration rate) at year \(i\), \(\beta_0\) is the value of our intercept, and \(\beta_1\) is the value of our slope.

There are three general steps to building tidy models (from Kuhn and Silge, 2023):

  1. Specify the type of model based on its mathematical structure (e.g., linear regression, random forest, KNN, etc).
  1. Specify the engine for fitting the model. Most often this reflects the software package that should be used, like Stan or glmnet. These are models in their own right, and parsnip provides consistent interfaces by using these as engines for modeling.
  1. When required, declare the mode of the model. The mode reflects the type of prediction outcome. For numeric outcomes, the mode is regression; for qualitative outcomes, it is classification. If a model algorithm can only address one type of prediction outcome, such as linear regression, the mode is already set.

We can specify the model and the engine as follows:

library(tidymodels)

# specify the model engine
lm_model <- 
  linear_reg() %>% 
  set_engine("lm")

lm_form_fit <- lm_model %>% 
  fit(Rate ~ Year, data = df_train)

Try running lm_form_fit! What do you make of the results?

# look at results
lm_form_fit
## parsnip model object
## 
## 
## Call:
## stats::lm(formula = Rate ~ Year, data = data)
## 
## Coefficients:
## (Intercept)         Year  
##  -14188.612        7.365

4.5.3 Making Predictions

We might be interested in making predictions from our data. The good news is that doing so is easy with tidymodels:

# examine predictions
predict(lm_form_fit, new_data = df_test)
## # A tibble: 5 × 1
##   .pred
##   <dbl>
## 1  248.
## 2  557.
## 3  601.
## 4  631.
## 5  675.

Notice that we used df_test here to make predictions on our test dataset (not the data that we trained the model with). We could compare these to the true values for these points:

# look at true values
df_test
##   Year     Count Rate
## 1 1960   346,015  193
## 2 2002 2,033,022  703
## 3 2008 2,307,504  755
## 4 2012 2,228,424  707
## 5 2018 2,102,400  642

However, we might want to predict beyond the years in our data. We could create a new dataframe of out of sample points of future years! Let’s try this for 2022-2030.

# create evaluation dataset
df_eval <- data.frame(Year = c(2022:2030),
                      Count = NA,
                      Rate = NA)

Now, we can make predictions for this new period. Do they seem plausible to you? Why or why not?

# examine predictions
predict(lm_form_fit, new_data = df_eval)
## # A tibble: 9 × 1
##   .pred
##   <dbl>
## 1  704.
## 2  712.
## 3  719.
## 4  726.
## 5  734.
## 6  741.
## 7  748.
## 8  756.
## 9  763.

4.6 Final Project Proposal

Final projects can be completed in groups of up to four students. The proposals can also be collaborative, but each student should turn in one.

Proposals should answer the following questions:

  1. What is your research question? Why does it matter? How does it relate to society?

  2. What data will you use to answer the question? Do you have access to these data?

  • The class list that we made together in Problem Set 2 is available below.
  • Ideally, you are able to show access to at least a portion of the data that you will need.
  • Beyond the APIs and other data sources we have used in class, there are some great options on Kaggle.
  • Consider conducting a survey! All Our Ideas is one tool that could be useful for building such a survey.
  • See an example project on wildfires from a similar class here.
  1. What methods will you use to answer your research question?
  • Most projects will use some of the methods that we haven’t yet covered. This is great, but show that you have an idea of where you are heading!
  1. What kind of product will you create? How do you want others to read or engage with your project?
  • In most cases, projects will be R Markdown documents or Jupyter Notebooks, both of which can be published as websites.

The class list of ideas for data sources:

resource.name link custom_ready description
All Our Ideas https://allourideas.org/ custom made All Our Ideas is a wiki survey that asks participants to compare two items (e.g. which is better, Stanford or Cal?) at a time. The respondents’ answers are then aggregated so that each answer has a win percentage, and the surveyor can rank a large number of items. Respondents can also add their own ideas (answers) to the survey. This is useful for doing surveys on large or ambiguous questions, where the researcher might not know the best answers and it may be hard to compare all at once, but where there is a definitive underlying ranking.
US Census Data https://data.census.gov/ custom made All Our Ideas is a wiki survey that asks participants to compare two items (e.g. which is better, Stanford or Cal?) at a time. The respondents’ answers are then aggregated so that each answer has a win percentage, and the surveyor can rank a large number of items. Respondents can also add their own ideas (answers) to the survey. This is useful for doing surveys on large or ambiguous questions, where the researcher might not know the best answers and it may be hard to compare all at once, but where there is a definitive underlying ranking.
Medicare’s priciest drugs. https://www.cms.gov/data-research/statistics-trends-and-reports/cms-drug-spending custom made CMS has released several information products that provide greater transparency on spending for drugs in the Medicare and Medicaid programs.
ABC News Polls https://www.langerresearch.com/category/abc-news-polls/ ready made ABC News Poll is a publically available archive produced by Langer Research Associates. These surveys capture public opinion polling for the ABC News Television Network from 2010-2025. Surveys cover a variety of topics from sentiments on immigration, political figures, tariffs, and more.
General Social Survey 2024 https://gss.norc.org/get-the-data.html ready made The GSS has been a reliable source of data to help researchers, students, and journalists monitor and explain trends in American behaviors, demographics, and opinions. You’ll find the complete GSS data set on this site, and can access the GSS Data Explorer to explore, analyze, extract, and share custom sets of GSS data. The 2024 GSS Cross-section continues the multi-mode design and several other new features. We encourage users review the documents below before they perform any analysis.
IPUMS USA https://usa.ipums.org/usa/ ready made The data is the U.S. CENSUS DATA FOR SOCIAL, ECONOMIC, AND HEALTH RESEARCH. IPUMS USA collects, preserves and harmonizes U.S. census microdata and provides easy access to this data with enhanced documentation. Data includes decennial censuses from 1790 to 2010 and American Community Surveys (ACS) from 2000 to the present.
Education and Exployment https://www.bls.gov/emp/tables/unemployment-earnings-education.htm not custom made Education and Employment is a data set created by the National Bureau of Labor Statistics outlining the unemployment and median earnings per educational level. As is expected, those with advanced degrees (PhD, Masters, etc have the lowest unemployment rate and highest median salary, while those with little educational experience report the opposite. This data is useful in establishing a potential relationship between education and exployment outcomes like wage/salary.
Consumer Expenditure Survey https://www.bls.gov/cex/ custom made Consumer Expenditure Surveys provide data on expenditures, income, and demographic characteristics of consumers in the United States. It is the only Federal household survey to provide data on the full range of consumers’ expenditures and incomes. It can be used for poverty and consumer research.
Roper Center https://ropercenter.cornell.edu/ custom made The Roper Center for Public Opinion Research uses survey and polling data to gatehr information about the public’s uncensored view of different policies, discussions and broad topics. It is possible to register as a data provider with the Roper Center, meaning that the center also accepts data from a shared collective.
Hate Crime Reported in the US https://cde.ucr.cjis.gov/LATEST/webapp/#/pages/explorer/crime/hate-crime custom made Hate Crime Reported in the US is data on the number of United States offenses, United States incidents, and percent of hate crime in comparison to the total US population. Different bias categories are also included. The number of hate crime rates over time that you can look at can span from 3 months to 10 years.
Pew Research Center https://www.pewresearch.org/ ready made Pew Research Center is a nonprofit research organization with a dedicated staff of researchers and experts in a variety of fields. Pew primarily develops their datasets through surveys; the subject of their surveys are largely based on current event issues, the questions people are asking, and gaps in reliable data. This tool is largely perceived as a reliable non-partisan resource for social science - given their brand, they aim to uphold rigorous research standards.
Pew Research Center https://www.pewresearch.org/datasets/ previously collected Pew Research Center collects data on a wide range of topics from healthcare, AI, politics, and social opinions and is collected primarily via survey data. These are heavily large-scale datsets ranging from the entire United States to International surveys. This is a relevant resource to use for data sources as this data source is free, features a wide variety of social science topics, and is relevant for understanding human behavior and experiances from a social context.
Education and Income (Census) https://www.census.gov/library/stories/2025/09/education-and-income.html not custom Education and Income is a U.S Census Bureau story that shows how income varies by educaitonal level. It demonstrates that higher levels of education are associated with higher earnings and differences in income outcomes across the population.
Google Trends https://trends.google.com/ ready made Google Trends provides data on how often specific search terms are entered into Google over time, allowing researchers to analyze patterns in public interest and behavior. The platform aggregates search data and normalizes it so that trends can be compared across different time periods and geographic regions. Researchers can use this tool to track changes in attention, identify spikes in interest around events, and compare the popularity of multiple topics. This is especially useful for studying real-time social trends, cultural shifts, and public reactions to news or events, although it only captures behavior from people who use Google and may not fully represent the entire population.
Wikipedia Pageviews https://pageviews.wmcloud.org/ ready made Wikipedia Pageviews has a number that allow the public to view and compare statistics for wikipedia articles. These statistics include the number of page visits, the number of edits made, the number of page visits to an article in different languages, among others. This data may be used to analyse data similar to google trends, but with of an emphasis on summery articles rather than general google searches.
Google Trends https://trends.google.com/trends/ readymade Google Trends provides data on search behavior over time, allowing researchers to analyze public interest and trends.

4.7 Problem Set 4

Recommended Resources: Tidy Modeling With R

  1. Start a new document, problemset4.Rmd, in your soc10problemsets repository. (You can download a template here, but be sure to save it in your own soc10problemsets repo). In the chunk below, show how you would gather data on California’s ethnoracial demographics for 2020 using tidycensus. (Note: in the chunk below, eval is set to false, so you won’t see any results. That is ok! Just show the code. And remove your personal key). Commit your changes.

  2. In Bit by Bit, Salganik describes several possibilities and challenges of survey data in the digital age. Do you think any of these have changed since 2017 (when the book was written)? Pick a new type of survey or data collection, and describe its current possibilities and limitations. Commit your changes.

  3. Gather the data from this Wikipedia page on incarceration in the US (you can use the code from Section 4 of the course website to gather this). Use mutate(), parse_number() from the readr package, and the magrittr pipe (%<>%) to fix the issue with the Count variable. Then plot the Count variable over time with ggplot(). Commit your changes.

  4. Now model it! Add a linear model of incarceration over time to your plot. (Note: If you are unable to plot question 3 correctly, you can plot Rate as the outcome for this question). Commit your changes.

  5. Would a linear or non-linear model be more appropriate for explaining and predicting trends in US society and incarceration? Explain the benefits and drawbacks of each. Commit your changes.