After you finish the homework, please complete the following (short, anonymous) post-homework survey: https://forms.gle/AM1x5qEnLCvxsgrJ7
We have marked questions in blue . Please put answers in black (do not change colors). You'll want to write text answers in "markdown" mode instead of code. In Jupyter notebook, you can go to Cell > Cell Type > Markdown, from the menu. Please carefully read the late days policy and grading procedure here. In that link, we also give some tips on exporting your notebook to PDF, which is required for GradeScope submission.
A few notes about this homework:
Please read Sections 3 and 4 (pages 6-13) here: https://www.nber.org/system/files/working_papers/w20830/w20830.pdf, and answer the following questions.
Please summarize the sections in no more than two sentences.
Do you think it's a problem that most ratings are positive? If so, why? Answer in no more than four sentences. Please incorporate concepts discussed in class in your answer.
Think back to a time that you trained a model on data from people or gathered opinions via a survey (an informal one is fine). If you have not done that before, you may answer these questions about an article in the news that reported on public opinions or a model that you think might be in deployment at a company or organization with which you interact (for example, Amazon, google maps, etc)
Briefly summarize the scenario in no more than two sentences.
What was the construct that you cared about/wanted to measure? What was the measurement (numerical data)? In what ways did the measurement not match the construct you cared about? Answer in no more than 4 sentences.
What selection biases/differential non-response issues occurred and how did it affect your measurement? (If your answer is "None," explain exactly why you believe the assumptions discussed in class were met). Answer in no more than 3 sentences.
Given what we have learned in class so far, what would you do differently if faced with the same scenario again? Answer in no more than 3 sentences.
In this part of the homework, we provide you with data from a poll in Florida before the 2016 Presidential election in the United States. We also provide you with (one pollster's) estimates of who will vote in the 2016 election, made before the election. You will use this data and apply the weighting techniques covered in class.
import pandas as pd
import numpy as np
dfpoll = pd.read_csv('polling_data_hw1.csv') # raw polling data
dfpoll.head()
candidate | age | gender | party | race | education | |
---|---|---|---|---|---|---|
0 | Someone else | 30-44 | Male | Independent | White | College |
1 | Hillary Clinton | 45-64 | Male | Republican | Hispanic | College |
2 | Hillary Clinton | 30-44 | Male | Independent | Hispanic | College |
3 | Hillary Clinton | 65+ | Female | Democrat | White | College |
4 | Donald Trump | 65+ | Female | Republican | White | High School |
dfdemographic = pd.read_csv('florida_proportions_hw1.csv') # proportions of population
dfdemographic.head()
Electoral_Proportion | Demographic_Type_1 | Demographic_Type_2 | Demographic_1 | Demographic_2 | |
---|---|---|---|---|---|
0 | 0.387927 | party | NaN | Democrat | NaN |
1 | 0.398788 | party | NaN | Republican | NaN |
2 | 0.213285 | party | NaN | Independent | NaN |
3 | 0.445928 | gender | NaN | Male | NaN |
4 | 0.554072 | gender | NaN | Female | NaN |
dfdemographic.tail()
Electoral_Proportion | Demographic_Type_1 | Demographic_Type_2 | Demographic_1 | Demographic_2 | |
---|---|---|---|---|---|
112 | 0.034216 | race | education | Hispanic | Some College |
113 | 0.027588 | race | education | Hispanic | College |
114 | 0.010929 | race | education | Other | High School |
115 | 0.010570 | race | education | Other | Some College |
116 | 0.015142 | race | education | Other | College |
dfdemographic contains estimates of likely voters in Florida in 2016. When Demographic_Type_2 is NaN, the row refers to just the marginal population percentage of the group in Demographic_1 of type Demographic_Type_1. When it is not NaN, the row has the joint distribution of the corresponding demographic groups.
For example, row 0 means that 38.7927% of the electorate is from the Democrat party. Row 113 means that 2.7588% of the electorate is Hispanic AND graduated college.
Here, we'll visualize whether the respondents in the poll match the likely voter estimates. Create a scatter-plot where each point represents one Demographic group (for example, party-Independent), where the X axis is the Electoral_Proportion in dfdemographic, and the Y axis is the proportion in dfpoll.
In your view, which group is most over-represented? Most under-represented? Why? Answer in no more than 3 sentences. There are multiple reasonable definitions of "over" or "under" represented; any choice is fine as long as you justify your answer.
For this question, we'll ignore people who answered anything but "Hillary Clinton" or "Donald Trump."
You'll notice that some of the groups in the polling data ("refused") do not show up in the population percentages. For the questions that require weighting by demographics, ignore those respondents.
Below, report the "raw polling average," the percentage of people "Hillary Clinton" divided by the number who answered either Hillary or Trump.
For each demographic type separately -- age, gender, party, race, and education -- weight the poll by just that demographic type, in accordance to the population proportions given. Report the resulting poll results, and briefly (at most 3 sentences) describe what you observe.
For example, when weighted by race, you'll report:
Weighted by race --- Clinton: 0.530, Trump: 0.470
Now, for each pair of demographic types in dfdemographic, do the same -- weight the poll by that pair of demographic types, in accordance to the given joint distributions, and briefly (at most 3 sentences) describe what you observe.
For example, when weighted by race and age, you'll find:
Weighted by age and race: Clinton: 0.525, Trump: 0.475
We don't always have access to joint distributions across the population -- for example, it may be hard to estimate from past exit polls (surveys done as people are leaving the polling station) what the joint distribution of education and gender is, for example. However, access to marginal distributions are often available.
As discussed in class, one strategy when you don't have access to joint distributions -- only marginals -- is to multiply the marginal distributions. For example, if 50% of your population is Democratic and 50% is a woman, then pretend that 50% times 50% = 25% of your population is a Democratic women. Clearly this technique is not perfect, but it is sometimes a useful heuristic.
For the following pairs of Demographic types, report the weighting results if you use the joint distributions in dfdemographic versus if you approximate the joint distribution using the marginals. Briefly (at most 3 sentences) describe what you observe.
(party, gender)
(race, gender)
As an example output, here's the results for two other pairs of demographics:
Demo1 | Demo2 | Joint | |
---|---|---|---|
0 | age | race | 0.524516 |
1 | age | education | 0.525483 |
The above techniques use the mean answer among people who share a demographic as the estimate for that demographic. But that wastes information across demographics. For example, maybe people who only have "Some College" are similar enough to people who have "High School" as to provide some useful information.
First, do the following: use a logistic regression (or your favorite prediction tool) to predict candidate choice, using the demographics. You might want to convert some demographics (like education) to ordered numeric (e.g., 1, 2, 3) as opposed to using discrete categories.
Here, you will earn partial bonus points by just reporting the predictions and comparing them to the means of each covariate group in the raw polling data. Give a scatter-plot, where each point is one combination of full demographics (age, gender, party, race/ethnicity, education), the X axis is the raw polling average for that combination, and the Y axis is your regression prediction for that combination.
Then, once you have predictions for each set of covariates, "post-stratify" to get a single population estimate by plugging them into the above weighting techniques, where you use the predictions instead of the raw averages in that cell. Report the resulting estimates if you do the 2-dimensional joint weighting (on every pair).
You may use existing python packages, such as here. Another approach would be to use rpy2 to call R
, as there are many well-maintained packages in R
to analyze polling data. One example is here.
i. In Part B, you should notice a discrepancy from what we said in class and the data -- weighting by education does not seem to help much in reducing the polling average from being pro-Clinton.
Here, we'll try to dig into the data to see why the methods we tried above might not be perfect, and what data you would want (such as demographic joint distribution) to do better.
First, aggregate (using the groupby function) the poll results by education. Second, aggregate by education and some of the other covariates (for example, education and race, or education and party). Discuss in 4 sentences or less.
ii. You'll notice that there are some responses with "refused," and that those people in particular are Trump-leaning. Furthermore, there are likely many people who refused to answer the poll at all, who do not show up in the data. The weighting techniques we used above would ignore these people. How would you adjust your procedures/estimates above to take them into account? Answer in at most 3 sentences.
None of the above techniques deal with selection biases/non-response on un-measured covariates. Do you think that may be an important concern in this dataset? Why or why not? Respond in 3 or fewer sentences.
Throughout this homework, you made many estimates of the same quantity -- the fraction of people who will vote for Clinton in Florida. Below, plot a histogram of all your estimates.
Given all your above analysis, if you were a pollster what would you report as your single estimate?
Justify your choice, in at most 3 sentences
Though we did not discuss how to calculate margin of error or standard errors with weighting in this course, what would you say if someone asked you how confident you are in your estimate? You may either qualitatively answer, or try to come up with a margin of error.