Chapter 23 Papers

23.1 ‘Mandatory Minimums’

23.1.1 Task

  • Working individually and in an entirely reproducible way, please find a dataset of interest on Open Data Toronto and write a short paper telling a story about the data.

23.1.2 Guidance

  • Find a dataset of interest on Open Data Toronto and download it in a reproducible way using the R package opendatatoronto (Gelfand 2020).
  • Create a folder with appropriate sub-folders, add it to GitHub, and then prepare a PDF using R Markdown with these sections (you are welcome to use this starter folder: https://github.com/RohanAlexander/starter_folder):
    • title,
    • author,
    • date,
    • abstract,
    • introduction,
    • data, and
    • references.
  • In the data section thoroughly and precisely discuss the source of the data and the bias this brings (ethical, statistical, and otherwise). Comprehensively describe and summarize the data using text and at least one graph and one table. Graphs must be made in ggplot (Wickham 2016) and tables must be made using knitr::kable() (with or without kableExtra) or gt (Iannone, Cheng, and Schloerke 2020b). Make sure to cross-reference graphs and tables.
  • Use bibtex to add references. Be sure to reference R and any R packages you use, as well as the dataset. Check that you have referenced everything. Strong submissions will draw on related literature and would be sure to also reference those. There are various options in R Markdown for references style; just pick one that you are used to.
  • Go back and write an introduction. This should be two or three paragraphs. The last paragraph should set out the remainder of the paper.
  • Add an abstract. This should be three or four sentences. And then add a descriptive title (Hint: ‘Paper 1’ is not descriptive.)
  • Add a link to your GitHub repo via a footnote.
  • Check that your GitHub repo is well-organized, and add an informative README. (Hint: Comment. Your. Code.). Make sure that you’ve got at least one R script in there, in addition, to your R Markdown file.
  • Pull this all together as a PDF and check that the paper is well-written and able to be understood by the average reader of, say, FiveThirtyEight. This means that you are allowed to use mathematical notation, but you must explain all of it in plain language. All statistical concepts and terminology must be explained. Your reader is someone with a university education, but not necessarily someone who understands what a p-value is - explain everything that you use.
  • Check there is no evidence that this is a class assignment.
  • Via Quercus, submit the PDF.

23.1.3 Check offs points

  • Check you’ve not included any R code or raw R output in the final PDF.
  • Check that although you’ll probably have most of your code in the R Markdown, make sure that you have at least one R script in the scripts folder.
  • Check there is thoroughly commented code that directly creates your PDF. Do not knit to html and then save as a PDF. Do not knit to Word and then save as a PDF
  • Check that your graph and discussion are extremely clear, and of comparable quality to those of FiveThirtyEight.
  • Check that the date is updated.
  • Check your entire workflow is entirely reproducible.
  • Check for typos.

23.1.4 FAQ

  • Can I use a dataset from Kaggle instead? No, because too many people use Kaggle datasets so employers are sick of them.
  • I can’t use code to download my dataset, can I just manually download it? No, because your entire workflow needs to be reproducible. Please fix the download problem or pick a different dataset.
  • How much should I write? Most students submit something in the two-to-six-page range, but it’s really up to you. Be precise and thorough.
  • My data is about apartment blocks/NBA/League of Legends so there’s no ethical or bias aspect, what do I do? Please re-read the readings to better understand bias and ethics. If you really can’t think of something, then it might be worth picking a different dataset.
  • Can I use Python? No. If you already know Python then it doesn’t hurt to learn another language.
  • Why do I need to cite R, when I don’t need to cite Word? R is a free statistical programming language with academic origins so it’s appropriate to acknowledge the work of others. It’s also important for reproducibility.

23.2 ‘These numbers mean dial it up’

23.2.1 Task

Please consider this scenario:

  • ‘You are employed as a junior data scientist at Petit Poll - a Canadian polling company. Petit Poll has a contract with a ’client’ - an Ontario government department - to provide them with advice. In particular, the client wants to understand the effect of COVID shut-downs on restaurant businesses and has asked Petit Poll to design an experiment where some restaurants are shutdown.’
  • Working as part of a small team of 1-3 people, and in an entirely reproducible way, please decide on an intervention, and some measurement strategies, and then write a short paper telling a story about the effect of shut-downs on restaurants.

23.2.2 Guidance

  • Working as part of a team of 1-3 people, prepare a PDF in R Markdown with the following features (you are welcome to use this starter folder: https://github.com/RohanAlexander/starter_folder):
    • title,
    • author/s,
    • date,
    • abstract,
    • introduction,
    • data,
    • discussion, and
    • references.
  • In the data section you should specify the intervention and data gathering methodology,
  • In the discussion section and any other relevant section, please be sure to discuss ethics and bias with reference to relevant literature.
  • Decide on an intervention. Some aspects to address include:
    • How will it be designed and implemented?
    • What will be random about it?
    • How will you ensure the separation of treatment and non-treatment?
    • How long will it run?
  • Decide on a survey methodology. Some aspects to address include:
    • What is the population, frame, and sample?
    • What sampling methods will you use and why? What are some of the statistical properties that the method brings to the table?
    • How are you going to reach your desired respondents?
    • How much do you estimate this will cost?
    • What steps will you take to deal with non-response and how will non-response affect your survey?
    • How are you going to protect respondent privacy?
  • Remember to consider all of this in the context of your ‘client’ - for instance, what are they interested in?
  • Develop a survey on a platform that was introduced in class. Be sure to test it yourselves. You will want to test this as much as possible, maybe even swap informally with another group?
  • Now release the surveys into the (simulated) ‘field’. Please do this by simulating an appropriate number of responses to your survey in R. Don’t forget to simulate in relation to the intervention that you proposed. Do you need two, or even more, surveys? Show the results and discuss your ‘findings’. Everything must be entirely reproducible.
  • You may wish to scrape some data and/or use open data sources to appropriately parameterize your simulations. Don’t forget to cite them when you do this.
  • Use R Markdown to write a PDF report about all of this. Discuss your intervention, results and findings, your survey design and motivations, etc - all of it. You are writing a report that will eventually go to the client, so you must set the scene, and use language that demonstrates your command of statistical concepts but brings the reader along with you. Be sure to include graphs and tables and reference them in your discussion. Be sure to be clear about weaknesses and biases, and opportunities for future work.
  • Your report must be well written. You are allowed to, and should, use mathematical notation, but you must explain all of it in plain language. Similarly, you can, and should, use experimental/survey/sampling/observational data terminology, but again, you need to explain it.
  • Your graphs and tables must be of an incredibly high standard. Graphs and tables should be well formatted and report-ready. They should be clean and digestible. Furthermore, you should label and describe each table/figure.
  • Your client has stats graduates working for it who need to be impressed by the main content of the report, but also has people who barely know what an average is and these people need to be impressed also.
  • Your graphs must be of an extremely high standard.
  • Check that you have referenced everything, including R, R packages, and datasets. Strong submissions will draw on related literature and would be sure to also reference those. The style of references does not matter, provided it is consistent.
  • Via Quercus, submit your PDF report. You must provide a link to the GitHub repo where the code that you used for this assignment lives (hint: Comment. Your. Code.). Your entire workflow must be entirely reproducible. Your repo should be clearly organised and a useful README included. And you must include the R Markdown file that produced the PDF in that repo.
  • Please be sure to include a link to your survey/s in your report and screenshots of the survey/s in the appendix of your report.
  • Everyone in the team receives the same mark.
  • There should be no evidence that this is a class assignment.

23.2.3 Check offs points

23.2.4 FAQ

  • Can I work by myself? Yes. But I recommend forming a group and the workload for the course assumes you’ll work on the second and third paper as part of a group of four.
  • Can we switch groups for the third paper? Yes.
  • How can I find a group? I will randomly create groups of four in Quercus. You are welcome to shift out of those groups and form your own groups if you’d like.
  • Can I get a different mark to the rest of my group? No. Everyone in the group gets the same mark.
  • I wrote my paper by myself, so can I be graded on a different scale? No. All papers are graded in the same way.
  • How much should I write? Most students submit something in the 10-to-15-page range, but it’s really up to you. Be precise and thorough.

23.3 ‘The Short List’

23.3.1 Task

  • Working as part of a small team of 1-3 people, and in an entirely reproducible way, please pick a paper to reproduce from an approved list and then write a short paper telling a story based on this. Your story should both talk about the (reproduced) findings, but also (a bit more ‘meta’) about what you learnt from the process.

23.3.2 Guidance

  • Working as part of a team of 1-3 people, prepare a PDF in R Markdown with the following features:
    • title,
    • author/s,
    • date,
    • abstract,
    • introduction,
    • data,
    • model,
    • results,
    • discussion, and
    • references.
  • In the discussion section and any other relevant section, please be sure to discuss ethics and bias with reference to relevant literature.
  • You should reproduce one of the following papers:
    • Liran Einav, Amy Finkelstein, Tamar Oostrom, Abigail Ostriker, Heidi Williams, 2020, ‘Screening and Selection: The Case of Mammograms’, American Economic Review.
    • Pons, Vincent, 2018, ‘Will a Five-Minute Discussion Change Your Mind? A Countrywide Experiment on Voter Choice in France’ American Economic Review.
    • Barari, Soubhik, Christopher Lucas, and Kevin Munger, 2021, ‘Political Deepfake Videos Misinform the Public, But No More than Other Fake Media’, 13 January, https://osf.io/cdfh3/.
    • Others TBD
    • If you have a favourite paper and want to reproduce it, then please submit it to me for consideration before Reading Week.
  • You should follow the lead of the author/s of the paper you’re reproducing, but thoroughly think about, and discuss, what is being done. Regardless of the particular model that you are using, and the (possibly lack of) extent to which this is done in the paper, your model must be well explained, thoroughly justified, explained as appropriate to the task at hand, and the results must be beautifully described.
  • You must include a DAG (probably in the model section).
  • You must have a discussion of power and experimental design (probably in the data section)
  • Your paper must be well-written, draw on relevant literature, and show your statistical skills by explaining all statistical concepts that you draw on.
  • You are welcome to use appendices for supporting, but not critical, material. Your discussion must include sub-sections that focus on three or four interesting points, and also sub-sections on weaknesses and next steps.
  • In your report you must provide a link to a GitHub repo that fully contains your analysis. Your code must be entirely reproducible, documented, and readable. Your repo must be well-organised and appropriately use folders.
  • Your graphs and tables must be of an incredibly high standard. Graphs and tables should be well formatted and report-ready. They should be clean and digestible. Furthermore, you should label and describe each table/figure.
  • When you discuss the dataset (in the data section) you should make sure to discuss (at least):
    • Its key features, strengths, and weaknesses generally.
    • A discussion of the questionnaire - what is good and bad about it?
    • A discussion of the methodology including how they find people to take the survey; what their population, frame, and sample were; what sampling approach they took and what some of the trade-offs may be; what they do about non-response; the cost.
    • A discussion of the intervention and experimental design.
    • These are just some of the issues strong submissions will consider. Show off your knowledge. If this becomes too detailed then you should push some of this to footnotes or an appendix.
  • When you discuss your model (in the model section), you must be extremely careful to spell out the statistical model that you are using, defining and explaining each aspect and why it is important. (For a Bayesian model, a discussion of priors and regularization is almost always important.) You should mention the software that you used to run the model. You should be clear about model convergence, model checks, and diagnostic issues. How do the sampling and survey aspects that you discussed assert themselves in the modelling decisions that you make? Again, if it becomes too detailed then push some of the details to footnotes or an appendix. You have the original paper to guide you, but you’ll likely need to go well-beyond what is included.
  • You should present model results, graphs, figures, etc, in the results section. This section should strictly relay results. Interpretation of these results and conclusions drawn from the results should be left for the discussion section.
  • Your discussion should focus on your model results. Interpret them and explain what they mean. Put them in context. What do we learn about the world having understood your model and its results? What caveats could apply? To what extent does your model represent the small world and the large world (to use the language of McElreath, Ch 2)? What are some weaknesses and opportunities for future work? Additionally, as this is a reproduction you should include a sub-section on differences you found and difficulties that you had.
  • Check that you have referenced everything. Strong submissions will draw on related literature in the discussion (and other sections) and would be sure to also reference those. The style of references does not matter, provided it is consistent.
  • As a team, via Quercus, submit a PDF of your paper. Again, in your paper you must have a link to the associated GitHub repo. And you must include the R Markdown file that produced the PDF in that repo. And you must include the R Markdown file that produced the PDF in that repo. The repo must be well-organized and have a detailed README.
  • A good way to work as a team would be to split up the work, so that one person is doing each section. The people doing the sections that rely on data (such as the analysis and the graphs) could just simulate it while they are waiting for the person putting together the data to finish.
  • It is expected that your submission be well written and able to be understood by the average reader of say 538. This means that you are allowed to use mathematical notation, but you must be able to explain it all in plain English. Similarly, you can (and hint: you should) use survey, sampling, observational, and statistical terminology, but again you need to explain it. Your work should have flow and should be easy to follow and understand. To communicate well, anyone at the university level should be able to read your report once and relay back the methodology, overall results, findings, weaknesses and next steps without confusion.
  • Everyone in the team receives the same mark.
  • There should be no evidence that this is a class assignment.

23.3.3 Check offs points

23.3.4 FAQ

  • Do I have to stay in the same group as the second paper? No. You’re welcome to change. However, it’s important that you don’t change the second paper group on Quercus - be sure to only change the third paper group.
  • Can we switch groups for the third paper? Yes.
  • How much should I write? Most students submit something in the 10-to-15-page range, but it’s really up to you. Be precise and thorough.
  • My paper doesn’t have a DAG, what do I do? You need to make the DAG.

23.4 ‘Two Cathedrals’

23.4.1 Task

  • Working individually, please conduct original research that applies methods from statistics to a question that involves an experiment.

23.4.2 Guidance

You have various options for topics (pick one): - Develop a research question that is of interest to you and obtain or create a relevant dataset. This option involves developing your own research question based on your own interests, background, and expertise. I encourage you to take this option, but please discuss your plans with me. How does one come up with ideas? One way is to be question-driven, where you keep an informal log of small ideas, questions, and puzzles, that you have as you’re reading and working. Often, after dwelling on it for a while you can manage to find some questions of interest. Another way is to be data-driven - try to find some interesting dataset and then work backward. Finally, yet another way, is to be methods-driven - let’s say that you happen to understand Gaussian processes, then just apply that expertise. - Others TBA

  • You should know the expectations by now. If you need a refresher then review the past problem sets. But essentially:
    • Everything is entirely reproducible.
    • Your paper must be written in R Markdown.
    • Your paper must have the following sections:
      • Title, date, author, keywords, abstract, introduction, data, model, results, discussion, appendix (optional, for supporting, but not critical, material), and a reference list.
    • Your paper must be well-written, draw on relevant literature, and show your statistical skills by explaining all statistical concepts that you draw on.
    • The discussion needs to be substantial. For instance, if the paper were 10 pages long then a discussion should be at least 2.5 pages. In the discussion, the paper must include subsections on weaknesses and next steps - but these must be in proportion.
    • The report must provide a link to a GitHub repo that contains everything (apart from any raw data that you git ignored if it is not yours to share). The code must be entirely reproducible, documented, and readable. The repo must be well-organised and appropriately use folders and README files.

23.4.3 Peer review submission

  • My expectations for this paper are very high. I’m very excited to read what you submit. To help you achieve this standard, there is an initial ‘submission’ where you can get comments and feedback and then the final, actual, submission.
  • Submit initial materials for peer-review.
    • As an individual, via Quercus, submit a PDF of your rough draft on Quercus.
    • At a minimum this must include:
    • All top-matter (title, author (you can use a pseudonym if you want), date, keywords, abstract) completely filled out.
    • A fully written Introduction section.
  • All other sections must be present in your paper, but don’t have to be filled out (e.g. you must have a ‘Data’ heading, but you don’t need to have content in that section).
  • To be clear - it is fine to later change any aspect of what you submit at this check-point.
  • You will be awarded two percentage points just for submitting a draft that meets this minimum.
  • The point of this is to get feedback on your work (and to make sure you have at least started thinking about this project) so you are more than welcome to include other sections that you wish to get feedback on.
  • There will be no extensions granted for this submission since the following submission is dependent on this date.

23.4.4 Conduct peer-review

  • As an individual, you will randomly be assigned a handful of rough drafts to provide feedback. You have three days to provide feedback to your peers.
  • If you provide feedback to one peer you will receive one percentage point, if you provide feedback to two peers you will receive two percentage points, if you provide feedback to three (or more) peers you will receive the full three percentage points.
  • Your feedback must include at least five comments (meaningful/useful bullet points). These must be well-written and thoughtful.
  • There will be no extensions granted for this submission since the following submission is dependent on this date.
  • Please remember that you are providing feedback here to help your colleagues. All comments should be professional and kind. It is challenging to receive criticism. Please remember that your goal here is to help your peers advance their writing/analysis. Any feedback that is inappropriate or not up to standard will receive a 0 and cannot be redeemed later.

23.4.5 Check offs points

23.4.6 FAQ

  • Can I work as part of a team? No. It’s important that you have some work that is entirely your own. You really need your own work to show off for job applications etc.
  • How much should I write? Most students submit something in the 10-to-15-page range, but it’s really up to you. Be precise and thorough.

23.5 ‘A Proportional Response’

23.5.1 Task

Working in teams of one to four people, please consider this scenario:

  • ‘You are employed as a junior statistician at Petit Poll - a Canadian polling company. Petit Poll has a contract with a Canadian political party to provide them with monthly polling updates.’
  • Working as part of a small team of 1-4 people, and in an entirely reproducible way, please write a short paper that tells the client a story about their standing.

23.5.3 Check offs points

23.5.4 FAQ

23.6 ‘Mr Willis of Ohio’

23.6.1 Task

  • Working in teams of one to four people, and in an entirely reproducible way, please use the Canadian General Social Survey (GSS) and a regression model to tell a story.

23.6.3 Check offs points

  • It is recommended that you (informally) proofread one another’s sections - why not exchange papers with another group?
  • Everyone in the team receives the same mark.
  • There should be no evidence that this is a class assignment.

23.6.4 FAQ

23.7 ‘Five Votes Down’

23.7.1 Task

  • The primary goal of this paper is to predict the overall popular vote of the 2020 American presidential election using multilevel regression with post-stratification.

23.7.3 Check offs points

  • It is expected that your submission be well written and able to be understood by the average reader of say 538. This means that you are allowed to use mathematical notation, but you must be able to explain it all in plain English. Similarly, you can (and hint: you should) use survey, sampling, observational, and statistical terminology, but again you need to explain it. The average person doesn’t know what a p-value is nor what a confidence interval is. You need to explain all of this in plain language the first time you use it. Your work should have flow and should be easy to follow and understand. To communicate well, anyone at the university level should be able to read your report once and relay back the methodology, overall results, findings, weaknesses and next steps without confusion.
  • It is recommended that you (informally) proofread one another’s work - why not exchange papers with another group?
  • Everyone in the team receives the same mark.
  • There should be no evidence that this is a class assignment.

23.7.4 FAQ

23.8 ‘What’s next?’

23.8.1 Task

Please work individually. In this paper, you will conduct original research that applies methods from statistics to a question involving surveys, sampling or observational data.

23.8.3 Check offs points

23.8.4 FAQ

References

Gelfand, Sharla. 2020. Opendatatoronto: Access the City of Toronto Open Data Portal. https://CRAN.R-project.org/package=opendatatoronto.

Iannone, Richard, Joe Cheng, and Barret Schloerke. 2020b. Gt: Easily Create Presentation-Ready Display Tables. https://CRAN.R-project.org/package=gt.

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.