5 On writing

Required material

If you want to be a writer, you must do two things above all others: read a lot and write a lot. There’s no way around these two things that I’m aware of, no shortcut.

S. King (2000, 145)

  • Read On Writing Well, (any edition is fine) (Zinsser 1976).
  • Read Publication, publication, (G. King 2006)
  • Read two of the following well-written quantitative papers:
  • Read two of the following articles from The New Yorker:
    • Funny Like a Guy, Tad Friend
    • Going the Distance, David Remnick
    • Happy Feet, Alexandra Jacobs
    • Levels of the Game, John McPhee
    • Reporting from Hiroshima, John Hersey
    • The Catastrophist, Elizabeth Kolbert
    • The Quiet German, George Packer
  • Read two of the following articles from miscellaneous publications:
    • Blades of Glory, Holly Anderson
    • Born to Run, Walt Harrington
    • Dropped, Jason Fagone
    • Federer as Religious Experience, David Foster Wallace
    • Generation Why?, Zadie Smith
    • One hundred years of arm bars, David Samuels
    • Out in the Great Alone, Brian Phillips
    • Pearls Before Breakfast, Gene Weingarten
    • The Cult of ‘Jurassic Park’, Bryan Curtis
    • The House that Hova Built, Zadie Smith
    • The Re-Education of Chris Copeland, Flinder Boyd
    • The Sea of Crisis, Brian Phillips

Key concepts and skills

  • To get better at writing, write, ideally every day.
  • Write for the reader.
  • Have one message that you want to communicate.
  • Get to a first draft as quickly as possible.
  • Rewrite brutally.
  • Remove as many words as possible.

5.1 Introduction

[T]he duty of a scientist is not only to find new things, but to communicate them successfully in at least three forms: 1) Writing papers and books. 2) Prepared public talks. 3) Impromptu talks.

Hamming (1996, 65)

People who need to write: founders, VCs, lawyers, software engineers, designers, painters, data scientists, musicians, filmmakers, creative directors, physical trainers, teachers, writers. Learn to write.

Sahil Lavingia, 3 February 2020.

Writing well has done just as much for me as knowing how to code. I’d add that if you’re intimidated by writing, start a blog and write often about something you’re interested in. You’ll get better. At least that’s what I’ve done for the past 10 years. :)

Vicki Boykis, 3 February 2020.

We need to write in order to tell our stories. Writing allows us to communicate efficiently. It is also a way to work out what we believe and allows us to get feedback on our ideas. Effective papers are tightly written and well-organized, which makes the story flow and easy to follow. Proper sentence structure, spelling, vocabulary, and grammar are important because they remove distractions and enable each point to be clearly articulated. Effective papers demonstrate understanding of the topic by confidently using relevant terms and techniques and considering issues without being overly verbose. Graphs, tables, and references are used to enhance both the story and its credibility.

This chapter is about writing. By the end of it, you will have a better idea of how to write short, detailed, quantitative papers that communicate what you want them to, and do not waste the reader’s time. We write for the reader, not for ourselves. Specifically, we write to be useful to the reader, where ‘[u]seful writing tells people something true and important that they didn’t already know, and tells them as unequivocally as possible’ (Graham 2020). That said, the greatest benefit of writing nonetheless often accrues to the writer, even when we write for our audience. This is because the process of writing is a way to work out what we think and how we came to believe it.

The way to do a piece of writing is three or four times over, never once. For me, the hardest part comes first, getting something—anything—out in front of me. Sometimes in a nervous frenzy I just fling words as if I were flinging mud at a wall. Blurt out, heave out, babble out something—anything—as a first draft. With that, you have a achieved a sort of nucleus. Then, as you work it over and alter it, you begin to shape sentences that score higher with the ear and the eye. Edit it again—top to bottom. The chances are that about now you’ll be seeing something that you are sort of eager for others to see. And all that takes times. What I have left out is the interstitial time. You finish that first awful blurting, and then you put the thing aside. You get in your car and drive home. On the way, your mind is still knitting at the words. You think of a better way to say something, a good phrase to correct a certain problem. Without that drafted version—if it did not exist—you obviously would not be thinking of things that would improve it. In short, you may be actually writing only two or three hours a day, but your mind, in one way or another, is working on it twenty-four hours a day—yes, while you sleep—but only if some sort of draft of earlier version already exists. Until it exists, writing has not really begun.

McPhee (2017, 159)

The process of writing is a process of re-writing. And the critical task is to get to a first draft as quickly as possible. A complete first draft of a five-to-ten-page quantitative paper can be done in a day. Until that complete first draft exists, it is useful to try to not to delete or even revise anything that was written, regardless of how bad it may seem. Just write.

One of the most intimidating things in the world is a blank page, and we deal with this by immediately adding headings such as: ‘Introduction,’ ‘Data,’ ‘Model,’ ‘Results,’ and ‘Discussion.’ And then add fields in the top matter for the various bits and pieces that are needed, such as ‘title,’ ‘date,’ ‘author’ and ‘abstract.’ This creates a generic outline, and its role is akin to placing on the counter, the ingredients that we will use to prepare dinner (McPhee 2017).

Having established this generic outline, we need to develop an understanding of what we are exploring through developing a research question. In theory, we develop a research question, answer it, and then we do all the writing; but that rarely actually happens (Franklin 2005). Instead, we typically have some idea of the question, and our answer, and these become less vague as we write. This is because it is through the process of writing that we refine our thinking (S. King 2000, 131). Having put down some thoughts about the research question, we can start to add dot points in each of the sections, adding sub-sections, with informative sub-headings as needed. We then go back and expand those dot points into paragraphs.

While writing the first draft it is important to ignore the feeling that one is not good enough, or that it is impossible. Just write. We need words on paper, even if they are bad, and the first draft is when we accomplish this. Remove all distractions and just write. Perfectionism is the enemy, and should be set aside. Sometimes this can be accomplished by getting up very early to write, or by creating a deadline, or with a glass or two of wine. One friend puts her baby to sleep to the sound of her typing, with the result being that she must keep typing otherwise the baby will wake up. Creating a sense of urgency can be useful and rather than adding proper citations as we go, which could slow us down, just add something like ‘[TODO: CITE R HERE].’ Do similar with graphs and tables. That is, include textual descriptions such as ‘[TODO: ADD GRAPH THAT SHOWS EACH COUNTRY OVER TIME HERE]’ instead of actual graphs and tables. Focus on adding content, even if it is poorly written, or not ideal. When this is all done, a first draft exists!

This first draft will be bad. But it is by writing a bad first draft that we can get to a good second draft, a great third draft, and eventually excellence (Lamott 1994, 20). That first draft will be too long, it will not make sense, it will contain claims that cannot be supported, and some claims that should not be. Having focused on adding content while writing the first draft, to turn that into a second draft, we use the ‘delete’ key extensively, as well as ‘cut’ and ‘paste.’ Printing out the paper and using a red pen to move or remove is especially helpful. The process of going from a first draft to a second draft is best done in one sitting, to help with flow and consistency of the story. One aspect of this first re-write is enhancing the story that we want to tell. And another aspect is taking out everything that is not the story (S. King 2000, 57).

As we go through what was written in each of the sections, we try to bring some sense to it, with special consideration to how it supports our story. This revision process is the essence of writing (McPhee 2017, 160). We should also fix the references, and add the real graphs and tables. As part of this re-writing process, the paper’s central message tends to develop, and our answers to the research questions tend to become clearer. At this point, aspects such as the introduction can be returned to and, finally, the abstract. Typos and other issues affect the credibility of the work, and so it is important that these are fixed as part of the second draft.

We now have a paper that is sensible. The job is to now make it brilliant. Print it out again, and again go through it on paper. It is especially important to brutally remove everything that does not contribute to the story. At about this stage, we may be starting to get too close to the paper. We write for our reader, and so this is a great opportunity to give it to someone else for their comments. We ask them for feedback that enables us to better understand the weak parts of the story. After addressing these, it can be helpful to go through the paper once more, this time reading it aloud. A paper tends to never be ‘done’ and it is more that at a certain point we either run out of time or become sick of the sight of it.

5.2 Developing research questions

Both qualitative and quantitative approaches have their place, but here we focus on quantitative approaches. Qualitative research is important as well, and often the most interesting work has a little of both. When conducting quantitative analysis, we are subject to issues such as data quality, scales, measurement, and sources. We are often especially interested in trying to tease out causality. Regardless, we are trying to learn something about the world. Our research questions need to take this all into account.

Broadly, there are two ways to go about research:

  1. data-first; or
  2. question-first.

5.2.1 Data-first

When being data-first, the main issue is working out the questions that can be reasonably answered with the available data. When deciding what these are, it is useful to consider:

  1. Theory: Is there a reasonable expectation that there is something causal that could be determined? For instance, if the question involves charting the stock market, then it might be better to consider haruspex because at least that way we would have something to eat. Questions usually need to have some plausible theoretical underpinning to help avoid spurious relationships.
  2. Importance: There are plenty of trivial questions that can be answered, but it important to not waste our time or that of the reader. Having an important question can also help with motivation when we find ourselves in, say, the fourth straight week of cleaning data and de-bugging code. It can also make it easier to attract talented employees and funding.
  3. Availability: Is there a reasonable expectation of additional data being available in the future? This could allow us to answer related questions and turn this one paper into a research agenda.
  4. Iteration: Is this something that could be run multiple times, or is it a once-off analysis? If it is the former, then it becomes possible to start answering specific research questions and then iterate. But if we can only get access to the data once then we need to think about broader questions.

There’s a saying, sometimes attributed to Xiao-Li Meng that all of statistics is a missing data problem. And so paradoxically, another way to ask data-first questions to think about which data we do not have. For instance, returning to the neonatal and maternal mortality examples discussed in, respectively, Chapters 1 and 2, the fundamental problem is that we do not have perfect and complete data about cause of death. If we did, then we could count the number of relevant deaths. Having established the missing data problem, we can take a data-driven approach by looking at the data we do have, and then ask research questions that speak to the extent that we can use that to approximate our hypothetical perfect and complete dataset.

5.2.2 Question-first

When trying to be question-first, there is the inverse different issue of being concerned about data availability. The ‘FINER framework’ is used in medicine to help guide the development of research questions. It recommends asking questions that are: Feasible, Interesting, Novel, Ethical, and Relevant (Hulley 2007). Farrugia et al. (2010) builds on FINER with PICOT, which recommends additional considerations: Population, Intervention, Comparison group, Outcome of interest, and Time. It can feel overwhelming trying to write out a question. One way to go about it is to ask a very specific question. Another is to decide whether we are interested in descriptive, predictive, inferential, or causal analysis. These then lead to different types of questions, for instance, descriptive analysis: ‘What happened when…?’; predictive analysis: ‘What happens if…?’; inferential: ‘Why does… happen?’; and causal: ‘What happens if…?’

Often time will be constrained, possibly in interesting ways and these can guide the specifics of the research question. If we are interested in the effect of Trump’s tweets on the stock market, then that can be done just by looking at the minutes (milliseconds?) after he tweets. But what if we are interested in the effect of a cancer drug on long term outcomes? If the effect takes 20 years, then we must either wait a while, or we need to look at people who were treated in 2000, but then we have selection effects and different circumstances to if we give the drug today. Often the only reasonable thing to do is to build a statistical model, but then we need adequate sample sizes, etc.

When answering questions usually, the creation of a counterfactual is crucial. Briefly, a counterfactual is an if-then statement in which the ‘if’ is false. Consider the example of Humpty Dumpty from Lewis Carroll’s Through the Looking-Glass (Carroll 1871).

‘What tremendously easy riddles you ask!’ Humpty Dumpty growled out. ‘Of course I don’t think so! Why, if ever I did fall off—which there’s no chance of—but if I did—’ Here he pursed his lips and looked so solemn and grand that Alice could hardly help laughing. ‘If I did fall,’ he went on, ‘The King has promised me—with his very own mouth-to-to-’

Humpty is satisfied with what would happen if he were to fall off, even though he is similarly satisfied that this would never happen. It is this comparison group that often determines the answer to a question. For instance, consider the effect of VO2 max on the outcome of bike race. If we compare over the general population then it is an important variable, but if we only compare over elite athletes, then it is less important, because of selection.

5.3 Writing

I had not indeed published anything before I commenced “The Professor,” but in many a crude effort, destroyed almost as soon as composed, I had got over any such taste as I might once have had for ornamented and redundant composition, and come to prefer what was plain and homely.

The Professor (Brontë 1857).

We discuss the following components: title, abstract, introduction, data, results, discussion, figures, tables, equations, and technical terms. Throughout all sections of a paper it is important that we are as brief and specific as possible.

5.3.1 Title

A title is the first opportunity that we have to engage our reader in our story. Ideally, we are able to tell our reader exactly what we found. Effective titles are critical because otherwise papers will be ignored by readers. While a title does not have to be ‘cute,’ it does need to be effective. This means it needs to make the story clear.

One example of a title that is good enough is ‘On the 2016 Brexit referendum.’ This title is useful because the reader at least knows what the paper will be about. But it is not particular informative or enticing. A slightly better variant could be ‘On the ’Vote Leave’ outcome in the 2016 Brexit referendum’. This variant adds specifically which is particularly informative. Finally, another variant would be ‘Vote Leave outperforms in rural areas in the 2016 Brexit referendum: Evidence from a Bayesian hierarchical model.’ Here the reader knows the approach of the paper and also the main take-away.

We will consider a few examples of particularly effective titles. Hug et al. (2019) uses ‘National, regional, and global levels and trends in neonatal mortality between 1990 and 2017, with scenario-based projections to 2030: a systematic analysis.’ Here it is clear what the paper is about and the methods that are used. R. Alexander and Alexander (2021) uses ‘The Increased Effect of Elections and Changing Prime Ministers on Topics Discussed in the Australian Federal Parliament between 1901 and 2018.’ While the method used in that paper is not clear from the title, the main finding it, along with a good deal of information about what the content will be. And finally, M. J. Alexander, Kiang, and Barbieri (2018) uses ‘Trends in Black and White Opioid Mortality in the United States, 1979–2015.’

A title is often among the last aspects of a paper to be finalized. While getting through the first draft, we would typically just use a working title that is good enough to get the job done. We then refine it over the course of redrafting. The title needs to reflect the final story of the paper, and this is not usually something that we know at the start. We are interested in striking a balance between getting our reader interested enough to read the paper, and conveying enough of the content so as to be useful (Hayot 2014). We can think here of classic books, such as Macaulay’s History of England from the Accession of James the Second, or Churchill’s A History of the English-Speaking Peoples. Both are clear about what the content is, and, for their target audience, spark interest.

One specific approach is the form: ‘Exciting content: Specific content,’ for instance, ‘Returning to their roots: Examining the performance of ’Vote Leave’ in the 2016 Brexit referendum’. Kennedy and Gelman (2020) provides a particular nice example of this approach with ‘Know your population and know your model: Using model-based regression and poststratification to generalize findings beyond the observed sample,’ as does Craiu (2019) with ‘The Hiring Gambit: In Search of the Twofer Data Scientist.’ A close variant of this is ‘A question? And an answer.’ For instance, Cahill, Weinberger, and Alkema (2020) with ‘What increase in modern contraceptive use is needed in FP2020 countries to reach 75% demand satisfied by 2030? An assessment using the Accelerated Transition Method and Family Planning Estimation Model.’ As one gains experience with this variant, it becomes possible to know when it is appropriate to drop the answer part yet remain effective, such as Briggs (2021) with ‘Why Does Aid Not Target the Poorest?’ Another specific approach is ‘Specific content then broad content’ or inversely. For instance ‘Rurality, elites, and support for ’Vote Leave’ in the 2016 Brexit referendum’ or ‘Support for ’Vote Leave’ in the 2016 Brexit referendum, rurality and elites. This approach is used by Tolley and Paquet (2021) with ‘Gender, municipal party politics, and Montreal’s first woman mayor.’

5.3.2 Abstract

For a five-to-ten-page paper, a good abstract is a three to five sentence paragraph. For a longer paper the abstract can be slightly longer. The abstract needs to specify the story of the paper, and the objective of an abstract is to convey what was done and why it matters. To do this an abstract typically touches on the context of the work, its objectives, approach, and findings.

More specifically, a good recipe for an abstract is: first sentence: specify the general area of the paper and encourage the reader; second sentence: specify the dataset and methods at a general level; third sentence: specify the headline result; and a fourth sentence about implications.

We see this pattern in a variety of abstracts. For instance, Tolley and Paquet (2021) draw in the reader with their first sentence by mentioning the election of the first woman mayor in 400 years. The second sentence is clear about what is done in the paper. The third paper tells the reader how it is done i.e. a survey. And the fourth sentence adds some detail. The fifth and final sentence makes the main take-away from the paper clear.

In 2017, Montreal elected Valérie Plante, the first woman mayor in the city’s 400-year history. Using this election as a case study, we show how gender did and did not influence the outcome. A survey of Montreal electors suggests that gender was not a salient factor in vote choice. Although gender did not matter much for voters, it did shape the organization of the campaign and party. We argue that Plante’s victory can be explained in part by a strategy that showcased a less leader-centric party and a degendered campaign that helped counteract stereotypes about women’s unsuitability for positions of political leadership.

Similarly, Beauregard and Sheppard (2021) make broader environment clear within the first two sentences, and the specific contribution of this paper to that environment. The third and fourth sentences makes the data source clear and also the main findings. The fifth and sixth sentences add specificity here that would be of interest to likely readers of this abstract i.e. academic political science experts. And then the final sentence makes it clear the position of the authors.

Previous research on support for gender quotas focuses on attitudes toward gender equality and government intervention as explanations. We argue the role of attitudes toward women in understanding support for policies aiming to increase the presence of women in politics is ambivalent—both hostile and benevolent forms of sexism contribute in understanding support, albeit in different ways. Using original data from a survey conducted on a probability-based sample of Australian respondents, our findings demonstrate that hostile sexists are more likely to oppose increasing of women’s presence in politics through the adoption of gender quotas. Benevolent sexists, on the other hand, are more likely to support these policies than respondents exhibiting low levels of benevolent sexism. We argue this is because benevolent sexism holds that women are pure and need protection; they do not have what it takes to succeed in politics without the assistance of quotas. Finally, we show that while women are more likely to support quotas, ambivalent sexism has the same relationship with support among both women and men. These findings suggest that aggregate levels of public support for gender quotas do not necessarily represent greater acceptance of gender equality generally.

And finally, Briggs (2021) begins with a claim that seems unquestionably true. In the second sentence he then claims to have found that it is false. The third sentence specifies the extent of this claim, and the fourth sentence details how he comes to this position, before providing more detail. The final two sentences speak broad implications and importance.

Foreign-aid projects typically have local effects, so they need to be placed close to the poor if they are to reduce poverty. I show that, conditional on local population levels, World Bank (WB) project aid targets richer parts of countries. This relationship holds over time and across world regions. I test five donor-side explanations for pro-rich targeting using a pre-registered conjoint experiment on WB Task Team Leaders (TTLs). TTLs perceive aid-receiving governments as most interested in targeting aid politically and controlling implementation. They also believe that aid works better in poorer or more remote areas, but that implementation in these areas is uniquely difficult. These results speak to debates in distributive politics, international bargaining over aid, and principal-agent issues in international organizations. The results also suggest that tweaks to WB incentive structures to make ease of project implementation less important may encourage aid to flow to poorer parts of countries.

The journal Nature provides a guide for constructing an abstract. They recommend a structure that results in an abstract of six parts, that add up to around 200 words.

  1. A basic introductory sentence that is comprehensible to a wide audience.
  2. A more detailed sentence about background that is relevant to likely readers.
  3. A sentence that states the general problem.
  4. Sentences that summarize and then explain the main results.
  5. A sentence about general context.
  6. And finally, a sentence about the broader perspective.

5.3.3 Introduction

An introduction needs to be self-contained and convey everything that a reader needs to know. It is important to recognize that we are not writing a mystery story. Instead, we want to give-away the most important points in the introduction. For a six-page paper, an introduction may be two or three paragraphs of main content. Hayot (2014, 90) describes the goal of an introduction is to engage the reader, locate them in some discipline and background, and then tell them what happens in the rest of the paper. It is completely reader-focused.

The introduction should set the scene and give the reader some background. For instance, we typically start a little broader. This provides some context to the paper. We then describe how the paper fits into that context, and give some high-level results, especially focused on the one key result that is the main part of the story. We provide more detail here than we provided in the abstract, but not the full extent. And the final bit of main content is to broadly discuss next steps. Finally, we finish the introduction with an additional short final paragraph that highlights the structure of the paper.

As an example (with made-up details):

The UK Conservative Party has always done well in rural electorates. And the 2016 Brexit vote was no different with a significant different in support between rural and urban areas. But even by the standard of rural support for conservative issues, support for ‘Vote Leave’ was unusually strong with ‘Vote Leave’ being most heavily supported in the East Midlands and the East of England, while the strongest support for ‘Remain’ was in Greater London.

In this paper we look at why the performance of ‘Vote Leave’ in the 2016 Brexit referendum was so correlated with rurality. We construct a model in which support for ‘Vote Leave’ at a voting area level, is explained by the number of farms in the area, the average internet connectivity, and the median age. We find that as the median age of an area increases, the likelihood that an area supported ‘Vote Leave’ decreases by 14 percentage points. Future work could look at the effect of having a Conservative MP which would allow a more nuanced understanding of these effects.

The remainder of this paper is structured as follows: Section 2 discusses the data, Section 3 discusses the model, Section 4 presents the results, and finally Section 5 discusses our findings and some weaknesses.

The introduction needs to be self-contained and tell your reader everything that they need to know. A reader should be able to only read the introduction and have an accurate picture of all the major aspects that they would if they were to read the whole paper. It would be rare to include graphs or tables in the introduction. An introduction always closes with the structure of the paper. For instance (and this is just a rough guide) an introduction for a 10-page paper, should probably be about 3 or 4 paragraphs, or 10 per cent, but it depends on specifics.

5.3.4 Data

Robert Caro, Lyndon B. Johnson’s biographer, describes the importance of conveying ‘a sense of place’ when writing biography (Caro 2019, 141). This he defines as ‘the physical setting in which a book’s action is occurring: to see it clearly enough, in sufficient detail, so that he feels as if he himself were present while the action is occurring.’ He provides the following example:

When Rebekah walked out the front door of that little house, there was nothing—a roadrunner streaking behind some rocks with something long and wet dangling from his beak, perhaps, or a rabbit disappearing around a bush so fast that all she really saw was the flash of a white tail—but otherwise nothing. There was no movement except for the ripple of the leaves in the scattered trees, no sound except for the constant whisper of the wind… If Rebekah climbed, almost in desperation, the hill in the back of the house, what she saw from its crest was more hills, an endless vista of hills, hills on which there was visible not a single house… hills on which nothing moved, empty hills with, above them, empty sky; a hawk circling silently overhead was an event. Bus most of all, there was nothing human, no one to talk to.

Caro (2019, 146)

How thoroughly we can imagine the circumstances of Rebekah Baines Johnson (Lyndon B. Johnson’s mother). We need to provide our reader with the same sense of place for our dataset. When writing our papers, we need to achieve that same sense of place, for our data, as Caro is able to provide for the Hill county. We do this by being as explicit as possible about showing our dataset. We typically have a whole section about it and this is designed to show the reader, as closely as possible, the actual data that underpin our story.

When writing the data section, we are beginning our answer to the critical question about our claims, which is, how is it possible to know this? (McPhee 2017, 78). The preeminent example of a data section is provided by Doll and Hill (1950), who are interested in the effect smoking between control and treatment groups. They begin by clearly describing their dataset. They then use tables to display relevant cross-tabs. And use graphs to contrast their groups.

In the data section we need to thoroughly discuss the variables in the dataset that we are using. If there are other datasets that could have been used, but were not, then these should be mentioned and our choices justified. If variables were constructed or combined, then this process and motivation should be explained.

To get a sense of the data, it is important that the reader is able to understand what the data that underpin the results look like. This means that we should graph the actual data that are used in our analysis, or as close to them as possible. And we should also include tables of summary statistics. If the dataset was created from some other source, then it can also help to include an example of that original source. For instance, if the dataset was created from survey responses then the survey form should be included, potentially in an appendix.

The data section will also have figures and tables. Here some judgment is required. While it is important that the reader has the opportunity to understand the details, it may be that some are better placed in an appendix. Figure and tables are a critical aspect of convincing people of a story. In a graph we can show the data and then let the reader decide for themselves. And using a table, we can more easily summarize our dataset. At the very least, every variable needs to be shown in a graph and summarized in a table. Figures and tables should be numbered and then cross-referenced in the text, for instance, “Figure 1 shows…,” “Table 1 describes….” For every graph and table there should be extensive accompanying text that describes their main aspects, and adds additional detail.

We discuss the components of graphs and tables, including titles and labels, in Chapter 6. But here we will discuss captions, as they are between text and the graph or table. Captions need to be informative and self-contained. As Cleveland (1994, 57) says, the ‘interplay between graph, caption, and text is a delicate one,’ however the reader should be able to read only the caption and understand what the graph or table shows. A caption that is two of three lines long would is not necessarily inappropriate. And all aspects of the graph or table should be explained. For instance, consider Figures 5.1 and 5.2 from Bowley (1901, 151), which are both exceptionally clear, and self-contained.

Example of a well-captioned figure

Figure 5.1: Example of a well-captioned figure

Example of a well-captioned table

Figure 5.2: Example of a well-captioned table

The choice between a table and a graph comes down to how much information is to be conveyed. In general, if there is specific information that should be considered, such as a summary statistic, then a table is a good option, while if we are interested in the reader making comparisons and understanding trends then a graph is a good option (Gelman, Pasarica, and Dodhia 2002).

Finally, if there is relevant literature then we would discuss it throughout the paper as appropriate. For instance, when there is literature relevant to the data then it should be discussed in this section, literature relevant to the model, results, or discussion should be mentioned as appropriate in those sections. It is rarely necessary to have a separate literature review section.

5.3.5 Model

We will often build a statistical model that we will use to explore the data, and we often have a specific section about this. At a minimum it is important to clearly specify equation/s that describe the model being used, and explain their components with plain language and cross-references.

The model section typically begins with the model being written out, explained, and justified. Depending on the expected reader, some background may be needed. After specifying the model with appropriate mathematical notation and cross-referencing it, the components of the model are then typically defined and explained. It is especially important to define each aspect of the notation. This helps convince the reader that the model was well-chosen and enhances the credibility of the paper. The model’s variables should correspond to those that were discussed in the data section, making a clear link between the two sections.

There should be some discussion of how features enter the model and why. For instance, some examples could include, why use ages rather than age-groups, why does state/province have a levels effect, and why is gender a categorical variable. In general, we are trying to convey a sense that this is the model for the situation. We want the reader to understand how the aspects that were discussed in the data section assert themselves in the modelling decisions that were made.

The model section should close with some discussion of the assumptions that underpin the model, and a brief discussion of alternative models, or variants, and strengths and weaknesses made clear. It should be clear in the reader’s mind why it was this model that was chosen.

At some point in this section, it is usually appropriate to specify the software that was used to run the model, and to provide some evidence of thought about the circumstances in which the model may not be appropriate. The later point would typically be expanded on in the discussion. And there should be evidence of model validation and checking, model convergence, and/or diagnostic issues. Again, there is a balance needed here, and some of this content may be more appropriate placed in appendices.

When technical terms are used, they should be briefly explained in plain language for readers who might not be familiar with it. For instance, M. Alexander (2019b) integrates an explanation of the Gini coefficient that brings the reader along.

To look at the concentration of baby names, let’s calculate the Gini coefficient for each country, sex and year. The Gini coefficient measures dispersion or inequality among values of a frequency distribution. It can take any value between 0 and 1. In the case of income distributions, a Gini coefficient of 1 would mean one person has all the income. In this case, a Gini coefficient of 1 would mean that all babies have the same name. In contrast, a Gini coefficient of 0 would mean names are evenly distributed across all babies.

5.3.6 Results

Two excellent examples of results sections provided by Kharecha and Hansen (2013) and Kiang et al. (2021). In the results section, we want to communicate the outcomes of the model in a clear way and without too much in the way of discussion of implications. The results section likely requires summary statistics, tables, and graphs. Each of those aspects should be cross-referenced and have text associated with them that details what is seen in them. This section should strictly relay results; that is, we are interested in what the results are, rather than what they mean.

This section would also typically include table/s of coefficient estimates based on the modelling that we used to further explore the data. Various features of the estimates should be discussed, and differences between the models explained. It may be that different subsets of the data are considered separately. Again, all graphs and tables need to have plain language text accompany them. A rough guide is that the amount of text should be at least equal to the amount of space taken up by the tables and graphs. For instance, if a full page is used to display a table of coefficient estimates, then that should be cross-referenced and accompanied by at least a full page of text about that table.

5.3.7 Discussion

A discussion section may be the final section of a paper and would typically have four or five sub-sections.

The discussion section would typically begin with a sub-section that comprises a one- or two-paragraph summary of what was done in the paper. This would be followed by two or three sub-sections that are devoted to the key things that we learn about the world from this paper. For instance, there are typically a few implications that come from the modelling results. These few sub-sections are the main opportunity to justify or detail the implications of the story being told in the paper. Typically, these sub-sections do not see newly introduced graphs or tables, but are instead focused on what we learn from those that were introduced in earlier sections. It may be that some of the results are discussed in relation to what others have found, and differences could be attempted to be reconciled here.

Following these sub-sections of what we learn about the world, we would typically have a sub-section focused on some of the weaknesses of what was done. This could concern aspects such as the data that were used, the approach, and the model. And the final sub-section is typically a few paragraphs that specify what is left to learn, and how future work could proceed.

In general, we would expect this section to take at least twenty-five per cent of the total paper. For instance, in an eight page paper, we would expect at least two pages of discussion.

5.3.8 Brevity, typos, and grammar

Brevity is important. Partly this is because we write for the reader, and the reader has other priorities. But it is also because as the writer it focuses us to consider what our most important points are, how we can best support them, and where our arguments are weakest. Jean Chrétien, the former Canadian Prime Minister, describes how ‘[t]o allow me to get to the heart of an issue quickly, I asked the officials to summarize their documents in two or three pages and attach the rest of the materials as background information. I soon discovered that this was a problem only for those who didn’t really know what they were talking about.’ (Chrétien 2007, 105).

This experience is not unique to Canada. For instance, Oliver Letwin, the former British Conservative Cabinet member, describes there as being ‘a huge amount of terrible guff, at huge, colossal, humongous length coming from some departments’ and how he asked ‘for them to be one quarter of the length’ (Hughes and Rutter 2016). He found that the departments were able to accommodate this request without losing anything important.

This experience is also not new. For instance, Churchill asked for brevity during the Second World War, saying ‘the discipline of setting out the real points concisely will prove an aid to clearer thinking.’ And the letter from Szilard and Einstein to FDR that was the catalyst for the Manhattan Project was only two pages.

This experience is also not unique to academia. For instance, one of the foundations of Amazon, which is one of the world’s largest companies, is clear writing. Specifically, instead of PowerPoint presentations, Jeff Bezos asked for ‘[w]ell structured, narrative text… [which] forces better thought and better understanding of what’s more important than what, and how things are related.’

Zinsser (1976) goes further and describes ‘the secret of good writing’ being ‘to strip every sentence to its cleanest components.’ Every sentence should be simplified to its essence. And every word that does not contribute should be removed.

Typos and other grammatical mistakes affect the credibility of claims. If the reader cannot trust us to use a spell-checker, then why should they trust us to use logistic regression? Microsoft Word and Google Docs are useful here for their spell-checkers: copy/paste from R Markdown, look for the red and green lines, and fix them in R Markdown.

We are not worried about the n-th degree of grammatical content. Instead, we are interested in grammar and sentence structure that occurs in conversational language use (S. King 2000, 118). The way to develop that comfort is by reading a lot, and asking others to read your work also.

Unnecessary words, typos, and grammatical issues should be removed from papers with a fanatical zeal.

5.3.9 Rules

A variety of authors have established rules for writing, including famously, Orwell (1946), which were reimagined by The Economist (2013). A further reimagining, focused on telling stories with data, could be:

  • Focus on the reader and their needs. Everything else is comment.
  • Establish a logical structure and rely on that structure to tell the story.
  • Write a first draft as quickly as possible.
  • Re-write that extensively and without favor.
  • Aim to be concise and direct. Remove as many words as possible.
  • Using words precisely. Stock-markets rise or fall, not improve or worsen.
  • Use short sentence where possible.
  • Avoid jargon.
  • Write as though your work will be on the front page of a newspaper. Because it could be.

5.4 Exercises and tutorial

5.4.1 Exercises

  1. According to G. King (2006), what is the key task of subheadings (pick one)?
    1. Enable a reader who randomly falls asleep but keeps turning pages to know where they are.
    2. Be broad and sweeping so that a reader is impressed by the importance of the paper.
    3. Use acronyms to integrate the paper into the literature.
  2. According to G. King (2006), what is the maximum length of an abstract (pick one)?
    1. Two hundred words.
    2. Two hundred and fifty words.
    3. One hundred words.
    4. One hundred and fifty words.
  3. According to G. King (2006), in a paper, raw computer output should be (pick one)?
    1. Commented out.
    2. Not included.
    3. Included.
  4. According to G. King (2006), if our standard error was 0.05 then which of the following specificity for a coefficient would be silly (select all that apply)?
    1. 2.7182818
    2. 2.718282
    3. 2.72
    4. 2.7
    5. 2.7183
    6. 2.718
    7. 3
    8. 2.71828
  5. When should we try not to use the ‘delete’ key (pick one)?
    1. While writing the first draft.
    2. While writing the second draft.
    3. While writing the third draft.
    4. The ‘delete’ key should always be used.
  6. How long should a first draft take to write of a five-to-ten-page paper (pick one)?
    1. One hour
    2. One day
    3. One week
    4. One month
  7. What is a key aspect of the re-drafting process (select all that apply)?
    1. Going through it with a red pen to remove unneeded words.
    2. Printing the paper and reading a physical copy.
    3. Cutting and pasting to enhance flow.
    4. Reading it aloud.
    5. Exchanging it with others.
  8. What are three features of a good research question (write a paragraph or two)?
  9. What are some of the challenges of being ‘data-first’ (write a paragraph or two)?
  10. What are some of the challenges of being ‘question-first’ (write a paragraph or two)?
  11. What is a counterfactual (pick one)?
    1. If-then statements in which the if does not happen.
    2. If-then statements in which the if happens.
    3. Statements that are either true or false.
    4. Statements that are neither true or false.
  12. Which of the following is the best title (pick one)?
    1. “Problem Set 1”
    2. “Unemployment”
    3. “Examining England’s Unemployment (2010-2020)”
    4. “England’s Unemployment Increased between 2010 and 2020”
  13. Which of the following is the best title (pick one)?
    1. “Problem Set 2”
    2. “Standard errors”
    3. “On standard errors with small samples”
  14. Which word/s can be removed from the following sentence without affecting its meaning (select all that apply)? ‘Like many parents, when our children were born, one of the first things that my wife and I did regularly was read stories to them.’
    1. first
    2. regularly
    3. stories
  15. Please write a new title for either Barron et al. (2018) or Fourcade and Healy (2017).
  16. Please write a new title for the first article from the list of articles from The New Yorker that you read.
  17. Please write a new title for the other article from the list of articles from The New Yorker that you read.
  18. Please write a new four-sentence abstract for Chambliss (1989)
  19. Please write a new four-sentence abstract for Doll and Hill (1950) or Student (1908) or Kharecha and Hansen (2013).
  20. Please write an abstract for the first article from the list of ‘miscellaneous’ articles that you read.
  21. Please write an abstract for the other article from the list of ‘miscellaneous’ articles that you read.
  22. Using only the 1000-most popular words in the English language – https://xkcd.com/simplewriter/ – re-write the following so that it retains its original meaning:

When using data, we try to tell a convincing story. It may be as exciting as predicting elections, as banal as increasing internet advertising click rates, as serious as finding the cause of a disease, or as fun as forecasting basketball games. In any case the key elements are the same.

5.4.2 Tutorial

Caro (2019, xii) writes at least one thousand words almost every day. In this tutorial we will write every day for a week. Begin by picking seven of the well-written papers specified above. Each day complete the following tasks:

  • Transcribe, by writing each word yourself, the entire introduction.
  • (This idea comes from McPhee (2017, 186).) Re-write the introduction so that it is five lines (or 10 per cent, whichever is less) shorter.
  • Transcribe, by writing each word yourself, the abstract.
  • Re-write a new, four-sentence, abstract for the paper.
  • (This idea comes from comes from Chelsea Parlett-Pelleriti.) Write a second version of your new abstract using only the one-thousand most popular words in the English language: https://xkcd.com/simplewriter/.
  • Detail three points about the way the paper is written that you like
  • Detail one point about the way the paper is written that you do not like.

Submit all seven papers.