STATUS: Under construction.
- (The) Economist, 2013, ‘Johnson: Those six little rules,’ Prospero, 29 July 2013, available at: https://www.economist.com/prospero/2013/07/29/johnson-those-six-little-rules.
- Alexander, Monica, 2019, ‘The concentration and uniqueness of baby names in Australia and the US,’ https://www.monicaalexander.com/posts/2019-20-01-babynames/. (Look at how Monica explains concepts, especially the Gini coefficient, in a way that you can understand even if you’ve never heard of it before.)
- Bronner, Laura, 2020, ‘Quant Editing,’ http://www.laurabronner.com/quant-editing. (Read these points and evaluate your own writing against them. It’s fine to not comply with them if you have a good reason, but you need to know that you’re not complying with them).
- Girouard, Dave, 2020, ‘A Founder’s Guide to Writing Well,’ First Round Review, 4 August, https://firstround.com/review/a-founders-guide-to-writing-well/.
- Graham, Paul, 2020, ‘How to Write Usefully,’ http://paulgraham.com/useful.html. (Graham is good at writing for a programmer, but if you have a similar background then you may like this.)
- Healy, Kieran, 2019, Data Visualization: A Practical Introduction, Princeton University Press, Chapters 3, 4, and 7, https://socviz.co/.
- Hodgetts, Paul, 2020, ‘The ggfortify Package,’ 31 December, https://www.hodgettsp.com/posts/r-ggfortify/.
- Wickham, Hadley, and Garrett Grolemund, 2017, R for Data Science, Chapter 28, https://r4ds.had.co.nz/.
- Zinsser, William, 1976 , On Writing Well. (Any edition is fine. This book is included because if you’re serious about improving your writing then you should start with this book. It only takes a few hours to read. You’ll go onto other books, but start with this one.)
- Zinsser, William, 2009, ‘Writing English as a Second Language,’ Lecture, Columbia Graduate School of Journalism, 11 August, https://theamericanscholar.org/writing-english-as-a-second-language/. (I’m realistic enough to realise that requiring a book, even though I’ve said it’s great and it’s short, is a bit of a stretch. If you really don’t want to commit to reading the Zinsser, then please at least read this ‘crib notes’ version of it.)
- Kuriwaki, Shiro, 2020, ‘Making maps in R with sf,’ 1 March, freely available at: https://vimeo.com/394800836.
- (The) Economist, 1991 , ‘The Economist Style Guide,’ Twelfth edition. (Any edition is fine. Pick a point or two each day and think about how it related to your own writing.)
- Cochrane, John H., 2005, ‘Writing Tips for Ph. D. Students,’ https://faculty.chicagobooth.edu/john.cochrane/research/papers/phd_paper_writing.pdf. (This is aimed at academic research papers, but parts are still broadly relevant. And if you’re going into academia then this is very relevant.)
- Codrey, Laura, 2013, ‘Churchill’s call for brevity,’ 17 October, https://blog.nationalarchives.gov.uk/churchills-call-for-brevity/.
- Engel, Claudia A, 2019, Using Spatial Data with R, 11 February, Chapter 3 Making Maps in R, freely available at: https://cengel.github.io/R-spatial/mapping.html.
- Five Thirty Eight, 2020, Pick almost any article in their sports (https://fivethirtyeight.com/sports/) or politics (https://fivethirtyeight.com/politics/) sections. (The people at 538 write beautifully. Look at how their titles tell you exactly what is going on, or what they found. Look at how nicely their first paragraphs motivates you to read the rest of the article. Why am I reading about BYU basketball when I’m indifferent to both BYU and college basketball? Because that title and first paragraph hooked me.)
- Graham, Paul, 2005, ‘Writing, Briefly,’ http://paulgraham.com/writing44.html.
- Lovelace, Robin, Jakub Nowosad, Jannes Muenchow, 2020, Geocomputation with R, 29 March, Chapter 8, Making maps with R, freely available at: https://geocompr.robinlovelace.net/adv-map.html.
- Patrick, Cameron, 2019, ‘Plotting multiple variables at once using ggplot2 and tidyr,’ 26 November, https://cameronpatrick.com/post/2019/11/plotting-multiple-variables-ggplot2-tidyr/.
- Patrick, Cameron, 2020, ‘Making beautiful bar charts with ggplot,’ 15 March, https://cameronpatrick.com/post/2020/03/beautiful-bar-charts-ggplot/.
- Shapiro, Jesse M., ‘Four Steps to an Applied Micro Paper,’ https://www.brown.edu/Research/Shapiro/pdfs/foursteps.pdf. (This is mostly recommended for the part about ‘the robot’ with regard to your data section.)
- Shapiro, Julian, ‘Writing Well,’ https://www.julian.com/guide/write/intro.
- Strunk, William Jr., 1959  ‘The Elements of Style.’ (Any edition is fine. Eventually you’ll move beyond this, but it’s important to know the rules before you break them).
- Vanderplas, Susan, Dianne Cook, and Heike Hofmann, 2020, ‘Testing Statistical Charts: What Makes a Good Graph?’ Annual Review of Statistics and Its Application, https://www.annualreviews.org/doi/abs/10.1146/annurev-statistics-031219-041252
Examples of well-written papers
- Barron, Alexander TJ, Jenny Huang, Rebecca L. Spang, and Simon DeDeo. “Individuals, institutions, and innovation in the debates of the French Revolution.” Proceedings of the National Academy of Sciences 115, no. 18 (2018): 4607-4612. r
- Chambliss, Daniel F. “The Mundanity of Excellence: An Ethnographic Report on Stratification and Olympic Swimmers.” Sociological Theory 7, no. 1 (1989): 70-86. doi:10.2307/202063.
- Joyner, Michael J. “Modeling: optimal marathon performance on the basis of physiological factors.” Journal of Applied Physiology, 70, no. 2 (1991): 683-687.
- Kharecha, Pushker A., and James E. Hansen, 2013, ‘Prevented mortality and greenhouse gas emissions from historical and projected nuclear power,’ Environmental science & technology, 47, no. 9, pp. 4889-4895.
- Samuel, Arthur L., 1959, ‘Some studies in machine learning using the game of checkers,’ IBM Journal of research and development, 3, no. 3, pp. 210-229.
- Wardrop, Robert L., 1995, ‘Simpson’s paradox and the hot hand in basketball,’ The American Statistician, 49, no. 1, 24-28.
- Show the reader your raw data, or as close as you can come to it.
- Use either
- Writing efficiently and effectively is a requirement if you want your work to be convincing.
- Don’t waste your reader’s time.
- A good title says what the paper is about, a great title says what the paper found.
- For a six-page paper, a good abstract is a three to five sentence paragraph. For a longer paper your abstract can be slightly longer.
- Thinking of maps as a (often fiddly, but strangely enjoyable) variant of a usual ggplot.
- Broadening the data that we make available via interactive maps, while still telling a clear story.
- Becoming comfortable with (and excited about) creating static maps.
[T]he duty of a scientist is not only to find new things, bu to communicate them successfully in at least three forms: 1) Writing papers and books. 2) Prepared public talks. 3) Impromptu talks.
In order to convince someone of your story, your paper must be well-written, well-organized, and easy to follow. It should flow easily from one point to the next. It should have proper sentence structure, spelling, vocabulary, and grammar. Each point should be articulated clearly and completely without being overly verbose. Papers should demonstrate your understanding of the topics you are writing about and your confidence in discussing the terms, techniques and issues that are relevant. References must be included and properly cited because this enhances your credibility.
People who need to write: founders, VCs, lawyers, software engineers, designers, painters, data scientists, musicians, filmmakers, creative directors, physical trainers, teachers, writers. Learn to write.
This is great advice. Writing well has done just as much for me as knowing how to code. I’d add that if you’re intimidated by writing, start a blog and write often about something you’re interested in. You’ll get better. At least that’s what I’ve done for the past 10 years. :)
This chapter is about writing. By the end of it you will have a better idea of how to write short, detailed, quantitative papers that communicate exactly what you want them to and don’t waste the time of your reader.
One critical part of telling stories with data, is that it’s ultimately the data that has to convince them. You’re the medium, but the data are the message. To that end, the easiest way to try to convince someone of your story is to show them the data that allowed you to come to that story. Plot your raw data, or as close to it as possible.
ggplot is a fantastic tool for doing this, there is a lot to that package and so it can be difficult to know where to start. My recommendation is that you start with either a scatter plot or a bar chart. What is critical is that you show the reader your raw data. These notes run through how to do that. It then discusses some more advanced options, but the important thing is that you show the reader your raw data (or as close to it as you can). Students seem to get confused what ‘raw’ means; I’m using it to refer to as close to the original dataset as possible, so no sums, or averages, etc, if possible. Sometimes your data are too disperse for that or you’ve got other constraints, so there needs to be an element of manipulation. The main point is that you, at the very least, need to plot the data that you’re going to be modelling. If you are dealing with larger datasets then just take a 10/1/0.1/etc per cent sample.
Source: YouTube screenshot.
Graphs are critical to tell a compelling story. And the most important thing with your graphs is to plot your raw data. Again: Plot. Your. Raw. Data.
Figure 6.1 provides invaluable advice (thank you to Thomas William Rosenthal).
Let’s look at a somewhat fun example from the
datasauRus package (Locke and D’Agostino McGowan 2018).
library(datasauRus) # Code from: https://juliasilge.com/blog/datasaurus-multiclass/ datasaurus_dozen %>% filter(dataset %in% c("dino", "star", "away", "bullseye")) %>% group_by(dataset) %>% summarise(across(c(x, y), list(mean = mean, sd = sd)), x_y_cor = cor(x, y) ) %>% ungroup() #> # A tibble: 4 × 6 #> dataset x_mean x_sd y_mean y_sd x_y_cor #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 away 54.3 16.8 47.8 26.9 -0.0641 #> 2 bullseye 54.3 16.8 47.8 26.9 -0.0686 #> 3 dino 54.3 16.8 47.8 26.9 -0.0645 #> 4 star 54.3 16.8 47.8 26.9 -0.0630
And despite these similarities at a summary statistic level, they’re actually very different, well, beasts, when you plot the raw data.
Bar charts are useful when you have one variable that you want to focus on. Hint: you almost always have one variable that you want to focus on. Hence, you should almost always include at least one (and likely many) bar charts. Bar charts go by a variety of names, depending on their specifics. I recommend the R Studio Data Viz Cheat Sheet.
To get started, let’s simulate some data.
set.seed(853) number_of_observation <- 10000 example_data <- tibble(person = c(1:number_of_observation), smoker = sample(x = c("Smoker", "Non-smoker"), size = number_of_observation, replace = TRUE), age_died = runif(number_of_observation, min = 0, max = 100) %>% round(digits = 0), height = sample(x = c(50:220), size = number_of_observation, replace = TRUE), num_children = sample(x = c(0:5), size = number_of_observation, replace = TRUE, prob = c(0.1, 0.2, 0.40, 0.15, 0.1, 0.05)) )
First, let’s have a look at the data.
head(example_data) #> # A tibble: 6 × 5 #> person smoker age_died height num_children #> <int> <chr> <dbl> <int> <int> #> 1 1 Smoker 55 80 3 #> 2 2 Non-smoker 54 78 2 #> 3 3 Non-smoker 84 109 1 #> 4 4 Smoker 75 114 4 #> 5 5 Smoker 32 135 1 #> 6 6 Smoker 37 220 0
Now let’s plot the age distribution. Based on our simulated data, we’re expecting a fairly uniform plot.
example_data %>% ggplot(mapping = aes(x = age_died)) + geom_bar()
Now let’s make it look a little better. There are themes that are built into ggplot, or you can install other themes from other packages, or you can edit aspects yourself. I’d recommend starting with the
ggthemes package for some fun ones, but I tend to just use classic or minimal. Remember that you must always refer to your graphs in your text (Figure 6.2).
example_data %>% ggplot(mapping = aes(x = age_died)) + geom_bar() + theme_minimal() + labs(x = "Age died", y = "Number", title = "Number of people who died at each age", caption = "Source: Simulated data.")
We may want to facet by some variable, in this case whether the person is a smoker (Figure 6.3).
example_data %>% ggplot(mapping = aes(x = age_died)) + geom_bar() + theme_minimal() + facet_wrap(vars(smoker)) + labs(x = "Age died", y = "Number", title = "Number of people who died at each age, by whether they smoke", caption = "Source: Simulated data.")
Alternatively, we may wish to colour by that instead (Figure 6.4). I’ll filter to just a handful of age-groups to keep it tractable.
example_data %>% filter(age_died < 25) %>% ggplot(mapping = aes(x = age_died, fill = smoker)) + geom_bar(position = "dodge") + theme_minimal() + labs(x = "Age died", y = "Number", fill = "Smoker", title = "Number of people who died at each age, by whether they smoke", caption = "Source: Simulated data.")
It’s important to recognise that a boxplot hides the full distribution of a variable. Unless you need to communicate the general distribution of many variables at once then you should not use them. The same box plot can apply to very different distributions.
Often, we are also interested in the relationship between two series. We’ll do that with a scatter plot. A scatter plot is almost always your best choice (Weissgerber et al. 2015). In this case, let’s simulate some data, say years of education and income.
set.seed(853) number_of_observation <- 500 scatter_data <- tibble(years_of_education = runif(n = number_of_observation, min = 10, max = 25), error = rnorm(n= number_of_observation, mean = 0, sd = 10000), ) %>% mutate(income = years_of_education * 5000 + error, income = if_else(income < 0, 0, income)) head(scatter_data) #> # A tibble: 6 × 3 #> years_of_education error income #> <dbl> <dbl> <dbl> #> 1 15.4 -13782. 63180. #> 2 11.8 7977. 66985. #> 3 17.3 -9787. 76498. #> 4 14.7 12999. 86689. #> 5 10.6 -1500. 51302. #> 6 16.1 1911. 82202.
Now let’s look at income as a function of years of education (Figure 6.5).
scatter_data %>% ggplot(mapping = aes(x = years_of_education, y = income)) + geom_point() + theme_minimal() + labs(x = "Years of education", y = "Income", title = "Relationship between income and years of education", caption = "Source: Simulated data.")
Box plots are almost never appropriate because they hide the distribution of data. To see this, consider some data from a beta distribution.
Then compare the box plots.
If we’re interested in quickly adding a line of best fit then, continuing with the earlier income example, we can do that with
geom_smooth() (Figure 6.6).
scatter_data %>% ggplot(mapping = aes(x = years_of_education, y = income)) + geom_point() + geom_smooth(method = lm, color = "black") + theme_minimal() + labs(x = "Years of education", y = "Income", title = "Relationship between income and years of education", caption = "Source: Simulated data.") #> `geom_smooth()` using formula 'y ~ x'
If we want to get counts by groups, then we may want to use a histogram. Figure 6.7 shows the counts for our simulated incomes.
scatter_data %>% ggplot(mapping = aes(x = income)) + geom_histogram() + theme_minimal() + labs(x = "Income", y = "Number", title = "Distribution of income", caption = "Source: Simulated data.") #> `stat_bin()` using `bins = 30`. Pick better value with #> `binwidth`.
Finally, let’s try putting them together. We’re going to use the
patchwork package (Pedersen 2020) and the
penguins package for data. Don’t forget
install.packages("palmerpenguins") as this is probably the first time you’ve used the package.
And we can make things fairly involved fairly quickly.
(p1 | p2) / p2
Tables are also critical to tell a compelling story. We may prefer a table to a graph when there are only a few features that we want to focus on. We’ll use
knitr::kable() alongside the ‘kableExtra’ package and also the
Let’s start with the kable package and the summary dinosaur data from earlier.
Even the defaults are pretty good, but we can add a few tweaks to make the table better. The first is that this many significant digits is inappropriate, we may also like to add a caption, make the column names consistent, and change the alignment.
The ‘’kableExtra’ package builds extra functionality (Zhu 2020).
gt package (Iannone, Cheng, and Schloerke 2020) is a newer package that brings a lot of exciting features. However, being newer it sometimes has issues with PDF output.
We could add sub-titles easily.
example_data %>% gt() %>% tab_header( title = "Summary stats can be misleading", subtitle = "With an example from a dinosaur!" )
|Summary stats can be misleading|
|With an example from a dinosaur!|
One common reason for needing a table is to report regression results. You should consider
modelsummary. But at the moment, my favourite is
modelsummary (Arel-Bundock 2021).
In many ways maps can be thought of as a fancy graph, where the x-axis is latitude, the y-axis is longitude, and there is some outline or a background image. We are used to this type of set-up, for instance, in a ggplot setting that is quite familiar. Static maps will be useful for printed output, such as a PDF or Word report, or where there is something in particular that you want to illustrate.
ggplot() + geom_polygon( # First draw an outline data = some_data, aes(x = latitude, y = longitude, group = group )) + geom_point( # Then add points of interest data = some_other_data, aes(x = latitude, y = longitude) )
And while there are some small complications, for the most part it is as straight-forward as that. The first step is to get some data. And helpfully, there is some geographic data built into ggplot, and there is some other information built into a package called
library(maps) library(tidyverse) canada <- map_data(database = "world", regions = "canada") canadian_cities <- maps::canada.cities head(canada) #> long lat group order region subregion #> 1 -59.78760 43.93960 1 1 Canada Sable Island #> 2 -59.92227 43.90391 1 2 Canada Sable Island #> 3 -60.03775 43.90664 1 3 Canada Sable Island #> 4 -60.11426 43.93911 1 4 Canada Sable Island #> 5 -60.11748 43.95337 1 5 Canada Sable Island #> 6 -59.93604 43.93960 1 6 Canada Sable Island head(canadian_cities) #> name country.etc pop lat long capital #> 1 Abbotsford BC BC 157795 49.06 -122.30 0 #> 2 Acton ON ON 8308 43.63 -80.03 0 #> 3 Acton Vale QC QC 5153 45.63 -72.57 0 #> 4 Airdrie AB AB 25863 51.30 -114.02 0 #> 5 Aklavik NT NT 643 68.22 -135.00 0 #> 6 Albanel QC QC 1090 48.87 -72.42 0
With that information in hand we can then create a map of Canada that shows the cities with a population over 1,000. (The
geom_polygon() function within
ggplot draws shapes, by connecting points within groups. And the
coord_map() function adjusts for the fact that we are making something that is 2D map to represent something that is 3D.)
ggplot() + geom_polygon(data = canada, aes(x = long, y = lat, group = group), fill = "white", colour = "grey") + coord_map(ylim = c(40, 70)) + geom_point(aes(x = canadian_cities$long, y = canadian_cities$lat), alpha = 0.3, color = "black") + theme_classic() + labs(x = "Longitude", y = "Latitude")
# If I'm being honest, this 'simple example' took me six hours to work out. Firstly # to find Canada and then to find Canadian cities.
As is often the case with R, there are many different ways to get started creating static maps. We’ve already seen how they can be built using simply ggplot, but here we’ll explore one package that has a bunch of functionalities built in that will make things easier:
There are two essential components to a map: 1) some border or background image (also known as a tile); and 2) something of interest within that border or on top of that tile. In
ggmap, we will use an open-source option for our tile, Stamen Maps (maps.stamen.com), and we will use plot points based on latitude and longitude.
Like Canada, in Australia people go to specific locations, called booths, to vote. These booths have latitudes and longitudes and so we can plot these. One reason we may like to do this is to notice patterns over geographies.
To get started we need to get a tile. We are going to use
ggmap to get a tile from Stamen Maps, which builds on OpenStreetMap (openstreetmap.org). The main argument to this function is to specify a bounding box. This requires two latitudes - one for the top of the box and one for the bottom of the box - and two longitudes - one for the left of the box and one for the right of the box. (It can be useful to use Google Maps, or an alternative, to find the values of these that you need.) The bounding box provides the coordinates of the edges that you are interested in. In this case I have provided it with coordinates such that it will be centered around Canberra, Australia (our equivalent of Ottawa - a small city that was created for the purposes of being the capital).
Once you have defined the bounding box, then the function
get_stamenmap() will get the tiles in that area. The number of tiles that it needs to get depends on the zoom, and the type of tiles that it gets depends on the maptype. I’ve chosen the maptype that I like here - the black and white option - but the helpfile specifies a few others that you may like. At this point you can pass your maps to ggmap and it will plot the tile! It will be actively downloading these tiles, so you need an internet connection.
canberra_stamen_map <- get_stamenmap(bbox, zoom = 11, maptype = "toner-lite") ggmap(canberra_stamen_map)
Once we have a map then we can use
ggmap() to plot it. (That circle in the middle of the map is where the Australian Parliament House is… yes, our parliament is surrounded by circular roads (we call them ‘roundabouts’), actually it’s surrounded by two of them.)
Now we want to get some data that we will plot on top of our tiles. We will just plot the location of the polling places, based on which ‘division’ (the Australian equivalent to ‘ridings’ in Canada) it is. This is available here: https://results.aec.gov.au/20499/Website/Downloads/HouseTppByPollingPlaceDownload-20499.csv. (The Australian Electoral Commission (AEC) is the official government agency that is responsible for elections in Australia.)
# Read in the booths data for each year booths <- readr::read_csv("https://results.aec.gov.au/24310/Website/Downloads/GeneralPollingPlacesDownload-24310.csv", skip = 1, guess_max = 10000) head(booths) #> # A tibble: 6 × 15 #> State DivisionID DivisionNm PollingPlaceID #> <chr> <dbl> <chr> <dbl> #> 1 ACT 318 Bean 93925 #> 2 ACT 318 Bean 93927 #> 3 ACT 318 Bean 11877 #> 4 ACT 318 Bean 11452 #> 5 ACT 318 Bean 8761 #> 6 ACT 318 Bean 8763 #> # … with 11 more variables: PollingPlaceTypeID <dbl>, #> # PollingPlaceNm <chr>, PremisesNm <chr>, #> # PremisesAddress1 <chr>, PremisesAddress2 <chr>, #> # PremisesAddress3 <chr>, PremisesSuburb <chr>, #> # PremisesStateAb <chr>, PremisesPostCode <chr>, #> # Latitude <dbl>, Longitude <dbl>
This dataset is for the whole of Australia, but as we are just going to plot the area around Canberra we will filter to that and only to booths that are geographic (the AEC has various options for people who are in hospital, or not able to get to a booth, etc, and these are still ‘booths’ in this dataset).
# Reduce the booths data to only rows with that have latitude and longitude booths_reduced <- booths %>% filter(State == "ACT") %>% select(PollingPlaceID, DivisionNm, Latitude, Longitude) %>% filter(!is.na(Longitude)) %>% # Remove rows that don't have a geography filter(Longitude < 165) # Remove Norfolk Island
Now we can use
ggmap in the same way as before to plot our underlying tiles, and then build on that using
geom_point() to add our points of interest.
ggmap(canberra_stamen_map, extent = "normal", maprange = FALSE) + geom_point(data = booths_reduced, aes(x = Longitude, y = Latitude, colour = DivisionNm), ) + scale_color_brewer(name = "2019 Division", palette = "Set1") + coord_map(projection="mercator", xlim=c(attr(map, "bb")$ll.lon, attr(map, "bb")$ur.lon), ylim=c(attr(map, "bb")$ll.lat, attr(map, "bb")$ur.lat)) + labs(x = "Longitude", y = "Latitude") + theme_minimal() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
We may like to save the map so that we don’t have to draw it every time, and we can do that in the same way as any other graph, using
ggsave("outputs/figures/map.pdf", width = 20, height = 10, units = "cm")
Finally, the reason that I used Stamen Maps and OpenStreetMap is because it is open source, however you can also use Google Maps if you want. This requires you to first register a credit card with Google, and specify a key, but with low usage should be free. The
get_googlemap() function with
ggmap, brings some nice features that
get_stamenmap() does not have. For instance, you can enter a placename and it’ll do it’s best to find it rather than needing to specify a bounding box.
Let’s see another example of a static map, this time using Toronto data accessed via the
opendatatoronto package. The dataset that we are going to plot is available here: https://open.toronto.ca/dataset/street-furniture-bicycle-parking/.
# This code is based on code from: https://open.toronto.ca/dataset/street-furniture-bicycle-parking/. library(opendatatoronto) # (The string identifies the package.) resources <- list_package_resources("71e6c206-96e1-48f1-8f6f-0e804687e3be") # In this case there is only one dataset within this resource so just need the first one raw_data <- filter(resources, row_number()==1) %>% get_resource() write_csv(raw_data, "inputs/data/bike_racks.csv") head(raw_data)
Now that we’ve saved a copy of the data, we can use that one. First, we need to clean it up a bit. There are some clear errors in the ADDRESSNUMBERTEXT field, but not too many, so we’ll just ignore it.
raw_data <- read_csv("inputs/data/bike_racks.csv") # We'll just focus on the data that we want bike_data <- tibble(ward = raw_data$WARD, id = raw_data$ID, status = raw_data$STATUS, street_address = paste(raw_data$ADDRESSNUMBERTEXT, raw_data$ADDRESSSTREET), latitude = raw_data$LATITUDE, longitude = raw_data$LONGITUDE) rm(raw_data)
Some of the bike racks were temporary so remove them and also let’s just look at the area around the university, which is Ward 11
If you look at the dataset at this point, then you’ll notice that there is a row for every bike parking spot. But we don’t really need to know that, because sometimes there are lots right next to each other. Instead, we’d just like the one point (we’ll take advantage of this in an interactive graph in a moment). So, we want to create a count by address, and then just get one instance per address.
bike_data <- bike_data %>% group_by(street_address) %>% mutate(number_of_spots = n(), running_total = row_number() ) %>% ungroup() %>% filter(running_total == 1) %>% select(-id, -running_total) head(bike_data) #> # A tibble: 6 × 4 #> street_address latitude longitude number_of_spots #> <chr> <dbl> <dbl> <int> #> 1 8 Kensington Ave 43.7 -79.4 1 #> 2 87 Avenue Rd 43.7 -79.4 4 #> 3 162 Mc Caul St 43.7 -79.4 1 #> 4 147 Baldwin St 43.7 -79.4 2 #> 5 888 Yonge St 43.7 -79.4 1 #> 6 180 Elizabeth St 43.7 -79.4 10 write_csv(bike_data, "outputs/data/bikes.csv")
Now we can grab our tile and add our bike rack data onto it.
bbox <- c(left = -79.420390, bottom = 43.642658, right = -79.383354, top = 43.672557) toronto_stamen_map <- get_stamenmap(bbox, zoom = 14, maptype = "toner-lite") ggmap(toronto_stamen_map, maprange = FALSE) + geom_point(data = bike_data, aes(x = longitude, y = latitude), alpha = 0.3 ) + labs(x = "Longitude", y = "Latitude") + theme_minimal()
To this point we just assumed that we already had geocoded data. The places ‘Canberra, Australia,’ or ‘Ottawa, Canada,’ are just names, they don’t actually inherently have a location. In order to plot them we need to get a latitude and longitude for them. The process of going from names to coordinates is called geocoding.
There are a range of options to geocode data in R, but one good package is
tidygeocoder (Cambon and Belanger 2021). To get started using the package we need a dataframe of locations. So we’ll just quickly make one here.
tidygeocoder::geo(city = some_locations$city, country = some_locations$country, method = 'osm') #> # A tibble: 2 × 4 #> city country lat long #> <chr> <chr> <dbl> <dbl> #> 1 Canberra Australia -35.3 149. #> 2 Ottawa Canada 45.4 -75.7
- I have a dataset that contains measurements of height (in cm) for a sample of 300 penguins, who are either the Adeline or Emperor species. I am interested in visualizing the distribution of heights by species in a graphical way. Please discuss whether a pie chart is an appropriate type of graph to use. What about a box and whisker plot? Finally, what are some considerations if you made a histogram? [Please write a paragraph or two for each aspect.]
- Assume the dataset and columns exist. Would this code work?
data %>% ggplot(aes(x = col_one)) %>% geom_point()(pick one)?
- If I have categorical data, which geom should I use to plot it (pick one)?
- Why are box plots often inappropriate (pick one)?
- They hide the full distribution of the data.
- They are hard to make.
- They are ugly.
- The mode is clearly displayed.
- Which of the following, if any, are elements of the layered grammar of graphics (Wickham 2010) (select all that apply)?
- A default dataset and set of mappings from variables to aesthetics.
- One or more layers, with each layer having one geometric object, one statistical transformation, one position adjustment, and optionally, one dataset and set of aesthetic mappings.
- Colours that enable the reader to understand the main point.
- A coordinate system.
- The facet specification.
- One scale for each aesthetic mapping used.