Exploratory Data Analysis

Rohan Alexander https://rohanalexander.com/ (University of Toronto)
May 17, 2020

The skim function from the skimr package is just a quick way of looking at the data. It can be handy when you get a new dataset to quickly come to terms with it. The p0, p25, p50, etc, values that it provides are percentiles. So for instance, P50 is the 50th percentile which means that 50 per cent of the data are below this, and 50 per cent are above i.e. the median. P0 should be the minimum, and p100 should be the maximum. The middle ones are similarly defined, so p25, means that 25 per cent of the data are below this point and 75 per cent of it is above this point. The general idea of both the percentiles and the histogram is to give you a quick idea of how skewed your data is.

Exploratory data analysis

This chapter was written with Monica Alexander.

Required reading

Recommended reading

Key concepts/skills/etc

Key libraries/functions/etc

Pre-quiz

Introduction

Exploratory data analysis is never finished, you just die.

This chapter is about exploratory data analysis (EDA) and data visualization steps in R. The aim is to get you used to working with real data (that has issues) to understand the main characteristics and potential issues.

We will be using the opendatatoronto R package, which interfaces with the City of Toronto Open Data Portal.

A note on packages

If you are running this Rmd on your local machine, you may need to install various packages used (using the install.packages function).

Load in all the packages we need:

TTC subway delays

This package provides an interface to all data available on the Open Data Portal provided by the City of Toronto.

Use the list_packages function to look at what’s available

all_data <- list_packages(limit = 500)
all_data
# A tibble: 402 x 11
   title id    topics civic_issues publisher excerpt dataset_category
   <chr> <chr> <chr>  <chr>        <chr>     <chr>   <chr>           
 1 Clot… 58ef… Commu… Poverty red… Municipa… Toront… Map             
 2 Bike… ac87… Devel… Mobility     Transpor… The To… Map             
 3 Stre… 99b1… City … <NA>         Transpor… Inform… Map             
 4 Body… c405… City … <NA>         Toronto … This d… Table           
 5 Neig… 4def… Locat… Affordable … Social D… Bounda… Map             
 6 Toro… 1d07… City … Mobility     Informat… Linear… Document        
 7 Stre… 1db3… City … Mobility     Transpor… Transi… Map             
 8 Pede… 4b5c… Trans… Affordable … Informat… This d… Document        
 9 Stre… 74f6… City … <NA>         Transpor… Public… Map             
10 Stre… 821f… City … <NA>         Transpor… Public… Map             
# … with 392 more rows, and 4 more variables: num_resources <int>,
#   formats <chr>, refresh_rate <chr>, last_refreshed <date>

Let’s download the data on TTC subway delays in 2019. There are multiple files for 2019 so we need to get them all and make them into one big dataframe.

res <- list_package_resources("996cfe8d-fb35-40ce-b569-698d51fc683b")
res <- res %>% mutate(year = str_extract(name, "201.?"))
delay_2019_ids <- res %>% filter(year==2019) %>% select(id) %>% pull()

delay_2019 <- c()
for(i in 1:length(delay_2019_ids)) {
  delay_2019 <- bind_rows(delay_2019, get_resource(delay_2019_ids[i]))
}

# make the column names nicer to work with
delay_2019 <- clean_names(delay_2019)

Let’s also download the delay code and readme, as reference.

delay_codes <- get_resource("fece136b-224a-412a-b191-8d31eb00491e")
delay_data_codebook <- get_resource("54247e39-5a7d-40db-a137-82b2a9ab0708")

This dataset has a bunch of interesting variables. You can refer to the readme for descriptions. Our outcome of interest is min_delay, which give the delay in mins.

head(delay_2019)
# A tibble: 6 x 10
  date                time  day   station code  min_delay min_gap
  <dttm>              <chr> <chr> <chr>   <chr>     <dbl>   <dbl>
1 2019-01-01 00:00:00 01:08 Tues… YORK M… PUSI          0       0
2 2019-01-01 00:00:00 02:14 Tues… ST AND… PUMST         0       0
3 2019-01-01 00:00:00 02:16 Tues… JANE S… TUSC          0       0
4 2019-01-01 00:00:00 02:27 Tues… BLOOR … SUO           0       0
5 2019-01-01 00:00:00 03:03 Tues… DUPONT… MUATC        11      16
6 2019-01-01 00:00:00 03:08 Tues… EGLINT… EUATC        11      16
# … with 3 more variables: bound <chr>, line <chr>, vehicle <dbl>

EDA and data viz

The following section highlights some tools that might be useful for you when you are getting used to a new dataset. There’s no one way of exploration, but it’s important to always keep in mind:

In any data analysis project, if it turns out you have data issues, surprising values, missing data etc, it’s important you document anything you found and the subsequent steps or assumptions you made before moving onto your data analysis / modeling.

As always:

  1. Start with an end in mind.
  2. Be as lazy as possible.

Data checks

Sanity Checks

We need to check variables should be what they say they are. If they aren’t, the natural next question is to what to do with issues (recode? remove?)

E.g. check days of week

unique(delay_2019$day)
[1] "Tuesday"   "Wednesday" "Thursday"  "Friday"    "Saturday" 
[6] "Sunday"    "Monday"   

Check lines: oh no. some issues here. Some have obvious recodes, others, not so much.

unique(delay_2019$line)
 [1] "YU"                     "BD"                    
 [3] "YU/BD"                  "SHP"                   
 [5] "SRT"                    NA                      
 [7] "YUS"                    "B/D"                   
 [9] "BD LINE"                "999"                   
[11] "YU/ BD"                 "YU & BD"               
[13] "BD/YU"                  "YU\\BD"                
[15] "46 MARTIN GROVE"        "RT"                    
[17] "BLOOR-DANFORTH"         "YU / BD"               
[19] "134 PROGRESS"           "YU - BD"               
[21] "985 SHEPPARD EAST EXPR" "22 COXWELL"            
[23] "100 FLEMINGDON PARK"    "YU LINE"               

The skimr package might also be useful here

# skim(delay_2019)

What are the different values of bound for each line?

For simplicity, just keep the correct line labels.

# delay_2019 %>%
#   filter(line %in% c("BD", "YU", "SHP", "SRT")) %>%
#   mutate(bound = as.factor(bound)) %>%
#   group_by(line) %>%
#   skim(bound)

Missing values

Look to see how many NAs by variable

delay_2019 %>% 
  summarise_all(.funs = funs(sum(is.na(.))/nrow(delay_2019)*100))
# A tibble: 1 x 10
   date  time   day station  code min_delay min_gap bound  line
  <dbl> <dbl> <dbl>   <dbl> <dbl>     <dbl>   <dbl> <dbl> <dbl>
1     0     0     0       0     0         0       0  22.8 0.260
# … with 1 more variable: vehicle <dbl>

The visdat package is also useful here, particularly to see how missing values are distributed.

vis_dat(delay_2019)
vis_miss(delay_2019)

Duplicates?

The get_dupes function from the janitor package is useful for this.

get_dupes(delay_2019)
# A tibble: 158 x 11
   date                time  day   station code  min_delay min_gap
   <dttm>              <chr> <chr> <chr>   <chr>     <dbl>   <dbl>
 1 2019-01-01 00:00:00 08:18 Tues… DONLAN… MUESA         5      10
 2 2019-01-01 00:00:00 08:18 Tues… DONLAN… MUESA         5      10
 3 2019-02-01 00:00:00 05:51 Frid… SCARB … MRTO         10      15
 4 2019-02-01 00:00:00 05:51 Frid… SCARB … MRTO         10      15
 5 2019-02-01 00:00:00 06:45 Frid… MIDLAN… MRWEA         3       8
 6 2019-02-01 00:00:00 06:45 Frid… MIDLAN… MRWEA         3       8
 7 2019-02-01 00:00:00 06:55 Frid… LAWREN… ERDO          0       0
 8 2019-02-01 00:00:00 06:55 Frid… LAWREN… ERDO          0       0
 9 2019-02-01 00:00:00 07:16 Frid… MCCOWA… MRWEA         5      10
10 2019-02-01 00:00:00 07:16 Frid… MCCOWA… MRWEA         5      10
# … with 148 more rows, and 4 more variables: bound <chr>,
#   line <chr>, vehicle <dbl>, dupe_count <int>

There are quite a few duplicates. Remove for now:

delay_2019 <- delay_2019 %>% distinct()

Visualizing distributions

Histograms, barplots, and density plots are your friends here.

Let’s look at the outcome of interest: min_delay. First of all just a histogram of all the data:

## Removing the observations that have non-standardized lines

delay_2019 <- delay_2019 %>% filter(line %in% c("BD", "YU", "SHP", "SRT"))

ggplot(data = delay_2019) + 
  geom_histogram(aes(x = min_delay))

To improve readability, could plot on logged scale:

ggplot(data = delay_2019) + 
  geom_histogram(aes(x = min_delay)) + scale_x_log10()

Our initial EDA hinted at an outlying delay time, let’s take a look at the largest delays below. Join the delay_codes dataset to see what the delay is. (Have to do some mangling as SRT has different codes).

delay_2019 <- delay_2019 %>% 
  left_join(delay_codes %>% rename(code = `SUB RMENU CODE`, code_desc = `CODE DESCRIPTION...3`) %>% select(code, code_desc)) 


delay_2019 <- delay_2019 %>%
  mutate(code_srt = ifelse(line=="SRT", code, "NA")) %>% 
  left_join(delay_codes %>% rename(code_srt = `SRT RMENU CODE`, code_desc_srt = `CODE DESCRIPTION...7`) %>% select(code_srt, code_desc_srt))  %>% 
  mutate(code = ifelse(code_srt=="NA", code, code_srt),
         code_desc = ifelse(is.na(code_desc_srt), code_desc, code_desc_srt)) %>% 
  select(-code_srt, -code_desc_srt)

The 455 min delay due to ‘Rail Related Problem’ is an outlier.

delay_2019 %>% 
  left_join(delay_codes %>% rename(code = `SUB RMENU CODE`, code_desc = `CODE DESCRIPTION...3`) %>% select(code, code_desc)) %>% 
  arrange(-min_delay) %>% 
  select(date, time, station, line, min_delay, code, code_desc)
# A tibble: 18,697 x 7
   date                time  station  line  min_delay code  code_desc 
   <dttm>              <chr> <chr>    <chr>     <dbl> <chr> <chr>     
 1 2019-06-25 00:00:00 18:48 WILSON … YU          455 PUTR  Rail Rela…
 2 2019-02-12 00:00:00 20:28 LAWRENC… SRT         284 MRWEA Weather R…
 3 2019-06-05 00:00:00 12:42 UNION T… YU          250 MUPLA Fire/Smok…
 4 2019-10-22 00:00:00 14:22 LAWRENC… YU          228 PUTS  Structure…
 5 2019-09-26 00:00:00 11:38 YORK MI… YU          193 MUPR1 Priority …
 6 2019-06-08 00:00:00 08:51 SPADINA… BD          180 MUPLB Fire/Smok…
 7 2019-12-02 00:00:00 06:59 DUNDAS … BD          176 MUPLB Fire/Smok…
 8 2019-01-29 00:00:00 05:46 VICTORI… BD          174 MUWEA Weather R…
 9 2019-02-22 00:00:00 17:32 ELLESME… SRT         168 PRW   Rail Defe…
10 2019-02-10 00:00:00 07:53 BAYVIEW… SHP         165 PUSI  Signals o…
# … with 18,687 more rows

Grouping and small multiples

A quick and powerful visualization technique is to group the data by a variable of interest, e.g. line

ggplot(data = delay_2019) + 
  geom_histogram(aes(x = min_delay, y = ..density.., fill = line), position = 'dodge', bins = 10) + scale_x_log10()

I switched to density above to look at the the distributions more comparably, but we should also be aware of differences in frequency, in particular, SHP and SRT have much smaller counts:

ggplot(data = delay_2019) + 
  geom_histogram(aes(x = min_delay, fill = line), position = 'dodge', bins = 10) + scale_x_log10()

If you want to group by more than one variable, facets are good:

ggplot(data = delay_2019) + 
  geom_density(aes(x = min_delay, color = day), bw = .08) + 
  scale_x_log10() + facet_grid(~line)

Side note: the station names are a mess. Try and clean up the station names a bit by taking just the first word (or, the first two if it starts with “ST”):

delay_2019 <- delay_2019 %>% 
  mutate(station_clean = ifelse(str_starts(station, "ST"), word(station, 1,2), word(station, 1)))

Plot top five stations by mean delay:

delay_2019 %>% 
  group_by(line, station_clean) %>% 
  summarise(mean_delay = mean(min_delay), n_obs = n()) %>% 
  filter(n_obs>1) %>% 
  arrange(line, -mean_delay) %>% 
  slice(1:5) %>% 
  ggplot(aes(station_clean, mean_delay)) + geom_col() + coord_flip() + facet_wrap(~line, scales = "free_y")

Visualizing time series

Daily plot is messy (you can check for yourself). Let’s look by week to see if there’s any seasonality. The lubridate package has lots of helpful functions that deal with date variables. First, mean delay (of those that were delayed more than 0 mins):

delay_2019 %>% 
  filter(min_delay>0) %>% 
  mutate(week = week(date)) %>% 
  group_by(week, line) %>% 
  summarise(mean_delay = mean(min_delay)) %>% 
  ggplot(aes(week, mean_delay, color = line)) + geom_point() + geom_smooth() + facet_grid(~line)

What about proportion of delays that were greater than 10 mins?

delay_2019 %>% 
  mutate(week = week(date)) %>% 
  group_by(week, line) %>% 
  summarise(prop_delay = sum(min_delay>10)/n()) %>% 
  ggplot(aes(week, prop_delay, color = line)) + geom_point() + geom_smooth() + facet_grid(~line)

Visualizing relationships

Note that scatter plots are a good precursor to modeling, to visualize relationships between continuous variables. Nothing obvious to plot here, but easy to do with geom_point.

Look at top five reasons for delay by station. Do they differ? Think about how this could be modeled.

delay_2019 %>%
  group_by(line, code_desc) %>%
  summarise(mean_delay = mean(min_delay)) %>%
  arrange(-mean_delay) %>%
  slice(1:5) %>%
  ggplot(aes(x = code_desc,
             y = mean_delay)) +
  geom_col() + 
  facet_wrap(vars(line), 
             scales = "free_y",
             nrow = 4) +
  coord_flip()

PCA

Principal components analysis is a really powerful exploratory tool. It allows you to pick up potential clusters and/or outliers that can help to inform model building.

Let’s do a quick (and imperfect) example looking at types of delays by station.

The delay categories are a bit of a mess, and there’s hundreds of them. As a simple start, let’s just take the first word:

delay_2019 <- delay_2019 %>% 
  mutate(code_red = case_when(
    str_starts(code_desc, "No") ~ word(code_desc, 1, 2),
    str_starts(code_desc, "Operator") ~ word(code_desc, 1,2),
    TRUE ~ word(code_desc,1))
         )

Let’s also just restrict the analysis to causes that happen at least 50 times over 2019. To do the PCA, the dataframe also needs to be switched to wide format:

dwide <- delay_2019 %>% 
  group_by(line, station_clean) %>% 
  mutate(n_obs = n()) %>% 
  filter(n_obs>1) %>% 
  group_by(code_red) %>% 
  mutate(tot_delay = n()) %>% 
  arrange(tot_delay) %>% 
  filter(tot_delay>50) %>% 
  group_by(line, station_clean, code_red) %>% 
  summarise(n_delay = n()) %>% 
  pivot_wider(names_from = code_red, values_from = n_delay) %>% 
  mutate_all(.funs = funs(ifelse(is.na(.), 0, .)))

Do the PCA:

delay_pca <- prcomp(dwide[,3:ncol(dwide)])

df_out <- as_tibble(delay_pca$x)
df_out <- bind_cols(dwide %>% select(line, station_clean), df_out)
head(df_out)
# A tibble: 6 x 40
# Groups:   line, station_clean [6]
  line  station_clean    PC1     PC2    PC3    PC4    PC5      PC6
  <chr> <chr>          <dbl>   <dbl>  <dbl>  <dbl>  <dbl>    <dbl>
1 BD    BATHURST        6.50   26.9   -2.71 -10.8  -8.40  -11.7   
2 BD    BAY            24.8     7.63  -2.19  -7.05  0.714   3.90  
3 BD    BLOOR         -62.4  -112.    57.3  -23.4  -5.09  -14.1   
4 BD    BROADVIEW      -6.60   28.1   -1.06 -14.0  -6.49   -8.29  
5 BD    CASTLE         23.8    11.8   -1.31  -7.93 -3.62   -3.37  
6 BD    CHESTER        24.6    -1.87 -18.6    2.75  1.85    0.0736
# … with 32 more variables: PC7 <dbl>, PC8 <dbl>, PC9 <dbl>,
#   PC10 <dbl>, PC11 <dbl>, PC12 <dbl>, PC13 <dbl>, PC14 <dbl>,
#   PC15 <dbl>, PC16 <dbl>, PC17 <dbl>, PC18 <dbl>, PC19 <dbl>,
#   PC20 <dbl>, PC21 <dbl>, PC22 <dbl>, PC23 <dbl>, PC24 <dbl>,
#   PC25 <dbl>, PC26 <dbl>, PC27 <dbl>, PC28 <dbl>, PC29 <dbl>,
#   PC30 <dbl>, PC31 <dbl>, PC32 <dbl>, PC33 <dbl>, PC34 <dbl>,
#   PC35 <dbl>, PC36 <dbl>, PC37 <dbl>, PC38 <dbl>

Plot the first two PCs, and label some outlying stations:

ggplot(df_out,aes(x=PC1,y=PC2,color=line )) + geom_point() + geom_text_repel(data = df_out %>% filter(PC2>100|PC1<100*-1), aes(label = station_clean))

Plot the factor loadings. Some evidence of public v operator?

df_out_r <- as_tibble(delay_pca$rotation)
df_out_r$feature <- colnames(dwide[,3:ncol(dwide)])

df_out_r
# A tibble: 38 x 39
        PC1      PC2      PC3      PC4      PC5      PC6      PC7
      <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
 1 -0.0412   0.0638   1.33e-2 -4.67e-2  0.0246   0.0184  -0.00363
 2 -0.0332  -0.00469 -4.14e-2 -7.51e-3  0.0201  -0.0122  -0.0914 
 3 -0.135    0.207    2.37e-2 -1.44e-1  0.135   -0.0381  -0.00931
 4 -0.0652   0.0475  -4.43e-2 -2.51e-2 -0.00139 -0.0748  -0.144  
 5 -0.00443  0.00878 -4.99e-5 -8.30e-4  0.00967  0.00954 -0.0160 
 6 -0.0268  -0.00722 -4.39e-3  5.34e-4 -0.0151  -0.0125  -0.00381
 7 -0.0813   0.0960  -4.62e-2  4.79e-2 -0.0978  -0.0365  -0.0766 
 8 -0.0117   0.0135   5.48e-3 -2.94e-2  0.0125   0.0377  -0.0790 
 9 -0.516    0.655   -1.77e-2 -1.62e-1 -0.221   -0.287   -0.184  
10 -0.151    0.0826   5.48e-2  3.52e-1 -0.397    0.281    0.110  
# … with 28 more rows, and 32 more variables: PC8 <dbl>, PC9 <dbl>,
#   PC10 <dbl>, PC11 <dbl>, PC12 <dbl>, PC13 <dbl>, PC14 <dbl>,
#   PC15 <dbl>, PC16 <dbl>, PC17 <dbl>, PC18 <dbl>, PC19 <dbl>,
#   PC20 <dbl>, PC21 <dbl>, PC22 <dbl>, PC23 <dbl>, PC24 <dbl>,
#   PC25 <dbl>, PC26 <dbl>, PC27 <dbl>, PC28 <dbl>, PC29 <dbl>,
#   PC30 <dbl>, PC31 <dbl>, PC32 <dbl>, PC33 <dbl>, PC34 <dbl>,
#   PC35 <dbl>, PC36 <dbl>, PC37 <dbl>, PC38 <dbl>, feature <chr>
ggplot(df_out_r,aes(x=PC1,y=PC2,label=feature )) + geom_text_repel()

Exercises

  1. Using the opendatatoronto package, download the data on mayoral campaign contributions for 2014. (note: the 2014 file you will get from get_resource, so just keep the sheet that relates to the Mayor election).
  2. Clean up the data format (fixing the parsing issue and standardizing the column names using janitor)
  3. Summarize the variables in the dataset. Are there missing values, and if so, should we be worried about them? Is every variable in the format it should be? If not, create new variable(s) that are in the right format.
  4. Visually explore the distribution of values of the contributions. What contributions are notable outliers? Do they share a similar characteristic(s)? It may be useful to plot the distribution of contributions without these outliers to get a better sense of the majority of the data.
  5. List the top five candidates in each of these categories:
    • total contributions
    • mean contribution
    • number of contributions
  6. Repeat 5 but without contributions from the candidates themselves.
  7. How many contributors gave money to more than one candidate?

Case study - Opinions about a casino in Toronto

This was written by Michael Chong.

Data preparation

Getting data from opendatatoronto

Here we use the opendatatoronto package again. See the previous example RMarkdown file for a deeper explanation of how the code below works.

The dataset I’m extracting below are the results from a survey in 2012 regarding the establishment of a casino in Toronto. More info available by following this link. In this analysis, we’ll be hoping to address the question, which demographic (age/gender) groups are more likely to be supportive of a new casino in Toronto?

# Get the data
casino_resource <- search_packages("casino survey")%>%
  list_package_resources() %>%
  filter(name == "toronto-casino-survey-results") %>%
  get_resource()

Getting the right kind of object

The object casino_resource isn’t quite useable yet, because it’s (inconveniently) stored as a list of 2 data frames:

# Check what kind of object the casino_resource object is
class(casino_resource)
[1] "list"

If we just return the object, we can see that the 2nd list item is empty, and we just want to keep the first one:

casino_resource
$tblSurvey
# A tibble: 17,766 x 94
   SurveyID Q1_A  Q1_B1 Q1_B2 Q1_B3 Q2_A  Q2_B  Q3_A  Q3_B  Q3_C 
      <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
 1        1 Stro… Do n… Do n… Do n… Does… "As … Not … Very… Not …
 2        2 Stro… Econ… Jobs  Arts… Fits… "Cos… Very… Very… Very…
 3        3 Stro… Ther… If t… <NA>  Fits… "Big… Very… Very… Very…
 4        4 Some… beli… mone… evid… Does… "My … Very… Very… Some…
 5        5 Neut… Like… Conc… <NA>  Neut… "Aga… Very… Very… Very…
 6        6 Stro… have… <NA>  <NA>  Does… "Tor… Not … Not … Not …
 7        7 Stro… The … Peop… We s… Does… "#3 … Not … Not … Not …
 8        8 Stro… It w… Mora… <NA>  Does… "Cas… Very… Very… Very…
 9        9 Stro… It's… traf… heal… Does… "No … Not … Very… Not …
10       10 Stro… Toro… Avoi… Prov… Fits… "Tor… Very… Very… Very…
# … with 17,756 more rows, and 84 more variables: Q3_D <chr>,
#   Q3_E <chr>, Q3_F <chr>, Q3_G <chr>, Q3_H <chr>, Q3_I <chr>,
#   Q3_J <chr>, Q3_K <chr>, Q3_L <chr>, Q3_M <chr>, Q3_N <chr>,
#   Q3_O <chr>, Q3_P <chr>, Q3_Q <chr>, Q3_Q_Other <chr>,
#   Q3_Comments <chr>, Q4_A <chr>, Q5 <chr>, Q6 <chr>,
#   Q6_Comments <chr>, Q7_A_StandAlone <chr>, Q7_A_Integrated <chr>,
#   Q7_A1 <chr>, Q7_A2 <chr>, Q7_A3 <chr>, Q7_A_A <chr>,
#   Q7_A_B <chr>, Q7_A_C <chr>, Q7_A_D <chr>, Q7_A_E <chr>,
#   Q7_A_F <chr>, Q7_A_G <chr>, Q7_A_H <chr>, Q7_A_I <chr>,
#   Q7_A_J <chr>, Q7_A_J_Other <chr>, Q7_B_StandAlone <chr>,
#   Q7_B_Integrated <chr>, Q7_B1 <chr>, Q7_B2 <chr>, Q7_B3 <chr>,
#   Q7_B_A <chr>, Q7_B_B <chr>, Q7_B_C <chr>, Q7_B_D <chr>,
#   Q7_B_E <chr>, Q7_B_F <chr>, Q7_B_G <chr>, Q7_B_H <chr>,
#   Q7_B_I <chr>, Q7_B_J <chr>, Q7_B_J_Other <chr>,
#   Q7_C_StandAlone <chr>, Q7_C_Integrated <chr>, Q7_C1 <chr>,
#   Q7_C2 <chr>, Q7_C3 <chr>, Q7_C_A <chr>, Q7_C_B <chr>,
#   Q7_C_C <chr>, Q7_C_D <chr>, Q7_C_E <chr>, Q7_C_F <chr>,
#   Q7_C_G <chr>, Q7_C_H <chr>, Q7_C_I <chr>, Q7_C_J <chr>,
#   Q7_C_J_Other <chr>, Q8_A1 <chr>, Q8_A2 <chr>, Q8_B1 <chr>,
#   Q8_B2 <chr>, Q8_B3 <chr>, Q9 <chr>, Q9_Considerations <chr>,
#   Q10 <chr>, Q11 <chr>, Age <chr>, Gender <chr>, PostalCode <chr>,
#   GroupName <chr>, DateCreated <dttm>, ...93 <lgl>, ...94 <lgl>

$Sheet1
# A tibble: 0 x 0

So, let’s only keep the first item by indexing the list with double square brackets:

casino_data <- casino_resource[[1]]

Cleaning up the dataframe

Let’s check out what the first couple rows of the dataframe looks like. By default, head() returns the first 6 rows:

head(casino_data) 
# A tibble: 6 x 94
  SurveyID Q1_A  Q1_B1 Q1_B2 Q1_B3 Q2_A  Q2_B  Q3_A  Q3_B  Q3_C  Q3_D 
     <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1        1 Stro… Do n… Do n… Do n… Does… "As … Not … Very… Not … Not …
2        2 Stro… Econ… Jobs  Arts… Fits… "Cos… Very… Very… Very… Very…
3        3 Stro… Ther… If t… <NA>  Fits… "Big… Very… Very… Very… Very…
4        4 Some… beli… mone… evid… Does… "My … Very… Very… Some… Some…
5        5 Neut… Like… Conc… <NA>  Neut… "Aga… Very… Very… Very… Not …
6        6 Stro… have… <NA>  <NA>  Does… "Tor… Not … Not … Not … Not …
# … with 83 more variables: Q3_E <chr>, Q3_F <chr>, Q3_G <chr>,
#   Q3_H <chr>, Q3_I <chr>, Q3_J <chr>, Q3_K <chr>, Q3_L <chr>,
#   Q3_M <chr>, Q3_N <chr>, Q3_O <chr>, Q3_P <chr>, Q3_Q <chr>,
#   Q3_Q_Other <chr>, Q3_Comments <chr>, Q4_A <chr>, Q5 <chr>,
#   Q6 <chr>, Q6_Comments <chr>, Q7_A_StandAlone <chr>,
#   Q7_A_Integrated <chr>, Q7_A1 <chr>, Q7_A2 <chr>, Q7_A3 <chr>,
#   Q7_A_A <chr>, Q7_A_B <chr>, Q7_A_C <chr>, Q7_A_D <chr>,
#   Q7_A_E <chr>, Q7_A_F <chr>, Q7_A_G <chr>, Q7_A_H <chr>,
#   Q7_A_I <chr>, Q7_A_J <chr>, Q7_A_J_Other <chr>,
#   Q7_B_StandAlone <chr>, Q7_B_Integrated <chr>, Q7_B1 <chr>,
#   Q7_B2 <chr>, Q7_B3 <chr>, Q7_B_A <chr>, Q7_B_B <chr>,
#   Q7_B_C <chr>, Q7_B_D <chr>, Q7_B_E <chr>, Q7_B_F <chr>,
#   Q7_B_G <chr>, Q7_B_H <chr>, Q7_B_I <chr>, Q7_B_J <chr>,
#   Q7_B_J_Other <chr>, Q7_C_StandAlone <chr>, Q7_C_Integrated <chr>,
#   Q7_C1 <chr>, Q7_C2 <chr>, Q7_C3 <chr>, Q7_C_A <chr>,
#   Q7_C_B <chr>, Q7_C_C <chr>, Q7_C_D <chr>, Q7_C_E <chr>,
#   Q7_C_F <chr>, Q7_C_G <chr>, Q7_C_H <chr>, Q7_C_I <chr>,
#   Q7_C_J <chr>, Q7_C_J_Other <chr>, Q8_A1 <chr>, Q8_A2 <chr>,
#   Q8_B1 <chr>, Q8_B2 <chr>, Q8_B3 <chr>, Q9 <chr>,
#   Q9_Considerations <chr>, Q10 <chr>, Q11 <chr>, Age <chr>,
#   Gender <chr>, PostalCode <chr>, GroupName <chr>,
#   DateCreated <dttm>, ...93 <lgl>, ...94 <lgl>

Unfortunately the column names aren’t very informative. For simplicity, we’ll use the .pdf questionnaire that accompanies this dataset from the Toronto Open Data website. Alternatively, we could get and parse the readme through the R package.

Here’s a link to the questionnaire.

Question 1 indicates the level of support for a casino in Toronto. We’ll use this as the response variable.

Concerning potential predictor variables, most of the questions ask respondents about their opinions on different aspects of a potential casino development, which aren’t particularly useful towards our cause. The only demographic variables are Age and Gender, so let’s choose these.

Here I’m also going to rename the columns so that my resulting data frame has columns opinion, age, and gender.

# Narrow down the dataframe to our variables of interest
casino_data <- casino_data %>%
  select(Q1_A, Age, Gender) %>%
  rename(opinion = Q1_A, age = Age, gender = Gender)

# Look at first couple rows:
head(casino_data)
# A tibble: 6 x 3
  opinion                   age   gender
  <chr>                     <chr> <chr> 
1 Strongly Opposed          25-34 Male  
2 Strongly in Favour        35-44 Female
3 Strongly in Favour        55-64 Male  
4 Somewhat Opposed          25-34 Male  
5 Neutral or Mixed Feelings 25-34 Female
6 Strongly Opposed          45-54 Female

Some visual exploration (and more cleanup, of course)

Let’s first do some quick exploration to get a feel for what’s going on in the data. We’ll first calculate proportions of casino support for each age-gender combination:

# Calculate proportions
casino_summary <- casino_data %>%
  group_by(age, gender, opinion) %>%
  summarise(n = n()) %>% # Count the number in each group and response
  group_by(age, gender) %>%
  mutate(prop = n/sum(n)) # Calculate proportions within each group

Some notes: * we use geom_col() to make a bar chart, * facet_grid() modifies the plot so that the plot has panels that correspond only to certain values of discrete variables (in this case, we will “facet” by age and gender). This is helpful in this case because we are interested in how the distribution of opinions changes by age and gender.

Some things to note:

Getting the data into a more model-suitable format

Get rid of responses that aren’t suitable

For simplicity we’ll assume that NA values and “Prefer not to disclose” responses occur randomly, and remove them from our dataset (note in reality this assumption might not hold up and we might want to be more careful). Let’s check how many rows are in the original dataset:

# nrow() returns the number of rows in a dataframe:
nrow(casino_data)
[1] 17766

Now let’s filter() accordingly to omit the responses we don’t want. In case you’re unfamiliar, I’m going to make use of:

casino_data <- casino_data %>%
  # Only keep rows with non-NA:
  filter(!is.na(opinion), !is.na(age), !is.na(gender)) %>%
  # Only keep rows where age and gender are disclosed:
  filter(age != "Prefer not to disclose", gender != "Prefer not to disclose")

Let’s check how many rows of data we’re left with:

nrow(casino_data)
[1] 13658

Convert response variable into binary

To clean up the first problem (response variables out of order), we might as well take this opportunity to convert these into a format suitable for our model. In a logistic regression, we would like our response variable to be binary, but in this case we have 5 possible categories ranging from “Strongly Opposed” to “Strongly in Favour”. We’ll recategorize them into a new supportive_or_not variable as follows.

We do this with the mutate() function, which creates new columns (possibly as functions of existing columns), and case_when(), which provides a way to assign values conditional on if-statements. The syntax here is a little strange. On the LHS of the ~ is the “if” condition, and the RHS of the tilde is the value to return. For example, x == 0 ~ 3 would return 3 when x is 0.

Another commonly used operator here is the %in% operator, which checks whether something is an element of a vector. E.g.:

# Store possible opinions in vectors
yes_opinions <- c("Strongly in Favour", "Somewhat in Favour")
no_opinions <- c("Neutral or Mixed Feelings", "Somewhat Opposed", "Strongly Opposed")

# Create `supportive` column:
casino_data <- casino_data %>%
  mutate(supportive = case_when(
    opinion %in% yes_opinions ~ TRUE, # Assign TRUE
    opinion %in% no_opinions ~ FALSE  # Assign FALSE
  ))

Convert age to a numeric variable

Age in this survey is given in age groups. Let’s instead treat it map it to a numeric variable so that we can more easily talk about trends with age. We’ll map the youngest age to 1, and so on:

casino_data <- casino_data %>%
  mutate(age_group = case_when(
    age == "Under 15" ~ 0,
    age == "15-24" ~ 1,
    age == "25-34" ~ 2,
    age == "35-44" ~ 3,
    age == "45-54" ~ 4,
    age == "55-64" ~ 5,
    age == "65 or older" ~ 6
  ))

Now let’s make the same plot again, with our new processed data:

We can sort of see some difference in the distribution between different panels. To formalize this, we can run a logistic regression.

Logistic Regression

Now, we’re set up to feed it to the regression. We can do this with glm(), which allows us to fit generalized linear models.

We use family = "binomial" to specify a logistic regression, and our formula is supportive ~ age_group + gender, which indicates that supportive is the (binary) response variable since it’s on the LHS, and age_group and gender are our predictor variables.

casino_glm <- glm(supportive ~ age_group + gender, data = casino_data, family = "binomial")

We can take a look at the results of running the GLM using summary():

summary(casino_glm)

Call:
glm(formula = supportive ~ age_group + gender, family = "binomial", 
    data = casino_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0107  -0.8888  -0.6804   1.4249   1.8822  

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -1.10594    0.05863 -18.862  < 2e-16 ***
age_group           -0.07983    0.01376  -5.801 6.59e-09 ***
genderMale           0.70036    0.04027  17.390  < 2e-16 ***
genderTransgendered  0.69023    0.39276   1.757   0.0789 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 16010  on 13657  degrees of freedom
Residual deviance: 15653  on 13654  degrees of freedom
AIC: 15661

Number of Fisher Scoring iterations: 4

Interpretation

Interpretation can be a little tricky. Here are some important things to note about our results:

Numeric age group variable

Remember that we coded age_group as numbers 1 to 5. Because we’ve used age groups instead of age, we have to be careful with how we phrase our conclusion. The coefficient estimate corresponds to the effect of moving up a unit on the age group scale (e.g. from the 25-34 age group to the 35-44 age group), rather than 1 year in age (e.g. from age 28 to 29).

log-odds ratios

The effect estimates are on the log-odds scale. This means the effect of -0.07983 for age_group is interpreted as: for each unit increase in age_group, we estimate a 0.07983 decrease in the log-odds of being supportive of a casino.

We could exponentiate the coefficient estimate to make this at least a little easier to interpret. The number we get is interpreted as a factor for the odds.

exp(-0.07983)
[1] 0.9232733

So our (cleaner) interpretation is: the odds of an individuals of the same gender being pro-casino are predicted to change by a factor of 0.9232733 for each unit increase in age_group

Baseline category

First, note that because we have categorical variables, the gender coefficients are relative to a “baseline” category. The value of gender that doesn’t appear in the table, Female, is implicitly used as our baseline gender category.

Technical note: if the variable is stored as a character class, then glm() will choose the alphabetically first value to use as the baseline.

exp(0.70036)
[1] 2.014478

So, the interpretation of the genderMale coefficient is: the odds of a male individual supporting a casino is 2.0144778 times higher than a female individual of the same age_group.

Making estimates

A manual way

Using the formula found in ISLR 4.3.3, we can make estimates for an individual of certain characteristics. Suppose we wanted to predict the the probability of supporting a Toronto casino for an individual who was 36 and identified as transgender. Then:

First, let’s extract the coefficient estimates as a vector using coefficients():

coefs <- coefficients(casino_glm)
coefs
        (Intercept)           age_group          genderMale 
        -1.10593925         -0.07983372          0.70036199 
genderTransgendered 
         0.69022910 

Since this vector is labelelled, we can index it using square brackets and names. For instance:

coefs["age_group"]
  age_group 
-0.07983372 

So first let’s evaluate the exponent term \(e^{\beta_0 + \cdots + \beta_p X_p}\):

exp_term <- exp(coefs["(Intercept)"] + coefs["age_group"]*3 + coefs["genderTransgendered"]*1)

Now evaluate the expression that gives the probability of casino support:

# The unname() command just takes off the label that it "inherited" from the coefs vector.
# (don't worry about it, doesn't affect any functionality)
unname(exp_term / (1 + exp_term))
[1] 0.3418161

A more streamlined way

Thankfully R comes with a convenient function to make prediction estimates from a glm(). We do this using the predict() function. First, we need to make a dataframe that has the relevant variables and values that we’re interested in predicting. We’ll use the same values as before:

prediction_df <- data.frame(age_group = 3, gender = "Transgendered")

The dataframe looks like this:

prediction_df
  age_group        gender
1         3 Transgendered

Then we feed it into the predict() function, along with our glm object. To get the probability, we need to specify type = "response".

predict(casino_glm, newdata = prediction_df, type = "response")
        1 
0.3418161 

This matches the probability we got from doing this manually, yay!

Case study - Historical Canadian elections

Case study - Airbnb listing in Toronto

Essentials

In this case study we…

Recommended reading

Set up

Get data

The dataset is from Inside Airbnb (http://insideairbnb.com).

We can give read_csv() a link to where the dataset is and it will download it. This helps with reproducibility because the source is clear. But, as that link could change at any time, longer-term reproducibility, as well as wanting to minimise the effect on the Inside Airbnb servers, suggest that we should also save a local copy of the data and then use that. (As the original data is not ours, we should not make that public without first getting written permission.)

Clean data

There are an enormous number of columns, so we’ll just select a few.

names(airbnb_data) %>% length()
[1] 106
airbnb_data_selected <- 
  airbnb_data %>% 
  select(host_id, 
         host_since, 
         host_response_time, 
         host_is_superhost, 
         host_listings_count,
         host_total_listings_count,
         host_neighbourhood, 
         host_listings_count, 
         neighbourhood_cleansed, 
         room_type, 
         bathrooms, 
         bedrooms, square_feet, 
         price, 
         number_of_reviews, 
         has_availability, 
         review_scores_rating, 
         review_scores_accuracy, 
         review_scores_cleanliness, 
         review_scores_checkin, 
         review_scores_communication, 
         review_scores_location, 
         review_scores_value
         )

We might like to have a brief look at the dataset to see if anything weird is going on. There are a bunch of ways of doing this, but one way is ‘skim’ from the skimr package (Waring, Quinn, McNamara, Arino de la Rubia, Zhu and Ellis, 2020).

# skim(airbnb_data_selected)

A few things jump out:

  1. There are a character variables that should probably be numerics or dates/times: host_response_time, price, weekly_price, monthly_price, cleaning_fee.
  2. Weekly and monthly price is missing in an overwhelming number of observations.
  3. Roughly a fifth of observations are missing a review score and there seems like there is some correlation between those review-type variables.
  4. There are two variants of the neighbourhood name.
  5. There are NAs in host_is_superhost.
  6. The reviews seem really skewed.
  7. There is someone who has 328 properties on Airbnb.

Price

First we need to convert to a numeric. This is a common problem, and you need to be a little careful that it doesn’t all just convert to NAs. In our case if we just force the price data to be a numeric then it will go to NA because there are a lot of characters that R doesn’t know how to convert, e.g. what is the numeric for ‘$’? So we need to remove those characters first.

airbnb_data_selected$price %>% head()
[1] "$469.00" "$99.00"  "$66.00"  "$72.00"  "$199.00" "$54.00" 
# First work out what is going on
airbnb_data_selected$price %>% str_split("") %>% unlist() %>% unique()
 [1] "$" "4" "6" "9" "." "0" "7" "2" "1" "5" "3" "8" ","
# It's clear that '$' needs to go. The only odd thing is ',' so take a look at those:
airbnb_data_selected %>% 
  select(price) %>% 
  filter(str_detect(price, ","))
# A tibble: 87 x 1
   price    
   <chr>    
 1 $1,988.00
 2 $1,099.00
 3 $2,799.00
 4 $1,100.00
 5 $1,450.00
 6 $1,999.00
 7 $1,060.00
 8 $1,300.00
 9 $2,142.00
10 $1,200.00
# … with 77 more rows
# It's clear that the data is just nicely formatted, but we need to remove the comma:
airbnb_data_selected <- 
  airbnb_data_selected %>% 
  mutate(price = str_remove(price, "\\$"),
         price = str_remove(price, ","),
         price = as.integer(price)
         )

Now we can have a look at prices.

# Look at distribution of price
airbnb_data_selected %>%
  ggplot(aes(x = price)) +
  geom_histogram(binwidth = 10) +
  theme_classic() +
  labs(x = "Price per night",
       y = "Number of properties")
#  We use bins with a width of 10, so that's going to aggregate prices into 10s so that we don't get overwhelmed with bars.

So we have some outliers. Let’s zoom in on prices that are more than $1,000.

# Look at distribution of high prices
airbnb_data_selected %>%
  filter(price > 1000) %>% 
  ggplot(aes(x = price)) +
  geom_histogram(binwidth = 10) +
  theme_classic() +
  labs(x = "Price per night",
       y = "Number of properties")

Let’s look in more detail at those with a price more than $4,999
airbnb_data_selected %>%
  filter(price > 4999)
# A tibble: 22 x 22
   host_id host_since host_response_t… host_is_superho…
     <dbl> <date>     <chr>            <lgl>           
 1  9.91e7 2016-10-10 N/A              FALSE           
 2  5.29e6 2013-03-02 within an hour   TRUE            
 3  1.20e8 2017-03-07 N/A              FALSE           
 4  8.46e6 2013-08-27 N/A              FALSE           
 5  1.47e8 2017-08-22 N/A              FALSE           
 6  6.74e7 2016-04-16 N/A              FALSE           
 7  2.58e8 2019-04-24 within an hour   TRUE            
 8  2.58e8 2019-04-24 within an hour   TRUE            
 9  2.71e8 2019-06-24 a few days or m… FALSE           
10  2.72e8 2019-06-27 within an hour   FALSE           
# … with 12 more rows, and 18 more variables:
#   host_listings_count <dbl>, host_total_listings_count <dbl>,
#   host_neighbourhood <chr>, neighbourhood_cleansed <chr>,
#   room_type <chr>, bathrooms <dbl>, bedrooms <dbl>,
#   square_feet <dbl>, price <int>, number_of_reviews <dbl>,
#   has_availability <lgl>, review_scores_rating <dbl>,
#   review_scores_accuracy <dbl>, review_scores_cleanliness <dbl>,
#   review_scores_checkin <dbl>, review_scores_communication <dbl>,
#   review_scores_location <dbl>, review_scores_value <dbl>
# It's pretty clear that there is something odd going on with some of these, but some of them seem legit.

Let’s look at the distribution of prices that are in a ‘reasonable’ range, which until Monica is a full professor, will be defined as a nightly price of less than $1,000.

airbnb_data_selected %>%
  filter(price < 1000) %>% 
  ggplot(aes(x = price)) +
  geom_histogram(binwidth = 10) +
  theme_classic() +
  labs(x = "Price per night",
       y = "Number of properties")

Interestingly it looks like there is some bunching of prices, possible around numbers ending in zero or nine? Let’s just zoom in on prices between $90 and $210, out of interest, but change the bins to be smaller.

# Look at distribution of price again
airbnb_data_selected %>%
  filter(price > 90) %>% 
  filter(price < 210) %>% 
  ggplot(aes(x = price)) +
  geom_histogram(binwidth = 1) +
  theme_classic() +
  labs(x = "Price per night",
       y = "Number of properties")

Superhosts

Airbnb says that:

Superhosts are experienced hosts who provide a shining example for other hosts, and extraordinary experiences for their guests.

Once a host reaches Superhost status, a badge superhost badge will automatically appear on their listing and profile to help you identify them.

We check Superhosts’ activity four times a year, to ensure that the program highlights the people who are most dedicated to providing outstanding hospitality.

First we’ll look at the NAs in host_is_superhost.

airbnb_data_selected %>%
  filter(is.na(host_is_superhost))
# A tibble: 285 x 22
   host_id host_since host_response_t… host_is_superho…
     <dbl> <date>     <chr>            <lgl>           
 1  2.27e6 NA         <NA>             NA              
 2  2.27e6 NA         <NA>             NA              
 3  2.27e6 NA         <NA>             NA              
 4  5.99e6 NA         <NA>             NA              
 5  8.73e6 NA         <NA>             NA              
 6  1.11e7 NA         <NA>             NA              
 7  1.62e7 NA         <NA>             NA              
 8  1.46e5 NA         <NA>             NA              
 9  1.56e7 NA         <NA>             NA              
10  1.85e7 NA         <NA>             NA              
# … with 275 more rows, and 18 more variables:
#   host_listings_count <dbl>, host_total_listings_count <dbl>,
#   host_neighbourhood <chr>, neighbourhood_cleansed <chr>,
#   room_type <chr>, bathrooms <dbl>, bedrooms <dbl>,
#   square_feet <dbl>, price <int>, number_of_reviews <dbl>,
#   has_availability <lgl>, review_scores_rating <dbl>,
#   review_scores_accuracy <dbl>, review_scores_cleanliness <dbl>,
#   review_scores_checkin <dbl>, review_scores_communication <dbl>,
#   review_scores_location <dbl>, review_scores_value <dbl>

There are 285 of these and it’s clear that there is something odd going on - maybe the host removed the listing or similar?

We’ll also want to create a binary variable out of this. It’s true/false at the moment, which is fine for the modelling, but there are a handful of situations where it’ll be easier if we have a 0/1.

airbnb_data_selected <- 
  airbnb_data_selected %>%
  mutate(host_is_superhost_binary = case_when(
    host_is_superhost == TRUE ~ 1,
    host_is_superhost == FALSE ~ 0,
    TRUE ~ 999
    )
  )
airbnb_data_selected$host_is_superhost_binary[airbnb_data_selected$host_is_superhost_binary == 999] <- NA

Reviews

Airbnb says that:

In addition to written reviews, guests can submit an overall star rating and a set of category star ratings for their stay.

Hosts can view their star ratings on their Progress page, under Reviews. Hosts using professional hosting tools can find reviews and quality details on their Performance page, under Quality.

Guests can give ratings on:

In each category, hosts are able to see how often they get 5 stars, how guests rated nearby hosts, and, in some cases, tips to help improve the listing.

The number of stars displayed at the top of a listing page is an aggregate of the primary scores guests have given for that listing. At the bottom of a listing page, there’s an aggregate for each category rating. A host needs to receive star ratings from at least 3 guests before their aggregate score appears.

TODO: I don’t understand how these review scores are being constructed. Airbnb says it’s a star rating, but how are they converting this into the 10 point scale, similarly, how are their constructing the overall one, which seems to be out of 100? There’s a lot of clumping around 20, 40, 60, 80, 100 - could they be averaging a five-star scale and then rebasing it to 100?

Now we’ll deal with the NAs in review_scores_rating. This one is more complicated as there are a lot of them.

airbnb_data_selected %>%
  filter(is.na(review_scores_rating))
# A tibble: 4,676 x 23
   host_id host_since host_response_t… host_is_superho…
     <dbl> <date>     <chr>            <lgl>           
 1   48239 2009-10-25 N/A              FALSE           
 2  545074 2011-04-29 N/A              FALSE           
 3 1210571 2011-09-26 N/A              FALSE           
 4 1408862 2011-11-15 N/A              FALSE           
 5 1411076 2011-11-15 N/A              FALSE           
 6 1409872 2011-11-15 N/A              FALSE           
 7 1466410 2011-12-02 within a few ho… FALSE           
 8 1664812 2012-01-28 N/A              FALSE           
 9 1828773 2012-02-28 N/A              FALSE           
10 1923052 2012-03-14 N/A              FALSE           
# … with 4,666 more rows, and 19 more variables:
#   host_listings_count <dbl>, host_total_listings_count <dbl>,
#   host_neighbourhood <chr>, neighbourhood_cleansed <chr>,
#   room_type <chr>, bathrooms <dbl>, bedrooms <dbl>,
#   square_feet <dbl>, price <int>, number_of_reviews <dbl>,
#   has_availability <lgl>, review_scores_rating <dbl>,
#   review_scores_accuracy <dbl>, review_scores_cleanliness <dbl>,
#   review_scores_checkin <dbl>, review_scores_communication <dbl>,
#   review_scores_location <dbl>, review_scores_value <dbl>,
#   host_is_superhost_binary <dbl>
Now see if it’s just because they don’t have any reviews.
airbnb_data_selected %>%
  filter(is.na(review_scores_rating)) %>% 
  select(number_of_reviews) %>% 
  table()
.
   0    1    2    3    4 
4389  251   26    8    2 

So it’s clear that in almost all these cases they don’t have a review yet because they don’t have enough reviews. However, it’s a large proportion of the total - almost a fifth of properties don’t have any reviews (hence an NA in review_scores_rating).

We can use vis_miss from the visdat package (Tierney, 2017) to make sure that all components of the review are missing. If the NAs are being driven by the Airbnb requirement of at least three reviews then we would expect they would all be missing.

# We'll just check whether this is the same for all of the different variants of reviews
airbnb_data_selected %>% 
  select(review_scores_rating, 
         review_scores_accuracy, 
         review_scores_cleanliness, 
         review_scores_checkin, 
         review_scores_communication, 
         review_scores_location, 
         review_scores_value) %>% 
  vis_miss()

It looks pretty convincing that in almost all cases, all the different variants of reviews are missing. So let’s just focus on the main review.

airbnb_data_selected %>%
  filter(!is.na(review_scores_rating)) %>% 
  ggplot(aes(x = review_scores_rating)) +
  geom_histogram(binwidth = 1) +
  theme_classic() +
  labs(x = "Average review score",
       y = "Number of properties")

It’s pretty clear that almost all the reviews are more than 80. Let’s just zoom in on that 60 to 80 range to check what the distribution looks like in that range.

airbnb_data_selected %>%
  filter(!is.na(review_scores_rating)) %>% 
  filter(review_scores_rating > 60) %>%
  filter(review_scores_rating < 80) %>% 
  ggplot(aes(x = review_scores_rating)) +
  geom_histogram(binwidth = 1) +
  theme_classic() +
  labs(x = "Average review score",
       y = "Number of properties")

Response time

Airbnb says that:

Hosts have 24 hours to officially accept or decline reservation requests. You’ll be updated by email about the status of your request.

More than half of all reservation requests are accepted within one hour of being received. The vast majority of hosts reply within 12 hours.

If a host confirms your request, your payment is processed and collected by Airbnb in full. If a host declines your request or the request expires, we don’t process your payment.

TODO: I don’t understand how you can get a response time of NA? It must be related to some other variable.

Looking now at response time:
table(airbnb_data_selected$host_response_time)

a few days or more                N/A       within a day 
               321               6324               1584 
within a few hours     within an hour 
              2542              12341 

Interestingly it seems like what looks like ‘NAs’ in the host_response_time variable are not being coded as proper NAs, but are instead being treated as another category. We’ll recode them to be actual NAs.

airbnb_data_selected$host_response_time[airbnb_data_selected$host_response_time == "N/A"] <- NA

So here we clearly have issues with NAs. We probably want to filter them away for this example because it’s just a quick example, but there are an awful lot of them (more than 20 per cent) so we’ll have a quick look at them in relation to the review score.

airbnb_data_selected %>% 
  filter(is.na(host_response_time)) %>% 
  ggplot(aes(x = review_scores_rating)) +
  geom_histogram(binwidth = 1)

There seem to be an awful lot that have an overall review of 100. There are also an awful lot that have a review score of NA.

airbnb_data_selected %>% 
  filter(is.na(host_response_time)) %>%
  filter(is.na(review_scores_rating))
# A tibble: 2,313 x 23
   host_id host_since host_response_t… host_is_superho…
     <dbl> <date>     <chr>            <lgl>           
 1   48239 2009-10-25 <NA>             FALSE           
 2  545074 2011-04-29 <NA>             FALSE           
 3 1210571 2011-09-26 <NA>             FALSE           
 4 1408862 2011-11-15 <NA>             FALSE           
 5 1411076 2011-11-15 <NA>             FALSE           
 6 1409872 2011-11-15 <NA>             FALSE           
 7 1664812 2012-01-28 <NA>             FALSE           
 8 1828773 2012-02-28 <NA>             FALSE           
 9 1923052 2012-03-14 <NA>             FALSE           
10 2291374 2012-05-04 <NA>             FALSE           
# … with 2,303 more rows, and 19 more variables:
#   host_listings_count <dbl>, host_total_listings_count <dbl>,
#   host_neighbourhood <chr>, neighbourhood_cleansed <chr>,
#   room_type <chr>, bathrooms <dbl>, bedrooms <dbl>,
#   square_feet <dbl>, price <int>, number_of_reviews <dbl>,
#   has_availability <lgl>, review_scores_rating <dbl>,
#   review_scores_accuracy <dbl>, review_scores_cleanliness <dbl>,
#   review_scores_checkin <dbl>, review_scores_communication <dbl>,
#   review_scores_location <dbl>, review_scores_value <dbl>,
#   host_is_superhost_binary <dbl>

Host number of listings

There are two versions of a variable telling you how many properties a host has on Airbnb, so to start just check whether there’s a difference.

airbnb_data_selected %>% 
  mutate(listings_count_is_same = if_else(host_listings_count == host_total_listings_count, 1, 0)) %>% 
  filter(listings_count_is_same == 0)
# A tibble: 0 x 24
# … with 24 variables: host_id <dbl>, host_since <date>,
#   host_response_time <chr>, host_is_superhost <lgl>,
#   host_listings_count <dbl>, host_total_listings_count <dbl>,
#   host_neighbourhood <chr>, neighbourhood_cleansed <chr>,
#   room_type <chr>, bathrooms <dbl>, bedrooms <dbl>,
#   square_feet <dbl>, price <int>, number_of_reviews <dbl>,
#   has_availability <lgl>, review_scores_rating <dbl>,
#   review_scores_accuracy <dbl>, review_scores_cleanliness <dbl>,
#   review_scores_checkin <dbl>, review_scores_communication <dbl>,
#   review_scores_location <dbl>, review_scores_value <dbl>,
#   host_is_superhost_binary <dbl>, listings_count_is_same <dbl>

There are none in this dataset so we can just remove one column for now and have a quick look at the other one.

airbnb_data_selected <- 
  airbnb_data_selected %>% 
  select(-host_listings_count)

airbnb_data_selected %>% 
  count(host_total_listings_count)
# A tibble: 60 x 2
   host_total_listings_count     n
                       <dbl> <int>
 1                         0  1750
 2                         1  9727
 3                         2  3424
 4                         3  1889
 5                         4  1049
 6                         5   809
 7                         6   530
 8                         7   353
 9                         8   284
10                         9   287
# … with 50 more rows

So there are a large number who have somewhere in the 2-10 properties range, but the usual long tail. The number with 0 listings is unexpected and worth following up on. And there are a bunch with NA that we’ll need to deal with.

airbnb_data_selected %>% 
  filter(host_total_listings_count == 0) %>% 
  head()
# A tibble: 6 x 22
  host_id host_since host_response_t… host_is_superho…
    <dbl> <date>     <chr>            <lgl>           
1  140602 2010-06-08 within an hour   FALSE           
2  992557 2011-08-19 a few days or m… FALSE           
3 2737251 2012-06-25 <NA>             FALSE           
4 3367744 2012-08-25 within a day     FALSE           
5 2139510 2012-04-14 <NA>             FALSE           
6 1786233 2012-02-21 within a few ho… FALSE           
# … with 18 more variables: host_total_listings_count <dbl>,
#   host_neighbourhood <chr>, neighbourhood_cleansed <chr>,
#   room_type <chr>, bathrooms <dbl>, bedrooms <dbl>,
#   square_feet <dbl>, price <int>, number_of_reviews <dbl>,
#   has_availability <lgl>, review_scores_rating <dbl>,
#   review_scores_accuracy <dbl>, review_scores_cleanliness <dbl>,
#   review_scores_checkin <dbl>, review_scores_communication <dbl>,
#   review_scores_location <dbl>, review_scores_value <dbl>,
#   host_is_superhost_binary <dbl>

There’s nothing that immediately jumps out as odd about the people with zero listings, but there must be something going on.

Based on this dataset, there’s a third way of looking at the number of properties someone has and that’s to look at the number of times the unique ID occurs.

airbnb_data_selected %>% 
  count(host_id) %>% 
  arrange(-n) %>% 
  head()
# A tibble: 6 x 2
    host_id     n
      <dbl> <int>
1  10202618   121
2   1919294    82
3 152088065    53
4  33238402    50
5  63526506    48
6 164320990    48

Again this makes it clear that there are many with multiple properties listed.

Decisions

The purpose of this document is to just give a quick introduction to using real-world data, so we’ll just remove anything that is annoying, but if you’re using this for research then you’d need to justify these decisions and/or possibly make different ones.

Get rid of prices more than $999.
airbnb_data_filtered <- 
  airbnb_data_selected %>% 
  filter(price < 1000)
dim(airbnb_data_filtered)
[1] 23310    22
Get rid of anyone with an NA for whether they are a super host.
# Just remove host_is_superhost NAs for now.
airbnb_data_filtered <- 
  airbnb_data_filtered %>%
  filter(!is.na(host_is_superhost))
dim(airbnb_data_filtered)
[1] 23025    22
Get rid of anyone with an NA in their main review score - this removes roughly 20 per cent of observations.
# We'll just get rid of them for now, but this is probably something that deserves more attention - possibly in an appendix or similar.
airbnb_data_filtered <- 
  airbnb_data_filtered %>%
  filter(!is.na(review_scores_rating))
# There are still some where the rest of the reviews are missing even though there is a main review score
# There seem to be an awful lot that have an overall review of 100. Does that make sense?
dim(airbnb_data_filtered)
[1] 18485    22
Get rid of anyone with a main review score less than 70.
# We'll just get rid of them for now, but this is probably something that deserves more attention - possibly in an appendix or similar.
airbnb_data_filtered <- 
  airbnb_data_filtered %>%
  filter(review_scores_rating > 69)
# There are still some where the rest of the reviews are missing even though there is a main review score
# There seem to be an awful lot that have an overall review of 100. Does that make sense?
dim(airbnb_data_filtered)
[1] 18143    22
Get rid of anyone with a NA in their response time - this removes roughly another 20 per cent of the observations.
airbnb_data_filtered <- 
  airbnb_data_filtered %>% 
  filter(!is.na(host_response_time))
dim(airbnb_data_filtered)
[1] 14165    22

(TODO: We don’t have to do this next step as we’ve already got rid of them at some other point. So there’s something systematic going on and we should come back and look into it.)

Get rid of anyone with a NA in their number of properties.
airbnb_data_filtered <- 
  airbnb_data_filtered %>% 
  filter(!is.na(host_total_listings_count))
dim(airbnb_data_filtered)
[1] 14165    22
Get rid of anyone with a 100 for their review_scores_rating.
airbnb_data_filtered <- 
  airbnb_data_filtered %>% 
  filter(review_scores_rating != 100)
dim(airbnb_data_filtered)
[1] 10857    22
Only keep people with one property:
airbnb_data_filtered <- 
  airbnb_data_filtered %>% 
  add_count(host_id) %>% 
  filter(n == 1) %>% 
  select(-n)
dim(airbnb_data_filtered)
[1] 4880   22

Explore data

We might like to make some graphs to see if we can any relationships jump out. Some aspects that come to mind is looking at prices and reviews and super hosts, and number of properties and neighbourhood.

TODO: Come back and make this paragraph better.

Price and reviews

Look at the relationship between price and reviews, and whether they are a super-host.

# Look at both price and reviews
airbnb_data_filtered %>%
  ggplot(aes(x = price, y = review_scores_rating, color = host_is_superhost)) +
  geom_point(size = 1, alpha = 0.1) + # Make the points smaller and more transparent as they overlap considerably.
  theme_classic() +
  labs(x = "Price per night",
       y = "Average review score",
       color = "Super host") + # Probably should recode this to more meaningful than TRUE/FALSE.
  scale_color_brewer(palette = "Set1")

Superhost and response-time

One of the aspects that may make someone a super host is how quickly they respond to enquiries. One could imagine that being a superhost involves quickly saying yes or no to enquiries. Let’s look at the data. First, we want to look at the possible values of superhost by their response times.

airbnb_data_filtered %>% 
  tabyl(host_is_superhost) %>% 
  adorn_totals("row") %>% 
  adorn_pct_formatting()
 host_is_superhost    n percent
             FALSE 2617   53.6%
              TRUE 2263   46.4%
             Total 4880  100.0%

Fortunately, it looks like when we removed the reviews rows we removed any NAs from whether they were a super host, but if we go back and look into that we may need to check again. The tabyl function within the janitor package (Firke, 2020) would list the NAs if there were any, but in case you don’t trust it, another way of check this is to try to filter to just the NAs.

airbnb_data_filtered %>% 
  filter(is.na(host_is_superhost))
# A tibble: 0 x 22
# … with 22 variables: host_id <dbl>, host_since <date>,
#   host_response_time <chr>, host_is_superhost <lgl>,
#   host_total_listings_count <dbl>, host_neighbourhood <chr>,
#   neighbourhood_cleansed <chr>, room_type <chr>, bathrooms <dbl>,
#   bedrooms <dbl>, square_feet <dbl>, price <int>,
#   number_of_reviews <dbl>, has_availability <lgl>,
#   review_scores_rating <dbl>, review_scores_accuracy <dbl>,
#   review_scores_cleanliness <dbl>, review_scores_checkin <dbl>,
#   review_scores_communication <dbl>, review_scores_location <dbl>,
#   review_scores_value <dbl>, host_is_superhost_binary <dbl>

Now let’s look at the response time.

airbnb_data_filtered %>% 
  tabyl(host_response_time) %>% 
  adorn_totals("row") %>% 
  adorn_pct_formatting()
 host_response_time    n percent
 a few days or more   78    1.6%
       within a day  538   11.0%
 within a few hours  833   17.1%
     within an hour 3431   70.3%
              Total 4880  100.0%

So a vast majority respond within an hour.

Finally, we can look at the cross-tab.

airbnb_data_filtered %>% 
  tabyl(host_response_time, host_is_superhost) %>% 
  adorn_percentages("row") %>%
  adorn_pct_formatting(digits = 0) %>%
  adorn_ns() %>% 
  adorn_title()
                    host_is_superhost           
 host_response_time             FALSE       TRUE
 a few days or more        92%   (72)  8%    (6)
       within a day        74%  (396) 26%  (142)
 within a few hours        60%  (502) 40%  (331)
     within an hour        48% (1647) 52% (1784)

So if someone doesn’t respond within an hour then it’s unlikely that they are a super host.

Neighbourhood

Finally, let’s look at neighbourhood. The data provider has attempted to clean the neighbourhood variable for us, so we’ll just use this for now. If we were doing this analysis properly, we’d need to check whether they’d made any mistakes.

# We expect something in the order of 100 to 150 neighbourhoods, with the top ten accounting for a large majority of listings.
airbnb_data_filtered %>% 
  tabyl(neighbourhood_cleansed) %>% 
  adorn_totals("row") %>% 
  adorn_pct_formatting() %>% 
  nrow()
[1] 139
airbnb_data_filtered %>% 
  tabyl(neighbourhood_cleansed) %>% 
  adorn_pct_formatting() %>% 
  arrange(-n) %>% 
  filter(n > 100) %>% 
  adorn_totals("row") %>% 
  head()
              neighbourhood_cleansed    n percent
   Waterfront Communities-The Island 1001   20.5%
                             Niagara  273    5.6%
               Church-Yonge Corridor  176    3.6%
                               Annex  174    3.6%
 Dovercourt-Wallace Emerson-Junction  163    3.3%
                 Bay Street Corridor  130    2.7%

Model data

We will now run some models on our dataset. We will first split the data into test/training groups, we do this using functions from the tidymodels package (Kuhn and Wickham, 2019) (which like the tidyverse package (Wickham et al., 2019) is a package of packages).

set.seed(853)

airbnb_data_filtered_split <- 
  airbnb_data_filtered %>%
  initial_split(prop = 3/4)

airbnb_train <- training(airbnb_data_filtered_split)
airbnb_test <- testing(airbnb_data_filtered_split)

rm(airbnb_data_filtered_split)

Logistic regression

We may like to look at whether we can forecast whether someone is a super host, and the factors that go into explaining that. As the dependent variable is binary, this is a good opportunity to look at logistic regression. We expect that better reviews will be associated with faster responses and higher reviews. Specifically, the model that we estimate is:

\[\mbox{Prob(Is super host} = 1) = \beta_0 + \beta_1 \mbox{Response time} + \beta_2 \mbox{Reviews} + \epsilon\]

We estimate the model using glm in the R language (R Core Team, 2019).

logistic_reg_superhost_response_review <- glm(host_is_superhost ~ 
                                                host_response_time + 
                                                review_scores_rating,
                                              data = airbnb_train,
                                              family = binomial
                                              )

We can have a quick look at the results, for instance, the summary function. We could also use tidy or glance from the broom package (Robinson and Hayes, 2020). The details should be the same, but the broom functions are tibbles, which means that we can more easily deal with them within a tidy framework.

summary(logistic_reg_superhost_response_review)

Call:
glm(formula = host_is_superhost ~ host_response_time + review_scores_rating, 
    family = binomial, data = airbnb_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9513  -0.7401  -0.0529   0.8562   4.4894  

Coefficients:
                                      Estimate Std. Error z value
(Intercept)                          -49.44238    1.90724 -25.924
host_response_timewithin a day         1.13811    0.47746   2.384
host_response_timewithin a few hours   1.99356    0.46831   4.257
host_response_timewithin an hour       2.42867    0.46035   5.276
review_scores_rating                   0.49249    0.01905  25.851
                                     Pr(>|z|)    
(Intercept)                           < 2e-16 ***
host_response_timewithin a day         0.0171 *  
host_response_timewithin a few hours 2.07e-05 ***
host_response_timewithin an hour     1.32e-07 ***
review_scores_rating                  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 5051.2  on 3659  degrees of freedom
Residual deviance: 3583.2  on 3655  degrees of freedom
AIC: 3593.2

Number of Fisher Scoring iterations: 6
tidy(logistic_reg_superhost_response_review)
# A tibble: 5 x 5
  term                          estimate std.error statistic   p.value
  <chr>                            <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)                    -49.4      1.91      -25.9  3.61e-148
2 host_response_timewithin a d…    1.14     0.477       2.38 1.71e-  2
3 host_response_timewithin a f…    1.99     0.468       4.26 2.07e-  5
4 host_response_timewithin an …    2.43     0.460       5.28 1.32e-  7
5 review_scores_rating             0.492    0.0191     25.9  2.39e-147
glance(logistic_reg_superhost_response_review)
# A tibble: 1 x 8
  null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
          <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
1         5051.    3659 -1792. 3593. 3624.    3583.        3655  3660

We might like to look at what our model predicts, compared with whether the person was actually a super host. We can do that in a variety of ways, but one way is to use augment from the broom package (Robinson and Hayes, 2020). This will add the prediction and associated uncertainty to the data. For every row we will then have the probability that our model is estimating that they are a superhost. But ultimately, we need a binary forecast. There are a bunch of different options, but one is to just say that if the model estimates a probability of more than 0.5 then we bin it into a superhost, and other not.

airbnb_data_filtered_logistic_fit_train <- 
  augment(logistic_reg_superhost_response_review, 
          data = airbnb_train %>% select(host_is_superhost, 
                                         host_is_superhost_binary,
                                         host_response_time,
                                         review_scores_rating
                                         ),
          type.predict = "response") %>% # We use the "response" option here so that the function does the work of worrying about the exponential and log odds for us. Our result will be a probability.
  select(-.hat, -.sigma, -.cooksd, -.std.resid) %>% 
  mutate(predict_host_is_superhost = if_else(.fitted > 0.5, 1, 0), # How do things change if we change the 0.5 cutoff?
         host_is_superhost_binary = as.factor(host_is_superhost_binary),
         predict_host_is_superhost_binary = as.factor(predict_host_is_superhost)
         )

We can look at how far off the model is. There are a bunch of ways of doing this, but one is to look at what probability the model has given each person.

airbnb_data_filtered_logistic_fit_train %>% 
  ggplot(aes(x = .fitted, fill = host_is_superhost_binary)) +
  geom_histogram(binwidth = 0.05, position = "dodge") +
  theme_classic() +
  labs(x = "Estimated probability that host is super host",
       y = "Number",
       fill = "Host is super host") +
  scale_fill_brewer(palette = "Set1")

We can look at how the model probabilities change based on average review score, and their average time to respond.

ggplot(airbnb_data_filtered_logistic_fit_train, 
       aes(x = review_scores_rating, 
           y = .fitted, 
           color = host_response_time)) +
  geom_line() +
  geom_point() +
  labs(x = "Average review score",
       y = "Predicted probability of being a superhost",
       color = "Host response time") +
  theme_classic() +
  scale_color_brewer(palette = "Set1")
# What is missing from this graph?

This nice thing about this graph is that it illustrates nicely the effect of a host having an average response time of, say, ‘within an hour’ compared with ‘within a few hours’.

We can focus on how the model does in terms of raw classification using confusionMatrix from the caret package (Kuhn, 2020). This also gives a bunch of diagnostics (the help file explains what they are). In general, they suggest this isn’t the best model that’s ever existed.

caret::confusionMatrix(data = airbnb_data_filtered_logistic_fit_train$predict_host_is_superhost_binary,
                       reference = airbnb_data_filtered_logistic_fit_train$host_is_superhost_binary)
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1502  341
         1  472 1345
                                         
               Accuracy : 0.7779         
                 95% CI : (0.764, 0.7912)
    No Information Rate : 0.5393         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.5555         
                                         
 Mcnemar's Test P-Value : 5.132e-06      
                                         
            Sensitivity : 0.7609         
            Specificity : 0.7977         
         Pos Pred Value : 0.8150         
         Neg Pred Value : 0.7402         
             Prevalence : 0.5393         
         Detection Rate : 0.4104         
   Detection Prevalence : 0.5036         
      Balanced Accuracy : 0.7793         
                                         
       'Positive' Class : 0              
                                         

In any case, to this point we’ve been looking at how the model has done on the training set. It’s also relevant how it does on the test set. Again, there are a bunch of ways to do this, but one is to again use the augment function, but to include a newdata argument.

airbnb_data_filtered_logistic_fit_test <- 
  augment(logistic_reg_superhost_response_review, 
          data = airbnb_train %>% select(host_is_superhost, 
                                         host_is_superhost_binary,
                                         host_response_time,
                                         review_scores_rating
                                         ),
          newdata = airbnb_test %>% select(host_is_superhost, 
                                         host_is_superhost_binary,
                                         host_response_time,
                                         review_scores_rating
                                         ), # I'm selecting just because the
          # dataset is quite wide, and so this makes it easier to look at.
          type.predict = "response") %>% 
  mutate(predict_host_is_superhost = if_else(.fitted > 0.5, 1, 0), 
         host_is_superhost_binary = as.factor(host_is_superhost_binary),
         predict_host_is_superhost_binary = as.factor(predict_host_is_superhost)
         )

We would expect the performance to be slightly worse on the test set. But it’s actually fairly similar.

caret::confusionMatrix(data = airbnb_data_filtered_logistic_fit_test$predict_host_is_superhost_binary,
                       reference = airbnb_data_filtered_logistic_fit_test$host_is_superhost_binary)
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 501 132
         1 142 445
                                          
               Accuracy : 0.7754          
                 95% CI : (0.7509, 0.7986)
    No Information Rate : 0.527           
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.5499          
                                          
 Mcnemar's Test P-Value : 0.5866          
                                          
            Sensitivity : 0.7792          
            Specificity : 0.7712          
         Pos Pred Value : 0.7915          
         Neg Pred Value : 0.7581          
             Prevalence : 0.5270          
         Detection Rate : 0.4107          
   Detection Prevalence : 0.5189          
      Balanced Accuracy : 0.7752          
                                          
       'Positive' Class : 0               
                                          

We could compare the test with the training sets in terms of forecasts.

training <- airbnb_data_filtered_logistic_fit_train %>% 
  select(host_is_superhost_binary, .fitted) %>% 
  mutate(type = "Training set")

test <- airbnb_data_filtered_logistic_fit_test %>% 
  select(host_is_superhost_binary, .fitted) %>% 
  mutate(type = "Test set")

both <- rbind(training, test)
rm(training, test)

both %>% 
  ggplot(aes(x = .fitted, 
             fill = host_is_superhost_binary)) +
  geom_histogram(binwidth = 0.05, position = "dodge") +
  theme_minimal() +
  labs(x = "Estimated probability that host is super host",
       y = "Number",
       fill = "Host is super host") +
  scale_fill_brewer(palette = "Set1") +
  facet_wrap(vars(type),
             nrow = 2,
             scales = "free_y")

HINT: How many aspects mentioned in the modelling criteria in rubric for PS3 have I gone through?

Next steps

Ideas for the future:

References

References