APIs

Required reading

Recommended reading

Recommended viewing

Key concepts/skills/etc

Key libraries

Key functions/etc

Pre-quiz

  1. In your own words, what is an API?
  2. Find two APIs and discuss how you could use them to tell interesting stories.
  3. Find two APIs that have an R packages written around them. How could you use these to tell interesting stories?
  4. What is the main argument to the GET method?

Introduction

In everyday language, and for our purposes, an Application Programming Interface (API) is simply a situation in which someone has set up specific files on their computer such that you can follow their instructions to get them. For instance, when you use a gif on Slack, Slack asks Giphy’s server for the appropriate gif, Giphy’s server gives that gif to Slack and then Slack inserts it into your chat. The way in which Slack and Giphy interact is determined by Giphy’s API. More strictly, an API is just an application that runs on a server that we access using the HTTP protocol.

In our case, we are going to focus on using APIs for gathering data. So I’ll tailor the language that I use toward that, and so:

[a]n API is the tool that makes a website’s data digestible for a computer. Through it, a computer can view and edit data, just like a person can by loading pages and submitting forms. (Cooksey 2014, Chapter 1)

For instance, you could go to Google Maps and then scroll and click and drag to center the map on Canberra, Australia, or you could just paste this into your browser: https://www.google.ca/maps/@-35.2812958,149.1248113,16z. You just used the Google Maps API.1

The advantage of using an API is that the data provider specifies exactly the data that they are willing to provide, and the terms under which they will provide it. These terms may include things like rate limits (i.e. how often you can ask for data), and what you can do with the data (e.g. maybe you’re not allowed to use it for commercial purposes, or to republish it, or whatever). Additionally, because the API is being provided specifically for you to use it, it is less likely to be subject to unexpected changes. Because of this it is ethically and legally clear that when an API is available you should try to use it.

In this chapter we introduce some APIs in R. In particular, we will first introduce some R packages that wrap around APIs and make it easier to use an API, and we will then deal directly with APIs.

R packages that wrap around APIs

There are a lot of R packages that wrap around APIs making it easier for you to use an API within ‘familiar surroundings’. Here, I’ll run through some useful and/or fun ones.

rtweet

Twitter is a rich source of text and other data. The Twitter API is the way in which Twitter ask that you interact with Twitter in order to gather these data. The rtweet package (Kearney 2019) is built around this API and allows us to interact with it in ways that are similar to using any other R package. Initially all you need a regular Twitter account.

Get started by install the library if you need and then calling it.

# install.packages('rtweet')
library(rtweet)
library(tidyverse)

To get started we need to authorise rtweet. We start that process by calling a function from the package.

get_favorites(user = "RohanAlexander")

This will open a browser on your computer, and you will then have to log into your regular Twitter account as shown in Figure 1.

rtweet authorisation page

Figure 1: rtweet authorisation page

Once that is done we can actually get my favourites and then save them.

rohans_favs <- get_favorites("RohanAlexander")

saveRDS(rohans_favs, "dont_push/rohans_favs.rds")

And then looking at the most recent favourite, we can see it was when Professor Bolton tweeted about one of the stellar students in ISSC.

rohans_favs %>% 
  arrange(desc(created_at)) %>% 
  slice(1) %>% 
  select(screen_name, text)
# A tibble: 1 x 2
  screen_name text                                                              
  <chr>       <chr>                                                             
1 Liza_Bolton One of our awesome @UofTStatSci students! I 💜 learning about the …

Let’s look at who is tweeting about R, using one of the common R hashtags: #rstats. I’ve removed retweets so that we hopefully get some actual interesting projects.

rstats_tweets <- search_tweets(
  q = "#rstats",
  include_rts = FALSE
)

saveRDS(rstats_tweets, "dont_push/rstats_tweets.rds")

And then have a look at them.

names(rstats_tweets)
 [1] "user_id"                 "status_id"              
 [3] "created_at"              "screen_name"            
 [5] "text"                    "source"                 
 [7] "display_text_width"      "reply_to_status_id"     
 [9] "reply_to_user_id"        "reply_to_screen_name"   
[11] "is_quote"                "is_retweet"             
[13] "favorite_count"          "retweet_count"          
[15] "quote_count"             "reply_count"            
[17] "hashtags"                "symbols"                
[19] "urls_url"                "urls_t.co"              
[21] "urls_expanded_url"       "media_url"              
[23] "media_t.co"              "media_expanded_url"     
[25] "media_type"              "ext_media_url"          
[27] "ext_media_t.co"          "ext_media_expanded_url" 
[29] "ext_media_type"          "mentions_user_id"       
[31] "mentions_screen_name"    "lang"                   
[33] "quoted_status_id"        "quoted_text"            
[35] "quoted_created_at"       "quoted_source"          
[37] "quoted_favorite_count"   "quoted_retweet_count"   
[39] "quoted_user_id"          "quoted_screen_name"     
[41] "quoted_name"             "quoted_followers_count" 
[43] "quoted_friends_count"    "quoted_statuses_count"  
[45] "quoted_location"         "quoted_description"     
[47] "quoted_verified"         "retweet_status_id"      
[49] "retweet_text"            "retweet_created_at"     
[51] "retweet_source"          "retweet_favorite_count" 
[53] "retweet_retweet_count"   "retweet_user_id"        
[55] "retweet_screen_name"     "retweet_name"           
[57] "retweet_followers_count" "retweet_friends_count"  
[59] "retweet_statuses_count"  "retweet_location"       
[61] "retweet_description"     "retweet_verified"       
[63] "place_url"               "place_name"             
[65] "place_full_name"         "place_type"             
[67] "country"                 "country_code"           
[69] "geo_coords"              "coords_coords"          
[71] "bbox_coords"             "status_url"             
[73] "name"                    "location"               
[75] "description"             "url"                    
[77] "protected"               "followers_count"        
[79] "friends_count"           "listed_count"           
[81] "statuses_count"          "favourites_count"       
[83] "account_created_at"      "verified"               
[85] "profile_url"             "profile_expanded_url"   
[87] "account_lang"            "profile_banner_url"     
[89] "profile_background_url"  "profile_image_url"      
rstats_tweets %>% 
  select(screen_name, text) %>% 
  head()
# A tibble: 6 x 2
  screen_name     text                                                          
  <chr>           <chr>                                                         
1 CRANberriesFeed CRAN updates: gstat hdme opalr tinytest https://t.co/y5W2NTKS…
2 CRANberriesFeed New CRAN package FeatureImpCluster with initial version 0.1.2…
3 CRANberriesFeed CRAN updates: crfsuite https://t.co/y5W2NTKSXT #rstats        
4 CRANberriesFeed CRAN updates: DSI https://t.co/y5W2NTKSXT #rstats             
5 CRANberriesFeed New CRAN package arcos with initial version 1.1 https://t.co/…
6 CRANberriesFeed CRAN updates: COVID19 gaiah mosaic smoothSurv https://t.co/y5…

There is a bunch of other things that you can do just using a regular user account, and if you’re interested then you should try the examples in the rtweet package documentation: https://rtweet.info/index.html. But more is available once you register as a developer (https://developer.twitter.com/en/apply-for-access). The Twitter API document is surprisingly readable and you may enjoy some of it: https://developer.twitter.com/en/docs.

When I introduced APIs I said that the ‘data provider specifies exactly the data that they are willing to provide…’ and we have certainly been able to take advantage of what they provide But I continued ‘…and the terms under which they will provide it’ and here we haven’t done our part. In particular, I took some tweets and saved them. If I had pushed these to GitHub then it’s possible I may have accidently stored sensitive information if there happened to be some in the tweets. Or if I had taken enough tweets to start to do some reasonable statistical analysis then even if there wasn’t sensitive information I may have violated the terms if I had pushed those saved tweets to GitHub. Finally, I linked a Twitter user name, in this case @Liza_Bolton with Professor Bolton. I happened to ask her if this was okay, but if I hadn’t done that then I would have been violating the Twitter terms of service.

If you use Twitter data, please take a moment to look at the terms: https://developer.twitter.com/en/developer-terms/more-on-restricted-use-cases.

spotifyr

For the next example I will introduce the spotifyr package (Thompson et al. 2020). Again, this is a wrapper that has been developed around an API, in this case the Spotify API.

https://www.rcharlie.com/spotifyr/

# devtools::install_github('charlie86/spotifyr')
library(spotifyr)

In order to use this account you need a Spotify Developer Account, which you can set-up here: https://developer.spotify.com/dashboard/. That’ll have you log in with your Spotify details and then accept their terms (it’s worth looking at some of these and I’ll follow up on a few below) as in Figure 2.

rtweet authorisation page

Figure 2: rtweet authorisation page

What we need from here is a ‘Client ID’ and you can just fill out some basic details. In our case we probably ‘don’t know’ what we’re building, which means that Spotify requires us to use a non-commercial agreement, which is fine. In order to use the Spotify API we need a Client ID and a Client Secret.

These are things that you want to keep to yourself. There are a variety of ways of keeping this secret, (and my understanding is that a helpful package is on its way) but we’ll keep them in our System Environment. In this way, when we push to GitHub they won’t be included. To do this we need to be careful about the naming, because spotifyr will look in our environment for specifically named keys.

To do this we are going to use the usethis package Wickham and Bryan (2020). So if you don’t have that then please install it. There is a file called ‘.Renviron’ which we will open and add our secrets to. This file also controls things like your default library location and more information is available at Lopp (2017) and Bryan and Hester (2020).

usethis::edit_r_environ() 

When you run that function it will open a file. There you can add your Spotify secrets.

SPOTIFY_CLIENT_ID = 'PUT_YOUR_CLIENT_ID_HERE'
SPOTIFY_CLIENT_SECRET = 'PUT_YOUR_SECRET_HERE'

Save your ‘.Renviron’ file, and then restart R (Session -> Restart R). You can now draw on that variable when you need.

Some functions that require your secrets as arguments will now just work. For instance, we will get information about Radiohead using get_artist_audio_features(). One of the arguments is authorization, but as that is set to default to look at the R Environment, we don’t need to do anything further.

radiohead <- get_artist_audio_features('radiohead')
saveRDS(radiohead, "inputs/radiohead.rds")
radiohead <- readRDS("inputs/radiohead.rds")

names(radiohead)
 [1] "artist_name"                  "artist_id"                   
 [3] "album_id"                     "album_type"                  
 [5] "album_images"                 "album_release_date"          
 [7] "album_release_year"           "album_release_date_precision"
 [9] "danceability"                 "energy"                      
[11] "key"                          "loudness"                    
[13] "mode"                         "speechiness"                 
[15] "acousticness"                 "instrumentalness"            
[17] "liveness"                     "valence"                     
[19] "tempo"                        "track_id"                    
[21] "analysis_url"                 "time_signature"              
[23] "artists"                      "available_markets"           
[25] "disc_number"                  "duration_ms"                 
[27] "explicit"                     "track_href"                  
[29] "is_local"                     "track_name"                  
[31] "track_preview_url"            "track_number"                
[33] "type"                         "track_uri"                   
[35] "external_urls.spotify"        "album_name"                  
[37] "key_name"                     "mode_name"                   
[39] "key_mode"                    
radiohead %>% 
  select(artist_name, track_name, album_name) %>% 
  head()
  artist_name                               track_name
1   Radiohead                      Airbag - Remastered
2   Radiohead            Paranoid Android - Remastered
3   Radiohead Subterranean Homesick Alien - Remastered
4   Radiohead     Exit Music (For a Film) - Remastered
5   Radiohead                    Let Down - Remastered
6   Radiohead                Karma Police - Remastered
                     album_name
1 OK Computer OKNOTOK 1997 2017
2 OK Computer OKNOTOK 1997 2017
3 OK Computer OKNOTOK 1997 2017
4 OK Computer OKNOTOK 1997 2017
5 OK Computer OKNOTOK 1997 2017
6 OK Computer OKNOTOK 1997 2017

Let’s just make a quick graph looking at track length over time.

radiohead %>% 
  ggplot(aes(x = album_release_year, y = duration_ms)) +
  geom_point()

Just because we can, let’s settle an argument. I’ve always said that Radiohead of quite depressing, but they’re my wife’s favourite band. So let’s see how depressing they are. Spotify provides various information about each track, including ‘valence’, which Spotify define as ‘(a) measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).’ So higher values are happier. Let’s compare someone who we know it likely to be happy - Taylor Swift - with Radiohead.

swifty <- get_artist_audio_features('taylor swift')
saveRDS(swifty, "inputs/swifty.rds")
swifty <- readRDS("inputs/swifty.rds")

tibble(name = c(swifty$artist_name, radiohead$artist_name),
       year = c(swifty$album_release_year, radiohead$album_release_year),
       valence = c(swifty$valence, radiohead$valence)
               ) %>% 
  ggplot(aes(x = year, y = valence, color = name)) +
  geom_point() +
  theme_minimal() +
  labs(x = "Year",
       y = "Valence",
       color = "Name") +
  scale_color_brewer(palette = "Set1")

Finally, for the sake of embarrassment, let’s look at our most played artists.

top_artists <- get_my_top_artists_or_tracks(type = 'artists', time_range = 'long_term', limit = 20)

saveRDS(top_artists, "inputs/top_artists.rds")
top_artists <- readRDS("inputs/top_artists.rds")

top_artists %>% 
  select(name, popularity)
                  name popularity
1            Radiohead         81
2  Bombay Bicycle Club         66
3                Drake        100
4        Glass Animals         74
5                JAY-Z         85
6        Laura Marling         65
7       Sufjan Stevens         75
8      Vampire Weekend         73
9     Sturgill Simpson         65
10          Nick Drake         66
11        Dire Straits         78
12               Lorde         80
13         Marian Hill         65
14       José González         68
15       Stevie Wonder         79
16          Disclosure         82
17      Ben Folds Five         52
18       Ainslie Wills         40
19            Coldplay         89
20               alt-J         75

So pretty much my wife and I like what everyone else likes, with the exception of Ainslie Wills, who is an Australian and I suspect we used to listen to her when we were homesick.

How amazing that we live in a world that all that information is available with very little effort or cost.

Again, there is a lot more at the package’s website: https://www.rcharlie.com/spotifyr/. A very nice little application of the Spotify API using some statistical analysis is Pavlik (2019).

Next steps: Using APIs directly

In this section we introduce GET requests in which we use an API directly. We will use the httr package Wickham (2019). A GET request tries to obtain some specific data.

You make a GET request, using, GET(), which takes a URL as an argument.

Bryan, Jennifer, and Jim Hester. 2020. What They Forgot to Teach You About R. https://rstats.wtf/index.html.

Cooksey, Brian. 2014. “An Introduction to Apis.” Zapier. https://zapier.com/learn/apis/.

Kearney, Michael W. 2019. “Rtweet: Collecting and Analyzing Twitter Data.” Journal of Open Source Software 4 (42): 1829. https://doi.org/10.21105/joss.01829.

Lopp, Sean. 2017. “R for Enterprise: Understanding R’s Startup.” R Views. https://rviews.rstudio.com/2017/04/19/r-for-enterprise-understanding-r-s-startup/.

Pavlik, Kaylin. 2019. “Understanding + Classifying Genres Using Spotify Audio Features.” https://www.kaylinpavlik.com/classifying-songs-genres/.

Thompson, Charlie, Josiah Parry, Donal Phipps, and Tom Wolff. 2020. Spotifyr: R Wrapper for the ’Spotify’ Web Api. http://github.com/charlie86/spotifyr.

Wickham, Hadley. 2019. Httr: Tools for Working with Urls and Http. https://CRAN.R-project.org/package=httr.

Wickham, Hadley, and Jennifer Bryan. 2020. Usethis: Automate Package and Project Setup. https://CRAN.R-project.org/package=usethis.


  1. There are at least six great coffee shops shown just in this section of map including: Mocan & Green Grout; The Cupping Room; Barrio Collective Coffee; Lonsdale Street Cafe; Two Before Ten; and Red Brick. There are also two coffee shops that I love but that most wouldn’t classify as ‘great’ including: The Street Theatre Cafe; and the CBE Cafe.↩︎

References