Luke Stanke

Data Science – Analytics – Psychometrics – Applied Statistics

Taking Advanced Analytics to the Cloud – Part I: R on AWS


Running R on the cloud isn’t very difficult. This demo shows how to get Rstudio running on Amazon Web Services. To run R on the cloud we need to initiate a machine/computer and install R – that’s all very simple. Where you might get caught up is in the settings and permissions.

Step 1: Create an AWS Account

If you don’t have an AWS account you’ll need to sign up.

Step 2: Create a Role

Once you are signed-up and signed into AWS the first thing you’ll need to do is create a role for your machine. A role provides the necessary permissions for your machine to access various APIs available by AWS.

You can do this by searching IAM on the services tab.

Once you are on the page click “Roles” on the left menu, then the “Create New Role” button.

Then select Amazon EC2 from the AWS Service Roll.

There are a number of services that you could run with your instance that you will set up – and there are varying permissions, too. For now, just search and select AmazonS3FullAccess, AmazonEC2FullAccess, RedshiftFullAccess, and AthenaFullAccess. You really won’t need anyt of these right away, but they will be useful when connecting to other services like S3 or another EC2 instance. Note: photo does not include AthenaFullAccess but you should include it!

From there you’ll be good to go! Depending on your access needs your policies should be selected accordingly. In fact, if you were doing this correctly, you’d want to create policies that are not over-arching like the full access options we’ve selected for this demo.

Step 3: Create a Key Pair

Next, you’ll want to create a key pair. This will allow you to securely log into your instance to update your machine and install Rstudio.

Go to Services and search EC2 in the search bar. Once the EC2 page loads, click “Key Pairs” under the “Network & Security” section on the left menu bar.

From there click “Create Key Pair”. Give your key a name and hit create. The key pair will download as a .pem file. Do not lose this key! Also: do not share this key!

Step 4: Create a R/Rstudio/Shiny Security Group

As you can see by the steps this far its all about security. From the EC2 menu, under the “Network & Security” section on the left menu select “Security Groups”. Click the “Create Security Group Button” and a pop-up will appear. Create a security group name called “Rstudio/Shiny”. Under description write “Allows port access for R and Shiny”. You can leave the VPC drop down alone. On the inbound tab – the default – add three new rules.

Under the first rule select SSH from the dropdown. The source should be auto populated with 0.0.0.0/0. This opens up SSH to anywhere in the world – but you’ll need the .pem key to login.

For the second rule leave it as custom TCP and type 8787 as the port range. This is the port that needs to be opened on the server for you to connect to Rstudio. Under source type 0.0.0.0/0. This means you could log into Rstudio from any IP address. You could also just use your own IP address if you know it, too.

For the third rule leave it as custom TCP also and type 3838 as the port range. This is the port for the Shiny connection. Were not going to use it for the immediate demo but it’ll be very useful in the future. Under source type 0.0.0.0/0 as well.

Step 5: Launch EC2 Instance

Staying in the EC2 Section click “Instance” under the “Instances” Section on the left menu. Click the “Launch Instance” button. This will take you to “Step 1: Choose an Amazon Machine Image”.

You’ll have a number of tabs to select AMIs from. Stay on the Quick Start tab and select Amazon Linux AMI – it’s the first option. It has a number of pre-built tools that are useful – but it doesn’t have R installed. You’ll do that in a bit.

For “Step 2: Choose an Instance Type” choosing an instance type can be daunting. Here you are essentially specifying the type of computer you want to run. I typically select from the General purpose, Compute optimized, or Memory optimized options depending on the type of models I’m running and the type of data I am working with. For this example select t2.micro because this is just demo the tool. Click “Next: Configure Instance”. Note: you’ll need a more powerful machine to install packages Step 8 and beyond. I’d recommend a c2.large machine just to be safe.

For “Step 3: Configure Instance”, under “IAM Role” select the role you created earlier – my role was called EC2- S3-Redshift. Note: Under Advanced Options you could send a bash script to configure your instance, instead we’ll do it with command line tools. Click “Next: Add Storage”.

For “Step 4: Add Storage” we can stick with the default settings – for now. Click “Next: Add Tags”

For “Step 5: Add Tags”, click “Add Tag”, enter a key of “Name” and a value of “Rstudio Example” – this will give our machine a name of “Rstudio Example”. Having a name is important when you have more than one machine running. Click “Next: Configure Security Group”

For “Step 6: Configure Security Group”, under “Assign a Security Group” select “Select an existing security group”. Find and select your “Rstudio/Shiny” security group. If this is your first time doing this, it should be fine. If you have multiple security groups like myself you’ll have to search through the lot. Click “Review and Launch”.

Under the review screen, make sure you’ve selected instance types, security groups, and IAM roles as mentioned above. Click “Launch” – you’ll get a pop-up to select an existing key pair or create a new key pair. Choose an existing key pair and select the key pair you just created. My key is called awesome_key. Awknowledge you’ve selected the key pair and you’ll need it to log on. Note: make sure you have the key pair to log into your instance. This should go without saying, but I’m saying it – you need your key pair to log into your machine and set it up! Launch your instance.

Step 6: Login to your instance

This is where the Windows OS world diverges from the Apple/Unix/Linux/Ubuntu worlds. Windows doesn’t have a built-in terminal like these other machines so you’ll have to download and setup PuTTy if you don’t have it already. Next you’ll use your terminal to SSH into your newly created instance and set it up. The instance will likely need about 60 seconds to setup from when you hit launch.

After you launch click on your instance id just to the right of the message saying the instance launches has been initiated.

This will take you to your instance on the EC2 dashboard. The dashboard is full of important information – most notably it’ll tell you if your instance is up and running. From the image below you can see my instance state is green running. This tells me it’s ready to be SSH’ed into.

You can also see the Public DNS on the right and on the description tab. We’ll need that information to SSH into it. *Note: you can also see my IAM Role and key pair name. Click the “Connect” button just to the right of the “Launch Instance Button”. This provides you with addition directons on how to connect to your instance.

First, let’s open our terminal and change directories so that we are in the folder that contains our .pem key. If you haven’t moved it out of your downloads folder, it’s probably there – and you should probably move it.

Next change the permissions of your key pair to allow you to SSH onto your AWS instance.

chmod 400 awesome_key.pem

Then SSH onto your machine using the following format. You’ll have to replace the key pair with the key pair you’ve created. You’ll also have to change the public DNS address to the address of your machine..

ssh -i "awesome_key.pem" ec2-user@ec2-54-145-158-106.compute-1.amazonaws.com

Once you are logged in it your terminal should look like this:

Step 7: Setup your instance

Once we are logged in we need to update our machine, install a few additional programs, install R, and install Rstudio. You can do this by running the following commands line-by-line through your EC2 instance.

# Update the machine
sudo yum -y update

# Install programs that run well with the devtools package
sudo yum -y install libcurl-devel openssl-devel # used for devtools

# Install programs that assist APIs
sudo yum -y install libxml2 libxml2-devel

# Install R
sudo su
yum install -y R

#  Install PostgreSQL
yum install -y postgresql-devel

After running this code you will have 1) updated your machine; 2) installed tools to allow the devtools package to run; 3) installed tools to allow like httr and aws.s3 to run; and 4) installed base R.

Next you’ll want to install the most recent version of Rstudio and Shiny. Check here to find the most recent releases of Rstudio and Shiny. Edit the code so that you install the most recent version. Run the install of Rstudio and Shiny.


# Install RStudio Server - change version when installing your Rstudio
wget -P /tmp https://s3.amazonaws.com/rstudio-dailybuilds/rstudio-server-rhel-1.0.143-x86_64.rpm
sudo yum install -y --nogpgcheck /tmp/rstudio-server-rhel-1.0.143-x86_64.rpm

#install shiny and shiny-server - change version when installing your Rstudio
R -e "install.packages('shiny', repos='http://cran.rstudio.com/')"
wget https://download3.rstudio.org/centos5.9/x86_64/shiny-server-1.5.3.838-rh5-x86_64.rpm
yum install -y --nogpgcheck shiny-server-1.5.3.838-rh5-x86_64.rpm

#add user(s)
sudo useradd -m stanke
sudo passwd stanke

Finally add a user — and change the password once this is done you can terminate your SSH tunnel.

Step 8: Log into Rstudio

Copy your public DNS that was located on your EC2 page earlier. Paste that public DNS into your browser and add “:8787” after your instance – in my case “ec2-54-145-158-106.compute-1.amazonaws.com:8787”. Hit enter. Your Rstudio login page should appear. Enter your credentials from your new user. Click “Sign In”.

Step 9: Setup your Rstudio Defaults.

So technically you’ve done it. You’ve logged into Rstudio on AWS. But let’s take this another few steps. Let’s set up some defaults so that any time we want to set up an Rstudio instance on AWS we don’t have to go through the hassle we just did above. Let’s install a bunch of packages that you might regularly use. This install might take a while since you’ll be installing a number of packages onto your machine.

## install packages if not present
install.packages(
  c(
    "devtools",
    "sparklyr",
    "ggplot2",
    "magrittr",
    "tidyverse",
    "Lahman",
    "DBI",
    "rsparkling",
    "h2o",
    "ghit",
    "xml2",
    "stringr",
    "magrittr",
    "data.table",
    "clipr"
  )
)

Lets also create a new .R file and write a few lines of code. You don’t really need to do this, but it’ll show the power of Step 10 in a few minutes.

Here is the R script you can use:

### get data
data("iris")

###  See data structure
head(iris)

###  Save iris data locally.
write.csv(iris, "iris.csv")

After the initial script: A .R file, an iris object, and an iris.csv file.

Step 10: Take an image of your current instance

Setting up Rstudio is a pain – all that time waiting for packages to install, getting your data just right, only to possibly find out you didn’t size your instance correctly. No one wants to have to deal with that every time they setup R on AWS. This isn’t a problem as you can take a snapshot of your instance as it is and spin off new instances from the point and time of the snapshot.

To take a snapshot go back to the webpage with your EC2 details. Click the “Actions” button, then go to “Image” and “Create Image”.

From there enter an Image Name of “Rstudio AMI Example” and Image Description of “Rstudio AMI Example”. Click “Create Image” and wait a few minutes.

Step 11: Create an instance from the AMI

Launch a new instance. On “Step 1: Choose an Amazon Machine Image”, click “My AMIs”. Select “Rstudio AMI Example”. Follow Step 5 above for the rest of the setup. However with this instance tag the Name as “Rstudio AMI”. If you’ve set up everything correctly you shouldn’t need to SSH into your instance to configure.

Copy the Public DNS into your browser and add “:8787”. Login with your username and password derived earlier. Once logged in you’ll see that your .R script and .csv scripts are still saved to the instance – allowing you to quickly jump back into your analyses.

In fact, creating an image and then creating multiple instances will allow you to quickly fit models and test which instance type will be best for the models/data you are currently working with and eventually minimize costs. If the data become more complicated or larger in size you can create a new instance that can accommodate those changes. However, sometimes the data become so large and we need to use a distributed system – Rstudio can sit on top of distributed systems as well – I’ll talk about that in a different post. This approach will get you a great jump-start though.


Taking Advanced Analytics to the Cloud – Part II: Objects from S3 to R using the aws.s3 package


When taking advanced analytics to the cloud you’ll need a strong understanding of your platform – whether it’s compute, storage, or some other feature. This tutorial walks you through reading to and from Amazon Web Service’s Simple Storage Service. For this demo cod will be running though Rstudio which is running on a linux server on the cloud – which you can learn how to do here.

Using aws.s3 package

Before I found this package I was doing things the hard way – Using the AWS command line tools to put and get data from S3. The aws.s3 package makes these practices very convienent.

library(aws.s3)
library(magrittr)

Saving System Variables

To make life easier you should save your .pem key credentials as system variables. Though doing this makes life easier it’s a greater security risk.

Sys.setenv(
  "AWS_ACCESS_KEY_ID" = "ABCDEFGHIJLMNOP",
  "AWS_SECRET_ACCESS_KEY" = "ABCAaKDKJHFSKhfiayrhekjabdfkasdhfiaewr0293u4bsn"
)

Looking into S3

Saving your credentials eliminates additional arguements needed to run each of the aws.s3 functions shown below. Lets start with looking at my buckets by using the bucket_list_df function. This returns my bucket names and creation dates as a data frame. f

bucket_list_df() %>%
  dplyr::arrange(dplyr::desc(CreationDate)) %>%
  head()
##                                            Bucket             CreationDate
## 1                                sample-data-demo 2017-06-01T20:04:07.000Z
## 2 aws-athena-query-results-666957067213-us-east-1 2017-05-20T18:18:31.000Z
## 3                 aws-logs-666957067213-us-east-1 2017-02-19T21:59:02.000Z
## 4                                         test.io 2017-01-25T13:38:32.000Z
## 5                                            test 2017-01-25T13:37:28.000Z
## 6                                      stanke.co 2016-10-04T13:02:41.000Z

I’m most interested in the sample-data-demo bucket. We can use the t function to examine the contents of the bucket. The output comes as a list – which isn’t always the best to work with. So I’ve written some code to take the output and transform it to a data frame/tibble.

##  List files in bucket
files <- get_bucket("sample-data-demo")

##  Convert files to tidy
files_df <-
tibble::data_frame(
file = character(),
LastModified = character()
)

n_files <- length(files)

for(i in 1:n_files) {
files_df <-
tibble::data_frame(
file = files[i][[1]]$Key,
LastModified = files[i][[1]]$LastModified
) %>% 
dplyr::bind_rows(files_df)

}

rm(n_files)

head(files_df)
## # A tibble: 6 x 2
##                   file             LastModified
##                  <chr>                    <chr>
## 1 flights_2008.csv.bz2 2017-06-04T02:39:13.000Z
## 2     flights_2008.csv 2017-06-04T16:01:52.000Z
## 3 flights_2007.csv.bz2 2017-06-04T02:39:08.000Z
## 4     flights_2007.csv 2017-06-04T15:59:50.000Z
## 5 flights_2006.csv.bz2 2017-06-04T02:39:03.000Z
## 6     flights_2006.csv 2017-06-04T15:57:58.000Z

Putting data into S3

Putting data into S3 is pretty easy. We can use several functions: S3save, put_object, OR save_object functions.

Where do we use these?

S3save

S3save is analogous to save. We can take an object and save it as an .Rdata file. Here I’ll take a local file – pro football results and create an object. Then I’ll save it to S3.

games <- readr::read_csv("data/NFL/GAME.csv")

s3save(games, object = "games.Rdata", bucket = "sample-data-demo")

Please note that I have to save this as an .Rdata object – even though it was originally an .csv file.

put_object

Put object allows me to put any object that is on a local drive onto S3. This can basically be any filetype – .Rdata, .csv, .csv.bz2 – they are all covered here. There are three arguments you need to know: 1) file= the location of the file you want to send to S3; 2) object= the name you want to give to the S3 object – probably the same as the file arguement; and 3) bucket= the name of the bucket you’d like to place the object into.

put_object("data/NFL/GAME.csv", "games.csv", bucket = "sample-data-demo" )
## [1] TRUE

Here we took the same .csv we read in earlier and saved the object as games.csv into the sample-data-demo bucket. You’ll see TRUE is returned indicating the file has successfully uploaded to S3.

save_object

The save_object function sounds like it might save a file to S3. But it’s actually the opposite of put_object. save_object takes a file on S3 and saves it to your working directory for you to use. I REPEAT: save_object takes a file from S3 and saves it to your working directory.

r save_object(“games.csv”, file = “games.csv”, bucket = “sample-data-demo”)

## [1] “games.csv”

We can then take this file and read it like we would normally do.

games <- readr::read_csv('games.csv')
dim(games)
## [1] 4256   17

get_object

The save_object function stores information locally. If you want to keep as much as possible in-memory you can use the get_object function – which returns the file as raw data.

games <- get_object("games.csv", bucket = "sample-data-demo")

games[1:100]
##   [1] 67 69 64 2c 73 65 61 73 2c 77 6b 2c 64 61 79 2c 76 2c 68 2c 73 74 61
##  [24] 64 2c 74 65 6d 70 2c 68 75 6d 64 2c 77 73 70 64 2c 77 64 69 72 2c 63
##  [47] 6f 6e 64 2c 73 75 72 66 2c 6f 75 2c 73 70 72 76 2c 70 74 73 76 2c 70
##  [70] 74 73 68 0d 0a 31 2c 32 30 30 30 2c 31 2c 53 55 4e 2c 53 46 2c 41 54
##  [93] 4c 2c 47 65 6f 72 67 69

As I mentioned, using the get_object function returns raw data. This means if you look at the immediate output you’ll see the bits of information as they are. To return the data as intended you’ll need to use the rawToChar function to convert the data:

games <- 
  aws.s3::get_object("games.csv", bucket = "sample-data-demo") %>%
  rawToChar() %>%
  readr::read_csv()

dim(games)
## [1] 4256   17

This works pretty well for reading in most file types, but I’ve found it very hard to use for compressed files. I’d recommend save_object for .bz2, .gz, .zip or any other compressed file. I just haven’t found a good solution yet.

delete_object

to delete an object on S3 just use the delete_object function. Here I’ll delete the files I just created for this demo.

aws.s3::delete_object("games.Rdata", bucket = "sample-data-demo")
## [1] TRUE
aws.s3::delete_object("games.csv", bucket = "sample-data-demo")
## [1] TRUE

put_bucket and delete_bucket

I can easily create or delete a bucket with the put_bucket and delete_bucket functions. With put_bucket I can also specify it’s ACL – i.e. private, public read, public read/write.

##  Make a bucket.
# put_bucket("stanke123123123")
##  And make it disappear.
# delete_bucket("stanke123123123")

These functions get you started with S3 on AWS. There are a host of other services available that I’ll continue to share


Most Downloaded R Packages of 2016

I was curious what packages are the most downloaded from CRAN. A quick google search of “most downloaded R packages 2016” produces outdated articles (see here, here, and here). Luckily the Rstudio CRAN keeps logs of the number of users downloading packages each day. With a few lines of code the data downloaded and aggregated. This data was then placed in Tableau and packages were given weekly ranks. Here are the results:

Target Store Locations with rvest and ggmap

I just finished developing a presentation for Target Analytics Network showcasing geospatial and mapping tools in R . I decided to use Target store locations as part of a case study in the presentation. The problem: I didn’t have any store location data, so I needed to get it from somewhere off the web. Since there some great tools in R to get this information, mainly rvest for scraping and ggmap for geocoding, it wasn’t a problem. Instead of just doing the work, I thought I should share what this process looks like:

First, we can go to the target website and find stores broken down by state.

Screen Shot 2016-02-15 at 4.14.41 PM

After finding this information, we can use the rvest package to scrape the information. The URL is so nicely formatted that you can easily grab any state if you know the state’s mailing code.

# Set the URL to borrow the data.
TargetURL <- paste0('http://www.target.com/store-locator/state-result?stateCode=', state)

Now we can set a state — Minnesota’s mailing code is MN.

# Set the state.
state <- 'MN'

Now that we have the URL, let’s grab the html from the webpage.

# Download the webpage.
TargetWebpage <-
  TargetURL %>%
  xml2::read_html()

Now we have to find the location of the table in the html code.

Screen Shot 2016-02-15 at 4.15.46 PM

Once we have found the html table, there are a number of ways we could extract from this location. I like to copy the the XPath location. It’s a bit lazy, but for the purpose of this exercise it makes life easy.

Once we have the XPath location, it’s easy to exact the table from the Target’s webpage. First we can pipe the html through the html_nodes function, this will isolate the html responsible for creating the store locations table. After that we can use the html_table to parse the html table into an R list. Let’s then use the data.frame function to take the list to a data frame and use the select function from the dplyr library to select specific variables. The problem with extracting the data is that the city, state, and zip code are in one column. Well its not really a problem for this exercise, but its maybe the perfectionist in me. Let’s use the separate function in the tidyr library to make city, state, and zipcode their own columns.

# Get all of the store locations.
TargetStores <-
  TargetWebpage %>%
  rvest::html_nodes(xpath = '//*[@id="stateresultstable"]/table') %>%
  rvest::html_table() %>%
  data.frame() %>%
  dplyr::select(`Store Name` = Store.Name, Address, `City/State/ZIP` = City.State.ZIP) %>%
  tidyr::separate(`City/State/ZIP`, into = c('City', 'Zipcode'), sep = paste0(', ', state)) %>%
  dplyr::mutate(State = state) %>%
  dplyr::as_data_frame()

Let’s get the coordinates for these stores; we can pass each store’s address through the geocode function which obtains the information from the Google Maps API — you can only geocode up to 2500 locations per day for free using the Google API.

# Geocode each store
TargetStores %<>%
  dplyr::bind_cols(
    ggmap::geocode(
      paste0(
        TargetStores$`Store Name`, ', ',
        TargetStores$Address, ', ',
        TargetStores$City, ', ',
        TargetStores$State, ', ',
        TargetStores$Zipcode
      ),
      output = 'latlon',
      source = 'google'
    )
  )

Now that we have the data, let’s plot. In order to plot this data, we need to put it in a spatial data frame — we can do this using the SpatialPointsDataFrame and CRS functions from the sp package. We need to specify the coordinates, the underlying data, and the projections

# Make a spatial data frame
TargetStores <-
  sp::SpatialPointsDataFrame(
    coords = TargetStores %>% dplyr::select(lon, lat) %>% data.frame,
    data = TargetStores %>% data.frame,
    proj4string = sp::CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0")
  )

Now that we have a spatial data frame, we can plot these points — I’m going to plot some other spatial data frames to make add context for the Target store point data.

# Plot Target in Minnesota
plot(mnCounties, col = '#EAF6AE', lwd = .4, border = '#BEBF92', bg = '#F5FBDA')
plot(mnRoads, col = 'darkorange', lwd = .5, add = TRUE)
plot(mnRoads2, col = 'darkorange', lwd = .15, add = TRUE)
plot(mnRivers, lwd = .6, add = TRUE, col = '#13BACC')
plot(mnLakes, border = '#13BACC', lwd = .2, col = '#EAF6F9', add = TRUE)
plot(TargetStores, add = TRUE, col = scales::alpha('#E51836', .8), pch = 20, cex = .6)

Target Locations in Minnesota

Yes! We’ve done it. We’ve plotted Target stores in Minnesota. That’s cool and all, but really we haven’t done much with the data we just obtained. Stay tuned for the next post to see what else we can do with this data.

UPDATE: David Radcliffe of the Twin Cities R User group presented something similar using Walmart stores.