Luke Stanke

Data Science – Analytics – Psychometrics – Applied Statistics

Creating orthographic maps (globe-like maps) in Tableau is easier than it seems.

I recently put together a an orthographic map (globe-like map) of eclipse patterns visualization in Tableau that initially seems impossible to create. My inspiration was the dynamic globe built in D3.js by the Washington Post. Building this visualization is may look difficult but I want to show that this is a myth and that building anything in Tableau isn’t that hard when you break everything down step-by-step. This post provides insight into the process of building this visualization.

Step 1: Get eclipse data.

I found eclipse data here for total and hybrid solar eclipses from 1961 through 2100. The data are in the .kmz format – for Google Earth. The bad news: Tableau wont read multiple .kmz files. So I needed to extract data from it. The good news it’s easy to convert the file into something useable. If I change the extension to from file_name.kmz to file_name.kmz.zip and then unzip the file, I’ll find a .kml file inside the unzipped folder.

 

More bad news: if I take a look at the file_name.kml file, I’ll see a file structure that at first seems very difficult to understand.

Step 2: Extract eclipse data.

Good news: no one shouldn’t be intimidated by kml. After a few minutes I was quick to recognize that all data were contained under the “<Document><Folder>” path. In each folder is a different layer that could be mapped. In the case of my visualization I was interested in the Umbral Path which was described in <name> tag. This is the path of totality. Within the <Folder> structure there are three different <Placemark> tags: Norther Limit, Center Line, and Southern Limit. Within the <Placemark> tag there are <coordinates> in which I could extract.

This is just a long-winded way of saying there is a consistent path we can follow to extract coordinates. All we need to do is extract this information. I prefer R, so here is a little function I wrote to pull the data. I’m also going to show a long function but its commented out so I can see what is done. This is also half of all the code needed to create the file.

# Build a function to extract coordinates.
# input folder location into function.
extract_data <- function(loc = "/Users/luke.stanke/Downloads//TSE_2038_12_26.kmz") {

# Find kml file.
  loc <-
    loc %>% 
    list.files() %>%
    .[!stringr::str_detect(., "files")] %>%
    paste(loc, ., sep = "/")

# Read .kml.  
  shape <- xml2::read_html(loc)
  

# Get data for the Umbral Path folder.
  valid <-
    shape %>%
    rvest::html_nodes("folder") %>%
    stringr::str_detect(., "Umbral Path")
  
# Extract coordinates.
  shape_file <-
    shape %>%
    rvest::html_nodes("folder") %>%
    .[valid] %>%
    rvest::html_nodes("placemark linestring coordinates") %>% 
    .[1] %>%
    rvest::html_text("") %>%
    stringr::str_replace_all("\r", "") %>%
    stringr::str_trim() %>%
    stringr::str_split(" ") %>%
    unlist() %>%
    data_frame(
      coords = .
    ) 
 
# Pull data for the Southern Limit. 
  name <-
    shape %>%
    rvest::html_nodes("folder") %>%
    .[valid] %>%
    rvest::html_nodes("placemark name") %>%
    .[1] %>%
    rvest::html_text()
  
# Build the data frame.
  shape_file <-
    strsplit(shape_file$coords, ",") %>%
    unlist() %>%
    matrix(ncol = 3, byrow = TRUE) %>%
    as_data_frame %>%
    mutate(
      V1 = as.numeric(V1),
      V2 = as.numeric(V2),
      V3 = as.numeric(V3),
      name = name
    ) %>%
    rename(
      latitude = V1,
      longitude = V2,
      altitude = V3
    ) %>%
    tibble::rownames_to_column("index") 
  
# Repeat for Central Line.
  if(shape %>%
    rvest::html_nodes("folder") %>%
    .[valid] %>%
    rvest::html_nodes("placemark linestring coordinates") %>%
    length() %>%
    `>=`(2)) {
    
    shape_file2 <-
      shape %>%
      rvest::html_nodes("folder") %>%
      .[valid] %>%
      rvest::html_nodes("placemark linestring coordinates") %>% 
      .[2] %>%
      rvest::html_text("") %>%
      stringr::str_replace_all("\r", "") %>%
      stringr::str_trim() %>%
      stringr::str_split(" ") %>%
      unlist() %>%
      data_frame(
        coords = .
      ) 
    
    name <-
      shape %>%
      rvest::html_nodes("folder") %>%
      .[valid] %>%
      rvest::html_nodes("placemark name") %>%
      .[2] %>%
      rvest::html_text()
    
    shape_file2 <-
      strsplit(shape_file2$coords, ",") %>%
      unlist() %>%
      matrix(ncol = 3, byrow = TRUE) %>%
      as_data_frame %>%
      mutate(
        V1 = as.numeric(V1),
        V2 = as.numeric(V2),
        V3 = as.numeric(V3),
        name = name
      ) %>%
      rename(
        latitude = V1,
        longitude = V2,
        altitude = V3
      ) %>%
      tibble::rownames_to_column("index")
    
    shape_file %<>%
      bind_rows(shape_file2)
    
    
  }
  
# Repeat for Northern Limit.
  if(shape %>%
     rvest::html_nodes("folder") %>%
     .[valid] %>%
     rvest::html_nodes("placemark linestring coordinates") %>%
     length() %>%
     `>=`(3)) {
    
    shape_file2 <-
      shape %>%
      rvest::html_nodes("folder") %>%
      .[valid] %>%
      rvest::html_nodes("placemark linestring coordinates") %>% 
      .[3] %>%
      rvest::html_text("") %>%
      stringr::str_replace_all("\r", "") %>%
      stringr::str_trim() %>%
      stringr::str_split(" ") %>%
      unlist() %>%
      data_frame(
        coords = .
      ) 
    
    name <-
      shape %>%
      rvest::html_nodes("folder") %>%
      .[valid] %>%
      rvest::html_nodes("placemark name") %>%
      .[3] %>%
      rvest::html_text()
    
    shape_file2 <-
      strsplit(shape_file2$coords, ",") %>%
      unlist() %>%
      matrix(ncol = 3, byrow = TRUE) %>%
      as_data_frame %>%
      mutate(
        V1 = as.numeric(V1),
        V2 = as.numeric(V2),
        V3 = as.numeric(V3),
        name = name
      ) %>%
      rename(
        latitude = V1,
        longitude = V2,
        altitude = V3
      ) %>%
      tibble::rownames_to_column("index")
    
    shape_file %<>%
      bind_rows(shape_file2)
    
    
  }  

# Combine everything together.

  
  shape_file %>%
    rename(longitude = latitude, latitude = longitude) %>%
    mutate(index = as.numeric(index)) %>%
    filter(name != "Central Line") %>% # Chose to drop Central line later.
    mutate(
      index = ifelse(name == "Northern Limit", -index, index),
      index = ifelse(name == "Northern Limit", index, index),
      Date = loc %>%
        stringr::str_split("TSE_") %>% 
        .[[1]] %>%
        .[2] %>%
        stringr::str_replace_all(".kml", "") %>%
        stringr::str_replace_all("_", "-") %>%
        lubridate::ymd()
    ) %>%
    arrange(index)  
}

With this function we can quickly build the a single data frame with all of the eclipse data.

# Search for all TSE files in the Downloads folder.
key_dirs <-
  list.dirs("/Users/luke.stanke/Downloads/") %>%
  .[stringr::str_detect(., "TSE_")] %>%
  .[!stringr::str_detect(., "files")]

# Run a for loop to create data.
for(i in 1:length(key_dirs)) {
  if(i == 1) {
    eclipse <- extract_data(key_dirs[i])
  } else {
    eclipse %<>%
      bind_rows(
        extract_data(key_dirs[i])
      )
  }
}

This produces a data frame with latitude, longitude, altitude (just 1), index (for the path of the polygons) and date of eclipse.

Step 3: Build Projections.

If we quickly visualize this here’s what we’ll have.

This is nice, but I really want this projected to look like a globe or Orthographic projection. To do this I need to create the math for the projection. We’re going to create 3 calculations in Tableau: [x], [y], and a filter called [valid]. We’ll first create two parameters: a base latitude and a base longitude. This will be our reference point that allows us to move around the globe.

The [Base Longitude] parameter is set to a range of -180 to 180 with steps of 15 degrees. This will eventually rotate the globe clockwise or counter-clockwise.

The [Base Latitude] parameter is set to a range of -90 to 90 with steps of 15 degrees. This will tilt the globe up or down towards the poles.

With these, let’s build our [x], [y], and [valid] calculations for the globe projection (we’re going to edit these later). X and Y will identify the new coordinates, while valid will work as a filter to only show values that are on facing outward from the projection.

// In Tableau: Create X coordinates.
COS(RADIANS([Latitude])) * SIN(RADIANS([Longitude]) - RADIANS([Base Longitude]))
// In Tableau: Create Y coordinates.
(
  COS(RADIANS([Base Latitude])) * 
  SIN(RADIANS([Latitude]))
) - (
  SIN(RADIANS([Base Latitude])) * 
  COS(RADIANS([Latitude])) * 
  COS(RADIANS([Longitude]) - RADIANS([Base Longitude]))
)
// In Tableau: Valid filter.
(
  SIN(RADIANS([Base Latitude])) *
  SIN(RADIANS([Latitude]))
) + (
  SIN(RADIANS([Base Latitude])) *
  SIN(RADIANS([Latitude])) *
  COS(RADIANS([Longitude]) - RADIANS([Base Longitude]))
)

Here is what I should see if I add [x] and [y] to the columns and rows, add date to color and index to path, and add [valid] to filter and set the range from 0 to 1.

Boom! its the eclipses projected to a sphere. Once again, bad news: we don’t have the earth behind the globe. But we can find this data and append it to the existing dataset.

Step 4: Download and extract shapefiles of countries.

In my opinion the go-to file is “ne_10m_admin_0_countries_lakes.shp”. This is country boundaries without lakes at 10 meters. Basically it’s far more detailed than we probably need but its nice. Just give it a search on the web and download the file.

Data are easily read into R with the st_read() function and polygon data can be extracted with the st_geometry() function.

When taking a look at the data this point I notice data are MULTIPOLYGON. The best way to describe this is that each country is made up of multiple polygons (potentially) therefore each country is has a set of polygons nested within each country. In some cases these polygons are nested multiple levels deep. A little investigation in the data told me there were three levels. Being the creative type I named each of these levels “set”, “subset”, and “subsubset”.

# Read in Land data.
library(sf)
land <- st_read("/Users/luke.stanke/Downloads/ne_10m_admin_0_countries_lakes/ne_10m_admin_0_countries_lakes.shp")

# Obtain underlying geometry/shapes.
land <- st_geometry(land)

# Get underlying data in shapefiles.
# ADMIN, ABBREV are Labels.
# scalerank, LABELRANK, TINY are admin levels I considered as filters.
land_data2 <- 
  land %>% 
  as.data.frame() %>%
  select(scalerank, LABELRANK, ADMIN, ABBREV, TINY) %>%
  as_data_frame()


# Create empty data frame to load data into.
land_data <-
  data_frame(
    longitude = double(),
    latitude = double(),
    set = character(),      # Data are nested.
    subset = character(),   # These are nesting
    subsubset = character() # identifiers.
  )

From there it’s a matter of looping through the layers to extract the data

# sort through highest level (countries).
for(i in 1:length(land)) {
  level_1 <-
    land %>%
    .[[i]] %>%
    stringr::str_split("MULTIPOLYGON \\(")
  
  # Sort through second level. (subset)
  for(j in 1:length(level_1)) {
    level_2 <-
      level_1 %>%
      .[[j]] %>%
      stringr::str_split("list\\(") %>%
      .[[1]] %>%
      .[. != ""] %>%
      stringr::str_split("c\\(") %>%
      .[[1]] %>%
      .[. != ""]

# Sort through third level get data and add to dataframe. (subsubset)    
    for(k in 1:length(level_2)) {
      level_2 %>%
        .[k] %>%
        stringr::str_replace_all("\\)", "") %>%
        stringr::str_split(",") %>%
        .[[1]] %>%
        stringr::str_trim() %>%
        as.numeric() %>%
        .[!is.na(.)] %>%
        matrix(ncol = 2) %>%
        as_data_frame() %>%
        rename(
          longitude = V1,
          latitude = V2
        ) %>%
        mutate(
          set = i,
          subset = j,
          subsubset = k
        ) %>%
        bind_rows(land_data) ->
        land_data
    }
  }
}

# Sort data by set, subset, subsubset.    
land_data %<>%
  arrange(set, subset, subsubset)

# Join underlying data.    
land_data %<>%
  mutate(set = as.character(set)) %>%
  left_join(
    as_data_frame(land_data2) %>%
    tibble::rownames_to_column("set")
  )

land_data %<>%
  mutate(
    name = "Land",
    altitude = 1
  ) %>%
  group_by(set, subset) %>%
  mutate(
    index = row_number()
  )

This produces a dataset with columns of latitude, longitude, set, subset, and subsubset. Now all we need to do is join this data together.

eclipse %<>%
  mutate(altitude = 1.01) %>%
  bind_rows(land_data)

# Write file. 
eclipse %>%
  readr::write_csv("eclipse.csv")    

If I go back to Tableau and refresh my data, add [set], [subset], and [subsubset] to detail, I now have countries.

To place countries in the right order we need to a manual sort the [set] dimension placing “NA” at the highest level of the sort screen.

Step 5: Edit functions to encorporate great circle (globe background).

Things are starting to look good. Even with a little formatting I could be done there. But what I want is a background of ocean – or a great circle – representing the boundaries of earth. I’m going to use the following R code to build my great circle:

# Build great circle. 
outline <-
  data_frame(
    index = seq(0, 360, 1/3), 
    altitude = 1,
    set = "Outline" # Create a data point every 1/3 of a degree.
  ) %>%
  mutate(
    latitude = altitude * cos(2*pi*(index/360)),
    longitude = altitude * sin(2*pi*(index/360)),
    index = index*3
  )

This produces a dataset with five columns: [index], [altitude], [set], [latitude], and [longitude]. Once we have this dataset all we need to do is bind the data to our current eclipse data in R.

# Write file. 
eclipse %>%
  bind_rows(outline) %>%
  readr::write_csv("eclipse.csv")    

Step 6: Incorporate great circle in Tableau.

So here’s the thing. For the outline data in R, I already set the latitude and longitude to the appropriate projections in Tableau, meaning if I apply the projections again – which would happen if I left the calculations in Tableau the same – I wouldn’t the the values I want because they would have been applied twice. This means I don’t ever want my current functions for the background to change. Which means I don’t want Tableau to apply the functions that I’ve already created for [x], [y], and [valid] when [set] is equal to “Outline”. Your probably wondering why I did it this way. Well one option would be to take this amount of points (360*3) and multiply it by (180*3) so that I don’t have to touch my current functions but that would bring in much more data than is necessary. Because I am bringing in only the outline, and I don’t want the outline to rotate as I change the projection, I need to edit/re-build my functions to support this particular situation.

Here is what the [x], [y], and [valid] functions should now look like:

// In Tableau: Edit X coordinates.
IF [set] = "Outline"
THEN [Longitude]
ELSE 
    COS(RADIANS([Latitude])) * 
    SIN(RADIANS([Longitude]) - 
    RADIANS([Base Longitude]))
END
// In Tableau: Edit Y coordinates.
//COS(RADIANS([Latitude])) * SIN(RADIANS([Longitude]))
IF [set] = "Outline"
THEN [Latitude]
ELSE
  (
    COS(RADIANS([Base Latitude])) * 
    SIN(RADIANS([Latitude]))
  ) - (
    SIN(RADIANS([Base Latitude])) * 
    COS(RADIANS([Latitude])) * 
    COS(RADIANS([Longitude]) - RADIANS([Base Longitude])))
END
// In Tableau: Edit valid coordinates.
IF [set] = "Outline"
THEN [Latitude]
ELSE
  (
    SIN(RADIANS([Base Latitude])) *
    SIN(RADIANS([Latitude]))
  ) + (
    COS(RADIANS([Base Latitude])) *
    COS(RADIANS([Latitude])) *
    COS(RADIANS([Longitude]) - RADIANS([Base Longitude]))
  )
END

From the re-worked functions above you should see that we are basically skipping the application of the the projections to the great circle – which is exactly what we wanted to do. When we then take a look at our visualization, we have the following:

There is no issue with the visualization, the countries are just the same color as the background. Lets create a crude color palette to color everything in how we would like:

// In Tableau: Create [color] dimension.
IF [set] = "Background"
THEN "Background"
ELSEIF [ADMIN] != "NA"
THEN "Land"
ELSE STR(-ROUND([Years]/10)*10) // Group eclipses by 10 year occurrences.
END

Dropping this new color calculation onto the color mark should produce something like this:

From there a little customization of the color can produce the following:

If I wanted to show the outline of the countries and the eclipse polygons all I need to do is add a border to the color.

Step 7: Create lines of latitude and longitude to add to the map.

Okay, this is the hardest thing to do of the map. Up until now it was basically get the data, create projections, edit data, edit projections, , re-edit data, re-edit projections. Now what I want to do is add grid lines for latitude and longitude. Here’s the thing: our current visualization is made up of polygons. I cant mix-and-match my to have both lines and polygons. This means I need to make the lines into polygons. This means draw each latitude and longitude line and then trace the same path backwards until I am back where I started the line. Then I basically have a polygon with no surface between the edges.

Here’s the R code to create the latitude and longitude “polygons”:

# Build latitude/longitude polygons.
# Build latitude.
latitude <-
  c(seq(0, 90, 20), -seq(0, 90, 20)) %>%
  sort() %>%
  unique() %>%
  rep(each = 360)

longitude_ <-
  -179:180 %>%
  rep(length(unique(latitude)))

latitude <-
  data_frame(
    latitude = latitude,
    longitude = longitude_,
    altitude = 1,
    index = longitude_,
    set = "latitude",
    subset = latitude
  )

latitude %<>%
  bind_rows(
    latitude %>% 
      mutate(index = 361 - index)
  )

# Build longitude.
longitude <-
  seq(-180, 179, 20) %>%
  rep(each = length(seq(-90, 90, 1)))

latitude_ <-
  seq(-90,90, 1) %>%
  rep(length(seq(-180, 179, 20)))

longitude <-
  data_frame(
    longitude = longitude,
    latitude = latitude_,
    altitude = 1,
    index = latitude_,
    set = "longitude",
    subset = longitude
  )

longitude %<>%
  bind_rows(
    longitude %>%
      mutate(
        index = 361 - index
      )
  )

# Output FINAL .csv
eclipse %>%
  bind_rows(latitude) %>%
  bind_rows(longitude) %>%
  bind_rows(outline) %>%
  readr::write_csv("eclipse.csv")

This is the last time I’ll use R and export data from it. Now we can just refresh our dataset in Tableau you’ll see the following:

Step 8: Fix the latitude lines.

Everything looks nearly correct except for the lines connecting the ends of the latitude lines. If you rotate the map you’ll see sometimes they show up and other times they do not. The short description reason is they show up when the start of each latitude line is in the current visible space. When we apply the valid filter, instead of looping back on itself (remember the latitude polygon is designed to go all the way around then trace back on the loop to the star. The fix is simple: we need the start of each latitude line by adjusting the [longitude] function to start on the opposite side of the current visible earth. This will allow for correct clipping.

I’ll create a new function called [Long Adj] to represent the adjusted longitude

// In Tableau: Create [long adj] measure.
IF CONTAINS([set], "latitude")
THEN [Longitude] + [Base Longitude]
ELSE [Longitude]
END

With this new longitude calculation, we then need to edit the [x], [y], and [valid] functions one last time.

// In Tableau: Edit X coordinates.
IF [set] = "Outline"
THEN [Longitude]
ELSE 
    COS(RADIANS([Latitude])) * 
    SIN(RADIANS([Long Adj]) - 
    RADIANS([Base Longitude]))
END
// In Tableau: Edit Y coordinates.
//COS(RADIANS([Latitude])) * SIN(RADIANS([Long Adj]))
IF [set] = "Outline"
THEN [Latitude]
ELSE
  (
    COS(RADIANS([Base Latitude])) * 
    SIN(RADIANS([Latitude]))
  ) - (
    SIN(RADIANS([Base Latitude])) * 
    COS(RADIANS([Latitude])) * 
    COS(RADIANS([Long Adj]) - RADIANS([Base Longitude])))
END
// In Tableau: Edit valid coordinates.
IF [set] = "Outline"
THEN [Latitude]
ELSE
  (
    SIN(RADIANS([Base Latitude])) *
    SIN(RADIANS([Latitude]))
  ) + (
    COS(RADIANS([Base Latitude])) *
    COS(RADIANS([Latitude])) *
    COS(RADIANS([Long Adj]) - RADIANS([Base Longitude]))
  )
END

These updates produces the following visualization after you re-sort the [set] values of “latitude” and “longitude” just below “NA” :

Step 9: Add borders to countries and keep latitude/longitude lines.

Now this looks really good. The only difference is I want my eclipse polygons not to have outlines and for every country to have borders and latitude/longitude grid lines to show up.

This requires me to create two x calculations and then create a dual axis chart.

My new x calculations – [x1] and [x2] are then:

// In Tableau: Create [x1] calculation for everything but eclipses.
IF ISNULL([Date])
THEN [x]
END
// In Tableau: Create [x2] calculation for only eclipses.
IF NOT ISNULL([Date])
THEN [x]
END

I can now replace [x1] with [x] on the columns shelf and add [x2] the columns shelf. I’ll also edit the borders on [x2] and opacity of color.

Step 10: Clean things up.

From here all I need to do is create the dual axis chart, synchronize the axes, do some formatting, and remove tooltips.

And that’s it. Is there a lot of custom coding? Yes, to get the data of the eclipses. There was also a little work to get the multipolygons of each country. But now I have countries for future maps so there is little work involved to do this in the future. Was it hard? No. There are only a few calculations I have created in Tableau: [x] (and [x1] and [x2]), [y], [valid], [long adj], and the optional [color]. There were also 2 parameters that allowed me to rotate the map [Base Latitude] and [Base Longitude]. Given the 7 custom components I’d call this one easy but time intensive (to write the blog).

Taking Advanced Analytics to the Cloud – Part I: R on AWS


Running R on the cloud isn’t very difficult. This demo shows how to get Rstudio running on Amazon Web Services. To run R on the cloud we need to initiate a machine/computer and install R – that’s all very simple. Where you might get caught up is in the settings and permissions.

Step 1: Create an AWS Account

If you don’t have an AWS account you’ll need to sign up.

Step 2: Create a Role

Once you are signed-up and signed into AWS the first thing you’ll need to do is create a role for your machine. A role provides the necessary permissions for your machine to access various APIs available by AWS.

You can do this by searching IAM on the services tab.

Once you are on the page click “Roles” on the left menu, then the “Create New Role” button.

Then select Amazon EC2 from the AWS Service Roll.

There are a number of services that you could run with your instance that you will set up – and there are varying permissions, too. For now, just search and select AmazonS3FullAccess, AmazonEC2FullAccess, RedshiftFullAccess, and AthenaFullAccess. You really won’t need anyt of these right away, but they will be useful when connecting to other services like S3 or another EC2 instance. Note: photo does not include AthenaFullAccess but you should include it!

From there you’ll be good to go! Depending on your access needs your policies should be selected accordingly. In fact, if you were doing this correctly, you’d want to create policies that are not over-arching like the full access options we’ve selected for this demo.

Step 3: Create a Key Pair

Next, you’ll want to create a key pair. This will allow you to securely log into your instance to update your machine and install Rstudio.

Go to Services and search EC2 in the search bar. Once the EC2 page loads, click “Key Pairs” under the “Network & Security” section on the left menu bar.

From there click “Create Key Pair”. Give your key a name and hit create. The key pair will download as a .pem file. Do not lose this key! Also: do not share this key!

Step 4: Create a R/Rstudio/Shiny Security Group

As you can see by the steps this far its all about security. From the EC2 menu, under the “Network & Security” section on the left menu select “Security Groups”. Click the “Create Security Group Button” and a pop-up will appear. Create a security group name called “Rstudio/Shiny”. Under description write “Allows port access for R and Shiny”. You can leave the VPC drop down alone. On the inbound tab – the default – add three new rules.

Under the first rule select SSH from the dropdown. The source should be auto populated with 0.0.0.0/0. This opens up SSH to anywhere in the world – but you’ll need the .pem key to login.

For the second rule leave it as custom TCP and type 8787 as the port range. This is the port that needs to be opened on the server for you to connect to Rstudio. Under source type 0.0.0.0/0. This means you could log into Rstudio from any IP address. You could also just use your own IP address if you know it, too.

For the third rule leave it as custom TCP also and type 3838 as the port range. This is the port for the Shiny connection. Were not going to use it for the immediate demo but it’ll be very useful in the future. Under source type 0.0.0.0/0 as well.

Step 5: Launch EC2 Instance

Staying in the EC2 Section click “Instance” under the “Instances” Section on the left menu. Click the “Launch Instance” button. This will take you to “Step 1: Choose an Amazon Machine Image”.

You’ll have a number of tabs to select AMIs from. Stay on the Quick Start tab and select Amazon Linux AMI – it’s the first option. It has a number of pre-built tools that are useful – but it doesn’t have R installed. You’ll do that in a bit.

For “Step 2: Choose an Instance Type” choosing an instance type can be daunting. Here you are essentially specifying the type of computer you want to run. I typically select from the General purpose, Compute optimized, or Memory optimized options depending on the type of models I’m running and the type of data I am working with. For this example select t2.micro because this is just demo the tool. Click “Next: Configure Instance”. Note: you’ll need a more powerful machine to install packages Step 8 and beyond. I’d recommend a c2.large machine just to be safe.

For “Step 3: Configure Instance”, under “IAM Role” select the role you created earlier – my role was called EC2- S3-Redshift. Note: Under Advanced Options you could send a bash script to configure your instance, instead we’ll do it with command line tools. Click “Next: Add Storage”.

For “Step 4: Add Storage” we can stick with the default settings – for now. Click “Next: Add Tags”

For “Step 5: Add Tags”, click “Add Tag”, enter a key of “Name” and a value of “Rstudio Example” – this will give our machine a name of “Rstudio Example”. Having a name is important when you have more than one machine running. Click “Next: Configure Security Group”

For “Step 6: Configure Security Group”, under “Assign a Security Group” select “Select an existing security group”. Find and select your “Rstudio/Shiny” security group. If this is your first time doing this, it should be fine. If you have multiple security groups like myself you’ll have to search through the lot. Click “Review and Launch”.

Under the review screen, make sure you’ve selected instance types, security groups, and IAM roles as mentioned above. Click “Launch” – you’ll get a pop-up to select an existing key pair or create a new key pair. Choose an existing key pair and select the key pair you just created. My key is called awesome_key. Awknowledge you’ve selected the key pair and you’ll need it to log on. Note: make sure you have the key pair to log into your instance. This should go without saying, but I’m saying it – you need your key pair to log into your machine and set it up! Launch your instance.

Step 6: Login to your instance

This is where the Windows OS world diverges from the Apple/Unix/Linux/Ubuntu worlds. Windows doesn’t have a built-in terminal like these other machines so you’ll have to download and setup PuTTy if you don’t have it already. Next you’ll use your terminal to SSH into your newly created instance and set it up. The instance will likely need about 60 seconds to setup from when you hit launch.

After you launch click on your instance id just to the right of the message saying the instance launches has been initiated.

This will take you to your instance on the EC2 dashboard. The dashboard is full of important information – most notably it’ll tell you if your instance is up and running. From the image below you can see my instance state is green running. This tells me it’s ready to be SSH’ed into.

You can also see the Public DNS on the right and on the description tab. We’ll need that information to SSH into it. *Note: you can also see my IAM Role and key pair name. Click the “Connect” button just to the right of the “Launch Instance Button”. This provides you with addition directons on how to connect to your instance.

First, let’s open our terminal and change directories so that we are in the folder that contains our .pem key. If you haven’t moved it out of your downloads folder, it’s probably there – and you should probably move it.

Next change the permissions of your key pair to allow you to SSH onto your AWS instance.

chmod 400 awesome_key.pem

Then SSH onto your machine using the following format. You’ll have to replace the key pair with the key pair you’ve created. You’ll also have to change the public DNS address to the address of your machine..

ssh -i "awesome_key.pem" ec2-user@ec2-54-145-158-106.compute-1.amazonaws.com

Once you are logged in it your terminal should look like this:

Step 7: Setup your instance

Once we are logged in we need to update our machine, install a few additional programs, install R, and install Rstudio. You can do this by running the following commands line-by-line through your EC2 instance.

# Update the machine
sudo yum -y update

# Install programs that run well with the devtools package
sudo yum -y install libcurl-devel openssl-devel # used for devtools

# Install programs that assist APIs
sudo yum -y install libxml2 libxml2-devel

# Install R
sudo su
yum install -y R

#  Install PostgreSQL
yum install -y postgresql-devel

After running this code you will have 1) updated your machine; 2) installed tools to allow the devtools package to run; 3) installed tools to allow like httr and aws.s3 to run; and 4) installed base R.

Next you’ll want to install the most recent version of Rstudio and Shiny. Check here to find the most recent releases of Rstudio and Shiny. Edit the code so that you install the most recent version. Run the install of Rstudio and Shiny.


# Install RStudio Server - change version when installing your Rstudio
wget -P /tmp https://s3.amazonaws.com/rstudio-dailybuilds/rstudio-server-rhel-1.0.143-x86_64.rpm
sudo yum install -y --nogpgcheck /tmp/rstudio-server-rhel-1.0.143-x86_64.rpm

#install shiny and shiny-server - change version when installing your Rstudio
R -e "install.packages('shiny', repos='http://cran.rstudio.com/')"
wget https://download3.rstudio.org/centos5.9/x86_64/shiny-server-1.5.3.838-rh5-x86_64.rpm
yum install -y --nogpgcheck shiny-server-1.5.3.838-rh5-x86_64.rpm

#add user(s)
sudo useradd -m stanke
sudo passwd stanke

Finally add a user — and change the password once this is done you can terminate your SSH tunnel.

Step 8: Log into Rstudio

Copy your public DNS that was located on your EC2 page earlier. Paste that public DNS into your browser and add “:8787” after your instance – in my case “ec2-54-145-158-106.compute-1.amazonaws.com:8787”. Hit enter. Your Rstudio login page should appear. Enter your credentials from your new user. Click “Sign In”.

Step 9: Setup your Rstudio Defaults.

So technically you’ve done it. You’ve logged into Rstudio on AWS. But let’s take this another few steps. Let’s set up some defaults so that any time we want to set up an Rstudio instance on AWS we don’t have to go through the hassle we just did above. Let’s install a bunch of packages that you might regularly use. This install might take a while since you’ll be installing a number of packages onto your machine.

## install packages if not present
install.packages(
  c(
    "devtools",
    "sparklyr",
    "ggplot2",
    "magrittr",
    "tidyverse",
    "Lahman",
    "DBI",
    "rsparkling",
    "h2o",
    "ghit",
    "xml2",
    "stringr",
    "magrittr",
    "data.table",
    "clipr"
  )
)

Lets also create a new .R file and write a few lines of code. You don’t really need to do this, but it’ll show the power of Step 10 in a few minutes.

Here is the R script you can use:

### get data
data("iris")

###  See data structure
head(iris)

###  Save iris data locally.
write.csv(iris, "iris.csv")

After the initial script: A .R file, an iris object, and an iris.csv file.

Step 10: Take an image of your current instance

Setting up Rstudio is a pain – all that time waiting for packages to install, getting your data just right, only to possibly find out you didn’t size your instance correctly. No one wants to have to deal with that every time they setup R on AWS. This isn’t a problem as you can take a snapshot of your instance as it is and spin off new instances from the point and time of the snapshot.

To take a snapshot go back to the webpage with your EC2 details. Click the “Actions” button, then go to “Image” and “Create Image”.

From there enter an Image Name of “Rstudio AMI Example” and Image Description of “Rstudio AMI Example”. Click “Create Image” and wait a few minutes.

Step 11: Create an instance from the AMI

Launch a new instance. On “Step 1: Choose an Amazon Machine Image”, click “My AMIs”. Select “Rstudio AMI Example”. Follow Step 5 above for the rest of the setup. However with this instance tag the Name as “Rstudio AMI”. If you’ve set up everything correctly you shouldn’t need to SSH into your instance to configure.

Copy the Public DNS into your browser and add “:8787”. Login with your username and password derived earlier. Once logged in you’ll see that your .R script and .csv scripts are still saved to the instance – allowing you to quickly jump back into your analyses.

In fact, creating an image and then creating multiple instances will allow you to quickly fit models and test which instance type will be best for the models/data you are currently working with and eventually minimize costs. If the data become more complicated or larger in size you can create a new instance that can accommodate those changes. However, sometimes the data become so large and we need to use a distributed system – Rstudio can sit on top of distributed systems as well – I’ll talk about that in a different post. This approach will get you a great jump-start though.


Taking Advanced Analytics to the Cloud – Part II: Objects from S3 to R using the aws.s3 package


When taking advanced analytics to the cloud you’ll need a strong understanding of your platform – whether it’s compute, storage, or some other feature. This tutorial walks you through reading to and from Amazon Web Service’s Simple Storage Service. For this demo cod will be running though Rstudio which is running on a linux server on the cloud – which you can learn how to do here.

Using aws.s3 package

Before I found this package I was doing things the hard way – Using the AWS command line tools to put and get data from S3. The aws.s3 package makes these practices very convienent.

library(aws.s3)
library(magrittr)

Saving System Variables

To make life easier you should save your .pem key credentials as system variables. Though doing this makes life easier it’s a greater security risk.

Sys.setenv(
  "AWS_ACCESS_KEY_ID" = "ABCDEFGHIJLMNOP",
  "AWS_SECRET_ACCESS_KEY" = "ABCAaKDKJHFSKhfiayrhekjabdfkasdhfiaewr0293u4bsn"
)

Looking into S3

Saving your credentials eliminates additional arguements needed to run each of the aws.s3 functions shown below. Lets start with looking at my buckets by using the bucket_list_df function. This returns my bucket names and creation dates as a data frame. f

bucket_list_df() %>%
  dplyr::arrange(dplyr::desc(CreationDate)) %>%
  head()
##                                            Bucket             CreationDate
## 1                                sample-data-demo 2017-06-01T20:04:07.000Z
## 2 aws-athena-query-results-666957067213-us-east-1 2017-05-20T18:18:31.000Z
## 3                 aws-logs-666957067213-us-east-1 2017-02-19T21:59:02.000Z
## 4                                         test.io 2017-01-25T13:38:32.000Z
## 5                                            test 2017-01-25T13:37:28.000Z
## 6                                      stanke.co 2016-10-04T13:02:41.000Z

I’m most interested in the sample-data-demo bucket. We can use the t function to examine the contents of the bucket. The output comes as a list – which isn’t always the best to work with. So I’ve written some code to take the output and transform it to a data frame/tibble.

##  List files in bucket
files <- get_bucket("sample-data-demo")

##  Convert files to tidy
files_df <-
tibble::data_frame(
file = character(),
LastModified = character()
)

n_files <- length(files)

for(i in 1:n_files) {
files_df <-
tibble::data_frame(
file = files[i][[1]]$Key,
LastModified = files[i][[1]]$LastModified
) %>% 
dplyr::bind_rows(files_df)

}

rm(n_files)

head(files_df)
## # A tibble: 6 x 2
##                   file             LastModified
##                  <chr>                    <chr>
## 1 flights_2008.csv.bz2 2017-06-04T02:39:13.000Z
## 2     flights_2008.csv 2017-06-04T16:01:52.000Z
## 3 flights_2007.csv.bz2 2017-06-04T02:39:08.000Z
## 4     flights_2007.csv 2017-06-04T15:59:50.000Z
## 5 flights_2006.csv.bz2 2017-06-04T02:39:03.000Z
## 6     flights_2006.csv 2017-06-04T15:57:58.000Z

Putting data into S3

Putting data into S3 is pretty easy. We can use several functions: S3save, put_object, OR save_object functions.

Where do we use these?

S3save

S3save is analogous to save. We can take an object and save it as an .Rdata file. Here I’ll take a local file – pro football results and create an object. Then I’ll save it to S3.

games <- readr::read_csv("data/NFL/GAME.csv")

s3save(games, object = "games.Rdata", bucket = "sample-data-demo")

Please note that I have to save this as an .Rdata object – even though it was originally an .csv file.

put_object

Put object allows me to put any object that is on a local drive onto S3. This can basically be any filetype – .Rdata, .csv, .csv.bz2 – they are all covered here. There are three arguments you need to know: 1) file= the location of the file you want to send to S3; 2) object= the name you want to give to the S3 object – probably the same as the file arguement; and 3) bucket= the name of the bucket you’d like to place the object into.

put_object("data/NFL/GAME.csv", "games.csv", bucket = "sample-data-demo" )
## [1] TRUE

Here we took the same .csv we read in earlier and saved the object as games.csv into the sample-data-demo bucket. You’ll see TRUE is returned indicating the file has successfully uploaded to S3.

save_object

The save_object function sounds like it might save a file to S3. But it’s actually the opposite of put_object. save_object takes a file on S3 and saves it to your working directory for you to use. I REPEAT: save_object takes a file from S3 and saves it to your working directory.

r save_object(“games.csv”, file = “games.csv”, bucket = “sample-data-demo”)

## [1] “games.csv”

We can then take this file and read it like we would normally do.

games <- readr::read_csv('games.csv')
dim(games)
## [1] 4256   17

get_object

The save_object function stores information locally. If you want to keep as much as possible in-memory you can use the get_object function – which returns the file as raw data.

games <- get_object("games.csv", bucket = "sample-data-demo")

games[1:100]
##   [1] 67 69 64 2c 73 65 61 73 2c 77 6b 2c 64 61 79 2c 76 2c 68 2c 73 74 61
##  [24] 64 2c 74 65 6d 70 2c 68 75 6d 64 2c 77 73 70 64 2c 77 64 69 72 2c 63
##  [47] 6f 6e 64 2c 73 75 72 66 2c 6f 75 2c 73 70 72 76 2c 70 74 73 76 2c 70
##  [70] 74 73 68 0d 0a 31 2c 32 30 30 30 2c 31 2c 53 55 4e 2c 53 46 2c 41 54
##  [93] 4c 2c 47 65 6f 72 67 69

As I mentioned, using the get_object function returns raw data. This means if you look at the immediate output you’ll see the bits of information as they are. To return the data as intended you’ll need to use the rawToChar function to convert the data:

games <- 
  aws.s3::get_object("games.csv", bucket = "sample-data-demo") %>%
  rawToChar() %>%
  readr::read_csv()

dim(games)
## [1] 4256   17

This works pretty well for reading in most file types, but I’ve found it very hard to use for compressed files. I’d recommend save_object for .bz2, .gz, .zip or any other compressed file. I just haven’t found a good solution yet.

delete_object

to delete an object on S3 just use the delete_object function. Here I’ll delete the files I just created for this demo.

aws.s3::delete_object("games.Rdata", bucket = "sample-data-demo")
## [1] TRUE
aws.s3::delete_object("games.csv", bucket = "sample-data-demo")
## [1] TRUE

put_bucket and delete_bucket

I can easily create or delete a bucket with the put_bucket and delete_bucket functions. With put_bucket I can also specify it’s ACL – i.e. private, public read, public read/write.

##  Make a bucket.
# put_bucket("stanke123123123")
##  And make it disappear.
# delete_bucket("stanke123123123")

These functions get you started with S3 on AWS. There are a host of other services available that I’ll continue to share


Impressions of Climate Change

Sports Analytics Mirrors Real Life: My Big Takeaways from SSAC17

Last weekend I finished up my first Sloan Sports Analytics Conference. For many in the analytics world this attending means running into big-name sports analytics figures: Bill James, Jessica Gelman, Nate Silver, Bill Barnwell, and Daryl Morey, to name a few. And while the conference boasts big names that packed very large auditoriums, there were some excellent smaller sessions that anyone in attendance could have taken a lot from. Here are a few of my bigger takeaways:

 

Perception doesn’t always meet reality.

In any given sport there are 2-3 organizations who on the leading edge of innovation. These organizations are consistent performers and are constantly seeking competitive advantages in-game and in-business. Teams are now tracking in great detail the physical excursion of players in practice to minimize injuries, keeping players healthy, and keeping the in-game product at a high-level for their fans. Even more impressive are some of the fan experience analytics. Some teams are tracking many sources of data – beyond box score, weather, and attendance data to understand fans habits.

The conference highlights big wins in sports analytics. But in hallway conversations with other organizations – those not presenting – they were quick to confide they felt way behind. And that’s not really a surprise – while sports analytics has been a hot topic for nearly a decade the business only began to boom in the past 5 years. A team on the leading edge five years ago can easily maintain their competitive advantage. This leaves teams late to the game playing major catch-up.

"Today, you will need to understand this, whether you are the first or last on the team" – Luis Scola on analytics #SSAC17 #futureofbball

— Sloan Sports Conf. (@SloanSportsConf) March 4, 2017

There’s an overused quote: “sports are a microcosm of life”. But this phrase also applies to analytics. In any industry it’s easy to believe that all organizations – especially your competitors – have a strong analytics culture and are building an ROI analytics behemoth. It’s just not the case. Reality is a handful of organizations – paraphrasing, but I don’t think it’s out of line – have “figured analytics out”.

 

A culture of analytics goes beyond the team (on the field/court).

Let me re-iterate from my earlier points: it’s important to use analytics to improve player performance and the in-game product, but the culture begins with leaders who drive business decisions with data. There are team executives across multiple sports that drive success using both their expertise of the game and advanced analytics. But there are also leaders who are not convinced of analytics and believe intuition and the eye-test are still the true way to gain competitive advantage. This is just isn’t true: research has largely shown leaders leveraging data and their expertise produce the best outcomes in any industry.

Building an analytics culture also requires capable employees who are domain experts, are have capacity with advancing technologies, can code, and can distill all the data into clear stories for the organization to use. People with these skills are nearly impossible to find across any industry – that’s why most executives call them unicorns.

Be able to program https://t.co/CDQaQwNCcG

— Daryl Morey (@dmorey) March 3, 2017

That’s the thing, finding the right people for a team is hard. Sometimes the right move is promoting an analytics savvy individual from within the organization; sometimes it’s hiring a fresh-out-of-school MBA graduate.

Both can backfire. That person internally promoted might be a strong communicator with the business but not have the expertise to place appropriate probabilities to leads, leading to wasted effort by ticket reps. And that recent MBA grad might know how to build the best models to place probabilities to leads but might not know how to talk to the business, leading to the exact same problem.

I’m not saying I have a solution for this. But it’s clear that every organization faces the same problem. And that’s not just in the sports world, but also in every industry, everywhere. It’s clear that sports teams are trending towards the integrated use of analytics across all lines of business but because talent is hard to find there have been some growing pains in some markets.

 

Make sure a story can be told from the data.

In every session I heard the same thing – directly and indirectly: analytics professionals need the capacity for data-based storytelling. And those stories need to be actionable. In a session with basketball professionals, coach Vinny Del Negro stated he receives a 40-page analytics binder before every game. That’s 40-pages of information 82 times a season. Meaning coach Del Negro has to absorb 3200 pages to develop in-game strategies for a dozen players. It’s too much information to digest, so he often just scans it for a handful of key takeaways he can bring to his team.

"There are two types of coaches: those that embrace analytics and those that are unemployed" @adirshiffman #SSAC17 #sustaininggreatness

— Sloan Sports Conf. (@SloanSportsConf) March 4, 2017

When it comes to the players, women’s professional Sue Bird tries to put aside analytics on herself to focus on her teammates strengths. That message carried from session to session: most pros already know their tendencies. But most were less familiar with their teammates and competition. And that’s where analytics is most useful on-court. With these digestible bites of data she makes in-game decisions on which open teammate should get a pass based on situational statistics of each player.

Regardless of background – in pro sports or in another industry – it’s clear that the analytics professionals have to be able to take complicated ideas and bake them into a 3-5 point story that has clear meaning and easily executable. Without this skillset analytics is practically fruitless. But with it teams can develop improved efficiencies in-game or in-business.

Recent NFL History

Please be patient: loading every play from every game since 2002. Trust me, its worth the wait!

NFL Drive Effectiveness 2016

13 Tips for Growing Data Science Teams

I recently presented to a local organization on setting/scaling up data science teams. We had a thoughtful one-hour conversation about the pseudo-prerequisites to accelerating the adoption of data science within an organization. Our conversation kept coming back to the same point: data scientists are a magnifying glass for an organization – whether if an organization has a strong analytic culture or if its’ systems are disparate – the current status will only be amplified with data scientists. But when data scientists unearth issues – in the data, in the infrastructure, or even the culture – its an opportunity to improve organizational productivity and culture. Now for some subtext:

In October of 2012, Harvard Business Review dubbed the Data Scientist as the sexiest job of the 21st century. Four years later, the immense demand remains unchanged for individuals who can write code to sift through both known-critical and seemingly superfluous data, correctly develop and apply statistical models, and effectively communicate key insights to improve organizational outcomes. In fact, as success stories continue to pour into the media, organizations – many you are very familiar with, but shall remain nameless – are hastily moving to add individuals with these skillsets onto their teams to explore their environments. With big data and analytics raising the eyebrows of most of c-suite executives, business units are catalyzing for the deep dive into the data science waters. While there are plenty of insights regarding the ideal skillset of a data scientist – hint: it’s the soft skills – very few mention the prerequisites for scaling data science teams. This begs the question: so if I am building a data science team, what should I do ASAP to make sure my team is successful? What follows are some of the overlooked aspects of building data science teams that my peers and I have personally experienced – some are even blockers for today’s most highly effective data science teams.

1. Access to data.

This shouldn’t come as a surprise, but data scientists work with data, preferably in the rawest form its collected. Giving permission to data might seem straightforward but some organizations will limit data scientists to data marts or cubes (pre aggregated data). With most data scientists spending a big chunk of their time playing data “Go Fish”, making data available allows data scientists to focus on what they do best.

2. Appropriate governance.

With freedom comes responsibility. Providing appropriate access to individuals who most likely have the highest analytical ceilings of an organization also means having strong data governance practices in place. Organizations should be thoughtful about the risk and reward of limiting access. Applying some basic data governance practices will allow data scientists understand the data, the data flow, and data quality issues.

3. Provide an analytic sandbox.

Data scientists are constantly building models and structuring data for analysis – its 80% of the job. Create a space for teams to build models and share data so that processes do not have to be re-run from scratch each time will save valuable business time.

4. Leverage data visualization tools.

This I believe is a big one. First, data science is more than just model building, it’s collecting critical insights about an organization and effectively communicating those insights. One of the best ways a data scientist can do this is by building tools that allow key stakeholders to see and interact with the story the data tells. Second, good data tools also build a culture of data fluency by allowing the business to interact and understand the who, what, and where of the business. They allow an organization to take action. The tools build trust around using data to improve the business and getting to the when, why and how.

5. Keep your team structures flexible.

If you are introducing the concept of data science teams to your organization for the first time you’ll probably spend hundreds of hours conceptualizing the ideal organizational structure. That’s fine, but just know that a team’s structure is likely to evolve in the first few years depending on the people and existing processes and structures you already have in place. In a nutshell, put more of a premium on collaboration over team structure.

6. Integrate data scientists with existing business units.

It’s great when other data scientists can collaborate with each other. It’s just as important that data scientists work with business units to better understand and solve problems. This would elevates data scientists from reactive stats-gatherers to proactive partners

7. Be sure your data science sponsors are 100% committed to supporting the team.

Data science return on investment can be accelerated – or limited – by leader support. These individuals are evangelists and enforcers for data science work across the organization. In some cases there are leaders who will leverage the idea of data scientists – not actually implement work, but hype the potential work – just to move up the organizational ranks, which hinders the effectiveness of data scientists and the organization.

8. Hire for diverse skillsets.

Prioritize culture and communication, but also make sure that your data scientists bring varying analytic skillsets. If you are trying to build a robust data science team it should not just be data scientists. Consider having supporting roles that might include a project manager, business analysts, data visualization specialists, UX designer/front-end developer and/or ETL developers

9. Develop internal team skillsets.

This should apply to all individuals in an organization because skillsets should be constantly evolving to match the tools and technologies that improve the business; consider embracing open-source technologies and cloud-based tools.

10. Be willing to fail.

Experimentation is a cornerstone of data science. It allows the business to learn what not to do. Experimentation shouldn’t happen on big parts of the business, they should be in small, controllable chunks. This is the heart of data science and many organizations might be afraid that they are going to lose. In reality they are going learn quickly about what does and doesn’t grow their business and should adopt the fail fast mantra.

11. Iterate quickly.

This is another cornerstone of data science. Data science projects shouldn’t take months to spin out initial insights. The team should learn what works and what don’t from the data using short cycles and focus on quick wins for the organization. If something works, then move it to a larger production environment.

12. Be comfortable with data not being perfect.

Because you are iterating quickly data might not be perfect. Data science is about action – avoid slowing down to do significant QA on data, extensive data collection, and/or complex modeling – at least to start.

13. Plan now for scaling up.

While you should start small and get some quick wins, start thinking about how you will want to scale and automate insights from the data science practice.

Wrap things up: highly effective data science teams don’t appear overnight. Building a successful team means developing people, culture, processes, science, and technologies. To do so, there are a number of things that can be done – many of these steps are rarely outlined for an organization but I’ve tried to highlight some of these components above. Most of the points above are just good business practices, practices that are highlighted by data scientists.

NFL Analytical MVP 2016