Luke Stanke

Data Science – Analytics – Psychometrics – Applied Statistics

Creating orthographic maps (globe-like maps) in Tableau is easier than it seems.

I recently put together a an orthographic map (globe-like map) of eclipse patterns visualization in Tableau that initially seems impossible to create. My inspiration was the dynamic globe built in D3.js by the Washington Post. Building this visualization is may look difficult but I want to show that this is a myth and that building anything in Tableau isn’t that hard when you break everything down step-by-step. This post provides insight into the process of building this visualization.

Step 1: Get eclipse data.

I found eclipse data here for total and hybrid solar eclipses from 1961 through 2100. The data are in the .kmz format – for Google Earth. The bad news: Tableau wont read multiple .kmz files. So I needed to extract data from it. The good news it’s easy to convert the file into something useable. If I change the extension to from file_name.kmz to file_name.kmz.zip and then unzip the file, I’ll find a .kml file inside the unzipped folder.

 

More bad news: if I take a look at the file_name.kml file, I’ll see a file structure that at first seems very difficult to understand.

Step 2: Extract eclipse data.

Good news: no one shouldn’t be intimidated by kml. After a few minutes I was quick to recognize that all data were contained under the “<Document><Folder>” path. In each folder is a different layer that could be mapped. In the case of my visualization I was interested in the Umbral Path which was described in <name> tag. This is the path of totality. Within the <Folder> structure there are three different <Placemark> tags: Norther Limit, Center Line, and Southern Limit. Within the <Placemark> tag there are <coordinates> in which I could extract.

This is just a long-winded way of saying there is a consistent path we can follow to extract coordinates. All we need to do is extract this information. I prefer R, so here is a little function I wrote to pull the data. I’m also going to show a long function but its commented out so I can see what is done. This is also half of all the code needed to create the file.

# Build a function to extract coordinates.
# input folder location into function.
extract_data <- function(loc = "/Users/luke.stanke/Downloads//TSE_2038_12_26.kmz") {

# Find kml file.
  loc <-
    loc %>% 
    list.files() %>%
    .[!stringr::str_detect(., "files")] %>%
    paste(loc, ., sep = "/")

# Read .kml.  
  shape <- xml2::read_html(loc)
  

# Get data for the Umbral Path folder.
  valid <-
    shape %>%
    rvest::html_nodes("folder") %>%
    stringr::str_detect(., "Umbral Path")
  
# Extract coordinates.
  shape_file <-
    shape %>%
    rvest::html_nodes("folder") %>%
    .[valid] %>%
    rvest::html_nodes("placemark linestring coordinates") %>% 
    .[1] %>%
    rvest::html_text("") %>%
    stringr::str_replace_all("\r", "") %>%
    stringr::str_trim() %>%
    stringr::str_split(" ") %>%
    unlist() %>%
    data_frame(
      coords = .
    ) 
 
# Pull data for the Southern Limit. 
  name <-
    shape %>%
    rvest::html_nodes("folder") %>%
    .[valid] %>%
    rvest::html_nodes("placemark name") %>%
    .[1] %>%
    rvest::html_text()
  
# Build the data frame.
  shape_file <-
    strsplit(shape_file$coords, ",") %>%
    unlist() %>%
    matrix(ncol = 3, byrow = TRUE) %>%
    as_data_frame %>%
    mutate(
      V1 = as.numeric(V1),
      V2 = as.numeric(V2),
      V3 = as.numeric(V3),
      name = name
    ) %>%
    rename(
      latitude = V1,
      longitude = V2,
      altitude = V3
    ) %>%
    tibble::rownames_to_column("index") 
  
# Repeat for Central Line.
  if(shape %>%
    rvest::html_nodes("folder") %>%
    .[valid] %>%
    rvest::html_nodes("placemark linestring coordinates") %>%
    length() %>%
    `>=`(2)) {
    
    shape_file2 <-
      shape %>%
      rvest::html_nodes("folder") %>%
      .[valid] %>%
      rvest::html_nodes("placemark linestring coordinates") %>% 
      .[2] %>%
      rvest::html_text("") %>%
      stringr::str_replace_all("\r", "") %>%
      stringr::str_trim() %>%
      stringr::str_split(" ") %>%
      unlist() %>%
      data_frame(
        coords = .
      ) 
    
    name <-
      shape %>%
      rvest::html_nodes("folder") %>%
      .[valid] %>%
      rvest::html_nodes("placemark name") %>%
      .[2] %>%
      rvest::html_text()
    
    shape_file2 <-
      strsplit(shape_file2$coords, ",") %>%
      unlist() %>%
      matrix(ncol = 3, byrow = TRUE) %>%
      as_data_frame %>%
      mutate(
        V1 = as.numeric(V1),
        V2 = as.numeric(V2),
        V3 = as.numeric(V3),
        name = name
      ) %>%
      rename(
        latitude = V1,
        longitude = V2,
        altitude = V3
      ) %>%
      tibble::rownames_to_column("index")
    
    shape_file %<>%
      bind_rows(shape_file2)
    
    
  }
  
# Repeat for Northern Limit.
  if(shape %>%
     rvest::html_nodes("folder") %>%
     .[valid] %>%
     rvest::html_nodes("placemark linestring coordinates") %>%
     length() %>%
     `>=`(3)) {
    
    shape_file2 <-
      shape %>%
      rvest::html_nodes("folder") %>%
      .[valid] %>%
      rvest::html_nodes("placemark linestring coordinates") %>% 
      .[3] %>%
      rvest::html_text("") %>%
      stringr::str_replace_all("\r", "") %>%
      stringr::str_trim() %>%
      stringr::str_split(" ") %>%
      unlist() %>%
      data_frame(
        coords = .
      ) 
    
    name <-
      shape %>%
      rvest::html_nodes("folder") %>%
      .[valid] %>%
      rvest::html_nodes("placemark name") %>%
      .[3] %>%
      rvest::html_text()
    
    shape_file2 <-
      strsplit(shape_file2$coords, ",") %>%
      unlist() %>%
      matrix(ncol = 3, byrow = TRUE) %>%
      as_data_frame %>%
      mutate(
        V1 = as.numeric(V1),
        V2 = as.numeric(V2),
        V3 = as.numeric(V3),
        name = name
      ) %>%
      rename(
        latitude = V1,
        longitude = V2,
        altitude = V3
      ) %>%
      tibble::rownames_to_column("index")
    
    shape_file %<>%
      bind_rows(shape_file2)
    
    
  }  

# Combine everything together.

  
  shape_file %>%
    rename(longitude = latitude, latitude = longitude) %>%
    mutate(index = as.numeric(index)) %>%
    filter(name != "Central Line") %>% # Chose to drop Central line later.
    mutate(
      index = ifelse(name == "Northern Limit", -index, index),
      index = ifelse(name == "Northern Limit", index, index),
      Date = loc %>%
        stringr::str_split("TSE_") %>% 
        .[[1]] %>%
        .[2] %>%
        stringr::str_replace_all(".kml", "") %>%
        stringr::str_replace_all("_", "-") %>%
        lubridate::ymd()
    ) %>%
    arrange(index)  
}

With this function we can quickly build the a single data frame with all of the eclipse data.

# Search for all TSE files in the Downloads folder.
key_dirs <-
  list.dirs("/Users/luke.stanke/Downloads/") %>%
  .[stringr::str_detect(., "TSE_")] %>%
  .[!stringr::str_detect(., "files")]

# Run a for loop to create data.
for(i in 1:length(key_dirs)) {
  if(i == 1) {
    eclipse <- extract_data(key_dirs[i])
  } else {
    eclipse %<>%
      bind_rows(
        extract_data(key_dirs[i])
      )
  }
}

This produces a data frame with latitude, longitude, altitude (just 1), index (for the path of the polygons) and date of eclipse.

Step 3: Build Projections.

If we quickly visualize this here’s what we’ll have.

This is nice, but I really want this projected to look like a globe or Orthographic projection. To do this I need to create the math for the projection. We’re going to create 3 calculations in Tableau: [x], [y], and a filter called [valid]. We’ll first create two parameters: a base latitude and a base longitude. This will be our reference point that allows us to move around the globe.

The [Base Longitude] parameter is set to a range of -180 to 180 with steps of 15 degrees. This will eventually rotate the globe clockwise or counter-clockwise.

The [Base Latitude] parameter is set to a range of -90 to 90 with steps of 15 degrees. This will tilt the globe up or down towards the poles.

With these, let’s build our [x], [y], and [valid] calculations for the globe projection (we’re going to edit these later). X and Y will identify the new coordinates, while valid will work as a filter to only show values that are on facing outward from the projection.

// In Tableau: Create X coordinates.
COS(RADIANS([Latitude])) * SIN(RADIANS([Longitude]) - RADIANS([Base Longitude]))
// In Tableau: Create Y coordinates.
(
  COS(RADIANS([Base Latitude])) * 
  SIN(RADIANS([Latitude]))
) - (
  SIN(RADIANS([Base Latitude])) * 
  COS(RADIANS([Latitude])) * 
  COS(RADIANS([Longitude]) - RADIANS([Base Longitude]))
)
// In Tableau: Valid filter.
(
  SIN(RADIANS([Base Latitude])) *
  SIN(RADIANS([Latitude]))
) + (
  SIN(RADIANS([Base Latitude])) *
  SIN(RADIANS([Latitude])) *
  COS(RADIANS([Longitude]) - RADIANS([Base Longitude]))
)

Here is what I should see if I add [x] and [y] to the columns and rows, add date to color and index to path, and add [valid] to filter and set the range from 0 to 1.

Boom! its the eclipses projected to a sphere. Once again, bad news: we don’t have the earth behind the globe. But we can find this data and append it to the existing dataset.

Step 4: Download and extract shapefiles of countries.

In my opinion the go-to file is “ne_10m_admin_0_countries_lakes.shp”. This is country boundaries without lakes at 10 meters. Basically it’s far more detailed than we probably need but its nice. Just give it a search on the web and download the file.

Data are easily read into R with the st_read() function and polygon data can be extracted with the st_geometry() function.

When taking a look at the data this point I notice data are MULTIPOLYGON. The best way to describe this is that each country is made up of multiple polygons (potentially) therefore each country is has a set of polygons nested within each country. In some cases these polygons are nested multiple levels deep. A little investigation in the data told me there were three levels. Being the creative type I named each of these levels “set”, “subset”, and “subsubset”.

# Read in Land data.
library(sf)
land <- st_read("/Users/luke.stanke/Downloads/ne_10m_admin_0_countries_lakes/ne_10m_admin_0_countries_lakes.shp")

# Obtain underlying geometry/shapes.
land <- st_geometry(land)

# Get underlying data in shapefiles.
# ADMIN, ABBREV are Labels.
# scalerank, LABELRANK, TINY are admin levels I considered as filters.
land_data2 <- 
  land %>% 
  as.data.frame() %>%
  select(scalerank, LABELRANK, ADMIN, ABBREV, TINY) %>%
  as_data_frame()


# Create empty data frame to load data into.
land_data <-
  data_frame(
    longitude = double(),
    latitude = double(),
    set = character(),      # Data are nested.
    subset = character(),   # These are nesting
    subsubset = character() # identifiers.
  )

From there it’s a matter of looping through the layers to extract the data

# sort through highest level (countries).
for(i in 1:length(land)) {
  level_1 <-
    land %>%
    .[[i]] %>%
    stringr::str_split("MULTIPOLYGON \\(")
  
  # Sort through second level. (subset)
  for(j in 1:length(level_1)) {
    level_2 <-
      level_1 %>%
      .[[j]] %>%
      stringr::str_split("list\\(") %>%
      .[[1]] %>%
      .[. != ""] %>%
      stringr::str_split("c\\(") %>%
      .[[1]] %>%
      .[. != ""]

# Sort through third level get data and add to dataframe. (subsubset)    
    for(k in 1:length(level_2)) {
      level_2 %>%
        .[k] %>%
        stringr::str_replace_all("\\)", "") %>%
        stringr::str_split(",") %>%
        .[[1]] %>%
        stringr::str_trim() %>%
        as.numeric() %>%
        .[!is.na(.)] %>%
        matrix(ncol = 2) %>%
        as_data_frame() %>%
        rename(
          longitude = V1,
          latitude = V2
        ) %>%
        mutate(
          set = i,
          subset = j,
          subsubset = k
        ) %>%
        bind_rows(land_data) ->
        land_data
    }
  }
}

# Sort data by set, subset, subsubset.    
land_data %<>%
  arrange(set, subset, subsubset)

# Join underlying data.    
land_data %<>%
  mutate(set = as.character(set)) %>%
  left_join(
    as_data_frame(land_data2) %>%
    tibble::rownames_to_column("set")
  )

land_data %<>%
  mutate(
    name = "Land",
    altitude = 1
  ) %>%
  group_by(set, subset) %>%
  mutate(
    index = row_number()
  )

This produces a dataset with columns of latitude, longitude, set, subset, and subsubset. Now all we need to do is join this data together.

eclipse %<>%
  mutate(altitude = 1.01) %>%
  bind_rows(land_data)

# Write file. 
eclipse %>%
  readr::write_csv("eclipse.csv")    

If I go back to Tableau and refresh my data, add [set], [subset], and [subsubset] to detail, I now have countries.

To place countries in the right order we need to a manual sort the [set] dimension placing “NA” at the highest level of the sort screen.

Step 5: Edit functions to encorporate great circle (globe background).

Things are starting to look good. Even with a little formatting I could be done there. But what I want is a background of ocean – or a great circle – representing the boundaries of earth. I’m going to use the following R code to build my great circle:

# Build great circle. 
outline <-
  data_frame(
    index = seq(0, 360, 1/3), 
    altitude = 1,
    set = "Outline" # Create a data point every 1/3 of a degree.
  ) %>%
  mutate(
    latitude = altitude * cos(2*pi*(index/360)),
    longitude = altitude * sin(2*pi*(index/360)),
    index = index*3
  )

This produces a dataset with five columns: [index], [altitude], [set], [latitude], and [longitude]. Once we have this dataset all we need to do is bind the data to our current eclipse data in R.

# Write file. 
eclipse %>%
  bind_rows(outline) %>%
  readr::write_csv("eclipse.csv")    

Step 6: Incorporate great circle in Tableau.

So here’s the thing. For the outline data in R, I already set the latitude and longitude to the appropriate projections in Tableau, meaning if I apply the projections again – which would happen if I left the calculations in Tableau the same – I wouldn’t the the values I want because they would have been applied twice. This means I don’t ever want my current functions for the background to change. Which means I don’t want Tableau to apply the functions that I’ve already created for [x], [y], and [valid] when [set] is equal to “Outline”. Your probably wondering why I did it this way. Well one option would be to take this amount of points (360*3) and multiply it by (180*3) so that I don’t have to touch my current functions but that would bring in much more data than is necessary. Because I am bringing in only the outline, and I don’t want the outline to rotate as I change the projection, I need to edit/re-build my functions to support this particular situation.

Here is what the [x], [y], and [valid] functions should now look like:

// In Tableau: Edit X coordinates.
IF [set] = "Outline"
THEN [Longitude]
ELSE 
    COS(RADIANS([Latitude])) * 
    SIN(RADIANS([Longitude]) - 
    RADIANS([Base Longitude]))
END
// In Tableau: Edit Y coordinates.
//COS(RADIANS([Latitude])) * SIN(RADIANS([Longitude]))
IF [set] = "Outline"
THEN [Latitude]
ELSE
  (
    COS(RADIANS([Base Latitude])) * 
    SIN(RADIANS([Latitude]))
  ) - (
    SIN(RADIANS([Base Latitude])) * 
    COS(RADIANS([Latitude])) * 
    COS(RADIANS([Longitude]) - RADIANS([Base Longitude])))
END
// In Tableau: Edit valid coordinates.
IF [set] = "Outline"
THEN [Latitude]
ELSE
  (
    SIN(RADIANS([Base Latitude])) *
    SIN(RADIANS([Latitude]))
  ) + (
    COS(RADIANS([Base Latitude])) *
    COS(RADIANS([Latitude])) *
    COS(RADIANS([Longitude]) - RADIANS([Base Longitude]))
  )
END

From the re-worked functions above you should see that we are basically skipping the application of the the projections to the great circle – which is exactly what we wanted to do. When we then take a look at our visualization, we have the following:

There is no issue with the visualization, the countries are just the same color as the background. Lets create a crude color palette to color everything in how we would like:

// In Tableau: Create [color] dimension.
IF [set] = "Background"
THEN "Background"
ELSEIF [ADMIN] != "NA"
THEN "Land"
ELSE STR(-ROUND([Years]/10)*10) // Group eclipses by 10 year occurrences.
END

Dropping this new color calculation onto the color mark should produce something like this:

From there a little customization of the color can produce the following:

If I wanted to show the outline of the countries and the eclipse polygons all I need to do is add a border to the color.

Step 7: Create lines of latitude and longitude to add to the map.

Okay, this is the hardest thing to do of the map. Up until now it was basically get the data, create projections, edit data, edit projections, , re-edit data, re-edit projections. Now what I want to do is add grid lines for latitude and longitude. Here’s the thing: our current visualization is made up of polygons. I cant mix-and-match my to have both lines and polygons. This means I need to make the lines into polygons. This means draw each latitude and longitude line and then trace the same path backwards until I am back where I started the line. Then I basically have a polygon with no surface between the edges.

Here’s the R code to create the latitude and longitude “polygons”:

# Build latitude/longitude polygons.
# Build latitude.
latitude <-
  c(seq(0, 90, 20), -seq(0, 90, 20)) %>%
  sort() %>%
  unique() %>%
  rep(each = 360)

longitude_ <-
  -179:180 %>%
  rep(length(unique(latitude)))

latitude <-
  data_frame(
    latitude = latitude,
    longitude = longitude_,
    altitude = 1,
    index = longitude_,
    set = "latitude",
    subset = latitude
  )

latitude %<>%
  bind_rows(
    latitude %>% 
      mutate(index = 361 - index)
  )

# Build longitude.
longitude <-
  seq(-180, 179, 20) %>%
  rep(each = length(seq(-90, 90, 1)))

latitude_ <-
  seq(-90,90, 1) %>%
  rep(length(seq(-180, 179, 20)))

longitude <-
  data_frame(
    longitude = longitude,
    latitude = latitude_,
    altitude = 1,
    index = latitude_,
    set = "longitude",
    subset = longitude
  )

longitude %<>%
  bind_rows(
    longitude %>%
      mutate(
        index = 361 - index
      )
  )

# Output FINAL .csv
eclipse %>%
  bind_rows(latitude) %>%
  bind_rows(longitude) %>%
  bind_rows(outline) %>%
  readr::write_csv("eclipse.csv")

This is the last time I’ll use R and export data from it. Now we can just refresh our dataset in Tableau you’ll see the following:

Step 8: Fix the latitude lines.

Everything looks nearly correct except for the lines connecting the ends of the latitude lines. If you rotate the map you’ll see sometimes they show up and other times they do not. The short description reason is they show up when the start of each latitude line is in the current visible space. When we apply the valid filter, instead of looping back on itself (remember the latitude polygon is designed to go all the way around then trace back on the loop to the star. The fix is simple: we need the start of each latitude line by adjusting the [longitude] function to start on the opposite side of the current visible earth. This will allow for correct clipping.

I’ll create a new function called [Long Adj] to represent the adjusted longitude

// In Tableau: Create [long adj] measure.
IF CONTAINS([set], "latitude")
THEN [Longitude] + [Base Longitude]
ELSE [Longitude]
END

With this new longitude calculation, we then need to edit the [x], [y], and [valid] functions one last time.

// In Tableau: Edit X coordinates.
IF [set] = "Outline"
THEN [Longitude]
ELSE 
    COS(RADIANS([Latitude])) * 
    SIN(RADIANS([Long Adj]) - 
    RADIANS([Base Longitude]))
END
// In Tableau: Edit Y coordinates.
//COS(RADIANS([Latitude])) * SIN(RADIANS([Long Adj]))
IF [set] = "Outline"
THEN [Latitude]
ELSE
  (
    COS(RADIANS([Base Latitude])) * 
    SIN(RADIANS([Latitude]))
  ) - (
    SIN(RADIANS([Base Latitude])) * 
    COS(RADIANS([Latitude])) * 
    COS(RADIANS([Long Adj]) - RADIANS([Base Longitude])))
END
// In Tableau: Edit valid coordinates.
IF [set] = "Outline"
THEN [Latitude]
ELSE
  (
    SIN(RADIANS([Base Latitude])) *
    SIN(RADIANS([Latitude]))
  ) + (
    COS(RADIANS([Base Latitude])) *
    COS(RADIANS([Latitude])) *
    COS(RADIANS([Long Adj]) - RADIANS([Base Longitude]))
  )
END

These updates produces the following visualization after you re-sort the [set] values of “latitude” and “longitude” just below “NA” :

Step 9: Add borders to countries and keep latitude/longitude lines.

Now this looks really good. The only difference is I want my eclipse polygons not to have outlines and for every country to have borders and latitude/longitude grid lines to show up.

This requires me to create two x calculations and then create a dual axis chart.

My new x calculations – [x1] and [x2] are then:

// In Tableau: Create [x1] calculation for everything but eclipses.
IF ISNULL([Date])
THEN [x]
END
// In Tableau: Create [x2] calculation for only eclipses.
IF NOT ISNULL([Date])
THEN [x]
END

I can now replace [x1] with [x] on the columns shelf and add [x2] the columns shelf. I’ll also edit the borders on [x2] and opacity of color.

Step 10: Clean things up.

From here all I need to do is create the dual axis chart, synchronize the axes, do some formatting, and remove tooltips.

And that’s it. Is there a lot of custom coding? Yes, to get the data of the eclipses. There was also a little work to get the multipolygons of each country. But now I have countries for future maps so there is little work involved to do this in the future. Was it hard? No. There are only a few calculations I have created in Tableau: [x] (and [x1] and [x2]), [y], [valid], [long adj], and the optional [color]. There were also 2 parameters that allowed me to rotate the map [Base Latitude] and [Base Longitude]. Given the 7 custom components I’d call this one easy but time intensive (to write the blog).

Sports Analytics Mirrors Real Life: My Big Takeaways from SSAC17

Last weekend I finished up my first Sloan Sports Analytics Conference. For many in the analytics world this attending means running into big-name sports analytics figures: Bill James, Jessica Gelman, Nate Silver, Bill Barnwell, and Daryl Morey, to name a few. And while the conference boasts big names that packed very large auditoriums, there were some excellent smaller sessions that anyone in attendance could have taken a lot from. Here are a few of my bigger takeaways:

 

Perception doesn’t always meet reality.

In any given sport there are 2-3 organizations who on the leading edge of innovation. These organizations are consistent performers and are constantly seeking competitive advantages in-game and in-business. Teams are now tracking in great detail the physical excursion of players in practice to minimize injuries, keeping players healthy, and keeping the in-game product at a high-level for their fans. Even more impressive are some of the fan experience analytics. Some teams are tracking many sources of data – beyond box score, weather, and attendance data to understand fans habits.

The conference highlights big wins in sports analytics. But in hallway conversations with other organizations – those not presenting – they were quick to confide they felt way behind. And that’s not really a surprise – while sports analytics has been a hot topic for nearly a decade the business only began to boom in the past 5 years. A team on the leading edge five years ago can easily maintain their competitive advantage. This leaves teams late to the game playing major catch-up.

"Today, you will need to understand this, whether you are the first or last on the team" – Luis Scola on analytics #SSAC17 #futureofbball

— Sloan Sports Conf. (@SloanSportsConf) March 4, 2017

There’s an overused quote: “sports are a microcosm of life”. But this phrase also applies to analytics. In any industry it’s easy to believe that all organizations – especially your competitors – have a strong analytics culture and are building an ROI analytics behemoth. It’s just not the case. Reality is a handful of organizations – paraphrasing, but I don’t think it’s out of line – have “figured analytics out”.

 

A culture of analytics goes beyond the team (on the field/court).

Let me re-iterate from my earlier points: it’s important to use analytics to improve player performance and the in-game product, but the culture begins with leaders who drive business decisions with data. There are team executives across multiple sports that drive success using both their expertise of the game and advanced analytics. But there are also leaders who are not convinced of analytics and believe intuition and the eye-test are still the true way to gain competitive advantage. This is just isn’t true: research has largely shown leaders leveraging data and their expertise produce the best outcomes in any industry.

Building an analytics culture also requires capable employees who are domain experts, are have capacity with advancing technologies, can code, and can distill all the data into clear stories for the organization to use. People with these skills are nearly impossible to find across any industry – that’s why most executives call them unicorns.

Be able to program https://t.co/CDQaQwNCcG

— Daryl Morey (@dmorey) March 3, 2017

That’s the thing, finding the right people for a team is hard. Sometimes the right move is promoting an analytics savvy individual from within the organization; sometimes it’s hiring a fresh-out-of-school MBA graduate.

Both can backfire. That person internally promoted might be a strong communicator with the business but not have the expertise to place appropriate probabilities to leads, leading to wasted effort by ticket reps. And that recent MBA grad might know how to build the best models to place probabilities to leads but might not know how to talk to the business, leading to the exact same problem.

I’m not saying I have a solution for this. But it’s clear that every organization faces the same problem. And that’s not just in the sports world, but also in every industry, everywhere. It’s clear that sports teams are trending towards the integrated use of analytics across all lines of business but because talent is hard to find there have been some growing pains in some markets.

 

Make sure a story can be told from the data.

In every session I heard the same thing – directly and indirectly: analytics professionals need the capacity for data-based storytelling. And those stories need to be actionable. In a session with basketball professionals, coach Vinny Del Negro stated he receives a 40-page analytics binder before every game. That’s 40-pages of information 82 times a season. Meaning coach Del Negro has to absorb 3200 pages to develop in-game strategies for a dozen players. It’s too much information to digest, so he often just scans it for a handful of key takeaways he can bring to his team.

"There are two types of coaches: those that embrace analytics and those that are unemployed" @adirshiffman #SSAC17 #sustaininggreatness

— Sloan Sports Conf. (@SloanSportsConf) March 4, 2017

When it comes to the players, women’s professional Sue Bird tries to put aside analytics on herself to focus on her teammates strengths. That message carried from session to session: most pros already know their tendencies. But most were less familiar with their teammates and competition. And that’s where analytics is most useful on-court. With these digestible bites of data she makes in-game decisions on which open teammate should get a pass based on situational statistics of each player.

Regardless of background – in pro sports or in another industry – it’s clear that the analytics professionals have to be able to take complicated ideas and bake them into a 3-5 point story that has clear meaning and easily executable. Without this skillset analytics is practically fruitless. But with it teams can develop improved efficiencies in-game or in-business.

Recent NFL History

Please be patient: loading every play from every game since 2002. Trust me, its worth the wait!

13 Tips for Growing Data Science Teams

I recently presented to a local organization on setting/scaling up data science teams. We had a thoughtful one-hour conversation about the pseudo-prerequisites to accelerating the adoption of data science within an organization. Our conversation kept coming back to the same point: data scientists are a magnifying glass for an organization – whether if an organization has a strong analytic culture or if its’ systems are disparate – the current status will only be amplified with data scientists. But when data scientists unearth issues – in the data, in the infrastructure, or even the culture – its an opportunity to improve organizational productivity and culture. Now for some subtext:

In October of 2012, Harvard Business Review dubbed the Data Scientist as the sexiest job of the 21st century. Four years later, the immense demand remains unchanged for individuals who can write code to sift through both known-critical and seemingly superfluous data, correctly develop and apply statistical models, and effectively communicate key insights to improve organizational outcomes. In fact, as success stories continue to pour into the media, organizations – many you are very familiar with, but shall remain nameless – are hastily moving to add individuals with these skillsets onto their teams to explore their environments. With big data and analytics raising the eyebrows of most of c-suite executives, business units are catalyzing for the deep dive into the data science waters. While there are plenty of insights regarding the ideal skillset of a data scientist – hint: it’s the soft skills – very few mention the prerequisites for scaling data science teams. This begs the question: so if I am building a data science team, what should I do ASAP to make sure my team is successful? What follows are some of the overlooked aspects of building data science teams that my peers and I have personally experienced – some are even blockers for today’s most highly effective data science teams.

1. Access to data.

This shouldn’t come as a surprise, but data scientists work with data, preferably in the rawest form its collected. Giving permission to data might seem straightforward but some organizations will limit data scientists to data marts or cubes (pre aggregated data). With most data scientists spending a big chunk of their time playing data “Go Fish”, making data available allows data scientists to focus on what they do best.

2. Appropriate governance.

With freedom comes responsibility. Providing appropriate access to individuals who most likely have the highest analytical ceilings of an organization also means having strong data governance practices in place. Organizations should be thoughtful about the risk and reward of limiting access. Applying some basic data governance practices will allow data scientists understand the data, the data flow, and data quality issues.

3. Provide an analytic sandbox.

Data scientists are constantly building models and structuring data for analysis – its 80% of the job. Create a space for teams to build models and share data so that processes do not have to be re-run from scratch each time will save valuable business time.

4. Leverage data visualization tools.

This I believe is a big one. First, data science is more than just model building, it’s collecting critical insights about an organization and effectively communicating those insights. One of the best ways a data scientist can do this is by building tools that allow key stakeholders to see and interact with the story the data tells. Second, good data tools also build a culture of data fluency by allowing the business to interact and understand the who, what, and where of the business. They allow an organization to take action. The tools build trust around using data to improve the business and getting to the when, why and how.

5. Keep your team structures flexible.

If you are introducing the concept of data science teams to your organization for the first time you’ll probably spend hundreds of hours conceptualizing the ideal organizational structure. That’s fine, but just know that a team’s structure is likely to evolve in the first few years depending on the people and existing processes and structures you already have in place. In a nutshell, put more of a premium on collaboration over team structure.

6. Integrate data scientists with existing business units.

It’s great when other data scientists can collaborate with each other. It’s just as important that data scientists work with business units to better understand and solve problems. This would elevates data scientists from reactive stats-gatherers to proactive partners

7. Be sure your data science sponsors are 100% committed to supporting the team.

Data science return on investment can be accelerated – or limited – by leader support. These individuals are evangelists and enforcers for data science work across the organization. In some cases there are leaders who will leverage the idea of data scientists – not actually implement work, but hype the potential work – just to move up the organizational ranks, which hinders the effectiveness of data scientists and the organization.

8. Hire for diverse skillsets.

Prioritize culture and communication, but also make sure that your data scientists bring varying analytic skillsets. If you are trying to build a robust data science team it should not just be data scientists. Consider having supporting roles that might include a project manager, business analysts, data visualization specialists, UX designer/front-end developer and/or ETL developers

9. Develop internal team skillsets.

This should apply to all individuals in an organization because skillsets should be constantly evolving to match the tools and technologies that improve the business; consider embracing open-source technologies and cloud-based tools.

10. Be willing to fail.

Experimentation is a cornerstone of data science. It allows the business to learn what not to do. Experimentation shouldn’t happen on big parts of the business, they should be in small, controllable chunks. This is the heart of data science and many organizations might be afraid that they are going to lose. In reality they are going learn quickly about what does and doesn’t grow their business and should adopt the fail fast mantra.

11. Iterate quickly.

This is another cornerstone of data science. Data science projects shouldn’t take months to spin out initial insights. The team should learn what works and what don’t from the data using short cycles and focus on quick wins for the organization. If something works, then move it to a larger production environment.

12. Be comfortable with data not being perfect.

Because you are iterating quickly data might not be perfect. Data science is about action – avoid slowing down to do significant QA on data, extensive data collection, and/or complex modeling – at least to start.

13. Plan now for scaling up.

While you should start small and get some quick wins, start thinking about how you will want to scale and automate insights from the data science practice.

Wrap things up: highly effective data science teams don’t appear overnight. Building a successful team means developing people, culture, processes, science, and technologies. To do so, there are a number of things that can be done – many of these steps are rarely outlined for an organization but I’ve tried to highlight some of these components above. Most of the points above are just good business practices, practices that are highlighted by data scientists.

Developing a Dynamic Ticket Viz

Tableau Public now supports Google Spreadsheets. I didn’t think of this as anything special. But after I realized cloud-based spreadsheets means dashboards that can be automatically update. I set out on a way to constantly update data in Tableau. Given my limited toolset – mainly R, Tableau, and a handful of AWS tools.

The Data

I decided I wanted to monitor NFL ticket prices on an hourly basis. I’d hit the API for 253 regular season, United States-based games – 3 are in London. This is not easy since there is no pre-built datasets. I’d need to hit a secondary ticket market API and store that data somewhere.

API Data Extraction

I decided to use secondary ticket market data from the StubHub API. To stream data, I created a cloud instance of R and Rstudio on an AWS Elastic Cloud Compute instance. FYI: EC2 is a web service that provides resizable compute capacity in the cloud. It is designed to make web-scale cloud computing easier for developers. In short, it’s a cloud computer. 

Data Storage

Now comes the interesting part. I’m taking data from the API and processing it on EC2. I need to store it somewhere. Since I’m using Amazon already, used its siple files storage system: Simple Storage Service ­– or S3 – provides developers and IT teams with secure, durable, highly-scalable cloud storage.

Data Processing

The data from StubHub is compiled every hour. Its also stored on S3. Using that data I am processing and want to send it to Google Spreadsheets. And technically I was able to do this from R, but it required constant manual authentication for the R-Google handshake every 4 hours. That was a pain, and would require me to wake-up at night so I decided to go a different route.

More Handshakes

Good news. I could save data from EC2/RStudio into S3. From there I could load data from S3 into Google Spreadsheets using the Blockspring API that sends .csv files from S3 to Google. Also with the API I can update data hourly. Bonus. 

Intermission

In case you are wondering here’s whats going on:

screen-shot-2016-09-22-at-11-26-03-pm

I wish this was as easy as I wrote it down to be. But a lot is going on.

The Visualization

screen-shot-2016-09-22-at-11-29-29-pm

There are two different things happening in this visualization. First are the line charts. Second are the stadium shapes. These are not difficult to pull, but take some time to do. Remember data was pulled from StubHub. In this data are IDs for stadium and the corresponding zones. This means if I have the shape files I can plot data as shapes. Here is a stadium for an upcoming game.

screen-shot-2016-09-22-at-11-42-54-pm

Good news. The colors on each of the sections of the stadium are the result of custom polygons in the html code. This means I can download the html from the webpage and extract the polygons from the graphic. I did this using some simple R code. This extracts the svg. That’s the good news.

stadium <-

   paste0(loc, team, '.htm') %>%

   xml2::read_html() %>%

   rvest::html_nodes('.svgcontainer svg path') %>%

   .[1:length(.) - 1]

stadium %>%

   as.character() %>%

   c('<svg>', ., '</svg>') %>%

   stringr::str_replace_all('class=\"st0\"', '') %>%

   writeLines(., file(paste0(loc, team, '.txt')))

 

 

The bad news: Another tool is needed. This just directs html on how to create the polygon, but it doesn’t have the actual points along the polygon ­– which is what I want. This requires a reformat of the data to a file type called .wkt. There are connectors for R to do this, but I didn’t have time to learn it, so I used an online site to make this happen. After I did this I saved the shapes – digestible by Tableau – back to the EC2/Rstudio instance. After that It’s pretty easy to read wkt files into R. This lets me get x and y values and some basic indicies.

 

More bad news: hand coding. The converter didn’t carry some of the underlying data over to the shapes. I couldn’t find a way around other than hand coding 9,900 rows of data. This meant looking at all sections in 32 stadiums. And some of them were very unorganized. But once this is done, I can connect ticket prices back to zones in stadiums.

Send a MMS Text

Prior to the start of this final Iron Viz feeder I joked with a few colleagues my next viz would be able to text a picture of it to you. And about 48 hours prior to submitting my visualization – while my data was still aggregating – I decided I needed to figure out how to do this. My coding skills outside of R are very limited, but there are a number of cool packages available in R that would allow me to send a text message. Here is how I did it:

Twilio is a messaging tool. I can send or receive text or phone calls with the service. Through the service I can make an API call and send the text message. The httr package makes it easy to do. Below is the code I used to send the message:

httr::POST(

paste0('https://api.twilio.com/2010-04-01/Accounts/', 'XXXX', '/Messages.json'),

config=authenticate('XXXXX', 'XXXXX', "basic"),

body = list(

   Body='See which NFL games had the hottest ticket prices in week 2: https://public.tableau.com/profile/stanke#!/vizhome/Whohasthehottestticketsrightnow/HottestNFLTicket',

   From="6122550403",

   Method="GET",

   To=paste0("+1", stringr::str_extract(as.character(isolate(input$phone_number)), "[[:digit:]]+")),

   MediaUrl = "https://XXXXXX.com/dashboard.png",

   encode = 'json'

)

Here I am sending my API a authenticated message in json format. I’m specifying the to number, from number, the text message and image URL. Everything is hard-coded with the exception of the input number – which I’ll talk about in a minute.

But first the image send from the MMS text. To do this I used the webshot package in R. This package requires phatom.js, ImageMagick, and GraphicsMagick. Once those were installed onto my EC2 instance I could just run the code below and it would grab an image of any website – Including the published tableau dashboard.

webshot::webshot(

"https://public.tableau.com/profile/stanke#!/vizhome/Whohasthehottestticketsrightnow/Print",

"dashboard.png",

selector = ".tableau-viz",

delay = 8,

vwidth = '800') %>%

webshot::resize('300%') %>%

webshot::shrink()

Then I just place this onto a webserver using the put_object function from the aws.s3 function:

aws.s3::put_object('dashboard.png')

The code you see above is essential for sending the text. But I still need a user interface to go with the code. I relied on Shiny, which is a web application framework for R. I created an open text box and a submit button which then triggered the code you see above. I wanted an alert to pop-up saying the message was sent, but I didn’t have time since this was less than 6 hours before the deadline for the visualization to be submitted and I still didn’t even know what my visualization was going to look like.

I thought the functionality turned out all right considering I was still learning how to program it just hours before the due date. I just added a bunch of text clarifying how the text worked to the user. It’s overkill and takes away from the experience a bit, but it’s still pretty cool that it works.

 

#TableauTorch Mobile Visualization

Visualize an .png in Tableau

I’ve always thought it was really awesome how some people could replicate images in Tableau. I thought I would give it a shot. I figured it’d take a long time to figure out, but it turns out it’s pretty easy to get key underlying data in R. After doing that you can quickly plot that information in Tableau.

So let’s start with a photo of downtown Minneapolis:

bigphotoforminneapolis

Well, I want to reduce the resolution of the photo, so I used pixlr to change the width of the photo to 200 pixels.

Once we do this we can load this photo into R. We can do that using the readPNG function which is part of the png package.

image <- 
  png::readPNG('.../mpls.png')

This produces an array that is height-by-width-by-array (RGB) in size. With this information I can edit the shape of the file.

###  Create empty space for the image
image2 <-
  data.frame(
    matrix(
      NA,
      nrow = dim(image)[1]*dim(image)[2],
      ncol = 5
    )
  ) %>%
  dplyr::as_data_frame() %>%
  dplyr::rename(
    row = X1,
    column = X2,
    red = X3,
    green = X4,
    blue = X5
  ) %>%
  dplyr::mutate(
    row = image %>% dim() %>% .[1] %>% seq(1, .) %>% rep(image %>% dim() %>% .[2]),
    column = image %>% dim() %>% .[2] %>% seq(1, .) %>% rep(each = (image %>% dim() %>% .[1]))
  )

### Fill in the data 
for(w in (image %>% dim() %>% .[2]) %>% seq(1, .)) {
  for(h in (image %>% dim() %>% .[1]) %>% seq(1, .)) {
    image2$red[image2$column == w & image2$row == h] <- image[h, w, 1]
    image2$green[image2$column == w & image2$row == h] <- image[h, w, 2]
    image2$blue[image2$column == w & image2$row == h] <- image[h, w, 3]
  }
}

Now let's add in the hexcolor as another column. The code I am writing allows for up to 27 levels of color. We'll also save the output

image2 %>%
  dplyr::mutate(
    red = round(red * 3, 0)/3,
    green = round(green * 3, 0)/3,
    blue = round(blue * 3, 0)/3,
    color = rgb(red, green, blue, maxColorValue = 1)
  ) %>%
  readr::write_csv('.../mpls.csv')

Now we can load this data into Tableau. Once we load it in, lets add the columns and rows into column and row pills as dimensions.

Screen Shot 2016-03-14 at 2.02.56 PM

Unfortunately, my photo is upside down. There a few fixes for this, the easiest is to reverse the row axis.

Screen Shot 2016-03-14 at 2.05.02 PM

Now the color of each point isn't what I am looking for, exactly, I am going to create some custom palettes that include all of the colors that are currently in the visualization. Here you can see all the colors in the graphic:

Screen Shot 2016-03-14 at 2.09.25 PM

The list totals 27 colors. Since a custom palette contains 16 colors, I'm going to make 2 palettes for the colors.
If you haven't ever made a custom palette, read this document.

If we create the colors in the same order they are sorted in the image above, the colors will be aligned from top to bottom and it makes it easy to load the correct color to the correct shape. The color change results in this viz:

Screen Shot 2016-03-14 at 2.24.46 PM

We've essentially done it! But now I want to edit the skyline so that it changes color based on the time of the day. The color of the sky is #AAAAFF. When it's mid-day, I want it to be bright blue, and when it's night I want it to be a dark blue. To do this. I need to make a dual axis chart. One chart includes all dots in the part of the sky, and another for all of the buildings.

To do this, I need to create two variables that will help sort things out later: Sky? and Exclude Sky. Both have the same formula:

IF [Color] = '#AAAAFF'
THEN 'Yes'
ELSE 'No'
END

I'm going to change the shape type for both charts to be based off of each of these variables. Then I am going to hide the sky one one viz, the buildings on the other, and then create a dual axis chart with synchronized axes.

gif create dual axis

Now lets create the sky color. I want it to start getting dark by 7PM and to be fully dark by 9PM. I also want it to start getting light by 7AM and be fully light by 9AM. I'm going to use the logit function to make this happen. Here's the function for Sky Color.

IF DATEPART("hour", NOW()) >= 12
THEN EXP((DATEPART("hour", NOW()) - 7.5)/2)/(1 + (EXP((DATEPART("hour", NOW()) - 7.5)/2)))
ELSE EXP((19.5 - DATEPART("hour", NOW()))/2)/(1 + (EXP((19.5 - DATEPART("hour", NOW()))/2)))
END

I'll make this as the color of my sky on the Sky portion of this viz. But let's make the lights show up when it's dark by creating a Modified Color variable. Here's the code:

IF DATEPART('hour', NOW()) < 7 OR DATEPART('hour', NOW()) >= 19
THEN IF [Color] = '#AAAAAA'
    THEN '#FFFFAA'
    ELSE [Color]
    END
ELSE [Color]
END

Now I'll replace the [color] variable on the color option.

skychange color

Now I'll put this in a dashboard and publish it:

Who should be the NFL MVP

It’s the end of the National Football League regular season tomorrow. With that, comes talk about who the best player — the MVP — will be. Since 2000, the MVP award has been won by 11 times by quarterbacks and 4 times by running backs. In fact, since the award has been given out only two defensive players — Alan Page in 1971 and Lawrence Taylor in 1986 have ever won MVP. The award has been given out to a special teams player just once — to Mark Moseley in the strike-shortened 1982 season. His recognition is a bit of an oddity, and some suggest he might not have even been the best kicker that year.

It’s easy to suggest what Most Valuable means — a player that has contributed the most to his team’s success — but it’s a lot harder to quantify that. It’s what smart statisticians that work for big sports companies work on all the time and think about with more data than I have access to, but I’m going to give it a shot. Using play-by-play data from Armchair Analysis, I created a model that gives the chances of winning of a game given the context of the games — things like time to play in the game, down, distance to go, distance from goal, and the difference in score.

Using this information I could then figure out what the chances of team are to win at any given point in the game. Take this situation: Let’s pretend Aaron Rodgers has the ball on his opponents 40 yard line. It’s 2nd and 7 yards to go with 7:50 remaining in the game and the Packers are down by four. In this situation, Rodgers and the Packers have a 45.92% chance to win the game. On the next play he is forced to run, and he gains 34 yards. The new odds of winning the game for the packers are 50.02%. This mean Rodgers contributed .051 points to his team’s chances of winning. We can figure this out for every player on every play. And then we can add it all up to figure out who’s the MVP.

So who is the MVP?