Luke Stanke

Data Science – Analytics – Psychometrics – Applied Statistics

Taking Advanced Analytics to the Cloud – Part I: R on AWS


Running R on the cloud isn’t very difficult. This demo shows how to get Rstudio running on Amazon Web Services. To run R on the cloud we need to initiate a machine/computer and install R – that’s all very simple. Where you might get caught up is in the settings and permissions.

Step 1: Create an AWS Account

If you don’t have an AWS account you’ll need to sign up.

Step 2: Create a Role

Once you are signed-up and signed into AWS the first thing you’ll need to do is create a role for your machine. A role provides the necessary permissions for your machine to access various APIs available by AWS.

You can do this by searching IAM on the services tab.

Once you are on the page click “Roles” on the left menu, then the “Create New Role” button.

Then select Amazon EC2 from the AWS Service Roll.

There are a number of services that you could run with your instance that you will set up – and there are varying permissions, too. For now, just search and select AmazonS3FullAccess, AmazonEC2FullAccess, RedshiftFullAccess, and AthenaFullAccess. You really won’t need anyt of these right away, but they will be useful when connecting to other services like S3 or another EC2 instance. Note: photo does not include AthenaFullAccess but you should include it!

From there you’ll be good to go! Depending on your access needs your policies should be selected accordingly. In fact, if you were doing this correctly, you’d want to create policies that are not over-arching like the full access options we’ve selected for this demo.

Step 3: Create a Key Pair

Next, you’ll want to create a key pair. This will allow you to securely log into your instance to update your machine and install Rstudio.

Go to Services and search EC2 in the search bar. Once the EC2 page loads, click “Key Pairs” under the “Network & Security” section on the left menu bar.

From there click “Create Key Pair”. Give your key a name and hit create. The key pair will download as a .pem file. Do not lose this key! Also: do not share this key!

Step 4: Create a R/Rstudio/Shiny Security Group

As you can see by the steps this far its all about security. From the EC2 menu, under the “Network & Security” section on the left menu select “Security Groups”. Click the “Create Security Group Button” and a pop-up will appear. Create a security group name called “Rstudio/Shiny”. Under description write “Allows port access for R and Shiny”. You can leave the VPC drop down alone. On the inbound tab – the default – add three new rules.

Under the first rule select SSH from the dropdown. The source should be auto populated with 0.0.0.0/0. This opens up SSH to anywhere in the world – but you’ll need the .pem key to login.

For the second rule leave it as custom TCP and type 8787 as the port range. This is the port that needs to be opened on the server for you to connect to Rstudio. Under source type 0.0.0.0/0. This means you could log into Rstudio from any IP address. You could also just use your own IP address if you know it, too.

For the third rule leave it as custom TCP also and type 3838 as the port range. This is the port for the Shiny connection. Were not going to use it for the immediate demo but it’ll be very useful in the future. Under source type 0.0.0.0/0 as well.

Step 5: Launch EC2 Instance

Staying in the EC2 Section click “Instance” under the “Instances” Section on the left menu. Click the “Launch Instance” button. This will take you to “Step 1: Choose an Amazon Machine Image”.

You’ll have a number of tabs to select AMIs from. Stay on the Quick Start tab and select Amazon Linux AMI – it’s the first option. It has a number of pre-built tools that are useful – but it doesn’t have R installed. You’ll do that in a bit.

For “Step 2: Choose an Instance Type” choosing an instance type can be daunting. Here you are essentially specifying the type of computer you want to run. I typically select from the General purpose, Compute optimized, or Memory optimized options depending on the type of models I’m running and the type of data I am working with. For this example select t2.micro because this is just demo the tool. Click “Next: Configure Instance”. Note: you’ll need a more powerful machine to install packages Step 8 and beyond. I’d recommend a c2.large machine just to be safe.

For “Step 3: Configure Instance”, under “IAM Role” select the role you created earlier – my role was called EC2- S3-Redshift. Note: Under Advanced Options you could send a bash script to configure your instance, instead we’ll do it with command line tools. Click “Next: Add Storage”.

For “Step 4: Add Storage” we can stick with the default settings – for now. Click “Next: Add Tags”

For “Step 5: Add Tags”, click “Add Tag”, enter a key of “Name” and a value of “Rstudio Example” – this will give our machine a name of “Rstudio Example”. Having a name is important when you have more than one machine running. Click “Next: Configure Security Group”

For “Step 6: Configure Security Group”, under “Assign a Security Group” select “Select an existing security group”. Find and select your “Rstudio/Shiny” security group. If this is your first time doing this, it should be fine. If you have multiple security groups like myself you’ll have to search through the lot. Click “Review and Launch”.

Under the review screen, make sure you’ve selected instance types, security groups, and IAM roles as mentioned above. Click “Launch” – you’ll get a pop-up to select an existing key pair or create a new key pair. Choose an existing key pair and select the key pair you just created. My key is called awesome_key. Awknowledge you’ve selected the key pair and you’ll need it to log on. Note: make sure you have the key pair to log into your instance. This should go without saying, but I’m saying it – you need your key pair to log into your machine and set it up! Launch your instance.

Step 6: Login to your instance

This is where the Windows OS world diverges from the Apple/Unix/Linux/Ubuntu worlds. Windows doesn’t have a built-in terminal like these other machines so you’ll have to download and setup PuTTy if you don’t have it already. Next you’ll use your terminal to SSH into your newly created instance and set it up. The instance will likely need about 60 seconds to setup from when you hit launch.

After you launch click on your instance id just to the right of the message saying the instance launches has been initiated.

This will take you to your instance on the EC2 dashboard. The dashboard is full of important information – most notably it’ll tell you if your instance is up and running. From the image below you can see my instance state is green running. This tells me it’s ready to be SSH’ed into.

You can also see the Public DNS on the right and on the description tab. We’ll need that information to SSH into it. *Note: you can also see my IAM Role and key pair name. Click the “Connect” button just to the right of the “Launch Instance Button”. This provides you with addition directons on how to connect to your instance.

First, let’s open our terminal and change directories so that we are in the folder that contains our .pem key. If you haven’t moved it out of your downloads folder, it’s probably there – and you should probably move it.

Next change the permissions of your key pair to allow you to SSH onto your AWS instance.

chmod 400 awesome_key.pem

Then SSH onto your machine using the following format. You’ll have to replace the key pair with the key pair you’ve created. You’ll also have to change the public DNS address to the address of your machine..

ssh -i "awesome_key.pem" ec2-user@ec2-54-145-158-106.compute-1.amazonaws.com

Once you are logged in it your terminal should look like this:

Step 7: Setup your instance

Once we are logged in we need to update our machine, install a few additional programs, install R, and install Rstudio. You can do this by running the following commands line-by-line through your EC2 instance.

# Update the machine
sudo yum -y update

# Install programs that run well with the devtools package
sudo yum -y install libcurl-devel openssl-devel # used for devtools

# Install programs that assist APIs
sudo yum -y install libxml2 libxml2-devel

# Install R
sudo su
yum install -y R

#  Install PostgreSQL
yum install -y postgresql-devel

After running this code you will have 1) updated your machine; 2) installed tools to allow the devtools package to run; 3) installed tools to allow like httr and aws.s3 to run; and 4) installed base R.

Next you’ll want to install the most recent version of Rstudio and Shiny. Check here to find the most recent releases of Rstudio and Shiny. Edit the code so that you install the most recent version. Run the install of Rstudio and Shiny.


# Install RStudio Server - change version when installing your Rstudio
wget -P /tmp https://s3.amazonaws.com/rstudio-dailybuilds/rstudio-server-rhel-1.0.143-x86_64.rpm
sudo yum install -y --nogpgcheck /tmp/rstudio-server-rhel-1.0.143-x86_64.rpm

#install shiny and shiny-server - change version when installing your Rstudio
R -e "install.packages('shiny', repos='http://cran.rstudio.com/')"
wget https://download3.rstudio.org/centos5.9/x86_64/shiny-server-1.5.3.838-rh5-x86_64.rpm
yum install -y --nogpgcheck shiny-server-1.5.3.838-rh5-x86_64.rpm

#add user(s)
sudo useradd -m stanke
sudo passwd stanke

Finally add a user — and change the password once this is done you can terminate your SSH tunnel.

Step 8: Log into Rstudio

Copy your public DNS that was located on your EC2 page earlier. Paste that public DNS into your browser and add “:8787” after your instance – in my case “ec2-54-145-158-106.compute-1.amazonaws.com:8787”. Hit enter. Your Rstudio login page should appear. Enter your credentials from your new user. Click “Sign In”.

Step 9: Setup your Rstudio Defaults.

So technically you’ve done it. You’ve logged into Rstudio on AWS. But let’s take this another few steps. Let’s set up some defaults so that any time we want to set up an Rstudio instance on AWS we don’t have to go through the hassle we just did above. Let’s install a bunch of packages that you might regularly use. This install might take a while since you’ll be installing a number of packages onto your machine.

## install packages if not present
install.packages(
  c(
    "devtools",
    "sparklyr",
    "ggplot2",
    "magrittr",
    "tidyverse",
    "Lahman",
    "DBI",
    "rsparkling",
    "h2o",
    "ghit",
    "xml2",
    "stringr",
    "magrittr",
    "data.table",
    "clipr"
  )
)

Lets also create a new .R file and write a few lines of code. You don’t really need to do this, but it’ll show the power of Step 10 in a few minutes.

Here is the R script you can use:

### get data
data("iris")

###  See data structure
head(iris)

###  Save iris data locally.
write.csv(iris, "iris.csv")

After the initial script: A .R file, an iris object, and an iris.csv file.

Step 10: Take an image of your current instance

Setting up Rstudio is a pain – all that time waiting for packages to install, getting your data just right, only to possibly find out you didn’t size your instance correctly. No one wants to have to deal with that every time they setup R on AWS. This isn’t a problem as you can take a snapshot of your instance as it is and spin off new instances from the point and time of the snapshot.

To take a snapshot go back to the webpage with your EC2 details. Click the “Actions” button, then go to “Image” and “Create Image”.

From there enter an Image Name of “Rstudio AMI Example” and Image Description of “Rstudio AMI Example”. Click “Create Image” and wait a few minutes.

Step 11: Create an instance from the AMI

Launch a new instance. On “Step 1: Choose an Amazon Machine Image”, click “My AMIs”. Select “Rstudio AMI Example”. Follow Step 5 above for the rest of the setup. However with this instance tag the Name as “Rstudio AMI”. If you’ve set up everything correctly you shouldn’t need to SSH into your instance to configure.

Copy the Public DNS into your browser and add “:8787”. Login with your username and password derived earlier. Once logged in you’ll see that your .R script and .csv scripts are still saved to the instance – allowing you to quickly jump back into your analyses.

In fact, creating an image and then creating multiple instances will allow you to quickly fit models and test which instance type will be best for the models/data you are currently working with and eventually minimize costs. If the data become more complicated or larger in size you can create a new instance that can accommodate those changes. However, sometimes the data become so large and we need to use a distributed system – Rstudio can sit on top of distributed systems as well – I’ll talk about that in a different post. This approach will get you a great jump-start though.


Taking Advanced Analytics to the Cloud – Part II: Objects from S3 to R using the aws.s3 package


When taking advanced analytics to the cloud you’ll need a strong understanding of your platform – whether it’s compute, storage, or some other feature. This tutorial walks you through reading to and from Amazon Web Service’s Simple Storage Service. For this demo cod will be running though Rstudio which is running on a linux server on the cloud – which you can learn how to do here.

Using aws.s3 package

Before I found this package I was doing things the hard way – Using the AWS command line tools to put and get data from S3. The aws.s3 package makes these practices very convienent.

library(aws.s3)
library(magrittr)

Saving System Variables

To make life easier you should save your .pem key credentials as system variables. Though doing this makes life easier it’s a greater security risk.

Sys.setenv(
  "AWS_ACCESS_KEY_ID" = "ABCDEFGHIJLMNOP",
  "AWS_SECRET_ACCESS_KEY" = "ABCAaKDKJHFSKhfiayrhekjabdfkasdhfiaewr0293u4bsn"
)

Looking into S3

Saving your credentials eliminates additional arguements needed to run each of the aws.s3 functions shown below. Lets start with looking at my buckets by using the bucket_list_df function. This returns my bucket names and creation dates as a data frame. f

bucket_list_df() %>%
  dplyr::arrange(dplyr::desc(CreationDate)) %>%
  head()
##                                            Bucket             CreationDate
## 1                                sample-data-demo 2017-06-01T20:04:07.000Z
## 2 aws-athena-query-results-666957067213-us-east-1 2017-05-20T18:18:31.000Z
## 3                 aws-logs-666957067213-us-east-1 2017-02-19T21:59:02.000Z
## 4                                         test.io 2017-01-25T13:38:32.000Z
## 5                                            test 2017-01-25T13:37:28.000Z
## 6                                      stanke.co 2016-10-04T13:02:41.000Z

I’m most interested in the sample-data-demo bucket. We can use the t function to examine the contents of the bucket. The output comes as a list – which isn’t always the best to work with. So I’ve written some code to take the output and transform it to a data frame/tibble.

##  List files in bucket
files <- get_bucket("sample-data-demo")

##  Convert files to tidy
files_df <-
tibble::data_frame(
file = character(),
LastModified = character()
)

n_files <- length(files)

for(i in 1:n_files) {
files_df <-
tibble::data_frame(
file = files[i][[1]]$Key,
LastModified = files[i][[1]]$LastModified
) %>% 
dplyr::bind_rows(files_df)

}

rm(n_files)

head(files_df)
## # A tibble: 6 x 2
##                   file             LastModified
##                  <chr>                    <chr>
## 1 flights_2008.csv.bz2 2017-06-04T02:39:13.000Z
## 2     flights_2008.csv 2017-06-04T16:01:52.000Z
## 3 flights_2007.csv.bz2 2017-06-04T02:39:08.000Z
## 4     flights_2007.csv 2017-06-04T15:59:50.000Z
## 5 flights_2006.csv.bz2 2017-06-04T02:39:03.000Z
## 6     flights_2006.csv 2017-06-04T15:57:58.000Z

Putting data into S3

Putting data into S3 is pretty easy. We can use several functions: S3save, put_object, OR save_object functions.

Where do we use these?

S3save

S3save is analogous to save. We can take an object and save it as an .Rdata file. Here I’ll take a local file – pro football results and create an object. Then I’ll save it to S3.

games <- readr::read_csv("data/NFL/GAME.csv")

s3save(games, object = "games.Rdata", bucket = "sample-data-demo")

Please note that I have to save this as an .Rdata object – even though it was originally an .csv file.

put_object

Put object allows me to put any object that is on a local drive onto S3. This can basically be any filetype – .Rdata, .csv, .csv.bz2 – they are all covered here. There are three arguments you need to know: 1) file= the location of the file you want to send to S3; 2) object= the name you want to give to the S3 object – probably the same as the file arguement; and 3) bucket= the name of the bucket you’d like to place the object into.

put_object("data/NFL/GAME.csv", "games.csv", bucket = "sample-data-demo" )
## [1] TRUE

Here we took the same .csv we read in earlier and saved the object as games.csv into the sample-data-demo bucket. You’ll see TRUE is returned indicating the file has successfully uploaded to S3.

save_object

The save_object function sounds like it might save a file to S3. But it’s actually the opposite of put_object. save_object takes a file on S3 and saves it to your working directory for you to use. I REPEAT: save_object takes a file from S3 and saves it to your working directory.

r save_object(“games.csv”, file = “games.csv”, bucket = “sample-data-demo”)

## [1] “games.csv”

We can then take this file and read it like we would normally do.

games <- readr::read_csv('games.csv')
dim(games)
## [1] 4256   17

get_object

The save_object function stores information locally. If you want to keep as much as possible in-memory you can use the get_object function – which returns the file as raw data.

games <- get_object("games.csv", bucket = "sample-data-demo")

games[1:100]
##   [1] 67 69 64 2c 73 65 61 73 2c 77 6b 2c 64 61 79 2c 76 2c 68 2c 73 74 61
##  [24] 64 2c 74 65 6d 70 2c 68 75 6d 64 2c 77 73 70 64 2c 77 64 69 72 2c 63
##  [47] 6f 6e 64 2c 73 75 72 66 2c 6f 75 2c 73 70 72 76 2c 70 74 73 76 2c 70
##  [70] 74 73 68 0d 0a 31 2c 32 30 30 30 2c 31 2c 53 55 4e 2c 53 46 2c 41 54
##  [93] 4c 2c 47 65 6f 72 67 69

As I mentioned, using the get_object function returns raw data. This means if you look at the immediate output you’ll see the bits of information as they are. To return the data as intended you’ll need to use the rawToChar function to convert the data:

games <- 
  aws.s3::get_object("games.csv", bucket = "sample-data-demo") %>%
  rawToChar() %>%
  readr::read_csv()

dim(games)
## [1] 4256   17

This works pretty well for reading in most file types, but I’ve found it very hard to use for compressed files. I’d recommend save_object for .bz2, .gz, .zip or any other compressed file. I just haven’t found a good solution yet.

delete_object

to delete an object on S3 just use the delete_object function. Here I’ll delete the files I just created for this demo.

aws.s3::delete_object("games.Rdata", bucket = "sample-data-demo")
## [1] TRUE
aws.s3::delete_object("games.csv", bucket = "sample-data-demo")
## [1] TRUE

put_bucket and delete_bucket

I can easily create or delete a bucket with the put_bucket and delete_bucket functions. With put_bucket I can also specify it’s ACL – i.e. private, public read, public read/write.

##  Make a bucket.
# put_bucket("stanke123123123")
##  And make it disappear.
# delete_bucket("stanke123123123")

These functions get you started with S3 on AWS. There are a host of other services available that I’ll continue to share


Impressions of Climate Change

Sports Analytics Mirrors Real Life: My Big Takeaways from SSAC17

Last weekend I finished up my first Sloan Sports Analytics Conference. For many in the analytics world this attending means running into big-name sports analytics figures: Bill James, Jessica Gelman, Nate Silver, Bill Barnwell, and Daryl Morey, to name a few. And while the conference boasts big names that packed very large auditoriums, there were some excellent smaller sessions that anyone in attendance could have taken a lot from. Here are a few of my bigger takeaways:

 

Perception doesn’t always meet reality.

In any given sport there are 2-3 organizations who on the leading edge of innovation. These organizations are consistent performers and are constantly seeking competitive advantages in-game and in-business. Teams are now tracking in great detail the physical excursion of players in practice to minimize injuries, keeping players healthy, and keeping the in-game product at a high-level for their fans. Even more impressive are some of the fan experience analytics. Some teams are tracking many sources of data – beyond box score, weather, and attendance data to understand fans habits.

The conference highlights big wins in sports analytics. But in hallway conversations with other organizations – those not presenting – they were quick to confide they felt way behind. And that’s not really a surprise – while sports analytics has been a hot topic for nearly a decade the business only began to boom in the past 5 years. A team on the leading edge five years ago can easily maintain their competitive advantage. This leaves teams late to the game playing major catch-up.

"Today, you will need to understand this, whether you are the first or last on the team" – Luis Scola on analytics #SSAC17 #futureofbball

— Sloan Sports Conf. (@SloanSportsConf) March 4, 2017

There’s an overused quote: “sports are a microcosm of life”. But this phrase also applies to analytics. In any industry it’s easy to believe that all organizations – especially your competitors – have a strong analytics culture and are building an ROI analytics behemoth. It’s just not the case. Reality is a handful of organizations – paraphrasing, but I don’t think it’s out of line – have “figured analytics out”.

 

A culture of analytics goes beyond the team (on the field/court).

Let me re-iterate from my earlier points: it’s important to use analytics to improve player performance and the in-game product, but the culture begins with leaders who drive business decisions with data. There are team executives across multiple sports that drive success using both their expertise of the game and advanced analytics. But there are also leaders who are not convinced of analytics and believe intuition and the eye-test are still the true way to gain competitive advantage. This is just isn’t true: research has largely shown leaders leveraging data and their expertise produce the best outcomes in any industry.

Building an analytics culture also requires capable employees who are domain experts, are have capacity with advancing technologies, can code, and can distill all the data into clear stories for the organization to use. People with these skills are nearly impossible to find across any industry – that’s why most executives call them unicorns.

Be able to program https://t.co/CDQaQwNCcG

— Daryl Morey (@dmorey) March 3, 2017

That’s the thing, finding the right people for a team is hard. Sometimes the right move is promoting an analytics savvy individual from within the organization; sometimes it’s hiring a fresh-out-of-school MBA graduate.

Both can backfire. That person internally promoted might be a strong communicator with the business but not have the expertise to place appropriate probabilities to leads, leading to wasted effort by ticket reps. And that recent MBA grad might know how to build the best models to place probabilities to leads but might not know how to talk to the business, leading to the exact same problem.

I’m not saying I have a solution for this. But it’s clear that every organization faces the same problem. And that’s not just in the sports world, but also in every industry, everywhere. It’s clear that sports teams are trending towards the integrated use of analytics across all lines of business but because talent is hard to find there have been some growing pains in some markets.

 

Make sure a story can be told from the data.

In every session I heard the same thing – directly and indirectly: analytics professionals need the capacity for data-based storytelling. And those stories need to be actionable. In a session with basketball professionals, coach Vinny Del Negro stated he receives a 40-page analytics binder before every game. That’s 40-pages of information 82 times a season. Meaning coach Del Negro has to absorb 3200 pages to develop in-game strategies for a dozen players. It’s too much information to digest, so he often just scans it for a handful of key takeaways he can bring to his team.

"There are two types of coaches: those that embrace analytics and those that are unemployed" @adirshiffman #SSAC17 #sustaininggreatness

— Sloan Sports Conf. (@SloanSportsConf) March 4, 2017

When it comes to the players, women’s professional Sue Bird tries to put aside analytics on herself to focus on her teammates strengths. That message carried from session to session: most pros already know their tendencies. But most were less familiar with their teammates and competition. And that’s where analytics is most useful on-court. With these digestible bites of data she makes in-game decisions on which open teammate should get a pass based on situational statistics of each player.

Regardless of background – in pro sports or in another industry – it’s clear that the analytics professionals have to be able to take complicated ideas and bake them into a 3-5 point story that has clear meaning and easily executable. Without this skillset analytics is practically fruitless. But with it teams can develop improved efficiencies in-game or in-business.

Recent NFL History

Please be patient: loading every play from every game since 2002. Trust me, its worth the wait!

NFL Drive Effectiveness 2016

13 Tips for Growing Data Science Teams

I recently presented to a local organization on setting/scaling up data science teams. We had a thoughtful one-hour conversation about the pseudo-prerequisites to accelerating the adoption of data science within an organization. Our conversation kept coming back to the same point: data scientists are a magnifying glass for an organization – whether if an organization has a strong analytic culture or if its’ systems are disparate – the current status will only be amplified with data scientists. But when data scientists unearth issues – in the data, in the infrastructure, or even the culture – its an opportunity to improve organizational productivity and culture. Now for some subtext:

In October of 2012, Harvard Business Review dubbed the Data Scientist as the sexiest job of the 21st century. Four years later, the immense demand remains unchanged for individuals who can write code to sift through both known-critical and seemingly superfluous data, correctly develop and apply statistical models, and effectively communicate key insights to improve organizational outcomes. In fact, as success stories continue to pour into the media, organizations – many you are very familiar with, but shall remain nameless – are hastily moving to add individuals with these skillsets onto their teams to explore their environments. With big data and analytics raising the eyebrows of most of c-suite executives, business units are catalyzing for the deep dive into the data science waters. While there are plenty of insights regarding the ideal skillset of a data scientist – hint: it’s the soft skills – very few mention the prerequisites for scaling data science teams. This begs the question: so if I am building a data science team, what should I do ASAP to make sure my team is successful? What follows are some of the overlooked aspects of building data science teams that my peers and I have personally experienced – some are even blockers for today’s most highly effective data science teams.

1. Access to data.

This shouldn’t come as a surprise, but data scientists work with data, preferably in the rawest form its collected. Giving permission to data might seem straightforward but some organizations will limit data scientists to data marts or cubes (pre aggregated data). With most data scientists spending a big chunk of their time playing data “Go Fish”, making data available allows data scientists to focus on what they do best.

2. Appropriate governance.

With freedom comes responsibility. Providing appropriate access to individuals who most likely have the highest analytical ceilings of an organization also means having strong data governance practices in place. Organizations should be thoughtful about the risk and reward of limiting access. Applying some basic data governance practices will allow data scientists understand the data, the data flow, and data quality issues.

3. Provide an analytic sandbox.

Data scientists are constantly building models and structuring data for analysis – its 80% of the job. Create a space for teams to build models and share data so that processes do not have to be re-run from scratch each time will save valuable business time.

4. Leverage data visualization tools.

This I believe is a big one. First, data science is more than just model building, it’s collecting critical insights about an organization and effectively communicating those insights. One of the best ways a data scientist can do this is by building tools that allow key stakeholders to see and interact with the story the data tells. Second, good data tools also build a culture of data fluency by allowing the business to interact and understand the who, what, and where of the business. They allow an organization to take action. The tools build trust around using data to improve the business and getting to the when, why and how.

5. Keep your team structures flexible.

If you are introducing the concept of data science teams to your organization for the first time you’ll probably spend hundreds of hours conceptualizing the ideal organizational structure. That’s fine, but just know that a team’s structure is likely to evolve in the first few years depending on the people and existing processes and structures you already have in place. In a nutshell, put more of a premium on collaboration over team structure.

6. Integrate data scientists with existing business units.

It’s great when other data scientists can collaborate with each other. It’s just as important that data scientists work with business units to better understand and solve problems. This would elevates data scientists from reactive stats-gatherers to proactive partners

7. Be sure your data science sponsors are 100% committed to supporting the team.

Data science return on investment can be accelerated – or limited – by leader support. These individuals are evangelists and enforcers for data science work across the organization. In some cases there are leaders who will leverage the idea of data scientists – not actually implement work, but hype the potential work – just to move up the organizational ranks, which hinders the effectiveness of data scientists and the organization.

8. Hire for diverse skillsets.

Prioritize culture and communication, but also make sure that your data scientists bring varying analytic skillsets. If you are trying to build a robust data science team it should not just be data scientists. Consider having supporting roles that might include a project manager, business analysts, data visualization specialists, UX designer/front-end developer and/or ETL developers

9. Develop internal team skillsets.

This should apply to all individuals in an organization because skillsets should be constantly evolving to match the tools and technologies that improve the business; consider embracing open-source technologies and cloud-based tools.

10. Be willing to fail.

Experimentation is a cornerstone of data science. It allows the business to learn what not to do. Experimentation shouldn’t happen on big parts of the business, they should be in small, controllable chunks. This is the heart of data science and many organizations might be afraid that they are going to lose. In reality they are going learn quickly about what does and doesn’t grow their business and should adopt the fail fast mantra.

11. Iterate quickly.

This is another cornerstone of data science. Data science projects shouldn’t take months to spin out initial insights. The team should learn what works and what don’t from the data using short cycles and focus on quick wins for the organization. If something works, then move it to a larger production environment.

12. Be comfortable with data not being perfect.

Because you are iterating quickly data might not be perfect. Data science is about action – avoid slowing down to do significant QA on data, extensive data collection, and/or complex modeling – at least to start.

13. Plan now for scaling up.

While you should start small and get some quick wins, start thinking about how you will want to scale and automate insights from the data science practice.

Wrap things up: highly effective data science teams don’t appear overnight. Building a successful team means developing people, culture, processes, science, and technologies. To do so, there are a number of things that can be done – many of these steps are rarely outlined for an organization but I’ve tried to highlight some of these components above. Most of the points above are just good business practices, practices that are highlighted by data scientists.

NFL Analytical MVP 2016

Which NFL tickets are hottest?