When taking advanced analytics to the cloud you’ll need a strong understanding of your platform – whether it’s compute, storage, or some other feature. This tutorial walks you through reading to and from Amazon Web Service’s Simple Storage Service. For this demo cod will be running though Rstudio which is running on a linux server on the cloud – which you can learn how to do here.

Using aws.s3 package

Before I found this package I was doing things the hard way – Using the AWS command line tools to put and get data from S3. The aws.s3 package makes these practices very convienent.

library(aws.s3)
library(magrittr)

Saving System Variables

To make life easier you should save your .pem key credentials as system variables. Though doing this makes life easier it’s a greater security risk.

Sys.setenv(
  "AWS_ACCESS_KEY_ID" = "ABCDEFGHIJLMNOP",
  "AWS_SECRET_ACCESS_KEY" = "ABCAaKDKJHFSKhfiayrhekjabdfkasdhfiaewr0293u4bsn"
)

Looking into S3

Saving your credentials eliminates additional arguements needed to run each of the aws.s3 functions shown below. Lets start with looking at my buckets by using the bucket_list_df function. This returns my bucket names and creation dates as a data frame. f

bucket_list_df() %>%
  dplyr::arrange(dplyr::desc(CreationDate)) %>%
  head()
##                                            Bucket             CreationDate
## 1                                sample-data-demo 2017-06-01T20:04:07.000Z
## 2 aws-athena-query-results-666957067213-us-east-1 2017-05-20T18:18:31.000Z
## 3                 aws-logs-666957067213-us-east-1 2017-02-19T21:59:02.000Z
## 4                                         test.io 2017-01-25T13:38:32.000Z
## 5                                            test 2017-01-25T13:37:28.000Z
## 6                                      stanke.co 2016-10-04T13:02:41.000Z

I’m most interested in the sample-data-demo bucket. We can use the t function to examine the contents of the bucket. The output comes as a list – which isn’t always the best to work with. So I’ve written some code to take the output and transform it to a data frame/tibble.

##  List files in bucket
files <- get_bucket("sample-data-demo")

##  Convert files to tidy
files_df <-
tibble::data_frame(
file = character(),
LastModified = character()
)

n_files <- length(files)

for(i in 1:n_files) {
files_df <-
tibble::data_frame(
file = files[i][[1]]$Key,
LastModified = files[i][[1]]$LastModified
) %>% 
dplyr::bind_rows(files_df)

}

rm(n_files)

head(files_df)
## # A tibble: 6 x 2
##                   file             LastModified
##                  <chr>                    <chr>
## 1 flights_2008.csv.bz2 2017-06-04T02:39:13.000Z
## 2     flights_2008.csv 2017-06-04T16:01:52.000Z
## 3 flights_2007.csv.bz2 2017-06-04T02:39:08.000Z
## 4     flights_2007.csv 2017-06-04T15:59:50.000Z
## 5 flights_2006.csv.bz2 2017-06-04T02:39:03.000Z
## 6     flights_2006.csv 2017-06-04T15:57:58.000Z

Putting data into S3

Putting data into S3 is pretty easy. We can use several functions: S3save, put_object, OR save_object functions.

Where do we use these?

S3save

S3save is analogous to save. We can take an object and save it as an .Rdata file. Here I’ll take a local file – pro football results and create an object. Then I’ll save it to S3.

games <- readr::read_csv("data/NFL/GAME.csv")

s3save(games, object = "games.Rdata", bucket = "sample-data-demo")

Please note that I have to save this as an .Rdata object – even though it was originally an .csv file.

put_object

Put object allows me to put any object that is on a local drive onto S3. This can basically be any filetype – .Rdata, .csv, .csv.bz2 – they are all covered here. There are three arguments you need to know: 1) file= the location of the file you want to send to S3; 2) object= the name you want to give to the S3 object – probably the same as the file arguement; and 3) bucket= the name of the bucket you’d like to place the object into.

put_object("data/NFL/GAME.csv", "games.csv", bucket = "sample-data-demo" )
## [1] TRUE

Here we took the same .csv we read in earlier and saved the object as games.csv into the sample-data-demo bucket. You’ll see TRUE is returned indicating the file has successfully uploaded to S3.

save_object

The save_object function sounds like it might save a file to S3. But it’s actually the opposite of put_object. save_object takes a file on S3 and saves it to your working directory for you to use. I REPEAT: save_object takes a file from S3 and saves it to your working directory.

r save_object(“games.csv”, file = “games.csv”, bucket = “sample-data-demo”)

## [1] “games.csv”

We can then take this file and read it like we would normally do.

games <- readr::read_csv('games.csv')
dim(games)
## [1] 4256   17

get_object

The save_object function stores information locally. If you want to keep as much as possible in-memory you can use the get_object function – which returns the file as raw data.

games <- get_object("games.csv", bucket = "sample-data-demo")

games[1:100]
##   [1] 67 69 64 2c 73 65 61 73 2c 77 6b 2c 64 61 79 2c 76 2c 68 2c 73 74 61
##  [24] 64 2c 74 65 6d 70 2c 68 75 6d 64 2c 77 73 70 64 2c 77 64 69 72 2c 63
##  [47] 6f 6e 64 2c 73 75 72 66 2c 6f 75 2c 73 70 72 76 2c 70 74 73 76 2c 70
##  [70] 74 73 68 0d 0a 31 2c 32 30 30 30 2c 31 2c 53 55 4e 2c 53 46 2c 41 54
##  [93] 4c 2c 47 65 6f 72 67 69

As I mentioned, using the get_object function returns raw data. This means if you look at the immediate output you’ll see the bits of information as they are. To return the data as intended you’ll need to use the rawToChar function to convert the data:

games <- 
  aws.s3::get_object("games.csv", bucket = "sample-data-demo") %>%
  rawToChar() %>%
  readr::read_csv()

dim(games)
## [1] 4256   17

This works pretty well for reading in most file types, but I’ve found it very hard to use for compressed files. I’d recommend save_object for .bz2, .gz, .zip or any other compressed file. I just haven’t found a good solution yet.

delete_object

to delete an object on S3 just use the delete_object function. Here I’ll delete the files I just created for this demo.

aws.s3::delete_object("games.Rdata", bucket = "sample-data-demo")
## [1] TRUE
aws.s3::delete_object("games.csv", bucket = "sample-data-demo")
## [1] TRUE

put_bucket and delete_bucket

I can easily create or delete a bucket with the put_bucket and delete_bucket functions. With put_bucket I can also specify it’s ACL – i.e. private, public read, public read/write.

##  Make a bucket.
# put_bucket("stanke123123123")
##  And make it disappear.
# delete_bucket("stanke123123123")

These functions get you started with S3 on AWS. There are a host of other services available that I’ll continue to share