When taking advanced analytics to the cloud you’ll need a strong understanding of your platform – whether it’s compute, storage, or some other feature. This tutorial walks you through reading to and from Amazon Web Service’s Simple Storage Service. For this demo cod will be running though Rstudio which is running on a linux server on the cloud – which you can learn how to do here.
Using aws.s3 package
Before I found this package I was doing things the hard way – Using the AWS command line tools to put and get data from S3. The aws.s3 package makes these practices very convienent.
library(aws.s3) library(magrittr)
Saving System Variables
To make life easier you should save your .pem key credentials as system variables. Though doing this makes life easier it’s a greater security risk.
Sys.setenv( "AWS_ACCESS_KEY_ID" = "ABCDEFGHIJLMNOP", "AWS_SECRET_ACCESS_KEY" = "ABCAaKDKJHFSKhfiayrhekjabdfkasdhfiaewr0293u4bsn" )
Looking into S3
Saving your credentials eliminates additional arguements needed to run each of the
bucket_list_df() %>% dplyr::arrange(dplyr::desc(CreationDate)) %>% head()
## Bucket CreationDate ## 1 sample-data-demo 2017-06-01T20:04:07.000Z ## 2 aws-athena-query-results-666957067213-us-east-1 2017-05-20T18:18:31.000Z ## 3 aws-logs-666957067213-us-east-1 2017-02-19T21:59:02.000Z ## 4 test.io 2017-01-25T13:38:32.000Z ## 5 test 2017-01-25T13:37:28.000Z ## 6 stanke.co 2016-10-04T13:02:41.000Z
I’m most interested in the sample-data-demo bucket. We can use the
## List files in bucket files <- get_bucket("sample-data-demo") ## Convert files to tidy files_df <- tibble::data_frame( file = character(), LastModified = character() ) n_files <- length(files) for(i in 1:n_files) { files_df <- tibble::data_frame( file = files[i][[1]]$Key, LastModified = files[i][[1]]$LastModified ) %>% dplyr::bind_rows(files_df) } rm(n_files) head(files_df)
## # A tibble: 6 x 2 ## file LastModified ## <chr> <chr> ## 1 flights_2008.csv.bz2 2017-06-04T02:39:13.000Z ## 2 flights_2008.csv 2017-06-04T16:01:52.000Z ## 3 flights_2007.csv.bz2 2017-06-04T02:39:08.000Z ## 4 flights_2007.csv 2017-06-04T15:59:50.000Z ## 5 flights_2006.csv.bz2 2017-06-04T02:39:03.000Z ## 6 flights_2006.csv 2017-06-04T15:57:58.000Z
Putting data into S3
Putting data into S3 is pretty easy. We can use several functions:
Where do we use these?
S3save
S3save is analogous to save. We can take an object and save it as an
games <- readr::read_csv("data/NFL/GAME.csv") s3save(games, object = "games.Rdata", bucket = "sample-data-demo")
Please note that I have to save this as an .Rdata object – even though it was originally an .csv file.
put_object
Put object allows me to put any object that is on a local drive onto S3. This can basically be any filetype – .Rdata, .csv, .csv.bz2 – they are all covered here. There are three arguments you need to know: 1)
put_object("data/NFL/GAME.csv", "games.csv", bucket = "sample-data-demo" )
## [1] TRUE
Here we took the same .csv we read in earlier and saved the object as games.csv into the sample-data-demo bucket. You’ll see
save_object
The
We can then take this file and read it like we would normally do.
games <- readr::read_csv('games.csv') dim(games)
## [1] 4256 17
get_object
The
games <- get_object("games.csv", bucket = "sample-data-demo") games[1:100]
## [1] 67 69 64 2c 73 65 61 73 2c 77 6b 2c 64 61 79 2c 76 2c 68 2c 73 74 61 ## [24] 64 2c 74 65 6d 70 2c 68 75 6d 64 2c 77 73 70 64 2c 77 64 69 72 2c 63 ## [47] 6f 6e 64 2c 73 75 72 66 2c 6f 75 2c 73 70 72 76 2c 70 74 73 76 2c 70 ## [70] 74 73 68 0d 0a 31 2c 32 30 30 30 2c 31 2c 53 55 4e 2c 53 46 2c 41 54 ## [93] 4c 2c 47 65 6f 72 67 69
As I mentioned, using the
games <- aws.s3::get_object("games.csv", bucket = "sample-data-demo") %>% rawToChar() %>% readr::read_csv() dim(games)
## [1] 4256 17
This works pretty well for reading in most file types, but I’ve found it very hard to use for compressed files. I’d recommend
delete_object
to delete an object on S3 just use the
aws.s3::delete_object("games.Rdata", bucket = "sample-data-demo")
## [1] TRUE
aws.s3::delete_object("games.csv", bucket = "sample-data-demo")
## [1] TRUE
put_bucket and delete_bucket
I can easily create or delete a bucket with the
## Make a bucket. # put_bucket("stanke123123123") ## And make it disappear. # delete_bucket("stanke123123123")
These functions get you started with S3 on AWS. There are a host of other services available that I’ll continue to share
// add bootstrap table styles to pandoc tables function bootstrapStylePandocTables() { $('tr.header').parent('thead').parent('table').addClass('table table-condensed'); } $(document).ready(function () { bootstrapStylePandocTables(); });