Predictive analytics and kaggle

Bozeman R useR Group

www.kaggle.com

Taxi Competition

Steps of predictive analytics

Step 1: Data processing

  • Data munging/cleaning
  • 80/20 rule for data science ( link )
  • For kaggle competitions data comes well organized ( csv )
  • Create response variable

Equipment

  • Ram: 32Gb
  • SSD
  • Core i7 8 virtual cores 4.4Ghz
  • Atlas for BLAS and LAPACK
train <- read.csv("train.csv")

121 seconds

1.8 Gb; 1.7 million rows

library(readr)
train <- read_csv("train.csv")

25 seconds

library(data.table)
train <- fread("train.csv")

35 seconds

  • Get total number of gps fixes => trip time
  • Split polyline out of the rest of the training variables
  • Save polyline and rest of training set separately

Step 2: Feature engineering

Create variables to go into model.

  • Transformations (log, square, inverse, standardization, discretization)
  • Interactions
  • More complicated…
# train$new_col <- train$old_col1 / train$old_col2

train <- as.data.table(train)
train[ , new_col := old_col1 / old_col1 ]
train[ , c("new_col1", "new_col2") :=
                         list(old_col1^2, sin(old_col2)) ]
thedate <- as.POSIXct(as.numeric(train$TIMESTAMP),
                origin="1970-01-01", tz = "GMT")

train$DAY <- format(thedate, "%a")
train$MONTH <- format(thedate, "%b")
train$TIME <- as.numeric(format(thedate, "%H")) +
    as.numeric(format(thedate, "%M")) / 60

Step 3: Data augmentation

  • Create even more data out of the data we have
  • Easy example: image analysis
  • rotate
  • blur
  • instagram
  • Test data is incomplete trips
  • Training data is complete trips
  • Sample from training data a portion of the whole trip
with open("./train_polyline.csv") as read_file,
    open("./train_aug_polyline_raw.csv", 'w') as write_file

    for line in read_file:
        n_gps = line.count('],[') + 1
        num_samples = int(math.ceil(math.log(n_gps / 5 + 1) + 1))

        for sample in range(0, num_samples):
            if n_gps == 1:
                num_gps_samp = 1
            else:
                num_gps_samp = int(math.ceil(random.uniform(1, n_gps - 1)))
            new_poly = line.split('],[')[0:num_gps_samp]
            write_file.write("\"" + "\",\"".join(new_poly) + "\"\n")

Step 4: Feature engineering (again)

  • Get information out of polyline.
  • Define clusters based on all recorded points.
  • Take augmented partial trips and cluster stats
  • Starting cluster, second cluster, most recent cluster
id1, [1, 2], [3, 4], [5, 6], ...
id2, [7, 8], [9, 10], ...
id1, 1, 2,
id1, 3, 4,
id1, 5, 6,
...
id2, 7, 8,
id2, 9, 10,
...
id, clust
id1, A
id1, A
id1, B
id1, B
...
id1, N
id2, C
id2, D
...
id2, G
id, first, second, last
id1,    A,      B,    N
id2,    C,      D,    G
...
cluststat <- function(X)
{
    clust <- unique(X)
    list(first = clust[1],
         second = clust[2],
         last = clust[length(clust)])
}
cluster_data[ , cluststat(.SD$clust), by = id]

Step 5: Model

  • DEEP LEARNING
  • random forest
  • k-means
  • GLM!!

h2o

library(h2o)

# use all threads: nthreads = -1
h2oserver <- h2o.init(nthreads = -1)

localhost:54321

  • Model: Poisson
  • Response: number of gps check-ins remaining
h2otrain <- h2o.importFile("train.csv")
# h2otrain <- as.h2o(train)

x <- c("CALL_TYPE", "TAXI_ID", "MONTH", "DAY", "TIME", "FIRST",
       "SECOND", "LAST", "POINTS_SO_FAR")

y <- "POINTS_LEFT"

h2o.fit <- h2o.glm(x = x, y = y, training_frame = h2otrain,
                   family = "poisson")

Step 6: Evaluation and model averaging

Cross Validation

  • Split training data into validation and train
  • Fit on train and predict on validation
  • Evaluate performance

Model averaging

  • Make predictions with a bunch of different types of models
  • Evaluate the performance with cross validation
  • Take a weighted average (weighted consensus) to make final predictions

Try it!

  • www.kaggle.com (Liberty Mutual Property Inspection Prediction)
  • github.com/useRbozeman