r - Speed up for loop on a large dataset -


i running analysis of citibike data , have created code works - extracts of trips of bikes started different station bike last stopped (that is, instances start.station.id different previous end.station.id, means bike moved truck. monthly datasets, however, large, summer months containing on 1 million individual trips (you can find them here: citibike data.

here snapshot of dataset:

  head(nov2015, n = 20)    tripduration          starttime           stoptime start.station.id                start.station.name 1          1110 11/1/2015 00:00:00 11/1/2015 00:18:31              537           lexington ave & e 24 st 2          1094 11/1/2015 00:00:01 11/1/2015 00:18:15              537           lexington ave & e 24 st 3           520 11/1/2015 00:00:05 11/1/2015 00:08:45              536                   1 ave & e 30 st 4           753 11/1/2015 00:00:15 11/1/2015 00:12:48              229                    great jones st 5           353 11/1/2015 00:00:22 11/1/2015 00:06:15              285                broadway & e 14 st 6          1285 11/1/2015 00:00:22 11/1/2015 00:21:48              268             howard st & centre st 7           477 11/1/2015 00:00:25 11/1/2015 00:08:23              379                   w 31 st & 7 ave 8           362 11/1/2015 00:00:28 11/1/2015 00:06:30              407              henry st & poplar st 9          2316 11/1/2015 00:00:37 11/1/2015 00:39:14              147          greenwich st & warren st 10          627 11/1/2015 00:00:42 11/1/2015 00:11:10              521                   8 ave & w 31 st 11         2304 11/1/2015 00:00:44 11/1/2015 00:39:08              147          greenwich st & warren st 12         1471 11/1/2015 00:01:04 11/1/2015 00:25:35              281 grand army plaza & central park s 13         1484 11/1/2015 00:01:36 11/1/2015 00:26:21              281 grand army plaza & central park s 14          284 11/1/2015 00:01:36 11/1/2015 00:06:20              247            perry st & bleecker st 15          886 11/1/2015 00:01:39 11/1/2015 00:16:25              492                   w 33 st & 7 ave 16          886 11/1/2015 00:01:42 11/1/2015 00:16:28              492                   w 33 st & 7 ave 17         1379 11/1/2015 00:01:44 11/1/2015 00:24:44              512                   w 29 st & 9 ave 18          179 11/1/2015 00:01:47 11/1/2015 00:04:47              319              fulton st & broadway 19          309 11/1/2015 00:01:51 11/1/2015 00:07:00              160           e 37 st & lexington ave 20          616 11/1/2015 00:02:08 11/1/2015 00:12:24              479                   9 ave & w 45 st    start.station.latitude start.station.longitude end.station.id         end.station.name 1                40.74026               -73.98409            531   forsyth st & broome st 2                40.74026               -73.98409            531   forsyth st & broome st 3                40.74144               -73.97536            498       broadway & w 32 st 4                40.72743               -73.99379            328  watts st & greenwich st 5                40.73455               -73.99074            151 cleveland pl & spring st 6                40.71911               -73.99973            476          e 31 st & 3 ave 7                40.74916               -73.99160            546     e 30 st & park ave s 8                40.70047               -73.99145            310      state st & smith st 9                40.71542               -74.01122            441          e 52 st & 2 ave 10               40.75097               -73.99444            285       broadway & e 14 st 11               40.71542               -74.01122            441          e 52 st & 2 ave 12               40.76440               -73.97371            367  e 53 st & lexington ave 13               40.76440               -73.97371            367  e 53 st & lexington ave 14               40.73535               -74.00483            453          w 22 st & 8 ave 15               40.75020               -73.99093            377         6 ave & canal st 16               40.75020               -73.99093            377         6 ave & canal st 17               40.75007               -73.99839            445       e 10 st & avenue 18               40.71107               -74.00945            264     maiden ln & pearl st 19               40.74824               -73.97831            362       broadway & w 37 st 20               40.76019               -73.99126            440          e 45 st & 3 ave    end.station.latitude end.station.longitude bikeid   usertype birth.year gender 1              40.71894             -73.99266  22545 subscriber       1981      2 2              40.71894             -73.99266  23959 subscriber       1980      1 3              40.74855             -73.98808  22251 subscriber       1988      1 4              40.72406             -74.00966  15869 subscriber       1981      1 5              40.72210             -73.99725  21645 subscriber       1987      1 6              40.74394             -73.97966  14788   customer         na      0 7              40.74445             -73.98304  21128 subscriber       1962      2 8              40.68927             -73.98913  21016 subscriber       1978      1 9              40.75601             -73.96742  24117 subscriber       1988      2 10             40.73455             -73.99074  17048 subscriber       1986      2 11             40.75601             -73.96742  18241 subscriber       1984      1 12             40.75828             -73.97069  24223   customer         na      0 13             40.75828             -73.97069  16779   customer         na      0 14             40.74475             -73.99915  17272 subscriber       1976      1 15             40.72244             -74.00566  15008 subscriber       1981      1 16             40.72244             -74.00566  23019 subscriber       1982      1 17             40.72741             -73.98142  23843 subscriber       1962      2 18             40.70706             -74.00732  22538 subscriber       1981      1 19             40.75173             -73.98754  22042 subscriber       1988      1 20             40.75255             -73.97283  22699 subscriber       1982      1  

and of code used extract "hidden" bike movements , put them coherent data.frame:

raw_data = read.csv("201511-citibike-tripdata.csv") unique_id = unique(raw_data$bikeid)   output <- data.frame("bikeid"= integer(0), "end.station.id"= integer(0), "start.station.id" = integer(0), "diff.time" = numeric(0),  "stoptime" = character(),"starttime" = character(), stringsasfactors=false)  (bikeid in unique_id) { onebike <- raw_data[ which(raw_data$bikeid== bikeid), ]  if(nrow(onebike) >=2 ){ for(i in 2:nrow(onebike )) { if(is.integer(onebike[i-1,"end.station.id"]) & is.integer(onebike[i,"start.station.id"]) & onebike[i-1,"end.station.id"] != onebike[i,"start.station.id"]){ diff_time <- as.double(difftime(strptime(onebike[i,"starttime"], "%m/%d/%y %h:%m:%s"),                                 strptime(onebike[i-1,"stoptime"], "%m/%d/%y %h:%m:%s")                                 ,units = "mins")) new_row <- c(bikeid, onebike[i-1,"end.station.id"], onebike[i,"start.station.id"], diff_time, as.character(onebike[i-1,"stoptime"]), as.character(onebike[i,"starttime"])) output[nrow(output) + 1,] = new_row } } } } 

because uses loop, extraction takes long time. there way speed process or rewrite code in way avoids using loop?

the output should stay same. namely:

 head(output)   bikeid end.station.id start.station.id        diff.time           stoptime          starttime 1  22545            520              529 24.8166666666667 11/2/2015 08:38:22 11/2/2015 09:03:11 2  22545            520              517 537.483333333333 11/2/2015 09:39:19 11/2/2015 18:36:48 3  22545           2004             3230 563.066666666667 11/2/2015 22:06:27 11/3/2015 07:29:31 4  22545            296             3236 471.783333333333 11/4/2015 23:40:29 11/5/2015 07:32:16 5  22545            520              449 43.4166666666667 11/9/2015 08:24:06 11/9/2015 09:07:31 6  22545            359              519 30.7166666666667 11/9/2015 09:14:46 11/9/2015 09:45:29 

this solution using data.table takes few minutes me.

uses shift function , adds previous row's data current row bikeid existing data.table.

then filter !is.na(end.station.id)&(end.station.id!=start.station.id) followed removing columns not needed , setting column order.

library(data.table) bikedata<-fread('201511-citibike-tripdata.csv') colnames(bikedata)<-make.names(colnames(bikedata))   bikedata[,c("end.station.id",             "diff.time",             "stoptime",             "starttime") :=            list(shift(end.station.id,1l,type="lag"),                 as.double(difftime(strptime(starttime,                                              "%m/%d/%y %h:%m:%s"),                                    strptime(shift(stoptime,1l,type="lag"),                                              "%m/%d/%y %h:%m:%s")                                    ,units = "mins")),                 as.character(shift(stoptime,1l,type="lag")),                 as.character(starttime)            ),          by=bikeid]  bikedatamoved<-bikedata[!is.na(end.station.id)&(end.station.id!=start.station.id)]  bikedatamoved[,               setdiff(colnames(bikedatamoved),c("bikeid","end.station.id",                                                 "start.station.id",                                                 "diff.time",                                                 "stoptime",                                                 "starttime")):=null]  setcolorder(bikedatamoved, c("bikeid",                               "end.station.id",                               "start.station.id",                                      "diff.time",                                         "stoptime",                                        "starttime")) 

Comments

Popular posts from this blog

ruby - Trying to change last to "x"s to 23 -

jquery - Clone last and append item to closest class -

c - Unrecognised emulation mode: elf_i386 on MinGW32 -