r - Speed up for loop on a large dataset -
i running analysis of citibike data , have created code works - extracts of trips of bikes started different station bike last stopped (that is, instances start.station.id
different previous end.station.id
, means bike moved truck. monthly datasets, however, large, summer months containing on 1 million individual trips (you can find them here: citibike data.
here snapshot of dataset:
head(nov2015, n = 20) tripduration starttime stoptime start.station.id start.station.name 1 1110 11/1/2015 00:00:00 11/1/2015 00:18:31 537 lexington ave & e 24 st 2 1094 11/1/2015 00:00:01 11/1/2015 00:18:15 537 lexington ave & e 24 st 3 520 11/1/2015 00:00:05 11/1/2015 00:08:45 536 1 ave & e 30 st 4 753 11/1/2015 00:00:15 11/1/2015 00:12:48 229 great jones st 5 353 11/1/2015 00:00:22 11/1/2015 00:06:15 285 broadway & e 14 st 6 1285 11/1/2015 00:00:22 11/1/2015 00:21:48 268 howard st & centre st 7 477 11/1/2015 00:00:25 11/1/2015 00:08:23 379 w 31 st & 7 ave 8 362 11/1/2015 00:00:28 11/1/2015 00:06:30 407 henry st & poplar st 9 2316 11/1/2015 00:00:37 11/1/2015 00:39:14 147 greenwich st & warren st 10 627 11/1/2015 00:00:42 11/1/2015 00:11:10 521 8 ave & w 31 st 11 2304 11/1/2015 00:00:44 11/1/2015 00:39:08 147 greenwich st & warren st 12 1471 11/1/2015 00:01:04 11/1/2015 00:25:35 281 grand army plaza & central park s 13 1484 11/1/2015 00:01:36 11/1/2015 00:26:21 281 grand army plaza & central park s 14 284 11/1/2015 00:01:36 11/1/2015 00:06:20 247 perry st & bleecker st 15 886 11/1/2015 00:01:39 11/1/2015 00:16:25 492 w 33 st & 7 ave 16 886 11/1/2015 00:01:42 11/1/2015 00:16:28 492 w 33 st & 7 ave 17 1379 11/1/2015 00:01:44 11/1/2015 00:24:44 512 w 29 st & 9 ave 18 179 11/1/2015 00:01:47 11/1/2015 00:04:47 319 fulton st & broadway 19 309 11/1/2015 00:01:51 11/1/2015 00:07:00 160 e 37 st & lexington ave 20 616 11/1/2015 00:02:08 11/1/2015 00:12:24 479 9 ave & w 45 st start.station.latitude start.station.longitude end.station.id end.station.name 1 40.74026 -73.98409 531 forsyth st & broome st 2 40.74026 -73.98409 531 forsyth st & broome st 3 40.74144 -73.97536 498 broadway & w 32 st 4 40.72743 -73.99379 328 watts st & greenwich st 5 40.73455 -73.99074 151 cleveland pl & spring st 6 40.71911 -73.99973 476 e 31 st & 3 ave 7 40.74916 -73.99160 546 e 30 st & park ave s 8 40.70047 -73.99145 310 state st & smith st 9 40.71542 -74.01122 441 e 52 st & 2 ave 10 40.75097 -73.99444 285 broadway & e 14 st 11 40.71542 -74.01122 441 e 52 st & 2 ave 12 40.76440 -73.97371 367 e 53 st & lexington ave 13 40.76440 -73.97371 367 e 53 st & lexington ave 14 40.73535 -74.00483 453 w 22 st & 8 ave 15 40.75020 -73.99093 377 6 ave & canal st 16 40.75020 -73.99093 377 6 ave & canal st 17 40.75007 -73.99839 445 e 10 st & avenue 18 40.71107 -74.00945 264 maiden ln & pearl st 19 40.74824 -73.97831 362 broadway & w 37 st 20 40.76019 -73.99126 440 e 45 st & 3 ave end.station.latitude end.station.longitude bikeid usertype birth.year gender 1 40.71894 -73.99266 22545 subscriber 1981 2 2 40.71894 -73.99266 23959 subscriber 1980 1 3 40.74855 -73.98808 22251 subscriber 1988 1 4 40.72406 -74.00966 15869 subscriber 1981 1 5 40.72210 -73.99725 21645 subscriber 1987 1 6 40.74394 -73.97966 14788 customer na 0 7 40.74445 -73.98304 21128 subscriber 1962 2 8 40.68927 -73.98913 21016 subscriber 1978 1 9 40.75601 -73.96742 24117 subscriber 1988 2 10 40.73455 -73.99074 17048 subscriber 1986 2 11 40.75601 -73.96742 18241 subscriber 1984 1 12 40.75828 -73.97069 24223 customer na 0 13 40.75828 -73.97069 16779 customer na 0 14 40.74475 -73.99915 17272 subscriber 1976 1 15 40.72244 -74.00566 15008 subscriber 1981 1 16 40.72244 -74.00566 23019 subscriber 1982 1 17 40.72741 -73.98142 23843 subscriber 1962 2 18 40.70706 -74.00732 22538 subscriber 1981 1 19 40.75173 -73.98754 22042 subscriber 1988 1 20 40.75255 -73.97283 22699 subscriber 1982 1
and of code used extract "hidden" bike movements , put them coherent data.frame:
raw_data = read.csv("201511-citibike-tripdata.csv") unique_id = unique(raw_data$bikeid) output <- data.frame("bikeid"= integer(0), "end.station.id"= integer(0), "start.station.id" = integer(0), "diff.time" = numeric(0), "stoptime" = character(),"starttime" = character(), stringsasfactors=false) (bikeid in unique_id) { onebike <- raw_data[ which(raw_data$bikeid== bikeid), ] if(nrow(onebike) >=2 ){ for(i in 2:nrow(onebike )) { if(is.integer(onebike[i-1,"end.station.id"]) & is.integer(onebike[i,"start.station.id"]) & onebike[i-1,"end.station.id"] != onebike[i,"start.station.id"]){ diff_time <- as.double(difftime(strptime(onebike[i,"starttime"], "%m/%d/%y %h:%m:%s"), strptime(onebike[i-1,"stoptime"], "%m/%d/%y %h:%m:%s") ,units = "mins")) new_row <- c(bikeid, onebike[i-1,"end.station.id"], onebike[i,"start.station.id"], diff_time, as.character(onebike[i-1,"stoptime"]), as.character(onebike[i,"starttime"])) output[nrow(output) + 1,] = new_row } } } }
because uses loop, extraction takes long time. there way speed process or rewrite code in way avoids using loop?
the output should stay same. namely:
head(output) bikeid end.station.id start.station.id diff.time stoptime starttime 1 22545 520 529 24.8166666666667 11/2/2015 08:38:22 11/2/2015 09:03:11 2 22545 520 517 537.483333333333 11/2/2015 09:39:19 11/2/2015 18:36:48 3 22545 2004 3230 563.066666666667 11/2/2015 22:06:27 11/3/2015 07:29:31 4 22545 296 3236 471.783333333333 11/4/2015 23:40:29 11/5/2015 07:32:16 5 22545 520 449 43.4166666666667 11/9/2015 08:24:06 11/9/2015 09:07:31 6 22545 359 519 30.7166666666667 11/9/2015 09:14:46 11/9/2015 09:45:29
this solution using data.table takes few minutes me.
uses shift function , adds previous row's data current row bikeid existing data.table.
then filter !is.na(end.station.id)&(end.station.id!=start.station.id) followed removing columns not needed , setting column order.
library(data.table) bikedata<-fread('201511-citibike-tripdata.csv') colnames(bikedata)<-make.names(colnames(bikedata)) bikedata[,c("end.station.id", "diff.time", "stoptime", "starttime") := list(shift(end.station.id,1l,type="lag"), as.double(difftime(strptime(starttime, "%m/%d/%y %h:%m:%s"), strptime(shift(stoptime,1l,type="lag"), "%m/%d/%y %h:%m:%s") ,units = "mins")), as.character(shift(stoptime,1l,type="lag")), as.character(starttime) ), by=bikeid] bikedatamoved<-bikedata[!is.na(end.station.id)&(end.station.id!=start.station.id)] bikedatamoved[, setdiff(colnames(bikedatamoved),c("bikeid","end.station.id", "start.station.id", "diff.time", "stoptime", "starttime")):=null] setcolorder(bikedatamoved, c("bikeid", "end.station.id", "start.station.id", "diff.time", "stoptime", "starttime"))
Comments
Post a Comment