The cabspotting.org project (now inactive) collected data on each of San Francisco's Yellow Cabs, including position, timestamp and whether there was a paying customer at the time. A copy of the original dataset can be obtained from the CrawDad project (see also the Accio's Mobility datasets). You can get a copy of the local data (used in this notebook) here (94Mb file).
The whole dataset is composed by 537 text files, with a total of 11220490 records. Each text file corresponds to a cab. Here's a sample of one of the files:
37.61632 -122.38803 1 1213034957 37.6158 -122.38463 1 1213034899 37.61639 -122.38593 0 1213034560 37.61574 -122.38801 0 1213034500 37.61393 -122.39732 0 1213034439
Columns are, respectively:
We can think about some interesting questions about this apparently simple data:
Please note that not all these questions may be answerable: there is no guarantee that it is possible to answer all those questions without using additional data, which is not provided, or even with additional data. Even if the data is reliable we cannot be absolutely sure about some possible answers (e.g. feasability of identifying circular rides).
The questions (and attempts to answer those) may not lead to good answers, but may lead to better questions about the data!
First let's include the libraries we will need.
library(lubridate)
To read one of the data files into a R data frame we could run:
file = "Data/cabspottingdata/new_obceoo.txt" data <- read.table(file, sep = "", header = FALSE, na.strings ="NA", stringsAsFactors = FALSE) colnames(data) <- c("latitude","longitude","occupied","timestamp") str(data)
## 'data.frame': 26115 obs. of 4 variables: ## $ latitude : num 37.8 37.8 37.8 37.8 37.8 ... ## $ longitude: num -122 -122 -122 -122 -122 ... ## $ occupied : int 0 0 0 0 0 0 0 0 0 0 ... ## $ timestamp: int 1213036146 1213036088 1213036028 1213035964 1213035907 1213035846 1213035789 1213035725 1213035663 1213035603 ...
We need to convert occupied to a factor and get a date/time representation for timestamp. Let's preserve the original column timestamp since it will be easier to use in some operations.
data$time <- as_datetime(data$timestamp) data$occupied <- factor(data$occupied) str(data)
## 'data.frame': 26115 obs. of 5 variables: ## $ latitude : num 37.8 37.8 37.8 37.8 37.8 ... ## $ longitude: num -122 -122 -122 -122 -122 ... ## $ occupied : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... ## $ timestamp: int 1213036146 1213036088 1213036028 1213035964 1213035907 1213035846 1213035789 1213035725 1213035663 1213035603 ... ## $ time : POSIXct, format: "2008-06-09 18:29:06" "2008-06-09 18:28:08" ...
Now we know how to read and preprocess one file with data. What about many files? The original data is spread over 536 files with the same structure, one for each file. We need to read all files into a single data frame, but if we just do that we will not know which data file (cab) is associated with that particular data. So we also need to add a column on the data that tells us which original file was used for that part of the data.
files <- list.files(path="Data/cabspottingdata/",pattern="^new_.*\\.txt",full.names=TRUE) files <- files[0:3] allData <- lapply(files,read.table,sep = "",header = FALSE,na.strings ="NA", stringsAsFactors = FALSE) lapply(allData,nrow)
## [[1]] ## [1] 23495 ## ## [[2]] ## [1] 5454 ## ## [[3]] ## [1] 21962
allData2 <- do.call(rbind, allData) nrow(allData2)
## [1] 50911
Here are some issues I run into while creating this part of the notebook and links to the solutions: R list.files: some regexes only return a single file, lapply in R , cannot open connection?.
In order to answer some questions we may need to create a new representation of the data from:
latitude | longitude | occupancy | timestamp |
---|---|---|---|
37.61511 | -122.38483 | 1 | 1213033524 |
37.61505 | -122.38479 | 1 | 1213033508 |
37.8045 | -122.40276 | 1 | 1213029887 |
37.80622 | -122.40873 | 1 | 1213029833 |
37.80603 | -122.40993 | 0 | 1213029808 |
37.80455 | -122.40992 | 0 | 1213029748 |
37.8037 | -122.40679 | 0 | 1213029689 |
37.80372 | -122.4064 | 0 | 1213029631 |
37.80175 | -122.41208 | 0 | 1213029480 |
37.80106 | -122.4111 | 0 | 1213029443 |
37.80099 | -122.41097 | 1 | 1213029432 |
37.80086 | -122.4108 | 0 | 1213029424 |
To something like:
startLatitude | startLongitude | endLatitude | endLongitude | startTimestamp | endTimestamp | occupancy | timeLength |
---|---|---|---|---|---|---|---|
37.61511 | -122.38483 | 37.80622 | -122.40873 | 1213033524 | 1213029833 | 1 | 3691 |
37.80603 | -122.40993 | 37.80106 | -122.4111 | 1213029748 | 1213029443 | 0 | 305 |
37.80099 | -122.41097 | 37.80099 | -122.41097 | 1213029432 | 1213029432 | 1 | 0 |
37.80086 | -122.4108 | 37.80086 | -122.4108 | 1213029424 | 1213029424 | 0 | 0 |
More information on the output may be required or desirable.
Preprocessing and exploring the data may lead to more questions about the data. For example, should we keep or remove rides (with or without occupancy) that last for a single timestamp? Should we identify and remove offline intervals? Since the data may be truncated should we remove the first and last rides (since we don't know if they started before or ended after the data was recorded)?
Warning: Code and results presented on this document are for reference use only. Code was written to be clear, not efficient. There are several ways to achieve the results, not all were considered.
See the R source code for this notebook.