Cabspotting Data EDA

About

The cabspotting.org project (now inactive) collected data on each of San Francisco's Yellow Cabs, including position, timestamp and whether there was a paying customer at the time. A copy of the original dataset can be obtained from the CrawDad project (see also the Accio's Mobility datasets). You can get a copy of the local data (used in this notebook) here (94Mb file).

The Data

The whole dataset is composed by 537 text files, with a total of 11220490 records. Each text file corresponds to a cab. Here's a sample of one of the files:

37.61632 -122.38803 1 1213034957
37.6158 -122.38463 1 1213034899
37.61639 -122.38593 0 1213034560
37.61574 -122.38801 0 1213034500
37.61393 -122.39732 0 1213034439

Columns are, respectively:

Questions about the Data

We can think about some interesting questions about this apparently simple data:

  1. What is the temporal coverage of the data?
  2. What is its spatial coverage?
  3. What does the data looks like over a map?
  4. Who is the busiest cab?
  5. How many distinct rides we have (considering a ride a continuous sequence of records with occupancy)?
  6. What are the shortest and longest rides (with simplified distance, i.e. considering straight lines between records)? How is the distribution of the rides' lenght?
  7. What are the shortest and longest rides in time?
  8. Can we identify offline intervals for the cabs?
  9. How's the distribution of the length of the rides considering the hour of the day?
  10. How's the distribution of the length of the rides considering the day of the week?
  11. What is the longest time without a ride, for any cab (longest continuous sequence of records without occupancy)?
  12. Can we find slow spots (e.g. where estimated speed is slow)?
  13. Does the slow spots vary during the hours of the day?
  14. Are there circular rides (i.e. rides where the starting and ending point are very close)?
  15. Are there hotspots for passenger pickup, i.e. spots where the average length of the ride is larger than other spots nearby?
  16. Can we predict length (distance) of ride based on time of day, day of week and pickup region?

Please note that not all these questions may be answerable: there is no guarantee that it is possible to answer all those questions without using additional data, which is not provided, or even with additional data. Even if the data is reliable we cannot be absolutely sure about some possible answers (e.g. feasability of identifying circular rides).

The questions (and attempts to answer those) may not lead to good answers, but may lead to better questions about the data!

Basic processing of the data

First let's include the libraries we will need.

library(lubridate)

Reading and organizing the data in a tidy data frame

To read one of the data files into a R data frame we could run:

file = "Data/cabspottingdata/new_obceoo.txt"
data <- read.table(file, sep = "",
                   header = FALSE,
                   na.strings ="NA", stringsAsFactors = FALSE)
colnames(data) <- c("latitude","longitude","occupied","timestamp")
str(data)
## 'data.frame':	26115 obs. of  4 variables:
##  $ latitude : num  37.8 37.8 37.8 37.8 37.8 ...
##  $ longitude: num  -122 -122 -122 -122 -122 ...
##  $ occupied : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ timestamp: int  1213036146 1213036088 1213036028 1213035964 1213035907 1213035846 1213035789 1213035725 1213035663 1213035603 ...

We need to convert occupied to a factor and get a date/time representation for timestamp. Let's preserve the original column timestamp since it will be easier to use in some operations.

data$time <- as_datetime(data$timestamp)
data$occupied <- factor(data$occupied)
str(data)
## 'data.frame':	26115 obs. of  5 variables:
##  $ latitude : num  37.8 37.8 37.8 37.8 37.8 ...
##  $ longitude: num  -122 -122 -122 -122 -122 ...
##  $ occupied : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ timestamp: int  1213036146 1213036088 1213036028 1213035964 1213035907 1213035846 1213035789 1213035725 1213035663 1213035603 ...
##  $ time     : POSIXct, format: "2008-06-09 18:29:06" "2008-06-09 18:28:08" ...

Reading and organizing all the data in a tidy data frame

Now we know how to read and preprocess one file with data. What about many files? The original data is spread over 536 files with the same structure, one for each file. We need to read all files into a single data frame, but if we just do that we will not know which data file (cab) is associated with that particular data. So we also need to add a column on the data that tells us which original file was used for that part of the data.

files <- list.files(path="Data/cabspottingdata/",pattern="^new_.*\\.txt",full.names=TRUE)
files <- files[0:3]
allData <- lapply(files,read.table,sep = "",header = FALSE,na.strings ="NA", stringsAsFactors = FALSE)
lapply(allData,nrow)
## [[1]]
## [1] 23495
## 
## [[2]]
## [1] 5454
## 
## [[3]]
## [1] 21962
allData2 <- do.call(rbind,
         allData)
nrow(allData2)
## [1] 50911

Here are some issues I run into while creating this part of the notebook and links to the solutions: R list.files: some regexes only return a single file, lapply in R , cannot open connection?.

Extracting rides' information from the data

In order to answer some questions we may need to create a new representation of the data from:

latitudelongitude occupancytimestamp
37.61511-122.38483 1 1213033524
37.61505-122.38479 1 1213033508
37.8045 -122.40276 1 1213029887
37.80622-122.40873 1 1213029833
37.80603-122.40993 0 1213029808
37.80455-122.40992 0 1213029748
37.8037 -122.40679 0 1213029689
37.80372-122.4064 0 1213029631
37.80175-122.41208 0 1213029480
37.80106-122.4111 0 1213029443
37.80099-122.41097 1 1213029432
37.80086-122.4108 0 1213029424

To something like:

startLatitudestartLongitudeendLatitudeendLongitudestartTimestampendTimestampoccupancytimeLength
37.61511 -122.38483 37.80622 -122.40873 1213033524 1213029833 13691
37.80603 -122.40993 37.80106 -122.4111 1213029748 1213029443 0 305
37.80099 -122.41097 37.80099 -122.41097121302943212130294321 0
37.80086-122.410837.80086-122.41081213029424121302942400

More information on the output may be required or desirable.

Preprocessing and exploring the data may lead to more questions about the data. For example, should we keep or remove rides (with or without occupancy) that last for a single timestamp? Should we identify and remove offline intervals? Since the data may be truncated should we remove the first and last rides (since we don't know if they started before or ended after the data was recorded)?

To be concluded...

Warning: Code and results presented on this document are for reference use only. Code was written to be clear, not efficient. There are several ways to achieve the results, not all were considered.

See the R source code for this notebook.

Updated August 09, 2019