Data Science Example - Books about Data Science

About

This is an example of a notebook to demonstrate concepts of Data Science. In this example we will do some very basic plotting of a book titles dataset.

We will use some basic data extracted from a book database dump. The dataset contains two fields: the title of the publication and the year it was published. There are more than one million entries in the dataset. You can download the dataset here (80MB).

Let's import the libraries we will need:

library(data.table)
library(ggplot2)

Reading the data

We can read a CSV file into R with the following code. It will parse the CSV file, using the header (first line) to name the fields and keep strings as strings.

data <- read.csv("Data/OldPubs/libgenbooks.csv", header = TRUE,stringsAsFactors=FALSE)

Let's see the structure of the dataframe:

str(data)

## 'data.frame':	1221173 obs. of  2 variables:
##  $ Title: chr  "Handbook of Clinical Drug Data" "Handbook of Herbs and Spices" "Handbook of Herbs and Spices Volume 2" "Medical terminology an illustrated guide" ...
##  $ Year : int  2001 2001 2004 2004 1980 2002 2000 2001 2005 2003 ...

Counting books with specific words in their titles

We want to see whether specific words and combinations appear in the books' titles. Let's use a naive approach: grep. We'll just create data frames where the titles are the ones that matches the string passed as first parameter to grep.

filteredDS <- data[grep("Data Science", data$Title,ignore.case=TRUE), ]
nrow(filteredDS)

## [1] 238

Let's try with other keywords!

filteredDM <- data[grep("Data Mining", data$Title,ignore.case=TRUE), ]
nrow(filteredDM)

## [1] 1020

filteredBD <- data[grep("Big Data", data$Title,ignore.case=TRUE), ]
nrow(filteredBD)

## [1] 586

filteredAI <- data[grep("Artificial Intelligence", data$Title,ignore.case=TRUE), ]
nrow(filteredAI)

## [1] 1478

filteredNN <- data[grep("Neural Network", data$Title,ignore.case=TRUE), ]
nrow(filteredNN)

## [1] 883

filteredML <- data[grep("Machine Learning", data$Title,ignore.case=TRUE), ]
nrow(filteredML)

## [1] 835

grep works, but we need to know how many books with a specific word on its title were published by year. table can help us there.

table(filteredDS$Year)

## 
## 1998 2003 2004 2005 2006 2011 2013 2014 2015 2016 2017 2018 
##    2    1    1    1    2    2   17   19   50   62   71   10

Looks OK, but we would like to have the count results as dataframes with the proper column names. Let's create tables for each subset of the data, convert them to dataframes, relabel these dataframes and ensure that the year is treated as a value (instead of a factor).

DSFrequency <- as.data.frame(table(filteredDS$Year))
names(DSFrequency) <- c("Year","Data Science")
DSFrequency$Year <- as.numeric(levels(DSFrequency$Year))[DSFrequency$Year]
DSFrequency

##    Year Data Science
## 1  1998            2
## 2  2003            1
## 3  2004            1
## 4  2005            1
## 5  2006            2
## 6  2011            2
## 7  2013           17
## 8  2014           19
## 9  2015           50
## 10 2016           62
## 11 2017           71
## 12 2018           10

Much better! Let's do the same for the other dataset subsets.

DMFrequency <- as.data.frame(table(filteredDM$Year))
names(DMFrequency) <- c("Year","Data Mining")
DMFrequency$Year <- as.numeric(levels(DMFrequency$Year))[DMFrequency$Year]
BDFrequency <- as.data.frame(table(filteredBD$Year))
names(BDFrequency) <- c("Year","Big Data")
BDFrequency$Year <- as.numeric(levels(BDFrequency$Year))[BDFrequency$Year]
AIFrequency <- as.data.frame(table(filteredAI$Year))
names(AIFrequency) <- c("Year","Artificial Intelligence")
AIFrequency$Year <- as.numeric(levels(AIFrequency$Year))[AIFrequency$Year]
NNFrequency <- as.data.frame(table(filteredNN$Year))
names(NNFrequency) <- c("Year","Neural Networks")
NNFrequency$Year <- as.numeric(levels(NNFrequency$Year))[NNFrequency$Year]
MLFrequency <- as.data.frame(table(filteredML$Year))
names(MLFrequency) <- c("Year","Machine Learning")
MLFrequency$Year <- as.numeric(levels(MLFrequency$Year))[MLFrequency$Year]

We need to merge all these dataframes together. See Simultaneously merge multiple data.frames in a list for some ways to do that.

all <- Reduce(function(dtf1, dtf2)
             merge(dtf1,dtf2,by="Year",all=TRUE),
             list(DSFrequency,DMFrequency,BDFrequency,AIFrequency,NNFrequency,MLFrequency))
head(all)

##   Year Data Science Data Mining Big Data Artificial Intelligence
## 1  101           NA           1       NA                      NA
## 2 1962           NA          NA       NA                       1
## 3 1968           NA          NA       NA                      NA
## 4 1971           NA          NA       NA                       5
## 5 1975           NA          NA       NA                       3
## 6 1976           NA          NA       NA                       2
##   Neural Networks Machine Learning
## 1              NA               NA
## 2              NA               NA
## 3               1                2
## 4              NA                1
## 5              NA               NA
## 6              NA               NA

This is a bit messy -- there are a lot of NAs caused by the merging of the dataframes. Let's fix this (here's how: How do I replace NA values with zeros in an R dataframe?):

all[is.na(all)] <- 0
str(all)

## 'data.frame':	47 obs. of  7 variables:
##  $ Year                   : num  101 1962 1968 1971 1975 ...
##  $ Data Science           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Data Mining            : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ Big Data               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Artificial Intelligence: num  0 1 0 5 3 2 1 5 3 7 ...
##  $ Neural Networks        : num  0 0 1 0 0 0 0 0 1 0 ...
##  $ Machine Learning       : num  0 0 2 1 0 0 0 0 0 0 ...

There are also books published before 1980, let's remove those. Let's also remove data for 2018 since it is incomplete.

all <- all[all$Year >= 1980, ]
all <- all[all$Year < 2018, ]

We're ready to plot the time series! Since we want one of the lines to be thicker than the others, we need an auxiliary field (see When creating a multiple line plot in ggplot2, how do you make one line thicker than the others?). See also ggplot2: axis manipulation and themes.

melted <- melt(all,id="Year")
colnames(melted) <- c("Year", "Keyword","Count")
melted$thickness <- 1
melted$thickness[melted$Keyword=="Data Science"] <- 3
head(melted)

##   Year      Keyword Count thickness
## 1 1980 Data Science     0         3
## 2 1981 Data Science     0         3
## 3 1982 Data Science     0         3
## 4 1983 Data Science     0         3
## 5 1984 Data Science     0         3
## 6 1985 Data Science     0         3

ggplot(melted,aes(x=Year,y=Count,colour=Keyword,group=Keyword,size=thickness)) +
  geom_line()+
  scale_size(range = c(1,3), guide="none")+
  scale_x_continuous("Year", breaks=seq(1980,2017,2))+
  guides(colour = guide_legend(override.aes = list(size=3)))+
  ggtitle("Books")+
  theme(axis.title=element_text(size=14),
        axis.text.x=element_text(size=11,angle=-90,vjust=0.5,hjust=1),
        axis.text.y=element_text(size=12),
        legend.title=element_text(size=14),
        legend.text=element_text(size=12),
        plot.title=element_text(size=22))

Exercises

Here are some suggested exercises:

Our simple grep use is not a very good choice: we didn't consider case variations and may not get titles similar to what could be interesting (e.g. "Data Scientist"). Try a better approach to filter books by title.
Try different keywords in different domains to see changes in publications' themes.

Warning: Code and results presented on this document are for reference use only. Code was written to be clear, not efficient. There are several ways to achieve the results, not all were considered.

See the R source code for this notebook.

Updated July 29, 2019

CAP394

Rafael Santos