Is Data Science getting popular?


Let's see if Data Science is getting more popular through a simple and lazy experiment: querying Clarivate's Web of Science for some terms, and counting the occurrence of those terms in titles of papers for each year it appears.

The Data

Some results were extracted from Clarivate's Web of Science beforehand to facilitate processing. From its main portal ( we did some searches for specific terms of interest, clicked on "analyze results", selected "Publication Years" and used 25 results then "Download data rows displayed in table". A sample file with downloaded data is shown below.

We created one file for the terms "Artificial Intelligence", "Deep Learning", "Data Mining", "Data Science", "Expert System*", "Machine Learning", and "Neural Network*". The file names (with links to it) are, respectively, WoS_AI.txt, WoS_DL.txt, WoS_DM.txt, WoS_DS.txt, WoS_ES.txt, WoS_ML.txt, and WoS_NN.txt.

Reading the Data into a Data Frame

Before executing any code, let's load the libraries:


Our first step is to read that data into a data frame. It won't be straightforward since the format is pure text, without indication of which lines are data, headers or comments. Worse, there are lines at the end of the file that are not data and must be skipped before parsing the other lines.

One way to do that is to read all lines into a vector, removing unwanted lines from that vector and then reading data from that vector:

# Read all lines into a vector:
rawData <- readLines("Data/WoS/WoS_DS.txt")
# Remove first line of that vector:
rawData <- rawData[-1]
# Remove last two lines of that vector:
rawData <- head(rawData,-2)
# Use it to create a data frame:
DSData <- read.table(textConnection(rawData),sep = "",
                     col.names = c("Year","Counts","Percent"))
OK, that work. We should create a function that gets a file name as parameter and return the data frame. While we're at it, let's remove the "Percent" column, since we won't use it later. Let's also use a different label for the "Counts" column -- that will make merging data frames simpler later.
Here it is:

readFile <- function(file,colName) {
  rawData <- readLines(file)
  rawData <- rawData[-1]
  rawData <- head(rawData,-2)
  data <- read.table(textConnection(rawData),sep = "",
                     col.names = c("Year",colName,"Percent"))
  data$Percent <- NULL

Processing the Data

Let's read data for all queries we did:

AIData <- readFile("Data/WoS/WoS_AI.txt","Artificial Intelligence")
DLData <- readFile("Data/WoS/WoS_DL.txt","Deep Learning")
DMData <- readFile("Data/WoS/WoS_DM.txt","Data Mining")
DSData <- readFile("Data/WoS/WoS_DS.txt","Data Science")
ESData <- readFile("Data/WoS/WoS_ES.txt","Expert Systems")
MLData <- readFile("Data/WoS/WoS_ML.txt","Machine Learning")
NNData <- readFile("Data/WoS/WoS_NN.txt","Neural Networks")

We need to merge all these dataframes together. See Simultaneously merge multiple data.frames in a list for some ways to do that.

allData <- Reduce(function(dtf1, dtf2)
This is a bit messy -- there are a lot of NAs caused by the merging of the dataframes. Let's fix this (here's how: How do I replace NA values with zeros in an R dataframe?):

allData[] <- 0
Let's ignore results before 1980. Let's also remove data for 2019 since it is incomplete.

allData <- allData[allData$Year >= 1980, ]
allData <- allData[allData$Year < 2019, ]

We're ready to plot the data. Plotting multiple time series in ggplot2 requires some melting of the data so we have one line for the year, one for the term and one for the count of terms for that year. We will add another column that defines the thickness of the line, so we can use different line styles for some terms.

# Melt the data in the proper format, with specific column names.
melted <- melt(allData,id="Year")
colnames(melted) <- c("Year","Term","Count")
# Set the thickness depending on the term.
melted$thickness <- 1
melted$thickness[melted$Term=="Data.Science"] <- 3
# Plot it with style!
ggplot(melted,aes(x=Year,y=Count,colour=Term,group=Term,size=thickness)) +
  scale_size(range = c(1,3), guide="none")+
  guides(colour = guide_legend(override.aes = list(size=3)))+
  ggtitle("Terms' Count")+
Updated July 29, 2019