Lattes CVs Database

About

The Lattes Platform is an information system about researchers and institutions in Brazil. Individual researchers can use a web-based interface to list their CVs (publications, work experience, achievements, etc.) Information on that platform is used by several other systems and applications both by the Brazilian government and by bibliometrics researchers.

There are more than 5.5 million individual CVs published on the platform. It is possible to retrieve individual CVs in a zipped XML file, but batch downloading is not possible (except for institutes and universities that signed an agreement with CNPq, the National Counsel of Scientific and Technological Development, which maintains the platform).

Data in these XML files is less organized than one could expect. A good part of the data is user-generated, therefore may be incomplete or inconsistent. For example, researchers A, B and C are coauthors of a paper. This paper may be listed with slightly different titles in A's, B's and C's Lattes CVs. Coauthors may also include other coauthors' names with different spellings or not at all. Naming consistency is not enforced: one CV may list one of the researcher's coauthors as "Smith, J."; other as "John Smith" or "J Smith" and so on -- and there is no guarantee that these names refer to the same person. This gets even worse for foreign coauthors, who often do not have entries on the Lattes CVs database.

The Lattes platform imposes some order in some aspects: researchers can enter the Digital Object Identifier of a paper, and the platform will retreive data from the DOI system. Some names of coauthors are also identifiable by the web interface of the CV: all researchers have a 16-digit identifier assigned to them, and sometimes coauthors are identified by name and id, but since most of the data is entered manually, mistakes are still present on the CVs.

The Data

Raw Data

This data is not tidy: there is too much information to be contained in a single table. To facilitate some of the possible analysis we extracted a subset of the 5.5 million Lattes CVs and extracted aspects of its data in different files. For this project we chose the CVs of the researchers that were recepients of productivity grantes levels 1A and 1B when the data was collected (around July 2018). There were 1257 recepients of grants level 1A and 1240 recepients of grants level 1B, in a total of 2497 researchers, but not all CVs were downloaded at that time: only 2282 CVs were downloaded. You can download a 364MB zip file containing those CVs.

Each CV is stored in a zip file which contains a single entry: the curriculo.xml file. The file name is a 16-digits identifier used in the CV itself. You can download the DTD file for the CVs or get more information on it at the Extração de Dados (Data Extraction) web site (in Portuguese).

To make things worse, some downloaded CVs may even be empty due to an error system, meaning that the XML content would not conform to the DTD. Other CVs, for some reason, does not contain some data we will need. We will skip these CVs for some parts of this project.

Preprocessed Data

We used some scripts to extract data from the selected CVs, in the CSV format. Data is organized in separate files:

Lattes_Metadata.csv (3.3Mb, 2279 lines): contains basic metadata about the Lattes CVs. Each line contains:

name: name of the researcher.
i16: researcher's 16 digit identifier.
date: date of the last update of the CV (YYYYMMDD).
countryB: country of birth.
institution: institution of address given by researcher.
country: country of address given by researcher.
abstract: researcher's CXV abstract.

Lattes_Educational.csv (1.3Mb, 6809 lines): contains information on the educational background of each researcher in the CV's list. Information about the undergraduate, masters's and PhD courses were collected (where present), so each researcher may have more than one line associated with its ID. Each line contains:

i16: researcher's 16 digit identifier.
level: educational level (e.g. "DOUTORADO", "GRADUACAO").
code_institution: code of the educational institution.
name_institution: name of the educational institution.
code_course: code of the course.
name_course: name of the course.
start: year course started.
end: year course finished.
thesis: title of the thesis.
advisor: advisor's 16 digit identifier. May be empty.
nadvisor: Name of advisor.

Lattes_Papers.csv (1.1G, 5456412 lines): contains information on the papers published in journals by the researchers, including coauthorship. Information on this file contains some repetition: since each paper may have several coauthors, information is repeated for each pair of coauthors: a paper by authors X, Y, Z and W will have one line for each pair XY, XZ, XW. If W is also on the base there will be more entries for WX, WY, WZ. Solving redundancies and ambiguity are important tasks.
Each line contains:

i16: researcher's 16 digit identifier.
coi16: researcher's coauthor's 16 digit identifier. May be empty (in case of non-identified coauthors, e.g. identified only by name).
coname: researcher's coauthor's name.
year: year of publication of paper.
title: title of paper.
journal: name of journal.
issn: ISSN of journal.

Lattes_Conferences.csv (291M, 1368082 lines): contains information on the papers published in conferences by the researchers, including coauthorship. Information on this file contains some repetition: since each paper may have several coauthors, information is repeated for each pair of coauthors: a paper by authors X, Y, Z and W will have one line for each pair XY, XZ, XW. If W is also on the base there will be more entries for WX, WY, WZ. Solving redundancies and ambiguity are important tasks.
Each line contains:

i16: researcher's 16 digit identifier.
coi16: researcher's coauthor's 16 digit identifier. May be empty (in case of non-identified coauthors, e.g. identified only by name).
coname: researcher's coauthor's name.
year: year of publication of paper.
title: title of paper.
proceedings: name of conference/title of proceedings.
nature: whether the paper is a complete paper, expanded abstract, etc.
class: whether theconference was international, national, local, etc.

Lattes_Supervisions.csv (31Mb, 112874 lines): contains information on the students that were supervised by the researchers. Only Masters' and PhD's supervisions were considered. Each line contains:

i16: researcher's 16 digit identifier.
sti16: student's 16 digit identifier. May be empty (in case of non-identified coauthors, e.g. identified only by name).
stname: student's name.
year: year of publication of thesis.
code_institution: code of the educational institution.
name_institution: name of the educational institution.
code_course: code of the course.
name_course: name of the course.
nature: whether it is thesis or dissertation.
type: information on the type (academic, professional) of the course.
title: title of thesis or dissertation.

Additional Data

Additional data that can be useful:

Here is the XML file for my Lattes CV. Open it in a browser to see it formatted (tested in Chrome).

Questions about the Data

Let's think about questions about the data.

Get the date of the CV that was most recently updated when the data was collected (we don't know for sure the date when the CVs were collected, but we know it was around July 2018).
Complete the "tidyification" of the other data files.
How many researchers per country of birth do we have?
Get a histogram of the "age" of the CVs's data (the last date of updates).
Show the evolution of numbers of publications through time, per researcher.
List the IDs of researchers that ought to be in the CV database (e.g. by being coauthors or advisors/advisees) but are not. One example: 8699821828310072
How many researchers per country of birth and grant level (1A, 1B) do we have?
How many researchers per institution do we have?
Which are the shortest and longest time to get a PhD (considering only researchers in this database)?
Who are the top 10 most prolific advisers for PhDs? And for undergraduate theses?
Create a word cloud with all the CVs abstract.
Create a word cloud with all the papers' titles for a specific CV.
Show the evolution of numbers of publications through time, per identifiable group (e.g. university).
How many researchers have a PhD but not a masters degree?
Can we identify frequent collaborators?
Can we identify names of researchers that ought to be in the CV database (e.g. by being coauthors) but are not? Note that this list is not the same one as the ID list: there are coauthors identified by name but not by ID.
With the list of IDs and names of the researchers that are not in the database create a list with only one ID per row and all the names associated with that ID. For example: researcher with ID 1234567812345678 may have entries as coauthor or advisor/advisee with different names ("J. Smith", "Smith, J.", "John Smith"). This list could be used for unification/disambiguation of authors.
Compare publication counts (considering both papers and articles) for researchers with grants 1A and 1B. Use, for example, a scatterplot for this.
Can we extract advisor genealogies from the data?
How can we disambiguate coauthors' names? To do this we will need a list of all possible names and IDs associated with these names (see other questions in this list).
Compare publication counts (considering both papers and articles) for researchers with grants 1A and 1B by year or period of years. Use, for example, a scatterplot for this.
Can we identify frequent indirect collaborators (e.g. A often publishes with B, B with C but not A with C)?
Can we identify frequent collaborators' changes through time (e.g. collaborations that ceased to be frequent)?
How can we disambiguate papers' titles?
Can we assess the quality of the data in a CV (e.g. its completeness)?
Can we suggest potential cooperation between reseachers?
I cheated: the original data is in a zipped XML file, which was parsed and filted by some scripts outside of this notebook. How could we get rid of the external parsing step? Hint: there are libraries to parse XML in R!
Think about more itens for that list that could be answered with this data.

Please note that not all these questions may be answerable: there is no guarantee that it is possible to answer all those questions without using additional data, which is not provided, or even with additional data. Even if the data is reliable we cannot be absolutely sure about some possible answers (e.g. disambiguation). In particular, for these questions, one must consider how can they be answered and how can we be sure the data is adequate to answer the question.

The questions (and attempts to answer those) may not lead to good answers, but may lead to better questions about the data!

Basic processing of the data

First let's include the libraries we will need.

library(stringr)

Reading and tidying up the metadata information file

file = "Data/Lattes/Lattes_Metadata.csv"
metadata <- read.csv(file,header=TRUE,sep=",",stringsAsFactors=FALSE)
str(metadata)

## 'data.frame':	2278 obs. of  7 variables:
##  $ name       : chr  "Ana Luiza Coelho Netto" "Edgard Graner" "Afranio Lineu Kritski" "Licio Augusto Velloso" ...
##  $ i16        : num  3.26e+11 4.60e+12 8.77e+12 9.61e+12 3.49e+13 ...
##  $ date       : int  20180416 20180704 20180802 20180316 20180808 20180809 20180313 20180330 20180703 20180809 ...
##  $ countryB   : chr  "Brasil" "Brasil" "Brasil" "Brasil" ...
##  $ institution: chr  "Universidade Federal do Rio de Janeiro" "Universidade Estadual de Campinas" "Universidade Federal do Rio de Janeiro" "Universidade Estadual de Campinas" ...
##  $ country    : chr  "Brasil" "Brasil" "Brasil" "Brasil" ...
##  $ abstract   : chr  "Bacharel em Geografia na Universidade Federal do Rio de Janeiro (1973); M.Sc. em Geografia Física/Geomorfologia"| __truncated__ "Graduou-se em Odontologia pela Universidade Estadual de Campinas em 1990, realizou o mestrado em Biologia e Pat"| __truncated__ "Possui graduação em Medicina pela Faculdade Evangélica do Paraná (1980), mestrado em Pneumologia e Tisiologia p"| __truncated__ "Professor Titular do Departamento de Clínica Médica da UNICAMP. Membro Titular da Academia Brasileira de Cienci"| __truncated__ ...

Let's set the proper data types:

# https://stackoverflow.com/questions/5812493/how-to-add-leading-zeros
metadata$i16 <- str_pad(metadata$i16,16,pad="0") # sprintf does not work, why?
metadata$countryB <- as.factor(metadata$countryB)
metadata$institution <- as.factor(metadata$institution)
metadata$country <- as.factor(metadata$country)
# https://stackoverflow.com/questions/17518564/convert-yyyymmdd-string-to-date-class-in-r
metadata$date <- as.Date(as.character(metadata$date), "%Y%m%d")
str(metadata)

## 'data.frame':	2278 obs. of  7 variables:
##  $ name       : chr  "Ana Luiza Coelho Netto" "Edgard Graner" "Afranio Lineu Kritski" "Licio Augusto Velloso" ...
##  $ i16        : chr  "0000325690951570" "0004597502232092" "0008770194107817" "0009613666274466" ...
##  $ date       : Date, format: "2018-04-16" "2018-07-04" ...
##  $ countryB   : Factor w/ 46 levels "Alemanha","Argentina",..: 7 7 7 7 7 7 38 7 7 7 ...
##  $ institution: Factor w/ 205 levels "","Associação Brasileira de Teoria e Análise Musical",..: 177 126 177 126 153 155 126 105 153 156 ...
##  $ country    : Factor w/ 3 levels "","Brasil","Canadá": 2 2 2 2 2 2 2 2 2 2 ...
##  $ abstract   : chr  "Bacharel em Geografia na Universidade Federal do Rio de Janeiro (1973); M.Sc. em Geografia Física/Geomorfologia"| __truncated__ "Graduou-se em Odontologia pela Universidade Estadual de Campinas em 1990, realizou o mestrado em Biologia e Pat"| __truncated__ "Possui graduação em Medicina pela Faculdade Evangélica do Paraná (1980), mestrado em Pneumologia e Tisiologia p"| __truncated__ "Professor Titular do Departamento de Clínica Médica da UNICAMP. Membro Titular da Academia Brasileira de Cienci"| __truncated__ ...

Reading and tidying up the educational background information file

file = "Data/Lattes/Lattes_Educational.csv"
education <- read.csv(file,header=TRUE,sep=",",stringsAsFactors=FALSE)
str(education)

## 'data.frame':	6808 obs. of  11 variables:
##  $ i16             : num  3.26e+11 3.26e+11 3.26e+11 4.60e+12 4.60e+12 ...
##  $ level           : chr  "GRADUACAO" "MESTRADO" "DOUTORADO" "GRADUACAO" ...
##  $ code_institution: chr  "020200000009" "020200000009" "001800000992" "007900000004" ...
##  $ name_institution: chr  "Universidade Federal do Rio de Janeiro" "Universidade Federal do Rio de Janeiro" "Katholieke Universiteit Leuven - Belgium" "Universidade Estadual de Campinas" ...
##  $ code_course     : int  90000001 90000026 90000027 90000001 33070148 33070148 90000004 90000001 33140308 90000001 ...
##  $ name_course     : chr  "Geografia" "Programa de Pós-Graduação em Geografia" "Geografie-Geologie" "Odontologia" ...
##  $ start           : int  1970 1974 1980 1987 1991 1994 1975 1986 1992 1980 ...
##  $ end             : int  1973 1979 1985 1990 1993 1996 1980 1990 1995 1983 ...
##  $ thesis          : chr  "" "(Com Louvor) - Processos Erosivos nas Encostas do Maciço da Tijuca, RJ: condicionantes e diretrizes" "(Summa Cum Lauda) Surface Hydrology and Soil Erosion in a Tropical Mountainous Rainforest Drainage Basin, Rio d"| __truncated__ "" ...
##  $ advisor         : num  NA NA NA NA NA ...
##  $ nadvisor        : chr  "Maria Regina Mousinho de Meis" "Maria Regina Mousinho de Meis" "Professor Jan De Ploey" "" ...

Let's set the proper data types:

# https://stackoverflow.com/questions/5812493/how-to-add-leading-zeros
education$i16 <- str_pad(education$i16,16,pad="0")
education$level <- as.factor(education$level)
education$code_institution <- as.factor(str_pad(education$code_institution,12,pad="0"))
education$name_institution <- as.factor(education$name_institution)
education$code_course <- as.factor(str_pad(education$code_course,8,pad="0"))
education$name_course <- as.factor(education$name_course)
education$advisor <- str_pad(education$advisor,16,pad="0")
str(education)

## 'data.frame':	6808 obs. of  11 variables:
##  $ i16             : chr  "0000325690951570" "0000325690951570" "0000325690951570" "0004597502232092" ...
##  $ level           : Factor w/ 3 levels "DOUTORADO","GRADUACAO",..: 2 3 1 2 3 1 2 3 1 2 ...
##  $ code_institution: Factor w/ 427 levels "000000000000",..: 105 105 30 70 70 70 347 105 60 70 ...
##  $ name_institution: Factor w/ 948 levels "","A. F. IOFFE PHYSICO-TECHNICAL INSTITUTE",..: 615 615 355 567 567 567 136 615 600 567 ...
##  $ code_course     : Factor w/ 675 levels "00000001","00000002",..: 597 622 623 597 355 355 600 597 428 597 ...
##  $ name_course     : Factor w/ 1693 levels "","(PhD) Estatistica",..: 936 1480 943 1350 215 215 1231 1450 1001 1231 ...
##  $ start           : int  1970 1974 1980 1987 1991 1994 1975 1986 1992 1980 ...
##  $ end             : int  1973 1979 1985 1990 1993 1996 1980 1990 1995 1983 ...
##  $ thesis          : chr  "" "(Com Louvor) - Processos Erosivos nas Encostas do Maciço da Tijuca, RJ: condicionantes e diretrizes" "(Summa Cum Lauda) Surface Hydrology and Soil Erosion in a Tropical Mountainous Rainforest Drainage Basin, Rio d"| __truncated__ "" ...
##  $ advisor         : chr  NA NA NA NA ...
##  $ nadvisor        : chr  "Maria Regina Mousinho de Meis" "Maria Regina Mousinho de Meis" "Professor Jan De Ploey" "" ...

Reading and tidying up the papers file

file = "Data/Lattes/Lattes_Papers.csv"
papers <- read.csv(file,header=TRUE,sep=",",stringsAsFactors=FALSE)
str(papers)

## 'data.frame':	5456411 obs. of  7 variables:
##  $ i16    : num  3.26e+11 3.26e+11 3.26e+11 3.26e+11 3.26e+11 ...
##  $ coi16  : num  NA NA 5.22e+15 7.87e+15 8.28e+15 ...
##  $ coname : chr  "A. M. F. MONTEIRO" "M. R. M. MEIS" "Andre de Souza Avelar" "N. F. FERNANDES" ...
##  $ year   : chr  "1974" "1974" "1992" "1994" ...
##  $ title  : chr  "Formacao Macacu: variações texturais e aproveitamentos economico." "Formacao Macacu: variações texturais e aproveitamentos economico." "Fraturas e Desenvolvimento de Unidades Geomorfologicas Concavas no Medio Vale do Rio Paraiba do Sul." "Subsurface Hydrology Of Layered Colluvium Mantle In Unchanneled Valleys: Southeastern Brazil." ...
##  $ journal: chr  "Boletim Paulista de Geografia" "Boletim Paulista de Geografia" "Revista Brasileira de Geociências" "Earth Surface Processes and Landforms" ...
##  $ issn   : chr  "00066079" "00066079" "03757536" "01979337" ...

Reading and tidying up the articles in conferences file

file = "Data/Lattes/Lattes_Conferences.csv"
conferences <- read.csv(file,header=TRUE,sep=",",stringsAsFactors=FALSE)
str(conferences)

## 'data.frame':	1368081 obs. of  8 variables:
##  $ i16        : num  3.26e+11 3.26e+11 3.26e+11 3.26e+11 3.26e+11 ...
##  $ coi16      : num  4.48e+15 6.21e+15 4.38e+15 6.12e+15 2.89e+15 ...
##  $ coname     : chr  "R. O. ROSAS" "M. E. DANTAS" "L. G. EIRADO SILVA" "Otávio Miguez da Rocha Leão" ...
##  $ year       : int  1994 1994 1994 1995 1995 1995 1995 1995 1995 1995 ...
##  $ title      : chr  "Definição de Domínios Geo-Hidroecológicos como Subsídio ao Planejamento Ambiental: o método analítico-integrati"| __truncated__ "Spatially Non-Uniform Sediment Storage In Fluvial Systems: The Role Of Bedrock Knickpoints In The Southeastern "| __truncated__ "Spatially Non-Uniform Sediment Storage In Fluvial Systems: The Role Of Bedrock Knickpoints In The Southeastern "| __truncated__ "Revegetação Induzida no Controle da Hidrologia e Erosão Superficial, P.N.T./RJ." ...
##  $ proceedings: chr  "ATAS DO 1. ENCONTRO NACIONAL DE CIÊNCIAS AMBIENTAIS" "ANAIS DO 14th INTERNATIONAL SEDIMENTARY CONGRESS, IAS-INTERN. ASS. OF GEOMOPHOLOGISTS" "ANAIS DO 14th INTERNATIONAL SEDIMENTARY CONGRESS, IAS-INTERN. ASS. OF GEOMOPHOLOGISTS" "ANAIS DO VI SIMPÓSIO NACIONAL DE GEOGRAFIA FÍSICA APLICADA" ...
##  $ nature     : chr  "COMPLETO" "RESUMO_EXPANDIDO" "RESUMO_EXPANDIDO" "COMPLETO" ...
##  $ class      : chr  "NAO_INFORMADO" "NAO_INFORMADO" "NAO_INFORMADO" "NACIONAL" ...

Reading and tidying up the supervisions file

file = "Data/Lattes/Lattes_Supervisions.csv"
supervisions <- read.csv(file,header=TRUE,sep=",",stringsAsFactors=FALSE)
str(supervisions)

## 'data.frame':	112873 obs. of  11 variables:
##  $ i16             : num  3.26e+11 3.26e+11 3.26e+11 3.26e+11 3.26e+11 ...
##  $ sti16           : num  NA 6.21e+15 6.89e+15 NA NA ...
##  $ stname          : chr  "Carlos Edgar de Deus" "Marcelo Eduardo Dantas" "Marcelo Motta de Freitas" "Ana Valéria Freire Alemão" ...
##  $ year            : int  1991 1995 1998 1998 2001 2000 2001 1998 2003 2004 ...
##  $ code_institution: chr  "020200000009" "020200000009" "020200000009" "020200000009" ...
##  $ name_institution: chr  "Universidade Federal do Rio de Janeiro" "Universidade Federal do Rio de Janeiro" "Universidade Federal do Rio de Janeiro" "Universidade Federal do Rio de Janeiro" ...
##  $ code_course     : int  31000240 31000240 31000240 31000240 31000240 31000240 31000240 31000240 31000240 31000240 ...
##  $ name_course     : chr  "Geografia" "Geografia" "Geografia" "Geografia" ...
##  $ nature          : chr  "Dissertação de mestrado" "Dissertação de mestrado" "Dissertação de mestrado" "Dissertação de mestrado" ...
##  $ type            : chr  "ACADEMICO" "ACADEMICO" "ACADEMICO" "ACADEMICO" ...
##  $ title           : chr  "O Papel da Formiga Sauva (Do Genero Atta) Na Hidrologia e Erosao dos Solos Em Ambiente de Pastagem: Bananal - Sp" "Controle Naturais e Antropogênicos Na Sedimentação Fluvial, Espacialmente Não Uniforme, Na Bacia do Rio Bananal"| __truncated__ "Comportamento Hidrológico e Erosivo de Bacia Montanhosa Sob Uso Agrícola: Subsídio Ao Controle Erosivo dos Solo"| __truncated__ "Recarga de Drenagem em Solos Florestados: o papel dos sistemas radiculares ." ...

To be concluded...

Warning: Code and results presented on this document are for reference use only. Code was written to be clear, not efficient. There are several ways to achieve the results, not all were considered.

See the R source code for this notebook.

Updated August 09, 2019

CAP394

Rafael Santos