A detailed description of the Russian Index of Research Organizations (RIRO) project (version 1.0).
RIRO, as the article’s title suggests, is a project launched in May 2021 and devoted to the Russian organizations doing research. What title does not tell is that RIRO was designed as a public, open, and autonomous project, where
In its current form RIRO v.1.0 is a set of tables linking the organization identifiers and profiles from various international registries (see a list below) to the Russian legal entities.
All files are available at Zenodo: https://zenodo.org/communities/riro/.
The code and examples of use will be shared via GitHub: https://github.com/OpenRIRO.
The project’s main purpose is to create and link the profiles (PID) corresponding to the Russian research organizations in the various information systems. One may argue that in those identifiers are already assigned to some organizaions. It looks so, but each register of identifiers has its own handicaps - some are trying to keep the most updated orgnames and rename the accounts, wiping out old names and connections; the others have more conservative record update policy and provide a lot of obsolete names; most do not follow the Russian mergers in science; and none is great in covering the small organizations (which is a terra incognita for any citation index). As a result, the identifiers provided by the international services suffer from a patchy coverage and have a limited value for the research assessment and scientometric analysis.
An UpSet diagram below shows the sets of organizations listen in RIRO (v.1.0) and matched to the various identifiers. Only the head and active (existing) organizations are counted here, so the identifiers referring to the branches or to the predecessors liquidated via acquisition, are not counted.
filename <- paste0(img.dir, "chart_upset_eng.png")
knitr::include_graphics(filename)
With the RIRO we aim to:
create, update, and share a public dataset of the organization identifiers that reflects not only their current statuses, but also their recent (at least) transformations;
asssess how complete is a representation of the Russian organizations in the international registries;
attract the researh community’s attention to the new information services and its features.
RIRO is about research, so in general any organization that appear in the affiliations of the scientific articles is a good candidate for RIRO. Participation is free (as in speech). And yet some organizations have been prioritized to appear in v.1.0, those are:
state (federal/regonal) organizations doing fundamental or applied research (research centres, scientific institutes, universities, large clinical centers or hospitals, reserves, museums, etc.);
private organizations (mostly universities).
We have not even reviewed the entities associated with the Russian state corporations (RosAtom, RosSpace, RosTech) or the military ones.
RIRO does not copy all the attributes linked to the records in the sources. We selected only the most useful (to our opinion) fields. In future RIRO releases we may add more fields.
The data in the tables 4-11 is the same as in the original sources - we have not changed the original records.
Some tables share the common fields. For example, ROR (table 4) has a column Wikidata, and Wikidata (table 5) has a column ROR. We did not chang or check those links, they are kept as they are present in the original sources, but renamed to avoid confusion (added a prefix referring to the source - ror_wikidata, wd_ror, etc.
Matching the identifiers to the organizations was made based on available information (name, location) - we have not checked if the data asssociated with the identifiers (in the source databases) is correctly assigned to the organizations.
The RIRO dataset (as a set of CSV tables) can be downloaded from Zenodo community, via OAI-PMH Harvesting API or by using REST API. Zenodo – is one of the leading data repositories supported by CERN (details).
The CSV tables share a primary key named code, so one can easily join the tables or build a database.
riro_versions <- list.dirs(paste0(dir, "/final_tables/"), recursive = FALSE) %>%
sort(., decreasing = TRUE) %>% .[grepl("1.0",.)]
riro_tables <- list.files(riro_versions, full.names = TRUE)
riro_data <- map(riro_tables, ~read_csv(.x, col_types = cols(.default = col_character())))
Below are the parts of each table corrsponding to the 3 organizations selected as an example:
This table comprises the basic organization details - OGRN (Primary State Registration Number), INN (Taxpayer Identification Number), KPP (Tax Registration Reason Code), full & short names, status {active, liquidated, in reorganization process}, and the branch type {head or branch}.
A table below lists the legal entities found by OGRN and their branches (the branch organization and its head entity share same OGRN and INN). To save some space a column with the short names column is not included.
t1 <- riro_data[[1]]
test_group <- t1 %>%
filter(ogrn %in% c("1021100511332", "1037739552740", "1027000861568"))
test_group %>% select(code, level, status, name_full, ogrn, inn, kpp) %>%
datatable(rownames = FALSE, filter = "none", escape = FALSE,
class = "row-border",
options = list(columnDefs = list(
list(width = '600px', targets = c(3)))))
Each row has its unique code whih serves as a primary key for all the tables of the RIRO database.
Table 2 comprises the full address and its separate parts (in Russian), accompanied with the geocode, geo coordinates and time zone. Table 2 is the only table in RIRO having 1:1 correspondance to Table 1 via code. The other tables can have few rows for one code.
This is a very important table, as it links the parent organizations not only with its current branches, but also with the predecessors (for convenience, in this document both will be referred as “children accounts”).
Table 3 does not pretend to be complete for few reasons:
the list includes only the last predecessors of the current organizations, so there is no historical perspective from 2000s or from USSR.
some organizations have many branch offices (e.g. hospitals), but information about their hierarchy has little or no value from research assessment perspective. Therefore, for some organizations RIRO does not show the branches. The orgs including the branches are mainly the federal organizations (subdued to the ministries and the federal agencies).
The “child_code” is a code for the children account, and the values in the “relation” column reflect nature of the subordination (it can be a branch or a predecessor).
So, using a list of OGRNs for 3 selected organizations, we extracted from Table 1 6 entities with unique codes (head & branch organizations), further used to retrieve a list of all the predecessors. As a result we get a list of 42 entities with unique code values.
Let’s look into this list:
hierarchy <- full_test %>%
add_row(code = unique(full_test$code),
child_code = unique(full_test$code),
relation = "Головная") %>%
rename(parent_code = code, code = child_code) %>%
arrange(parent_code, relation)
hierarchy %>% count(parent_code, relation) %>%
pivot_wider(names_from = relation,
values_from = n, values_fill = 0) %>%
select(parent_code, parent_org = `Головная`,
branch = `Филиал`, predecessors = `Правопредшественники`) %>%
datatable(rownames = FALSE, filter = "none",
escape = FALSE, class = "row-border",
options = list(autoWidth = FALSE,
columnDefs = list(list(className = 'dt-center',
targets = c(0:3)))))
One may note that a list of both head and branch organizations can be retrieved from Table 1, so why to use the table 3? Fair question, let me show you why the RIRO table 3 should not be ignored.
Research Organizations Registry (ROR) is an international project launched in 2019 with an ambitious goal to create a public ORCID-like registry for the research organizations. It inherits a lot from GRID (Global Research Identifier Database) and Wikidata. The ROR organization info can be downloaded as a JSON dump or retrieved via API.
Table 4 contains not all the Russian-related records from ROR, but only those matched to the organizations present in RIRO.
t4 <- riro_data[[4]]
hierarchy %>% inner_join(t4, by = "code") %>%
arrange(parent_code) %>% select(-parent_code, - relation) %>%
datatable(rownames = FALSE, filter = "none",
escape = FALSE, class = "row-border",
options = list(columnDefs = list(
list(width = '250px', targets = c(3:5)),
list(width = '400px', targets = c(8)))))
The column “Relationships” comprise the composite values of following structure:
label:xxxx|type:yyyyyy|id:https://ror.org/zzzzz
,
with 3 units (label, type, id) for the relative (according to ROR best judgements) organizations.
In cases like ours such references can be misleading - according to ROR the research institutes of the Komi Federal Research Center have different parents:
So for a single organization ROR shows 4 accounts subordinating to three different RAS structures.
The truth is that the found 4 research institutes of the Komi Federal Research Center ceased to exist as legal entities 3 years ago, they were merged into a single federal research center in May 2018.
That’s the value of Table 3 in RIRO - it helps to gather the related identifiers and qualify them as corresponding to a branch or a predecessor.
WikiData is a public repository of structured data originating from multiple sources. Some sources are more or less consistent (like CrossRef or ISSN), but there’s also a lot of Wikidata records that are created and modified by people. As a result, even though Wikidata offers a pre-defined templates for the profiles of universities and research organizations, many profiles have unpopulated fields.
The table 5 comprises a list of fields found in the Wikidata profiles for the organizations, but not a full copy. Moreover, Table 5 include only those Russian research organizations that match to the organizations in RIRO.
t5 <- riro_data[[5]]
hierarchy %>% inner_join(t5, by = "code") %>%
arrange(parent_code) %>%
select(-parent_code, - relation, -wd_item) %>%
mutate_at(c("wikipedia_eng", "wikipedia_rus", "wd_itemaltlabel", "wd_altlabel"),
~ifelse(is.na(.x),.x, paste0(substr(.x,1,40),"...")))%>%
datatable(rownames = FALSE, filter = "none", escape = FALSE, class = "row-border",
options = list(columnDefs = list(
list(width = '250px', targets = c(2:4)),
list(width = '150px', targets = c(6:9)))))
Scopus is a (one of leading) citation index accumulating the metadata from 20k+ journal titles, selected conference sources, and some academic book titles. Table 6 lists the Scopus affiliation profiles matched to the organizations in RIRO, and also a number of publications under Scopus affiliation profile (on a data of request, April 2021).
Over 1000 Russian research organizations (and universities) have an access to Scopus under the state-funded centralized subscription, and can use the matched IDs via the online UI or the API-service. The latter has few wrappers for python and R that make working with API more comfortable.
Please note that matching the affiliation profiles to RIRO organizations is based on the affiliation name and city. It does not guarantee that all the publications in the profile are assigned to it correctly. More details on how to edit the affiliation profiles in Scopus can be found on Elsevier web site.
t6 <- riro_data[[6]]
hierarchy %>% inner_join(t6, by = "code") %>%
arrange(parent_code) %>% select(-parent_code, - relation) %>%
datatable(rownames = FALSE, filter = "none", escape = FALSE, class = "row-border",
options = list(columnDefs = list(
list(className = 'dt-center', targets = c(0,6)),
list(width = '350px', targets = c(2:3)))))
Microsoft Academic Graph (MAG) is a database created based on the information extracted with Bing-parsers from the publisher web sites and PDF files details). This approach is differen from the one utilized by Web of Science and Scopus that receive a large part of information for indexation directly from the publishers.
MAG is also a source of information for many novel solutions like Lens, Semantic Scholar, Open Academic Graph, Unsub.
Even though the last news about MAG shocked us too, we decided to include the MAG Organization IDs into RIRO. Few international companies committed to launching a new tool that may substitute MAG:
ANNOUNCING: We’re building a replacement for Microsoft Academic Graph. https://t.co/GXelkpt6Zc
— Our Research (@our_research) May 8, 2021
Thanks to @MSFTResearch for providing the Microsoft Academic Graph (MAG). We’ve been working with MSR and MAG since 2018, and we’ve been collaborating on this transition for some time. 1/3https://t.co/7aHTLio8uK
— Semantic Scholar (@SemanticScholar) May 11, 2021
Table 7 lists just the MAG organization IDs and names agains the RIRO codes.
t7 <- riro_data[[7]]
hierarchy %>% inner_join(t7, by = "code") %>%
arrange(parent_code) %>% select(-parent_code, - relation) %>%
datatable(rownames = FALSE, filter = "none", escape = FALSE, class = "row-border",
options = list(autoWidth = FALSE,
columnDefs = list(
list(className = 'dt-center', targets = c(0:1)),
list(width = '450px', targets = c(2)))))
InCites is an analytical solution build over Web of Science Core Collection. It allows to export the records, so the matched names can be used for further analysis.
The table 8 lists the official organization names in InCite and Web of Science Core Collection against the RIRO codes.
t8 <- riro_data[[8]]
hierarchy %>% inner_join(t8, by = "code") %>%
arrange(parent_code) %>% select(-parent_code, - relation) %>%
datatable(rownames = FALSE, filter = "none", escape = FALSE, class = "row-border",
options = list(autoWidth = FALSE,
columnDefs = list(
list(className = 'dt-center', targets = c(0,3)),
list(width = '450px', targets = c(1:2)))))
SciVal is an analytical tool build over Scopus. Some Russian organizations have an access to SciVal API and could use the IDs matched against the RIRO codes in the table 9.
t9 <- riro_data[[9]]
hierarchy %>% inner_join(t9, by = "code") %>%
arrange(parent_code) %>% select(-parent_code, - relation) %>%
datatable(rownames = FALSE, filter = "none", escape = FALSE, class = "row-border",
options = list(autoWidth = FALSE,
columnDefs = list(
list(className = 'dt-center', targets = c(0:1)),
list(width = '450px', targets = c(2)))))
The system gathers the university statistical reports from all Russian higher education institutions (except some schools under the Ministries of Defence, etc). The reports contain a lot of useful information - from financial to enrollment data.
Table 10 lists the IDs that corresponds to the university’s web page on the portal, matched to the RIRO codes.
t10 <- riro_data[[10]]
hierarchy %>% inner_join(t10, by = "code") %>%
arrange(parent_code) %>% select(-parent_code, - relation) %>%
datatable(rownames = FALSE, filter = "none", escape = FALSE, class = "row-border",
options = list(autoWidth = FALSE,
columnDefs = list(list(className = 'dt-center',
targets = c(0:1)))))
Web of Science is by far the world’s oldest and most prominent citation index. At this moment Web of Science does not provide the organization IDs that could be used for search or data retrieval, but the search results have the orgaization names. The table 11 lists almost 4000 such names matched to the organizations in RIRO. This is not a complete list of known affiliation names for the Russian research organizations, but we hope to adjust this table in future releases of RIRO.
t11 <- riro_data[[11]]
hierarchy %>% inner_join(t11, by = "code") %>%
arrange(parent_code) %>% select(-parent_code, - relation) %>%
datatable(rownames = FALSE, filter = "none",
escape = FALSE, class = "row-border",
options = list(autoWidth = FALSE,
columnDefs = list(
list(className = 'dt-center', targets = c(0)),
list(width = '450px', targets = c(1,3)))))
An illustration below shows the identifiers (ROR, GRID, Scopus Affiliation ID, InCites ID, MAG, Wikidata) matched to 3 organizations selected as an example. Some IDs correspond to the head organizations, the others either to the branches or to the predecessors.
The identifiers are placed along X-axis, organized in three sections (by organization). The RIRO entities are place along Y-axis, the existing organization are shown as squares (the head organizations are also marked with a special sign), the predecessors as circles.
filename <- paste0(img.dir, "chart_examples_eng.png")
knitr::include_graphics(filename)
For RIRO v.1.0 we have a google form (in Russian) with the following scenarios of the change requests:
We do hope that feedback will help us to make RIRO a more valuable tool.
Allaire J, Xie Y, McPherson J, Luraschi J, Ushey K, Atkins A, Wickham H, Cheng J, Chang W, Iannone R (2021). rmarkdown: Dynamic Documents for R. R package version 2.7, <URL: https://github.com/rstudio/rmarkdown>.
Blondel E (2021). zen4R: Interface to ‘Zenodo’ REST API. R package version 0.4-3, <URL: https://github.com/eblondel/zen4R>.
Chang, W (2014). extrafont: Tools for using fonts. R package version 0.17, <URL: https://CRAN.R-project.org/package=extrafont>.
Henry L, Wickham H (2020). purrr: Functional Programming Tools. R package version 0.3.4, <URL: https://CRAN.R-project.org/package=purrr>.
Krassowski M (2020). “ComplexUpset.” doi: 10.5281/zenodo.3700590 (URL: https://doi.org/10.5281/zenodo.3700590), <URL: https://doi.org/10.5281/zenodo.3700590>.
Lex A, Gehlenborg N, Strobelt H, Vuillemot R, Pfister H (2014). “UpSet: Visualization of Intersecting Sets,.” IEEE Transactions on Visualization and Computer Graphics, 20(12), 1983–1992. doi: 10.1109/TVCG.2014.2346248 (URL: https://doi.org/10.1109/TVCG.2014.2346248), <URL: https://doi.org/10.1109/TVCG.2014.2346248>.
Wickham H (2020). tidyr: Tidy Messy Data. R package version 1.1.2, <URL: https://CRAN.R-project.org/package=tidyr>.
Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, <URL: https://ggplot2.tidyverse.org>.
Wickham H (2019). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0, <URL: https://CRAN.R-project.org/package=stringr>.
Wickham H, Francois R, Henry L, Muller K (2021). dplyr: A Grammar of Data Manipulation. R package version 1.0.3, <URL: https://CRAN.R-project.org/package=dplyr>.
Wickham H, Hester J (2020). readr: Read Rectangular Text Data. R package version 1.4.0, <URL: https://CRAN.R-project.org/package=readr>.
Wickham H, Seidel D (2020). scales: Scale Functions for Visualization. R package version 1.1.1, <URL: https://CRAN.R-project.org/package=scales>.
Xie Y (2020). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.30, <URL: https://yihui.org/knitr/>.
Xie Y (2015). Dynamic Documents with R and knitr, 2nd edition. Chapman and Hall/CRC, Boca Raton, Florida. ISBN 978-1498716963, <URL: https://yihui.org/knitr/>.
Xie Y (2014). “knitr: A Comprehensive Tool for Reproducible Research in R.” In Stodden V, Leisch F, Peng RD (eds.), Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595, <URL: http://www.crcpress.com/product/isbn/9781466561595>.
Xie Y, Allaire J, Grolemund G (2018). R Markdown: The Definitive Guide. Chapman and Hall/CRC, Boca Raton, Florida. ISBN 9781138359338, <URL: https://bookdown.org/yihui/rmarkdown>.
Xie Y, Cheng J, Tan X (2021). DT: A Wrapper of the JavaScript Library ‘DataTables’. R package version 0.17, <URL: https://CRAN.R-project.org/package=DT>.
Xie Y, Dervieux C, Riederer E (2020). R Markdown Cookbook. Chapman and Hall/CRC, Boca Raton, Florida. ISBN 9780367563837, <URL: https://bookdown.org/yihui/rmarkdown-cookbook>.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Lutai & Sterligov (2021, May 26). Russian Index of Research Organizations: RIRO v.1.0 (eng). Retrieved from https://openriro.github.io/posts/rirov10eng/
BibTeX citation
@misc{lutai2021riro, author = {Lutai, Aleksei and Sterligov, Ivan}, title = {Russian Index of Research Organizations: RIRO v.1.0 (eng)}, url = {https://openriro.github.io/posts/rirov10eng/}, year = {2021} }