Comparison of bibliographic data sources: Implications for the robustness of university rankings

Universities are increasingly evaluated on the basis of their outputs. These are often converted to simple and contested rankings with substantial implications for recruitment, income, and perceived prestige. Such evaluation usually relies on a single data source to define the set of outputs for a university. However, few studies have explored differences across data sources and their implications for metrics and rankings at the institutional scale. We address this gap by performing detailed bibliographic comparisons between Web of Science (WoS), Scopus, and Microsoft Academic (MSA) at the institutional level and supplement this with a manual analysis of 15 universities. We further construct two simple rankings based on citation count and open access status. Our results show that there are significant differences across databases. These differences contribute to drastic changes in rank positions of universities, which are most prevalent for non-English-speaking universities and those outside the top positions in international university rankings. Overall, MSA has greater coverage than Scopus and WoS, but with less complete affiliation metadata. We suggest that robust evaluation measures need to consider the effect of choice of data sources and recommend an approach where data from multiple sources is integrated to provide a more robust data set.

Download Full-text

Comparison of bibliographic data sources: Implications for the robustness of university rankings

10.1101/750075 ◽

2019 ◽

Cited By ~ 2

Author(s):

Chun-Kai (Karl) Huang ◽

Cameron Neylon ◽

Chloe Brookes-Kenworthy ◽

Richard Hosking ◽

Lucy Montgomery ◽

...

Keyword(s):

Data Sources ◽

University Rankings ◽

Multiple Sources ◽

Publication Type ◽

Bibliographic Data ◽

Data Source ◽

English Speaking ◽

Single Data ◽

The University ◽

Staff Recruitment

AbstractUniversities are increasingly evaluated, both internally and externally on the basis of their outputs. Often these are converted to simple, and frequently contested, rankings based on quantitative analysis of those outputs. These rankings can have substantial implications for student and staff recruitment, research income and perceived prestige of a university. Both internal and external analyses usually rely on a single data source to define the set of outputs assigned to a specific university. Although some differences between such databases are documented, few studies have explored them at the institutional scale and examined the implications of these differences for the metrics and rankings that are derived from them. We address this gap by performing detailed bibliographic comparisons between three key databases: Web of Science (WoS), Scopus and, the recently relaunched Microsoft Academic (MSA). We analyse the differences between outputs with DOIs identified from each source for a sample of 155 universities and supplement this with a detailed manual analysis of the differences for fifteen universities. We find significant differences between the sources at the university level. Sources differ in the publication year of specific objects, the completeness of metadata, as well as in their coverage of disciplines, outlets, and publication type. We construct two simple rankings based on citation counts and open access status of the outputs for these universities and show dramatic changes in position based on the choice of bibliographic data sources. Those universities that experience the largest changes are frequently those from non-English speaking countries and those that are outside the top positions in international university rankings. Overall MSA has greater coverage than Scopus or WoS, but has less complete affiliation metadata. We suggest that robust evaluation measures need to consider the effect of choice of data sources and recommend an approach where data from multiple sources is integrated to provide a more robust dataset.

Download Full-text

Event detection of different English data sources based on transfer learning

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189798 ◽

2021 ◽

pp. 1-11

Author(s):

Yanan Huang ◽

Yuji Miao ◽

Zhenjing Da

Keyword(s):

Transfer Learning ◽

Event Detection ◽

Visual Analysis ◽

Learning Algorithm ◽

Data Sources ◽

Data Set ◽

Data Source ◽

Single Data Source ◽

The Difference ◽

Single Data

The methods of multi-modal English event detection under a single data source and isomorphic event detection of different English data sources based on transfer learning still need to be improved. In order to improve the efficiency of English and data source time detection, based on the transfer learning algorithm, this paper proposes multi-modal event detection under a single data source and isomorphic event detection based on transfer learning for different data sources. Moreover, by stacking multiple classification models, this paper makes each feature merge with each other, and conducts confrontation training through the difference between the two classifiers to further make the distribution of different source data similar. In addition, in order to verify the algorithm proposed in this paper, a multi-source English event detection data set is collected through a data collection method. Finally, this paper uses the data set to verify the method proposed in this paper and compare it with the current most mainstream transfer learning methods. Through experimental analysis, convergence analysis, visual analysis and parameter evaluation, the effectiveness of the algorithm proposed in this paper is demonstrated.

Download Full-text

Tailoring data source distributions for fairness-aware data integration

Proceedings of the VLDB Endowment ◽

10.14778/3476249.3476299 ◽

2021 ◽

Vol 14 (11) ◽

pp. 2519-2532

Author(s):

Fatemeh Nargesian ◽

Abolfazl Asudeh ◽

H. V. Jagadish

Keyword(s):

Optimal Solution ◽

Cost Effective ◽

Data Sources ◽

Data Sets ◽

Multiple Sources ◽

Data Set ◽

Demographic Groups ◽

Reward Function ◽

Effective Manner ◽

Data Source

Data scientists often develop data sets for analysis by drawing upon sources of data available to them. A major challenge is to ensure that the data set used for analysis has an appropriate representation of relevant (demographic) groups: it meets desired distribution requirements. Whether data is collected through some experiment or obtained from some data provider, the data from any single source may not meet the desired distribution requirements. Therefore, a union of data from multiple sources is often required. In this paper, we study how to acquire such data in the most cost effective manner, for typical cost functions observed in practice. We present an optimal solution for binary groups when the underlying distributions of data sources are known and all data sources have equal costs. For the generic case with unequal costs, we design an approximation algorithm that performs well in practice. When the underlying distributions are unknown, we develop an exploration-exploitation based strategy with a reward function that captures the cost and approximations of group distributions in each data source. Besides theoretical analysis, we conduct comprehensive experiments that confirm the effectiveness of our algorithms.

Download Full-text

Is it possible to rank universities using fewer indicators? A study on five international university rankings

Aslib Journal of Information Management ◽

10.1108/ajim-05-2018-0118 ◽

2019 ◽

Vol 71 (1) ◽

pp. 18-37 ◽

Cited By ~ 5

Author(s):

Güleda Doğan ◽

Umut Al

Keyword(s):

Similarity Measure ◽

Decision Makers ◽

Cosine Similarity ◽

University Rankings ◽

University Ranking ◽

Data Set ◽

Content Type ◽

Research Questions ◽

Cosine Similarity Measure ◽

International University

Purpose The purpose of this paper is to analyze the similarity of intra-indicators used in research-focused international university rankings (Academic Ranking of World Universities (ARWU), NTU, University Ranking by Academic Performance (URAP), Quacquarelli Symonds (QS) and Round University Ranking (RUR)) over years, and show the effect of similar indicators on overall rankings for 2015. The research questions addressed in this study in accordance with these purposes are as follows: At what level are the intra-indicators used in international university rankings similar? Is it possible to group intra-indicators according to their similarities? What is the effect of similar intra-indicators on overall rankings? Design/methodology/approach Indicator-based scores of all universities in five research-focused international university rankings for all years they ranked form the data set of this study for the first and second research questions. The authors used a multidimensional scaling (MDS) and cosine similarity measure to analyze similarity of indicators and to answer these two research questions. Indicator-based scores and overall ranking scores for 2015 are used as data and Spearman correlation test is applied to answer the third research question. Findings Results of the analyses show that the intra-indicators used in ARWU, NTU and URAP are highly similar and that they can be grouped according to their similarities. The authors also examined the effect of similar indicators on 2015 overall ranking lists for these three rankings. NTU and URAP are affected least from the omitted similar indicators, which means it is possible for these two rankings to create very similar overall ranking lists to the existing overall ranking using fewer indicators. Research limitations/implications CWTS, Mapping Scientific Excellence, Nature Index, and SCImago Institutions Rankings (until 2015) are not included in the scope of this paper, since they do not create overall ranking lists. Likewise, Times Higher Education, CWUR and US are not included because of not presenting indicator-based scores. Required data were not accessible for QS for 2010 and 2011. Moreover, although QS ranks more than 700 universities, only first 400 universities in 2012–2015 rankings were able to be analyzed. Although QS’s and RUR’s data were analyzed in this study, it was statistically not possible to reach any conclusion for these two rankings. Practical implications The results of this study may be considered mainly by ranking bodies, policy- and decision-makers. The ranking bodies may use the results to review the indicators they use, to decide on which indicators to use in their rankings, and to question if it is necessary to continue overall rankings. Policy- and decision-makers may also benefit from the results of this study by thinking of giving up using overall ranking results as an important input in their decisions and policies. Originality/value This study is the first to use a MDS and cosine similarity measure for revealing the similarity of indicators. Ranking data is skewed that require conducting nonparametric statistical analysis; therefore, MDS is used. The study covers all ranking years and all universities in the ranking lists, and is different from the similar studies in the literature that analyze data for shorter time intervals and top-ranked universities in the ranking lists. It can be said that the similarity of intra-indicators for URAP, NTU and RUR is analyzed for the first time in this study, based on the literature review.

Download Full-text

Application of Dempster–Shafer Data Fusion Technique in Support of Decision Making with Big Data

Transportation Research Record Journal of the Transportation Research Board ◽

10.3141/2645-04 ◽

2017 ◽

Vol 2645 (1) ◽

pp. 32-37 ◽

Cited By ~ 2

Author(s):

Ping Yi ◽

Songling Zhang

Keyword(s):

Decision Making ◽

Big Data ◽

Data Fusion ◽

Incomplete Information ◽

Data Sources ◽

Fusion Technique ◽

Crowd Management ◽

Multiple Data ◽

Data Source ◽

Single Data

This paper introduces applications of the Dempster–Shafer (D-S) data fusion technique in transportation system decision making. D-S inference is a statistics-based data classification technique, and it can be used when data sources contribute discontinuous and incomplete information and no single data source can produce an overwhelmingly high probability of certainty for identifying the most probable event. The technique captures and combines the information contributed by the data sources by using Dempster’s rule to find the conjunction of the events and to determine the highest associated probability. The D-S theory is explained and its implementation described through numerical examples of a ride-hauling service and of crowd management at a subway station. Results from the applications have shown that the technique is very effective in dealing with incomplete information and multiple data sources in the era of big data.

Download Full-text

Analysis of key university leadership factors based on their international rankings (QS World University Rankings and Times Higher Education)

Problems and Perspectives in Management ◽

10.21511/ppm.18(4).2020.13 ◽

2020 ◽

Vol 18 (4) ◽

pp. 142-152

Author(s):

Maxim Polyakov ◽

Vladimir Bilozubenko ◽

Maxim Korneyev ◽

Natalia Nebaba

Keyword(s):

Higher Education ◽

Market Competition ◽

University Rankings ◽

Key Factors ◽

University Leadership ◽

Data Set ◽

Second Stage ◽

World University Rankings ◽

International University ◽

Significant Factors

In the context of globalization of the educational services market, competition between universities is becoming more intense. This manifests itself, among other things, in the struggle for positions in international university rankings. Given that universities are evaluated according to many criteria in such rankings, it becomes necessary to identify the most significant factors in determining their positions.This study aims to identify the key factors determining the world’s leading universities’ leadership in international university rankings. The numerical values of the criteria for compiling the QS World University Rankings (QS) and Times Higher Education (THE) rankings were an empirical basis for the study. The analysis covered the Top 50 universities (according to the QS ranking) and was conducted based on reports for 2020 and 2021.At first, clustering was carried out (method – k-means); the data set was the combination of numerical values of QS and THE criteria (six and five criteria, respectively). The universities were divided into three clusters in 2020 (23, 19, 8 universities) and 2021 (23, 17, 10 universities). This showed the universities’ leadership relative to each other for each year.At the second stage, classification processing was performed (method – decision trees). As a result, criteria combinations that give an absolute separation of all clusters (2020 – five combinations; 2021 – eight combinations) were identified. The obtained combinations largely determine universities’ affiliation to clusters; their criteria are recognized as key factors of their leadership in the rankings. This study’s results can serve as guidelines for improving universities’ positions in the rankings.

Download Full-text

Evaluating institutional open access performance: Sensitivity analysis

10.1101/2020.03.19.998542 ◽

2020 ◽

Author(s):

Chun-Kai Huang ◽

Cameron Neylon ◽

Richard Hosking ◽

Lucy Montgomery ◽

Katie Wilson ◽

...

Keyword(s):

Sensitivity Analysis ◽

Open Access ◽

Selection Bias ◽

Web Of Science ◽

Data Sources ◽

Institutional Level ◽

Margin Of Error ◽

Bibliographic Data ◽

Data Source ◽

Major Data

AbstractIn the article “Evaluating institutional open access performance: Methodology, challenges and assessment” we develop the first comprehensive and reproducible workflow that integrates multiple bibliographic data sources for evaluating institutional open access (OA) performance. The major data sources include Web of Science, Scopus, Microsoft Academic, and Unpaywall. However, each of these databases continues to update, both actively and retrospectively. This implies the results produced by the proposed process are potentially sensitive to both the choice of data source and the versions of them used. In addition, there remain the issue relating to selection bias in sample size and margin of error. The current work shows that the levels of sensitivity relating to the above issues can be significant at the institutional level. Hence, the transparency and clear documentation of the choices made on data sources (and their versions) and cut-off boundaries are vital for reproducibility and verifiability.

Download Full-text

Comparative analysis of Dimensions and Scopus bibliographic data sources: an approach to university research productivity

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v12i1.pp706-720 ◽

2022 ◽

Vol 12 (1) ◽

pp. 706

Author(s):

Pachisa Kulkanjanapiban ◽

Tipawan Silwattananusarn

Keyword(s):

Bibliometric Analysis ◽

Research Productivity ◽

Citation Count ◽

Data Sources ◽

Scientific Databases ◽

Bibliographic Data ◽

Significant Comparison ◽

Citation Counts ◽

Highly Correlated ◽

Number Of Citations

<p>This paper shows a significant comparison of two primary bibliographic data sources at the document level of Scopus and Dimensions. The emphasis is on the differences in their document coverage by institution level of aggregation. The main objective is to assess whether Dimensions offers at the institutional level good new possibilities for bibliometric analysis as at the global level. The results of a comparative study of the citation count profiles of articles published by faculty members of Prince of Songkla University (PSU) in Dimensions and Scopus from the year the databases first included PSU-authored papers (1970 and 1978, respectively) through the end of June 2020. Descriptive statistics and correlation analysis of 19,846 articles indexed in Dimensions and 13,577 indexed in Scopus. The main finding was that the number of citations received by Dimensions was highly correlated with citation counts in Scopus. Spearman’s correlation between citation counts in Dimensions and Scopus was a high and mighty relationship. The findings mainly affect Dimensions’ possibilities as instruments for carrying out bibliometric analysis of university members’ research productivity. University researchers can use Dimensions to retrieve information, and the design policies can be used to evaluate research using <br />scientific databases.</p>

Download Full-text

Earthquakes Reconnaissance Data Sources, a Literature Review

10.20944/preprints202106.0714.v1 ◽

2021 ◽

Author(s):

Diana Maria Contreras Mojica ◽

Sean Wilkinson ◽

Philip James

Keyword(s):

Laser Scanning ◽

Building Damage ◽

Current Data ◽

Data Sources ◽

Damage Data ◽

Efficient Data ◽

Data Source ◽

Single Data ◽

Post Disaster ◽

Selection Of

Earthquakes are one of the most catastrophic natural phenomena. After an earthquake, earthquake reconnaissance enables effective recovery by collecting building damage data and other impacts. This paper aims to identify state-of-the-art data sources for building damage assessment and guide more efficient data. This paper reviews 38 articles that indicate the sources used by different authors to collect data related to damages and post-disaster recovery progress after earthquakes between 2014 and 2021. The current data collection methods have been grouped into seven categories: fieldwork or ground surveys, omnidirectional imagery (OD), terrestrial laser scanning (TLS), remote sensing (RS), crowdsourcing platforms, social media (SM) and closed-circuit television videos (CCTV). The selection of a particular data source or collection technique for earthquake reconnaissance includes different criteria. Nowadays, reconnaissance mission can not rely on a single data source, and different data sources should complement each other, validate collected data, or quantify the damage comprehensively. The recent increase in the number of crowdsourcing and SM platforms as a source of data for earthquake reconnaissance is a clear indication of the tendency of data sources in the future.

Download Full-text

Exploiting Bioschemas Markup to Populate IDPcentral

10.37044/osf.io/v3jct ◽

2021 ◽

Author(s):

Alasdair J. G. Gray ◽

Petros Papadopoulos ◽

Ivan Mičetić ◽

András Hatos

Keyword(s):

Intrinsically Disordered Protein ◽

Data Sources ◽

Web Pages ◽

Multiple Sources ◽

Intrinsically Disordered ◽

Disordered Protein ◽

Europe 2020 ◽

Community Data ◽

Data Source ◽

Insight Into

One of the goals of the ELIXIR Intrinsically Disordered Protein (IDP) community is create a registry called IDPcentral. The registry will aggregate data contained in the community's specialist data sources such as DisProt, MobiDB, and Protein Ensemble Database (PED) so that proteins that are known to be intrinsically disordered can be discovered; with summary details of the protein presented, and the specialist source consulted for more detailed data. At the ELIXIR BioHackathon-Europe 2020, we aimed to investigate the feasibility of populating IDPcentral harvesting the Bioschemas markup that has been deployed on the IDP community data sources. The benefit of using Bioschemas markup, which is embedded in the HTML web pages for each protein in the data source, is that a standard harvesting approach can be used for all data sources; rather than needing bespoke wrappers for each data source API. We expect to harvest the markup using the Bioschemas Markup Scraper and Extractor (BMUSE) tool that has been developed specifically for this purpose. The challenge, however, is that the sources contain overlapping information about proteins but use different identifiers for the proteins. After the data has been harvested, it will need to be processed so that information about a particular protein, which will come from multiple sources, is consolidated into a single concept for the protein, with links back to where each piece of data originated.As well as populating the IDPcentral registry, we plan to consolidate the markup into a knowledge graph that can be queried to gain further insight into the IDPs.

Download Full-text