scholarly journals Comparison of bibliographic data sources: Implications for the robustness of university rankings

2020 ◽  
pp. 1-34 ◽  
Author(s):  
Chun-Kai (Karl) Huang ◽  
Cameron Neylon ◽  
Chloe Brookes-Kenworthy ◽  
Richard Hosking ◽  
Lucy Montgomery ◽  
...  

Universities are increasingly evaluated on the basis of their outputs. These are often converted to simple and contested rankings with substantial implications for recruitment, income, and perceived prestige. Such evaluation usually relies on a single data source to define the set of outputs for a university. However, few studies have explored differences across data sources and their implications for metrics and rankings at the institutional scale. We address this gap by performing detailed bibliographic comparisons between Web of Science (WoS), Scopus, and Microsoft Academic (MSA) at the institutional level and supplement this with a manual analysis of 15 universities. We further construct two simple rankings based on citation count and open access status. Our results show that there are significant differences across databases. These differences contribute to drastic changes in rank positions of universities, which are most prevalent for non-English-speaking universities and those outside the top positions in international university rankings. Overall, MSA has greater coverage than Scopus and WoS, but with less complete affiliation metadata. We suggest that robust evaluation measures need to consider the effect of choice of data sources and recommend an approach where data from multiple sources is integrated to provide a more robust data set.

2019 ◽  
Author(s):  
Chun-Kai (Karl) Huang ◽  
Cameron Neylon ◽  
Chloe Brookes-Kenworthy ◽  
Richard Hosking ◽  
Lucy Montgomery ◽  
...  

AbstractUniversities are increasingly evaluated, both internally and externally on the basis of their outputs. Often these are converted to simple, and frequently contested, rankings based on quantitative analysis of those outputs. These rankings can have substantial implications for student and staff recruitment, research income and perceived prestige of a university. Both internal and external analyses usually rely on a single data source to define the set of outputs assigned to a specific university. Although some differences between such databases are documented, few studies have explored them at the institutional scale and examined the implications of these differences for the metrics and rankings that are derived from them. We address this gap by performing detailed bibliographic comparisons between three key databases: Web of Science (WoS), Scopus and, the recently relaunched Microsoft Academic (MSA). We analyse the differences between outputs with DOIs identified from each source for a sample of 155 universities and supplement this with a detailed manual analysis of the differences for fifteen universities. We find significant differences between the sources at the university level. Sources differ in the publication year of specific objects, the completeness of metadata, as well as in their coverage of disciplines, outlets, and publication type. We construct two simple rankings based on citation counts and open access status of the outputs for these universities and show dramatic changes in position based on the choice of bibliographic data sources. Those universities that experience the largest changes are frequently those from non-English speaking countries and those that are outside the top positions in international university rankings. Overall MSA has greater coverage than Scopus or WoS, but has less complete affiliation metadata. We suggest that robust evaluation measures need to consider the effect of choice of data sources and recommend an approach where data from multiple sources is integrated to provide a more robust dataset.


2021 ◽  
pp. 1-11
Author(s):  
Yanan Huang ◽  
Yuji Miao ◽  
Zhenjing Da

The methods of multi-modal English event detection under a single data source and isomorphic event detection of different English data sources based on transfer learning still need to be improved. In order to improve the efficiency of English and data source time detection, based on the transfer learning algorithm, this paper proposes multi-modal event detection under a single data source and isomorphic event detection based on transfer learning for different data sources. Moreover, by stacking multiple classification models, this paper makes each feature merge with each other, and conducts confrontation training through the difference between the two classifiers to further make the distribution of different source data similar. In addition, in order to verify the algorithm proposed in this paper, a multi-source English event detection data set is collected through a data collection method. Finally, this paper uses the data set to verify the method proposed in this paper and compare it with the current most mainstream transfer learning methods. Through experimental analysis, convergence analysis, visual analysis and parameter evaluation, the effectiveness of the algorithm proposed in this paper is demonstrated.


2021 ◽  
Vol 14 (11) ◽  
pp. 2519-2532
Author(s):  
Fatemeh Nargesian ◽  
Abolfazl Asudeh ◽  
H. V. Jagadish

Data scientists often develop data sets for analysis by drawing upon sources of data available to them. A major challenge is to ensure that the data set used for analysis has an appropriate representation of relevant (demographic) groups: it meets desired distribution requirements. Whether data is collected through some experiment or obtained from some data provider, the data from any single source may not meet the desired distribution requirements. Therefore, a union of data from multiple sources is often required. In this paper, we study how to acquire such data in the most cost effective manner, for typical cost functions observed in practice. We present an optimal solution for binary groups when the underlying distributions of data sources are known and all data sources have equal costs. For the generic case with unequal costs, we design an approximation algorithm that performs well in practice. When the underlying distributions are unknown, we develop an exploration-exploitation based strategy with a reward function that captures the cost and approximations of group distributions in each data source. Besides theoretical analysis, we conduct comprehensive experiments that confirm the effectiveness of our algorithms.


2019 ◽  
Vol 71 (1) ◽  
pp. 18-37 ◽  
Author(s):  
Güleda Doğan ◽  
Umut Al

Purpose The purpose of this paper is to analyze the similarity of intra-indicators used in research-focused international university rankings (Academic Ranking of World Universities (ARWU), NTU, University Ranking by Academic Performance (URAP), Quacquarelli Symonds (QS) and Round University Ranking (RUR)) over years, and show the effect of similar indicators on overall rankings for 2015. The research questions addressed in this study in accordance with these purposes are as follows: At what level are the intra-indicators used in international university rankings similar? Is it possible to group intra-indicators according to their similarities? What is the effect of similar intra-indicators on overall rankings? Design/methodology/approach Indicator-based scores of all universities in five research-focused international university rankings for all years they ranked form the data set of this study for the first and second research questions. The authors used a multidimensional scaling (MDS) and cosine similarity measure to analyze similarity of indicators and to answer these two research questions. Indicator-based scores and overall ranking scores for 2015 are used as data and Spearman correlation test is applied to answer the third research question. Findings Results of the analyses show that the intra-indicators used in ARWU, NTU and URAP are highly similar and that they can be grouped according to their similarities. The authors also examined the effect of similar indicators on 2015 overall ranking lists for these three rankings. NTU and URAP are affected least from the omitted similar indicators, which means it is possible for these two rankings to create very similar overall ranking lists to the existing overall ranking using fewer indicators. Research limitations/implications CWTS, Mapping Scientific Excellence, Nature Index, and SCImago Institutions Rankings (until 2015) are not included in the scope of this paper, since they do not create overall ranking lists. Likewise, Times Higher Education, CWUR and US are not included because of not presenting indicator-based scores. Required data were not accessible for QS for 2010 and 2011. Moreover, although QS ranks more than 700 universities, only first 400 universities in 2012–2015 rankings were able to be analyzed. Although QS’s and RUR’s data were analyzed in this study, it was statistically not possible to reach any conclusion for these two rankings. Practical implications The results of this study may be considered mainly by ranking bodies, policy- and decision-makers. The ranking bodies may use the results to review the indicators they use, to decide on which indicators to use in their rankings, and to question if it is necessary to continue overall rankings. Policy- and decision-makers may also benefit from the results of this study by thinking of giving up using overall ranking results as an important input in their decisions and policies. Originality/value This study is the first to use a MDS and cosine similarity measure for revealing the similarity of indicators. Ranking data is skewed that require conducting nonparametric statistical analysis; therefore, MDS is used. The study covers all ranking years and all universities in the ranking lists, and is different from the similar studies in the literature that analyze data for shorter time intervals and top-ranked universities in the ranking lists. It can be said that the similarity of intra-indicators for URAP, NTU and RUR is analyzed for the first time in this study, based on the literature review.


Author(s):  
Ping Yi ◽  
Songling Zhang

This paper introduces applications of the Dempster–Shafer (D-S) data fusion technique in transportation system decision making. D-S inference is a statistics-based data classification technique, and it can be used when data sources contribute discontinuous and incomplete information and no single data source can produce an overwhelmingly high probability of certainty for identifying the most probable event. The technique captures and combines the information contributed by the data sources by using Dempster’s rule to find the conjunction of the events and to determine the highest associated probability. The D-S theory is explained and its implementation described through numerical examples of a ride-hauling service and of crowd management at a subway station. Results from the applications have shown that the technique is very effective in dealing with incomplete information and multiple data sources in the era of big data.


2020 ◽  
Vol 18 (4) ◽  
pp. 142-152
Author(s):  
Maxim Polyakov ◽  
Vladimir Bilozubenko ◽  
Maxim Korneyev ◽  
Natalia Nebaba

In the context of globalization of the educational services market, competition between universities is becoming more intense. This manifests itself, among other things, in the struggle for positions in international university rankings. Given that universities are evaluated according to many criteria in such rankings, it becomes necessary to identify the most significant factors in determining their positions.This study aims to identify the key factors determining the world’s leading universities’ leadership in international university rankings. The numerical values of the criteria for compiling the QS World University Rankings (QS) and Times Higher Education (THE) rankings were an empirical basis for the study. The analysis covered the Top 50 universities (according to the QS ranking) and was conducted based on reports for 2020 and 2021.At first, clustering was carried out (method – k-means); the data set was the combination of numerical values of QS and THE criteria (six and five criteria, respectively). The universities were divided into three clusters in 2020 (23, 19, 8 universities) and 2021 (23, 17, 10 universities). This showed the universities’ leadership relative to each other for each year.At the second stage, classification processing was performed (method – decision trees). As a result, criteria combinations that give an absolute separation of all clusters (2020 – five combinations; 2021 – eight combinations) were identified. The obtained combinations largely determine universities’ affiliation to clusters; their criteria are recognized as key factors of their leadership in the rankings. This study’s results can serve as guidelines for improving universities’ positions in the rankings.


2020 ◽  
Author(s):  
Chun-Kai Huang ◽  
Cameron Neylon ◽  
Richard Hosking ◽  
Lucy Montgomery ◽  
Katie Wilson ◽  
...  

AbstractIn the article “Evaluating institutional open access performance: Methodology, challenges and assessment” we develop the first comprehensive and reproducible workflow that integrates multiple bibliographic data sources for evaluating institutional open access (OA) performance. The major data sources include Web of Science, Scopus, Microsoft Academic, and Unpaywall. However, each of these databases continues to update, both actively and retrospectively. This implies the results produced by the proposed process are potentially sensitive to both the choice of data source and the versions of them used. In addition, there remain the issue relating to selection bias in sample size and margin of error. The current work shows that the levels of sensitivity relating to the above issues can be significant at the institutional level. Hence, the transparency and clear documentation of the choices made on data sources (and their versions) and cut-off boundaries are vital for reproducibility and verifiability.


Author(s):  
Pachisa Kulkanjanapiban ◽  
Tipawan Silwattananusarn

<p>This paper shows a significant comparison of two primary bibliographic data sources at the document level of Scopus and Dimensions. The emphasis is on the differences in their document coverage by institution level of aggregation. The main objective is to assess whether Dimensions offers at the institutional level good new possibilities for bibliometric analysis as at the global level. The results of a comparative study of the citation count profiles of articles published by faculty members of Prince of Songkla University (PSU) in Dimensions and Scopus from the year the databases first included PSU-authored papers (1970 and 1978, respectively) through the end of June 2020. Descriptive statistics and correlation analysis of 19,846 articles indexed in Dimensions and 13,577 indexed in Scopus. The main finding was that the number of citations received by Dimensions was highly correlated with citation counts in Scopus. Spearman’s correlation between citation counts in Dimensions and Scopus was a high and mighty relationship. The findings mainly affect Dimensions’ possibilities as instruments for carrying out bibliometric analysis of university members’ research productivity. University researchers can use Dimensions to retrieve information, and the design policies can be used to evaluate research using <br />scientific databases.</p>


Author(s):  
Diana Maria Contreras Mojica ◽  
Sean Wilkinson ◽  
Philip James

Earthquakes are one of the most catastrophic natural phenomena. After an earthquake, earthquake reconnaissance enables effective recovery by collecting building damage data and other impacts. This paper aims to identify state-of-the-art data sources for building damage assessment and guide more efficient data. This paper reviews 38 articles that indicate the sources used by different authors to collect data related to damages and post-disaster recovery progress after earthquakes between 2014 and 2021. The current data collection methods have been grouped into seven categories: fieldwork or ground surveys, omnidirectional imagery (OD), terrestrial laser scanning (TLS), remote sensing (RS), crowdsourcing platforms, social media (SM) and closed-circuit television videos (CCTV). The selection of a particular data source or collection technique for earthquake reconnaissance includes different criteria. Nowadays, reconnaissance mission can not rely on a single data source, and different data sources should complement each other, validate collected data, or quantify the damage comprehensively. The recent increase in the number of crowdsourcing and SM platforms as a source of data for earthquake reconnaissance is a clear indication of the tendency of data sources in the future.


2021 ◽  
Author(s):  
Alasdair J. G. Gray ◽  
Petros Papadopoulos ◽  
Ivan Mičetić ◽  
András Hatos

One of the goals of the ELIXIR Intrinsically Disordered Protein (IDP) community is create a registry called IDPcentral. The registry will aggregate data contained in the community's specialist data sources such as DisProt, MobiDB, and Protein Ensemble Database (PED) so that proteins that are known to be intrinsically disordered can be discovered; with summary details of the protein presented, and the specialist source consulted for more detailed data. At the ELIXIR BioHackathon-Europe 2020, we aimed to investigate the feasibility of populating IDPcentral harvesting the Bioschemas markup that has been deployed on the IDP community data sources. The benefit of using Bioschemas markup, which is embedded in the HTML web pages for each protein in the data source, is that a standard harvesting approach can be used for all data sources; rather than needing bespoke wrappers for each data source API. We expect to harvest the markup using the Bioschemas Markup Scraper and Extractor (BMUSE) tool that has been developed specifically for this purpose. The challenge, however, is that the sources contain overlapping information about proteins but use different identifiers for the proteins. After the data has been harvested, it will need to be processed so that information about a particular protein, which will come from multiple sources, is consolidated into a single concept for the protein, with links back to where each piece of data originated.As well as populating the IDPcentral registry, we plan to consolidate the markup into a knowledge graph that can be queried to gain further insight into the IDPs.


Sign in / Sign up

Export Citation Format

Share Document