Finding citations for PubMed: a large-scale comparison between five freely available bibliographic data sources

This paper presents a large-scale document-level comparison of two major bibliographic data sources: Scopus and Dimensions. The focus is on the differences in their coverage of documents at two levels of aggregation: by country and by institution. The main goal is to analyze whether Dimensions offers as good new opportunities for bibliometric analysis at the country and institutional levels as it does at the global level. Differences in the completeness and accuracy of citation links are also studied. The results allow a profile of Dimensions to be drawn in terms of its coverage by country and institution. Dimensions’ coverage is more than 25% greater than Scopus which is consistent with previous studies. However, the main finding of this study is the lack of affiliation data in a large fraction of Dimensions documents. We found that close to half of all documents in Dimensions are not associated with any country of affiliation while the proportion of documents without this data in Scopus is much lower. This situation mainly affects the possibilities that Dimensions can offer as instruments for carrying out bibliometric analyses at the country and institutional level. Both of these aspects are highly pragmatic considerations for information retrieval and the design of policies for the use of scientific databases in research evaluation.

Download Full-text

Large-scale comparison of bibliographic data sources: Scopus, Web of Science, Dimensions, Crossref, and Microsoft Academic

Quantitative Science Studies ◽

10.1162/qss_a_00112 ◽

2021 ◽

pp. 1-22

Author(s):

Martijn Visser ◽

Nees Jan van Eck ◽

Ludo Waltman

Keyword(s):

Large Scale ◽

Web Of Science ◽

Scientific Literature ◽

The Other ◽

Data Sources ◽

Document Type ◽

Bibliographic Data ◽

Microsoft Academic ◽

Over Time ◽

Comprehensive Coverage

We present a large-scale comparison of five multidisciplinary bibliographic data sources: Scopus, Web of Science, Dimensions, Crossref, and Microsoft Academic. The comparison considers scientific documents from the period 2008–2017 covered by these data sources. Scopus is compared in a pairwise manner with each of the other data sources. We first analyze differences between the data sources in the coverage of documents, focusing for instance on differences over time, differences per document type, and differences per discipline. We then study differences in the completeness and accuracy of citation links. Based on our analysis, we discuss the strengths and weaknesses of the different data sources. We emphasize the importance of combining a comprehensive coverage of the scientific literature with a flexible set of filters for making selections of the literature.

Download Full-text

A Large-Scale COVID-19 Twitter Chatter Dataset for Open Scientific Research—An International Collaboration

Epidemiologia ◽

10.3390/epidemiologia2030024 ◽

2021 ◽

Vol 2 (3) ◽

pp. 315-324

Author(s):

Juan M. Banda ◽

Ramya Tekumalla ◽

Guanyu Wang ◽

Jingyuan Yu ◽

Tuo Liu ◽

...

Keyword(s):

Large Scale ◽

Social Dynamics ◽

Additional Data ◽

Open Data ◽

Data Sources ◽

Research Projects ◽

Research Groups ◽

The World ◽

Data Source

As the COVID-19 pandemic continues to spread worldwide, an unprecedented amount of open data is being generated for medical, genetics, and epidemiological research. The unparalleled rate at which many research groups around the world are releasing data and publications on the ongoing pandemic is allowing other scientists to learn from local experiences and data generated on the front lines of the COVID-19 pandemic. However, there is a need to integrate additional data sources that map and measure the role of social dynamics of such a unique worldwide event in biomedical, biological, and epidemiological analyses. For this purpose, we present a large-scale curated dataset of over 1.12 billion tweets, growing daily, related to COVID-19 chatter generated from 1 January 2020 to 27 June 2021 at the time of writing. This data source provides a freely available additional data source for researchers worldwide to conduct a wide and diverse number of research projects, such as epidemiological analyses, emotional and mental responses to social distancing measures, the identification of sources of misinformation, stratified measurement of sentiment towards the pandemic in near real time, among many others.

Download Full-text

EdgeDIPN

Proceedings of the VLDB Endowment ◽

10.14778/3430915.3430922 ◽

2020 ◽

Vol 14 (3) ◽

pp. 320-328

Author(s):

Long Guo ◽

Lifeng Hua ◽

Rongfei Jia ◽

Fei Fang ◽

Binqiang Zhao ◽

...

Keyword(s):

Real Time ◽

Large Scale ◽

Attention Mechanism ◽

Data Sources ◽

User Intent ◽

Multiple User ◽

Shopping Experience ◽

Data Source ◽

Intent Prediction ◽

The Right

With the rapid growth of e-commerce in recent years, e-commerce platforms are becoming a primary place for people to find, compare and ultimately purchase products. To improve online shopping experience for consumers and increase sales for sellers, it is important to understand user intent accurately and be notified of its change timely. In this way, the right information could be offered to the right person at the right time. To achieve this goal, we propose a unified deep intent prediction network, named EdgeDIPN, which is deployed at the edge, i.e., mobile device, and able to monitor multiple user intent with different granularity simultaneously in real-time. We propose to train EdgeDIPN with multi-task learning, by which EdgeDIPN can share representations between different tasks for better performance and saving edge resources in the meantime. In particular, we propose a novel task-specific attention mechanism which enables different tasks to pick out the most relevant features from different data sources. To extract the shared representations more effectively, we utilize two kinds of attention mechanisms, where the multi-level attention mechanism tries to identify the important actions within each data source and the inter-view attention mechanism learns the interactions between different data sources. In the experiments conducted on a large-scale industrial dataset, EdgeDIPN significantly outperforms the baseline solutions. Moreover, EdgeDIPN has been deployed in the operational system of Alibaba. Online A/B testing results in several business scenarios reveal the potential of monitoring user intent in real-time. To the best of our knowledge, EdgeDIPN is the first full-fledged real-time user intent understanding center deployed at the edge and serving hundreds of millions of users in a large-scale e-commerce platform.

Download Full-text

A comprehensive evaluation of cache utilization characteristics in large scale WSN considering network driven cache replacement techniques

MATEC Web of Conferences ◽

10.1051/matecconf/201818805004 ◽

2018 ◽

Vol 188 ◽

pp. 05004

Author(s):

Christos Panagiotou ◽

Christos Antonopoulos ◽

Stavros Koubias

Keyword(s):

Smart City ◽

Large Scale ◽

Network Performance ◽

Comprehensive Evaluation ◽

Resource Conservation ◽

Data Sources ◽

Data Cache ◽

Cache Replacement ◽

Large Scale Networks ◽

Result Analysis

WSNs as adopted in current smart city deployments, must address demanding traffic factors and resilience in failures. Furthermore, caching of data in WSN can significantly benefit resource conservation and network performance. However, data sources generate data volumes that could not fit in the restricted data cache resources of the caching nodes. This unavoidably leads to data items been evicted and replaced. This paper aims to experimentally evaluate the prominent caching techniques in large scale networks that resemble the Smart city paradigm regarding network performance with respect to critical application and network parameters. Through respective result analysis valuable insights are provided concerning the behaviour of caching in typical large scale WSN scenarios.

Download Full-text

Characterization and selection of Japanese electronic health record databases used as data sources for non-interventional observational studies

10.21203/rs.3.rs-184585/v1 ◽

2021 ◽

Author(s):

Yumi Wakabayashi ◽

Masamitsu Eitoku ◽

Narufumi Suganuma

Keyword(s):

Electronic Health Record ◽

Observational Studies ◽

Large Scale ◽

Data Sources ◽

Flow Diagram ◽

Health Record ◽

Medical Institutions ◽

Data Source ◽

Electronic Health ◽

Using Data

Abstract Background Interventional studies are the fundamental method for obtaining answers to clinical question. However, these studies are sometimes difficult to conduct because of insufficient financial or human resources or the rarity of the disease in question. One means of addressing these issues is to conduct a non-interventional observational study using electronic health record (EHR) databases as the data source, although how best to evaluate the suitability of an EHR database when planning a study remains to be clarified. The aim of the present study is to identify and characterize the data sources that have been used for conducting non-interventional observational studies in Japan and propose a flow diagram to help researchers determine the most appropriate EHR database for their study goals. Methods We compiled a list of published articles reporting observational studies conducted in Japan by searching PubMed for relevant articles published in the last 3 years and by searching database providers’ publication lists related to studies using their databases. For each article, we reviewed the abstract and/or full text to obtain information about data source, target disease or therapeutic area, number of patients, and study design (prospective or retrospective). We then characterized the identified EHR databases. Results In Japan, non-interventional observational studies have been mostly conducted using data stored locally at individual medical institutions (713/1463) or collected from several collaborating medical institutions (351/1463). Whereas the studies conducted with large-scale integrated databases (195/1463) were mostly retrospective (68.2%), 27.2% of the single-center studies, 46.2% of the multi-center studies, and 74.4% of the post-marketing surveillance studies, identified in the present study, were conducted prospectively. Conclusions Our analysis revealed that the non-interventional observational studies were conducted using data stored local at individual medical institutions or collected from collaborating medical institutions in Japan. Disease registries, disease databases, and large-scale databases would enable researchers to conduct studies with large sample sizes to provide robust data from which strong inferences could be drawn. Using our flow diagram, researchers planning non-interventional observational studies should consider the strengths and limitations of each available database and choose the most appropriate one for their study goals. Trial registration Not applicable.

Download Full-text

Community-curated and standardised metadata of published ancient metagenomic samples with AncientMetagenomeDir

10.1101/2020.09.02.279570 ◽

2020 ◽

Author(s):

James A. Fellows Yates ◽

Aida Andrades Valtueña ◽

Ashild J. Vågene ◽

Becky Cribdon ◽

Irina M. Velsko ◽

...

Keyword(s):

Large Scale ◽

Genetic Data ◽

Data Retrieval ◽

Data Sources ◽

Valuable Data ◽

Dna And Rna ◽

Wide Range ◽

Evolutionary Studies ◽

Meta Analyses ◽

Microbial Samples

ABSTRACTAncient DNA and RNA are valuable data sources for a wide range of disciplines. Within the field of ancient metagenomics, the number of published genetic datasets has risen dramatically in recent years, and tracking this data for reuse is particularly important for large-scale ecological and evolutionary studies of individual microbial taxa, microbial communities, and metagenomic assemblages. AncientMetagenomeDir (archived at https://doi.org/10.5281/zenodo.3980833) is a collection of indices of published genetic data deriving from ancient microbial samples that provides basic, standardised metadata and accession numbers to allow rapid data retrieval from online repositories. These collections are community-curated and span multiple sub-disciplines in order to ensure adequate breadth and consensus in metadata definitions, as well as longevity of the database. Internal guidelines and automated checks to facilitate compatibility with established sequence-read archives and term-ontologies ensure consistency and interoperability for future meta-analyses. This collection will also assist in standardising metadata reporting for future ancient metagenomic studies.

Download Full-text

Dimensionality Reduction With Multi-Fold Deep Denoising Autoencoder

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Deep Learning Techniques and Optimization Strategies in Big Data Analytics ◽

10.4018/978-1-7998-1192-3.ch010 ◽

2020 ◽

pp. 154-165

Author(s):

Pattabiraman V. ◽

Parvathi R.

Keyword(s):

Large Scale ◽

Curse Of Dimensionality ◽

Data Sources ◽

Sensor Data ◽

Future Research ◽

Computational Power ◽

Nonlinear Methods ◽

Research Areas ◽

Learning Techniques ◽

Natural Data

Natural data erupting directly out of various data sources, such as text, image, video, audio, and sensor data, comes with an inherent property of having very large dimensions or features of the data. While these features add richness and perspectives to the data, due to sparsity associated with them, it adds to the computational complexity while learning, unable to visualize and interpret them, thus requiring large scale computational power to make insights out of it. This is famously called “curse of dimensionality.” This chapter discusses the methods by which curse of dimensionality is cured using conventional methods and analyzes its performance for given complex datasets. It also discusses the advantages of nonlinear methods over linear methods and neural networks, which could be a better approach when compared to other nonlinear methods. It also discusses future research areas such as application of deep learning techniques, which can be applied as a cure for this curse.

Download Full-text

Digital Traces: New Data, Resources, and Tools for Psychological-Science Research

Current Directions in Psychological Science ◽

10.1177/0963721419861410 ◽

2019 ◽

Vol 28 (6) ◽

pp. 560-566 ◽

Cited By ~ 8

Author(s):

Anat Rafaeli ◽

Shelly Ashtar ◽

Daniel Altman

Keyword(s):

Research Methods ◽

Large Scale ◽

New Technologies ◽

Science Research ◽

Psychological Research ◽

Data Sources ◽

Human Psychology ◽

Psychological Science ◽

Large Scale Data ◽

Scale Data

New technologies create and archive digital traces—records of people’s behavior—that can supplement and enrich psychological research. Digital traces offer psychological-science researchers novel, large-scale data (which reflect people’s actual behaviors), rapidly collected and analyzed by new tools. We promote the integration of digital-traces data into psychological science, suggesting that it can enrich and overcome limitations of current research. In this article, we review helpful data sources, tools, and resources and discuss challenges associated with using digital traces in psychological research. Our review positions digital-traces research as complementary to traditional psychological-research methods and as offering the potential to enrich insights on human psychology.

Download Full-text

Helping Novices Avoid the Hazards of Data: Leveraging Ontologies to Improve Model Generalization Automatically with Online Data Sources

AI Magazine ◽

10.1609/aimag.v37i2.2626 ◽

2016 ◽

Vol 37 (2) ◽

pp. 19-32 ◽

Cited By ~ 1

Author(s):

Sasin Janpuangtong ◽

Dylan A. Shell

Keyword(s):

Domain Knowledge ◽

Large Scale ◽

Model Building ◽

Data Extraction ◽

Data Sources ◽

Online Data ◽

Model Generalization ◽

Improve Model ◽

Building Process ◽

High Level

The infrastructure and tools necessary for large-scale data analytics, formerly the exclusive purview of experts, are increasingly available. Whereas a knowledgeable data-miner or domain expert can rightly be expected to exercise caution when required (for example, around fallacious conclusions supposedly supported by the data), the nonexpert may benefit from some judicious assistance. This article describes an end-to-end learning framework that allows a novice to create models from data easily by helping structure the model building process and capturing extended aspects of domain knowledge. By treating the whole modeling process interactively and exploiting high-level knowledge in the form of an ontology, the framework is able to aid the user in a number of ways, including in helping to avoid pitfalls such as data dredging. Prudence must be exercised to avoid these hazards as certain conclusions may only be supported if, for example, there is extra knowledge which gives reason to trust a narrower set of hypotheses. This article adopts the solution of using higher-level knowledge to allow this sort of domain knowledge to be used automatically, selecting relevant input attributes, and thence constraining the hypothesis space. We describe how the framework automatically exploits structured knowledge in an ontology to identify relevant concepts, and how a data extraction component can make use of online data sources to find measurements of those concepts so that their relevance can be evaluated. To validate our approach, models of four different problem domains were built using our implementation of the framework. Prediction error on unseen examples of these models show that our framework, making use of the ontology, helps to improve model generalization.

Download Full-text