Large expert-curated database for benchmarking document similarity detection in biomedical literature search

Database ◽

10.1093/database/baz085 ◽

2019 ◽

Vol 2019 ◽

Author(s):

Peter Brown ◽

Aik-Choon Tan ◽

Mohamed A El-Esawi ◽

Thomas Liehr ◽

Oliver Blanck ◽

...

Keyword(s):

Literature Search ◽

Relevant Literature ◽

Biomedical Literature ◽

Medical Subject Headings ◽

Document Similarity ◽

Inverse Document Frequency ◽

Research Fields ◽

Experience Levels ◽

Document Frequency ◽

Systematic Biases

Abstract Document recommendation systems for locating relevant literature have mostly relied on methods developed a decade ago. This is largely due to the lack of a large offline gold-standard benchmark of relevant documents that cover a variety of research fields such that newly developed literature search techniques can be compared, improved and translated into practice. To overcome this bottleneck, we have established the RElevant LIterature SearcH consortium consisting of more than 1500 scientists from 84 countries, who have collectively annotated the relevance of over 180 000 PubMed-listed articles with regard to their respective seed (input) article/s. The majority of annotations were contributed by highly experienced, original authors of the seed articles. The collected data cover 76% of all unique PubMed Medical Subject Headings descriptors. No systematic biases were observed across different experience levels, research fields or time spent on annotations. More importantly, annotations of the same document pairs contributed by different scientists were highly concordant. We further show that the three representative baseline methods used to generate recommended articles for evaluation (Okapi Best Matching 25, Term Frequency–Inverse Document Frequency and PubMed Related Articles) had similar overall performances. Additionally, we found that these methods each tend to produce distinct collections of recommended articles, suggesting that a hybrid method may be required to completely capture all relevant articles. The established database server located at https://relishdb.ict.griffith.edu.au is freely available for the downloading of annotation data and the blind testing of new methods. We expect that this benchmark will be useful for stimulating the development of new powerful techniques for title and title/abstract-based search engines for relevant articles in biomedical research.

Download Full-text

Erratum to: Large expert-curated database for benchmarking document similarity detection in biomedical literature search

Database ◽

10.1093/database/baz138 ◽

2020 ◽

Vol 2020 ◽

Author(s):

Peter Brown ◽

Yaoqi Zhou ◽

Keyword(s):

Literature Search ◽

Biomedical Literature ◽

Document Similarity ◽

Similarity Detection

Download Full-text

Text documents clustering using data mining techniques

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v11i1.pp664-670 ◽

2021 ◽

Vol 11 (1) ◽

pp. 664

Author(s):

Ahmed Adeeb Jalal ◽

Basheer Husham Ali

Keyword(s):

Information Technologies ◽

Scientific Field ◽

Research Papers ◽

Text Documents ◽

Inverse Document Frequency ◽

Classification Approach ◽

Research Fields ◽

Document Frequency ◽

Specific Category ◽

Using Data

Increasing progress in numerous research fields and information technologies, led to an increase in the publication of research papers. Therefore, researchers take a lot of time to find interesting research papers that are close to their field of specialization. Consequently, in this paper we have proposed documents classification approach that can cluster the text documents of research papers into the meaningful categories in which contain a similar scientific field. Our presented approach based on essential focus and scopes of the target categories, where each of these categories includes many topics. Accordingly, we extract word tokens from these topics that relate to a specific category, separately. The frequency of word tokens in documents impacts on weight of document that calculated by using a numerical statistic of term frequency-inverse document frequency (TF-IDF). The proposed approach uses title, abstract, and keywords of the paper, in addition to the categories topics to perform the classification process. Subsequently, documents are classified and clustered into the primary categories based on the highest measure of cosine similarity between category weight and documents weights.

Download Full-text

Design and Implementation of a Big Data Evaluator Recommendation System Using Deep Learning Methodology

Applied Sciences ◽

10.3390/app10228000 ◽

2020 ◽

Vol 10 (22) ◽

pp. 8000

Author(s):

Sukil Cha ◽

Mun Y. Yi ◽

Sekyoung Youm

Keyword(s):

Big Data ◽

Deep Learning ◽

Full Text ◽

Recommendation System ◽

Selection Process ◽

Korean Literature ◽

Inverse Document Frequency ◽

Design And Implementation ◽

Research Fields ◽

Document Frequency

As the number of researchers in South Korea has grown, there is increasing dissatisfaction with the selection process for national research and development (R&D) projects among unsuccessful applicants. In this study, we designed a system that can recommend the best possible R&D evaluators using big data that are collected from related systems, refined, and analyzed. Our big data recommendation system compares keywords extracted from applications and from the full-text of the achievements of the evaluator candidates. Weights for different keywords are scored using the term frequency–inverse document frequency algorithm. Comparing the keywords extracted from the achievement of the evaluator candidates’, a project comparison module searches, scores, and ranks these achievements similarly to the project applications. The similarity scoring module calculates the overall similarity scores for different candidates based on the project comparison module scores. To assess the performance of the evaluator candidate recommendation system, 61 applications in three Review Board (RB) research fields (system fusion, organic biochemistry, and Korean literature) were recommended as the evaluator candidates by the recommendation system in the same manner as the RB’s recommendation. Our tests reveal that the evaluator candidates recommended by the Korean Review Board and those recommended by our system for 61 applications in different areas, were the same. However, our system performed the recommendation in less time with no bias and fewer personnel. The system requiresrevisions to reflect qualitative indicators, such as journal reputation, before it can entirely replace the current evaluator recommendation process.

Download Full-text

Immune modulators in disease: integrating knowledge from the biomedical literature and gene expression

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocv166 ◽

2015 ◽

Vol 23 (3) ◽

pp. 617-626 ◽

Cited By ~ 1

Author(s):

Nophar Geifman ◽

Sanchita Bhattacharya ◽

Atul J Butte

Keyword(s):

Gene Expression ◽

Large Scale ◽

Biomedical Literature ◽

Cytokine Gene Expression ◽

Future Research ◽

Cytokine Gene ◽

Medical Subject Headings ◽

Expression Arrays ◽

Gene Expression Arrays ◽

Subject Headings

Abstract Objective Cytokines play a central role in both health and disease, modulating immune responses and acting as diagnostic markers and therapeutic targets. This work takes a systems-level approach for integration and examination of immune patterns, such as cytokine gene expression with information from biomedical literature, and applies it in the context of disease, with the objective of identifying potentially useful relationships and areas for future research. Results We present herein the integration and analysis of immune-related knowledge, namely, information derived from biomedical literature and gene expression arrays. Cytokine-disease associations were captured from over 2.4 million PubMed records, in the form of Medical Subject Headings descriptor co-occurrences, as well as from gene expression arrays. Clustering of cytokine-disease co-occurrences from biomedical literature is shown to reflect current medical knowledge as well as potentially novel relationships between diseases. A correlation analysis of cytokine gene expression in a variety of diseases revealed compelling relationships. Finally, a novel analysis comparing cytokine gene expression in different diseases to parallel associations captured from the biomedical literature was used to examine which associations are interesting for further investigation. Discussion We demonstrate the usefulness of capturing Medical Subject Headings descriptor co-occurrences from biomedical publications in the generation of valid and potentially useful hypotheses. Furthermore, integrating and comparing descriptor co-occurrences with gene expression data was shown to be useful in detecting new, potentially fruitful, and unaddressed areas of research. Conclusion Using integrated large-scale data captured from the scientific literature and experimental data, a better understanding of the immune mechanisms underlying disease can be achieved and applied to research.

Download Full-text

Poisson mixtures

Natural Language Engineering ◽

10.1017/s1351324900000139 ◽

1995 ◽

Vol 1 (2) ◽

pp. 163-190 ◽

Cited By ~ 146

Author(s):

Kenneth W. Church ◽

William A. Gale

Keyword(s):

Negative Binomial ◽

Probability Distributions ◽

Hidden Variables ◽

Heterogeneous Structure ◽

Text Compression ◽

Inverse Document Frequency ◽

Poisson Mixtures ◽

Document Frequency ◽

Wide Range ◽

Better Than

AbstractShannon (1948) showed that a wide range of practical problems can be reduced to the problem of estimating probability distributions of words and ngrams in text. It has become standard practice in text compression, speech recognition, information retrieval and many other applications of Shannon's theory to introduce a “bag-of-words” assumption. But obviously, word rates vary from genre to genre, author to author, topic to topic, document to document, section to section, and paragraph to paragraph. The proposed Poisson mixture captures much of this heterogeneous structure by allowing the Poisson parameter θ to vary over documents subject to a density function φ. φ is intended to capture dependencies on hidden variables such genre, author, topic, etc. (The Negative Binomial is a well-known special case where φ is a Г distribution.) Poisson mixtures fit the data better than standard Poissons, producing more accurate estimates of the variance over documents (σ2), entropy (H), inverse document frequency (IDF), and adaptation (Pr(x ≥ 2/x ≥ 1)).

Download Full-text

Demystifying intuition as the main decision system used by farmers.

CAB Reviews Perspectives in Agriculture Veterinary Science Nutrition and Natural Resources ◽

10.1079/pavsnnr202116010 ◽

2021 ◽

Vol 16 (10) ◽

Author(s):

Peter Nuthall

Keyword(s):

Decision Making ◽

Literature Search ◽

Relevant Literature ◽

Farm Management ◽

Decision Making Process ◽

Decision System ◽

Intuitive Process ◽

Important Improvement ◽

The Many ◽

Do So

Abstract Over the decades, many researchers have explored the concept of intuition as a decision-making process. However, most of this research does not quantify the important aspects of intuition, making it difficult to fully understand its nature and improve the intuitive process, enabling an efficient method of decision-making. The research described here, through a review of the relevant literature, demystifies intuition as a decision system by isolating the important intuition determining variables and relating them to quantitative intuition research. As most farm decisions are made through intuition, farmers, consultants, researchers and students of farm management will find the review useful, stimulating efforts for improving decision-making skills in farmers. The literature search covered all journals and recent decades and includes articles that consider the variables to be targeted in improving intuitive skill. This provides a basis for thinking about intuition and its improvement within the farming world. It was found from the literature that most of the logical areas that should influence decisions do in fact do so and should be targeted in improving intuition. One of the most important improvement processes is a farmer's self-criticism skills through using a decision diary in conjunction with reflection and consultation leading to improved decisions. This must be in conjunction with understanding, and learning about, the many other variables also impacting on intuitive skill.

Download Full-text

Inverse document frequency-based sensitivity scoring for privacy analysis

Signal Image and Video Processing ◽

10.1007/s11760-021-02013-1 ◽

2021 ◽

Author(s):

Onder Coban ◽

Ali Inan ◽

Selma Ayse Ozel

Keyword(s):

Inverse Document Frequency ◽

Document Frequency ◽

Privacy Analysis

Download Full-text

PubMed Labs: an experimental system for improving biomedical literature search

Database ◽

10.1093/database/bay094 ◽

2018 ◽

Vol 2018 ◽

Cited By ~ 9

Author(s):

Nicolas Fiorini ◽

Kathi Canese ◽

Rostyslav Bryzgunov ◽

Ievgeniia Radetska ◽

Asta Gindulyte ◽

...

Keyword(s):

Literature Search ◽

Experimental System ◽

Biomedical Literature

Download Full-text

A Combination of Text Mining Techniques for Relevant Literature Search and Extractive Summarization

Proceedings of the 2nd International Conference on Natural Language Processing and Information Retrieval - NLPIR 2018 ◽

10.1145/3278293.3278300 ◽

2018 ◽

Author(s):

Thiptanawat Phongwattana ◽

Jonathan H. Chan

Keyword(s):

Text Mining ◽

Literature Search ◽

Relevant Literature ◽

Extractive Summarization

Download Full-text

Efficient natural language classification algorithm for detecting duplicate unsupervised features

Informatics and Automation - Информатика и автоматизация ◽

10.15622/ia.2021.3.5 ◽

2021 ◽

Vol 20 (3) ◽

pp. 623-653

Author(s):

Saud Altaf ◽

Sofia Iqbal ◽

Muhammad Waseem Soomro

Keyword(s):

Natural Language ◽

Short Term Memory ◽

Short Term ◽

Vocabulary Size ◽

Language Understanding ◽

Inverse Document Frequency ◽

Classification Technique ◽

Document Frequency ◽

Text Features ◽

Long Short Term Memory

This paper focuses on capturing the meaning of Natural Language Understanding (NLU) text features to detect the duplicate unsupervised features. The NLU features are compared with lexical approaches to prove the suitable classification technique. The transfer-learning approach is utilized to train the extraction of features on the Semantic Textual Similarity (STS) task. All features are evaluated with two types of datasets that belong to Bosch bug and Wikipedia article reports. This study aims to structure the recent research efforts by comparing NLU concepts for featuring semantics of text and applying it to IR. The main contribution of this paper is a comparative study of semantic similarity measurements. The experimental results demonstrate the Term Frequency–Inverse Document Frequency (TF-IDF) feature results on both datasets with reasonable vocabulary size. It indicates that the Bidirectional Long Short Term Memory (BiLSTM) can learn the structure of a sentence to improve the classification.

Download Full-text