An Automatic Pipeline for Biomedical Research Hotspot Mining: Keyphrase Mining in Specific Disease Fields (Preprint)

2021 ◽  
Author(s):  
Ling Chai ◽  
Xiaoming Wu ◽  
Yuan Ni ◽  
Guotong Xie ◽  
Liyu Cao ◽  
...  

BACKGROUND With the increase in the number of biomedical scientific publications, it is of great value to characterize the research status of subtopics in this field, especially in the specific field of diseases. However, there has not been a fully automated pipeline for mining and analysing research hotspots in this field. OBJECTIVE We propose a completely automatic method based on natural language processing technology to analyize scientific innovations in a specific disease area. METHODS The whole pipeline consists of three steps, i.e. keyphrase extraction, clustering and cluster naming. The pipeline expands the existing literature analysis methods (including keyphrase extraction, document clustering, and paper ranking), adds advanced semantic mining technology (contextualized embeddings from pre-trained language models), and designs a document cluster naming strategy based on core document mining and topic-related phrase mining. With this pipeline, a full picture of the field of a specific disease is established. Distinct document clusters are generated to describe various subfields in disease-related research. Core documents and topic-related phrases are used to name clusters to interpret the concerns that researchers care about. Besides, the relations between clusters are analysed. Finally, several important clusters are analysed, whose core citation paths illustrate the research roadmap for a certain subfield and whose phrases directly describe the hotspots in each subfield. RESULTS We applied the method in the field of cataracts. From the 35117 cataract publications, the proposed method has extracted phrases with a high frequency like cataract extraction, cataract formation, intraocular pressure, etc. The method also found the most important documents in this field, which reveal the flow of research hotspots over time. 23 communities are generated and the top 10 topic-related phrases and core documents are extracted to name the communities. The cluster with the most paper is mainly about cataract formation. The cluster with the most high-impact papers focuses on common cataract diseases related to cataract epidemiology surveys. The cluster with the highest novelty and the highest progressiveness is related to the femtosecond laser technique. CONCLUSIONS This fully automated method can achieve the full picture of the research status of the field of a specific disease, without expert annotation.

2017 ◽  
Vol 4 (4) ◽  
pp. 201-206
Author(s):  
Chi-Chen Zhang ◽  
Rui-Fang Zhu ◽  
Hui-Ning Zhao ◽  
Zhen-Zhen Jin ◽  
Feng-Ru Yan ◽  
...  

2021 ◽  
Vol 2021 ◽  
pp. 1-16
Author(s):  
Yunfei Yang ◽  
Ke Lv ◽  
Jian Xue ◽  
Xi Huang

Fractional order research has interdisciplinary characteristics and has been widely used in the field of natural sciences. Therefore, fractional order research has become an important area of concern for scholars. This paper used 2854 literatures collected from China National Knowledge Infrastructure (CNKI) database from 2001 to 2020 as the data source and used bibliometrics and two visualization methods to conduct bibliometric analysis and visualization research on China’s fractional order research. To begin with, this paper analyzed the time series distribution of publications, the distribution of research institutions, the author cooccurrence network, the distribution of important journals, and the distribution of important literature, which explained the research status of the fractional order. Furthermore, this paper used VOSviewer software to analyze the clustering and density distribution of the fractional order research keywords, which revealed the hotspots of the fractional order research. Finally, with the help of CiteSpace software, the burst keywords were analyzed to further explore the frontiers of fractional order research. This paper systematically reveals the research status, research hotspots, and research frontiers of China’s fractional order research, which can provide certain theoretical and practical references for related follow-up researchers.


2021 ◽  
Vol 7 ◽  
pp. 205520762110576
Author(s):  
Phillip Richter-Pechanski ◽  
Nicolas A Geis ◽  
Christina Kiriakou ◽  
Dominic M Schwab ◽  
Christoph Dieterich

Objective A vast amount of medical data is still stored in unstructured text documents. We present an automated method of information extraction from German unstructured clinical routine data from the cardiology domain enabling their usage in state-of-the-art data-driven deep learning projects. Methods We evaluated pre-trained language models to extract a set of 12 cardiovascular concepts in German discharge letters. We compared three bidirectional encoder representations from transformers pre-trained on different corpora and fine-tuned them on the task of cardiovascular concept extraction using 204 discharge letters manually annotated by cardiologists at the University Hospital Heidelberg. We compared our results with traditional machine learning methods based on a long short-term memory network and a conditional random field. Results Our best performing model, based on publicly available German pre-trained bidirectional encoder representations from the transformer model, achieved a token-wise micro-average F1-score of 86% and outperformed the baseline by at least 6%. Moreover, this approach achieved the best trade-off between precision (positive predictive value) and recall (sensitivity). Conclusion Our results show the applicability of state-of-the-art deep learning methods using pre-trained language models for the task of cardiovascular concept extraction using limited training data. This minimizes annotation efforts, which are currently the bottleneck of any application of data-driven deep learning projects in the clinical domain for German and many other European languages.


2021 ◽  
Vol 8 ◽  
Author(s):  
Xia Cao ◽  
Qi-Jun Wu ◽  
Qing Chang ◽  
Tie-Ning Zhang ◽  
Xiang-Sen Li ◽  
...  

Background: The global incidence of metabolic syndrome (MetS) is continuously increasing, making it a potential worldwide public health concern. Research on dietary factors related to MetS has attracted considerable attention in the recent decades. However, the research hotspots, knowledge structure, and theme trends for the dietary factors associated with MetS remain unknown, and have not yet been systematically mapped. This study aimed to review the research status of diet as a risk factor for MetS through bibliometric methods. Bibliometric analysis was conducted using the Web of Science database. Research hotspots were identified using biclustering analysis with the gCLUTO software, and knowledge structure was explored via social network analysis using the Ucinet software. Theme trends were investigated using evolutionary analysis with the SciMAT software. In total, 1,305 papers were analyzed. The research output on the dietary factors associated with MetS increased steadily. The research scope was gradually expanding and diverse. Overall, eight research hot spots, four key dietary nodes, and four motor themes on the dietary factors associated with MetS were identified. Fatty acids, dietary fiber, and polyphenols have been the focus of research in this field over the years. Evolutionary analysis showed that fish oil and vitamin C were well-developed research foci recently. Prebiotics was recognized as an emerging theme with certain developmental potential. These findings provide a better understanding of the research status of the dietary factors associated with MetS and a reference for future investigations.


2021 ◽  
Vol 0 (0) ◽  
Author(s):  
Liangping Ding ◽  
Zhixiong Zhang ◽  
Huan Liu ◽  
Jie Li ◽  
Gaihong Yu

AbstractPurposeAutomatic keyphrase extraction (AKE) is an important task for grasping the main points of the text. In this paper, we aim to combine the benefits of sequence labeling formulation and pretrained language model to propose an automatic keyphrase extraction model for Chinese scientific research.Design/methodology/approachWe regard AKE from Chinese text as a character-level sequence labeling task to avoid segmentation errors of Chinese tokenizer and initialize our model with pretrained language model BERT, which was released by Google in 2018. We collect data from Chinese Science Citation Database and construct a large-scale dataset from medical domain, which contains 100,000 abstracts as training set, 6,000 abstracts as development set and 3,094 abstracts as test set. We use unsupervised keyphrase extraction methods including term frequency (TF), TF-IDF, TextRank and supervised machine learning methods including Conditional Random Field (CRF), Bidirectional Long Short Term Memory Network (BiLSTM), and BiLSTM-CRF as baselines. Experiments are designed to compare word-level and character-level sequence labeling approaches on supervised machine learning models and BERT-based models.FindingsCompared with character-level BiLSTM-CRF, the best baseline model with F1 score of 50.16%, our character-level sequence labeling model based on BERT obtains F1 score of 59.80%, getting 9.64% absolute improvement.Research limitationsWe just consider automatic keyphrase extraction task rather than keyphrase generation task, so only keyphrases that are occurred in the given text can be extracted. In addition, our proposed dataset is not suitable for dealing with nested keyphrases.Practical implicationsWe make our character-level IOB format dataset of Chinese Automatic Keyphrase Extraction from scientific Chinese medical abstracts (CAKE) publicly available for the benefits of research community, which is available at: https://github.com/possible1402/Dataset-For-Chinese-Medical-Keyphrase-Extraction.Originality/valueBy designing comparative experiments, our study demonstrates that character-level formulation is more suitable for Chinese automatic keyphrase extraction task under the general trend of pretrained language models. And our proposed dataset provides a unified method for model evaluation and can promote the development of Chinese automatic keyphrase extraction to some extent.


Author(s):  
J. S. Lally ◽  
R. J. Lee

In the 50 year period since the discovery of electron diffraction from crystals there has been much theoretical effort devoted to the calculation of diffracted intensities as a function of crystal thickness, orientation, and structure. However, in many applications of electron diffraction what is required is a simple identification of an unknown structure when some of the shape and orientation parameters required for intensity calculations are not known. In these circumstances an automated method is needed to solve diffraction patterns obtained near crystal zone axis directions that includes the effects of systematic absences of reflections due to lattice symmetry effects and additional reflections due to double diffraction processes.Two programs have been developed to enable relatively inexperienced microscopists to identify unknown crystals from diffraction patterns. Before indexing any given electron diffraction pattern, a set of possible crystal structures must be selected for comparison against the unknown.


Sign in / Sign up

Export Citation Format

Share Document