scholarly journals FullMeSH: improving large-scale MeSH indexing with full text

2019 ◽  
Vol 36 (5) ◽  
pp. 1533-1541
Author(s):  
Suyang Dai ◽  
Ronghui You ◽  
Zhiyong Lu ◽  
Xiaodi Huang ◽  
Hiroshi Mamitsuka ◽  
...  

Abstract Motivation With the rapidly growing biomedical literature, automatically indexing biomedical articles by Medical Subject Heading (MeSH), namely MeSH indexing, has become increasingly important for facilitating hypothesis generation and knowledge discovery. Over the past years, many large-scale MeSH indexing approaches have been proposed, such as Medical Text Indexer, MeSHLabeler, DeepMeSH and MeSHProbeNet. However, the performance of these methods is hampered by using limited information, i.e. only the title and abstract of biomedical articles. Results We propose FullMeSH, a large-scale MeSH indexing method taking advantage of the recent increase in the availability of full text articles. Compared to DeepMeSH and other state-of-the-art methods, FullMeSH has three novelties: (i) Instead of using a full text as a whole, FullMeSH segments it into several sections with their normalized titles in order to distinguish their contributions to the overall performance. (ii) FullMeSH integrates the evidence from different sections in a ‘learning to rank’ framework by combining the sparse and deep semantic representations. (iii) FullMeSH trains an Attention-based Convolutional Neural Network for each section, which achieves better performance on infrequent MeSH headings. FullMeSH has been developed and empirically trained on the entire set of 1.4 million full-text articles in the PubMed Central Open Access subset. It achieved a Micro F-measure of 66.76% on a test set of 10 000 articles, which was 3.3% and 6.4% higher than DeepMeSH and MeSHLabeler, respectively. Furthermore, FullMeSH demonstrated an average improvement of 4.7% over DeepMeSH for indexing Check Tags, a set of most frequently indexed MeSH headings. Availability and implementation The software is available upon request. Supplementary information Supplementary data are available at Bioinformatics online.

Author(s):  
Ronghui You ◽  
Yuxuan Liu ◽  
Hiroshi Mamitsuka ◽  
Shanfeng Zhu

Abstract Motivation With the rapid increase of biomedical articles, large-scale automatic Medical Subject Headings (MeSH) indexing has become increasingly important. FullMeSH, the only method for large-scale MeSH indexing with full text, suffers from three major drawbacks: FullMeSH (i) uses Learning To Rank, which is time-consuming, (ii) can capture some pre-defined sections only in full text and (iii) ignores the whole MEDLINE database. Results We propose a computationally lighter, full text and deep-learning-based MeSH indexing method, BERTMeSH, which is flexible for section organization in full text. BERTMeSH has two technologies: (i) the state-of-the-art pre-trained deep contextual representation, Bidirectional Encoder Representations from Transformers (BERT), which makes BERTMeSH capture deep semantics of full text. (ii) A transfer learning strategy for using both full text in PubMed Central (PMC) and title and abstract (only and no full text) in MEDLINE, to take advantages of both. In our experiments, BERTMeSH was pre-trained with 3 million MEDLINE citations and trained on ∼1.5 million full texts in PMC. BERTMeSH outperformed various cutting-edge baselines. For example, for 20 K test articles of PMC, BERTMeSH achieved a Micro F-measure of 69.2%, which was 6.3% higher than FullMeSH with the difference being statistically significant. Also prediction of 20 K test articles needed 5 min by BERTMeSH, while it took more than 10 h by FullMeSH, proving the computational efficiency of BERTMeSH. Supplementary information Supplementary data are available at Bioinformatics online


2020 ◽  
Author(s):  
Ronghui You ◽  
Yuxuan Liu ◽  
Hiroshi Mamitsuka ◽  
Shanfeng Zhu

AbstractMotivationWith the rapid increase of biomedical articles, large-scale automatic Medical Subject Headings (MeSH) indexing has become increasingly important. FullMeSH, the only method for large-scale MeSH indexing with full text, suffers from three major drawbacks: FullMeSH 1) uses Learning To Rank (LTR), which is time-consuming, 2) can capture some pre-defined sections only in full text, and 3) ignores the whole MEDLINE database.ResultsWe propose a computationally lighter, full-text and deep learning based MeSH indexing method, BERTMeSH, which is flexible for section organization in full text. BERTMeSH has two technologies: 1) the state-of-the-art pre-trained deep contextual representation, BERT (Bidirectional Encoder Representations from Transformers), which makes BERTMeSH capture deep semantics of full text. 2) a transfer learning strategy for using both full text in PubMed Central (PMC) and title and abstract (only and no full text) in MEDLINE, to take advantages of both. In our experiments, BERTMeSH was pre-trained with 3 million MEDLINE citations and trained on approximately 1.5 million full text in PMC. BERTMeSH outperformed various cutting edge baselines. For example, for 20K test articles of PMC, BERTMeSH achieved a Micro F-measure of 69.2%, which was 6.3% higher than FullMeSH with the difference being statistically significant. Also prediction of 20K test articles needed 5 minutes by BERTMeSH, while it took more than 10 hours by FullMeSH, proving the computational efficiency of [email protected]


2020 ◽  
Vol 36 (14) ◽  
pp. 4180-4188
Author(s):  
Lizhi Liu ◽  
Xiaodi Huang ◽  
Hiroshi Mamitsuka ◽  
Shanfeng Zhu

Abstract Motivation Annotating human proteins by abnormal phenotypes has become an important topic. Human Phenotype Ontology (HPO) is a standardized vocabulary of phenotypic abnormalities encountered in human diseases. As of November 2019, only <4000 proteins have been annotated with HPO. Thus, a computational approach for accurately predicting protein–HPO associations would be important, whereas no methods have outperformed a simple Naive approach in the second Critical Assessment of Functional Annotation, 2013–2014 (CAFA2). Results We present HPOLabeler, which is able to use a wide variety of evidence, such as protein–protein interaction (PPI) networks, Gene Ontology, InterPro, trigram frequency and HPO term frequency, in the framework of learning to rank (LTR). LTR has been proved to be powerful for solving large-scale, multi-label ranking problems in bioinformatics. Given an input protein, LTR outputs the ranked list of HPO terms from a series of input scores given to the candidate HPO terms by component learning models (logistic regression, nearest neighbor and a Naive method), which are trained from given multiple evidence. We empirically evaluate HPOLabeler extensively through mainly two experiments of cross validation and temporal validation, for which HPOLabeler significantly outperformed all component models and competing methods including the current state-of-the-art method. We further found that (i) PPI is most informative for prediction among diverse data sources and (ii) low prediction performance of temporal validation might be caused by incomplete annotation of new proteins. Availability and implementation http://issubmission.sjtu.edu.cn/hpolabeler/. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Author(s):  
Charles Tapley Hoyt ◽  
Daniel Domingo-Fernández ◽  
Rana Aldisi ◽  
Lingling Xu ◽  
Kristian Kolpeja ◽  
...  

AbstractThe rapid accumulation of new biomedical literature not only causes curated knowledge graphs to become outdated and incomplete, but also makes manual curation an impractical and unsustainable solution. Automated or semi-automated workflows are necessary to assist in prioritizing and curating the literature to update and enrich knowledge graphs.We have developed two workflows: one for re-curating a given knowledge graph to assure its syntactic and semantic quality and another for rationally enriching it by manually revising automatically extracted relations for nodes with low information density. We applied these workflows to the knowledge graphs encoded in Biological Expression Language from the NeuroMMSig database using content that was pre-extracted from MEDLINE abstracts and PubMed Central full text articles using text mining output integrated by INDRA. We have made this workflow freely available at https://github.com/bel-enrichment/bel-enrichment.Database URLhttps://github.com/bel-enrichment/results


2021 ◽  
Author(s):  
Chun-chao Lo ◽  
Shubo Tian ◽  
Yuchuan Tao ◽  
Jie Hao ◽  
Jinfeng Zhang

Most queries submitted to a literature search engine can be more precisely written as sentences to give the search engine more specific information. Sentence queries should be more effective, in principle, than short queries with small numbers of keywords. Querying with full sentences is also a key step in question-answering and citation recommendation systems. Despite the considerable progress in natural language processing (NLP) in recent years, using sentence queries on current search engines does not yield satisfactory results. In this study, we developed a deep learning-based method for sentence queries, called DeepSenSe, using citation data available in full-text articles obtained from PubMed Central (PMC). A large amount of labeled data was generated from millions of matched citing sentences and cited articles, making it possible to train quality predictive models using modern deep learning techniques. A two-stage approach was designed: in the first stage we used a modified BM25 algorithm to obtain the top 1000 relevant articles; the second stage involved re-ranking the relevant articles using DeepSenSe. We tested our method using a large number of sentences extracted from real scientific articles in PMC. Our method performed substantially better than PubMed and Google Scholar for sentence queries.


Children ◽  
2021 ◽  
Vol 8 (2) ◽  
pp. 137
Author(s):  
Kalliopi Kappou ◽  
Myrto Ntougia ◽  
Aikaterini Kourtesi ◽  
Eleni Panagouli ◽  
Elpis Vlachopapadopoulou ◽  
...  

Background: Anorexia nervosa (AN) is a serious, multifactorial mental disorder affecting predominantly young females. This systematic review examines neuroimaging findings in adolescents and young adults up to 24 years old, in order to explore alterations associated with disease pathophysiology. Methods: Eligible studies on structural and functional brain neuroimaging were sought systematically in PubMed, CENTRAL and EMBASE databases up to 5 October 2020. Results: Thirty-three studies were included, investigating a total of 587 patients with a current diagnosis of AN and 663 healthy controls (HC). Global and regional grey matter (GM) volume reduction as well as white matter (WM) microstructure alterations were detected. The mainly affected regions were the prefrontal, parietal and temporal cortex, hippocampus, amygdala, insula, thalamus and cerebellum as well as various WM tracts such as corona radiata and superior longitudinal fasciculus (SLF). Regarding functional imaging, alterations were pointed out in large-scale brain networks, such as default mode network (DMN), executive control network (ECN) and salience network (SN). Most findings appear to reverse after weight restoration. Specific limitations of neuroimaging studies in still developing individuals are also discussed. Conclusions: Structural and functional alterations are present in the early course of the disease, most of them being partially or totally reversible. Nonetheless, neuroimaging findings have been open to many biological interpretations. Thus, more studies are needed to clarify their clinical significance.


Energies ◽  
2021 ◽  
Vol 14 (4) ◽  
pp. 1099
Author(s):  
María José Rodríguez-Torres ◽  
Ainoa Morillas-España ◽  
José Luis Guzmán ◽  
Francisco Gabriel Acién

One of the most critical variables in microalgae-related processes is the pH; it directly determines the overall performance of the production system especially when coupling with wastewater treatment. In microalgae-related wastewater treatment processes, the adequacy of pH has a large impact on the microalgae/bacteria consortium already developing on these systems. For cost-saving reasons, the pH is usually controlled by classical On/Off control algorithms during the daytime period, typically with the dynamics of the system and disturbances not being considered in the design of the control system. This paper presents the modelling and pH control in open photobioreactors, both raceway and thin-layer, using advanced controllers. In both types of photobioreactors, a classic control was implemented and compared with a Proportional–Integral (PI) control, also the operation during only the daylight period and complete daily time was evaluated. Thus, three major variables already studied include (i) the type of reactors (thin-layers and raceways), (ii) the type of control algorithm (On/Off and PI), and (iii) the control period (during the daytime and throughout the daytime and nighttime). Results show that the pH was adequately controlled in both photobioreactors, although each type requires different control algorithms, the pH control being largely improved when using PI controllers, with the controllers allowing us to reduce the total costs of the process with the reduction of CO2 injections. Moreover, the control during the complete daily cycle (including night) not only not increases the amount of CO2 to be injected, otherwise reducing it, but also improves the overall performance of the production process. Optimal pH control systems here developed are highly useful to develop robust large-scale microalgae-related wastewater treatment processes.


2020 ◽  
Author(s):  
Marco Bertoni ◽  
Stephen Gibbons ◽  
Olmo Silva

Abstract We study how demand responds to the rebranding of existing state schools as autonomous ‘academies’ in the context of a radical and large-scale reform to the English education system. The academy programme encouraged schools to opt out of local state control and funding, but provided parents and students with limited information on the expected benefits. We use administrative data on school applications for three cohorts of students to estimate whether this rebranding changes schools’ relative popularity. We find that families – particularly higher-income, White British – are more likely to rank converted schools above non-converted schools on their applications. We also find that it is mainly schools that are high-performing, popular and proximate to families’ homes that attract extra demand after conversion. Overall, the patterns we document suggest that families read academy conversion as a signal of future quality gains – although this signal is in part misleading as we find limited evidence that conversion causes improved performance.


2019 ◽  
Vol 35 (14) ◽  
pp. i417-i426 ◽  
Author(s):  
Erin K Molloy ◽  
Tandy Warnow

Abstract Motivation At RECOMB-CG 2018, we presented NJMerge and showed that it could be used within a divide-and-conquer framework to scale computationally intensive methods for species tree estimation to larger datasets. However, NJMerge has two significant limitations: it can fail to return a tree and, when used within the proposed divide-and-conquer framework, has O(n5) running time for datasets with n species. Results Here we present a new method called ‘TreeMerge’ that improves on NJMerge in two ways: it is guaranteed to return a tree and it has dramatically faster running time within the same divide-and-conquer framework—only O(n2) time. We use a simulation study to evaluate TreeMerge in the context of multi-locus species tree estimation with two leading methods, ASTRAL-III and RAxML. We find that the divide-and-conquer framework using TreeMerge has a minor impact on species tree accuracy, dramatically reduces running time, and enables both ASTRAL-III and RAxML to complete on datasets (that they would otherwise fail on), when given 64 GB of memory and 48 h maximum running time. Thus, TreeMerge is a step toward a larger vision of enabling researchers with limited computational resources to perform large-scale species tree estimation, which we call Phylogenomics for All. Availability and implementation TreeMerge is publicly available on Github (http://github.com/ekmolloy/treemerge). Supplementary information Supplementary data are available at Bioinformatics online.


2015 ◽  
Vol 23 (3) ◽  
pp. 617-626 ◽  
Author(s):  
Nophar Geifman ◽  
Sanchita Bhattacharya ◽  
Atul J Butte

Abstract Objective Cytokines play a central role in both health and disease, modulating immune responses and acting as diagnostic markers and therapeutic targets. This work takes a systems-level approach for integration and examination of immune patterns, such as cytokine gene expression with information from biomedical literature, and applies it in the context of disease, with the objective of identifying potentially useful relationships and areas for future research. Results We present herein the integration and analysis of immune-related knowledge, namely, information derived from biomedical literature and gene expression arrays. Cytokine-disease associations were captured from over 2.4 million PubMed records, in the form of Medical Subject Headings descriptor co-occurrences, as well as from gene expression arrays. Clustering of cytokine-disease co-occurrences from biomedical literature is shown to reflect current medical knowledge as well as potentially novel relationships between diseases. A correlation analysis of cytokine gene expression in a variety of diseases revealed compelling relationships. Finally, a novel analysis comparing cytokine gene expression in different diseases to parallel associations captured from the biomedical literature was used to examine which associations are interesting for further investigation. Discussion We demonstrate the usefulness of capturing Medical Subject Headings descriptor co-occurrences from biomedical publications in the generation of valid and potentially useful hypotheses. Furthermore, integrating and comparing descriptor co-occurrences with gene expression data was shown to be useful in detecting new, potentially fruitful, and unaddressed areas of research. Conclusion Using integrated large-scale data captured from the scientific literature and experimental data, a better understanding of the immune mechanisms underlying disease can be achieved and applied to research.


Sign in / Sign up

Export Citation Format

Share Document