NLP-MeTaxa: A Natural Language Processing approach for Metagenomic Taxonomic Binning based on deep learning

2021 ◽  
Vol 16 ◽  
Author(s):  
Brahim Matougui ◽  
Abdelbasset Boukelia ◽  
Hacene Belhadef ◽  
Clovis Galiez ◽  
Mohamed Batouche

Background: Metagenomics is the study of genomic content in mass from an environment of interest such as the human gut or soil. Taxonomy is one of the most important fields of metagenomics, which is the science of defining and naming groups of microbial organisms that share the same characteristics. The problem of taxonomy classification is the identification and quantification of microbial species or higher-level taxa sampled by high throughput sequencing. Objective: Although many methods exist to deal with the taxonomic classification problem, assignment to low taxonomic ranks remains an important challenge for binning methods as is scalability to Gb-sized datasets generated with deep sequencing techniques. Methods: In this paper, we introduce NLP-MeTaxa, a novel composition-based method for taxonomic binning, which relies on the use of words embeddings and deep learning architecture. The new proposed approach is word-based, where the metagenomic DNA fragments are processed as a set of overlapping words by using the word2vec model to vectorize them in order to feed the deep learning model. NLP-MeTaxa output is visualized as NCBI taxonomy tree, this representation helps to show the connection between the predicted taxonomic identifiers. NLP-MeTaxa was trained on large-scale data from the NCBI RefSeq, more than 14,000 complete microbial genomes. The NLP-MeTaxa code is available at the website: https://github.com/padriba/NLP_MeTaxa/ Results: We evaluated NLP-MeTaxa with a real and simulated metagenomic dataset and compared our results to other tools' results. The experimental results have shown that our method outperforms the other methods especially for the classification of low-ranking taxonomic class such as species and genus. Conclusion: In summary, our new method might provide novel insight for understanding the microbial community through the identification of the organisms it might contain.

2020 ◽  
Vol 2 (4) ◽  
pp. 209-215
Author(s):  
Eriss Eisa Babikir Adam

The computer system is developing the model for speech synthesis of various aspects for natural language processing. The speech synthesis explores by articulatory, formant and concatenate synthesis. These techniques lead more aperiodic distortion and give exponentially increasing error rate during process of the system. Recently, advances on speech synthesis are tremendously moves towards deep learning process in order to achieve better performance. Due to leverage of large scale data gives effective feature representations to speech synthesis. The main objective of this research article is that implements deep learning techniques into speech synthesis and compares the performance in terms of aperiodic distortion with prior model of algorithms in natural language processing.


2021 ◽  
Vol 2021 ◽  
pp. 1-7
Author(s):  
Juncai Li ◽  
Xiaofei Jiang

Molecular property prediction is an essential task in drug discovery. Most computational approaches with deep learning techniques either focus on designing novel molecular representation or combining with some advanced models together. However, researchers pay fewer attention to the potential benefits in massive unlabeled molecular data (e.g., ZINC). This task becomes increasingly challenging owing to the limitation of the scale of labeled data. Motivated by the recent advancements of pretrained models in natural language processing, the drug molecule can be naturally viewed as language to some extent. In this paper, we investigate how to develop the pretrained model BERT to extract useful molecular substructure information for molecular property prediction. We present a novel end-to-end deep learning framework, named Mol-BERT, that combines an effective molecular representation with pretrained BERT model tailored for molecular property prediction. Specifically, a large-scale prediction BERT model is pretrained to generate the embedding of molecular substructures, by using four million unlabeled drug SMILES (i.e., ZINC 15 and ChEMBL 27). Then, the pretrained BERT model can be fine-tuned on various molecular property prediction tasks. To examine the performance of our proposed Mol-BERT, we conduct several experiments on 4 widely used molecular datasets. In comparison to the traditional and state-of-the-art baselines, the results illustrate that our proposed Mol-BERT can outperform the current sequence-based methods and achieve at least 2% improvement on ROC-AUC score on Tox21, SIDER, and ClinTox dataset.


2021 ◽  
Author(s):  
R. Salter ◽  
Quyen Dong ◽  
Cody Coleman ◽  
Maria Seale ◽  
Alicia Ruvinsky ◽  
...  

The Engineer Research and Development Center, Information Technology Laboratory’s (ERDC-ITL’s) Big Data Analytics team specializes in the analysis of large-scale datasets with capabilities across four research areas that require vast amounts of data to inform and drive analysis: large-scale data governance, deep learning and machine learning, natural language processing, and automated data labeling. Unfortunately, data transfer between government organizations is a complex and time-consuming process requiring coordination of multiple parties across multiple offices and organizations. Past successes in large-scale data analytics have placed a significant demand on ERDC-ITL researchers, highlighting that few individuals fully understand how to successfully transfer data between government organizations; future project success therefore depends on a small group of individuals to efficiently execute a complicated process. The Big Data Analytics team set out to develop a standardized workflow for the transfer of large-scale datasets to ERDC-ITL, in part to educate peers and future collaborators on the process required to transfer datasets between government organizations. Researchers also aim to increase workflow efficiency while protecting data integrity. This report provides an overview of the created Data Lake Ecosystem Workflow by focusing on the six phases required to efficiently transfer large datasets to supercomputing resources located at ERDC-ITL.


Big data is large-scale data collected for knowledge discovery, it has been widely used in various applications. Big data often has image data from the various applications and requires effective technique to process data. In this paper, survey has been done in the big image data researches to analysis the effective performance of the methods. Deep learning techniques provides the effective performance compared to other methods included wavelet based methods. The deep learning techniques has the problem of requiring more computational time, and this can be overcome by lightweight methods.


Author(s):  
Yilin Yan ◽  
Jonathan Chen ◽  
Mei-Ling Shyu

Stance detection is an important research direction which attempts to automatically determine the attitude (positive, negative, or neutral) of the author of text (such as tweets), towards a target. Nowadays, a number of frameworks have been proposed using deep learning techniques that show promising results in application domains such as automatic speech recognition and computer vision, as well as natural language processing (NLP). This article shows a novel deep learning-based fast stance detection framework in bipolar affinities on Twitter. It is noted that millions of tweets regarding Clinton and Trump were produced per day on Twitter during the 2016 United States presidential election campaign, and thus it is used as a test use case because of its significant and unique counter-factual properties. In addition, stance detection can be utilized to imply the political tendency of the general public. Experimental results show that the proposed framework achieves high accuracy results when compared to several existing stance detection methods.


Iproceedings ◽  
10.2196/15225 ◽  
2019 ◽  
Vol 5 (1) ◽  
pp. e15225
Author(s):  
Felipe Masculo ◽  
Jorn op den Buijs ◽  
Mariana Simons ◽  
Aki Harma

Background A Personal Emergency Response Service (PERS) enables an aging population to receive help quickly when an emergency situation occurs. The reasons that trigger a PERS alert are varied, including a sudden worsening of a chronic condition, a fall, or other injury. Every PERS case is documented by the response center using a combination of structured variables and free text notes. The text notes, in particular, contain a wealth of information in case of an incident such as contextual information, details about the situation, symptoms and more. Analysis of these notes at a population level could provide insight into the various situations that cause PERS medical alerts. Objective The objectives of this study were to (1) develop methods to enable the large-scale analysis of text notes from a PERS response center, and (2) to apply these methods to a large dataset and gain insight into the different situations that cause medical alerts. Methods More than 2.5 million deidentified PERS case text notes were used to train a document embedding model (ie, a deep learning Recurrent Neural Network [RNN] that takes the medical alert text note as input and produces a corresponding fixed length vector representation as output). We applied this model to 100,000 PERS text notes related to medical incidents that resulted in emergency department admission. Finally, we used t-SNE, a nonlinear dimensionality reduction method, to visualize the vector representation of the text notes in 2D as part of a graphical user interface that enabled interactive exploration of the dataset and visual analytics. Results Visual analysis of the vectors revealed the existence of several well-separated clusters of incidents such as fall, stroke/numbness, seizure, breathing problems, chest pain, and nausea, each of them related to the emergency situation encountered by the patient as recorded in an existing structured variable. In addition, subclusters were identified within each cluster which grouped cases based on additional features extracted from the PERS text notes and not available in the existing structured variables. For example, the incidents labeled as falls (n=37,842) were split into several subclusters corresponding to falls with bone fracture (n=1437), falls with bleeding (n=4137), falls caused by dizziness (n=519), etc. Conclusions The combination of state-of-the-art natural language processing, deep learning, and visualization techniques enables the large-scale analysis of medical alert text notes. This analysis demonstrates that, in addition to falls alerts, the PERS service is broadly used to signal for help in situations often related to underlying chronic conditions and acute symptoms such as respiratory distress, chest pain, diabetic reaction, etc. Moreover, the proposed techniques enable the extraction of structured information related to the medical alert from unstructured text with minimal human supervision. This structured information could be used, for example, to track trends over time, to generate concise medical alert summaries, and to create predictive models for desired outcomes.


2020 ◽  
Vol 10 (10) ◽  
pp. 2459-2465
Author(s):  
Iftikhar Ahmad ◽  
Muhammad Javed Iqbal ◽  
Mohammad Basheri

The size of data gathered from various ongoing biological and clinically studies is increasing at an exponential rate. The bio-inspired data mainly comprises of genes of DNA, protein and variety of proteomics and genetic diseases. Additionally, DNA microarray data is also available for early diagnosis and prediction of various types of cancer diseases. Interestingly, this data may store very vital information about genes, their structure and important biological function. The huge volume and constant increase in the extracted bio data has opened several challenges. Many bioinformatics and machine learning models have been developed but those fail to address key challenges presents in the efficient and accurate analysis of variety of complex biologically inspired data such as genetic diseases etc. The reliable and robust process of classifying the extracted data into different classes based on the information hidden in the sample data is also a very interesting and open problem. This research work mainly focuses to overcome major challenges in the accurate protein classification keeping in view of the success of deep learning models in natural language processing since it assumes the proteins sequences as a language. The learning ability and overall classification performance of the proposed system can be validated with deep learning classification models. The proposed system can have the superior ability to accurately classify the mentioned datasets than previous approaches and shows better results. The in-depth analysis of multifaceted biological data may also help in the early diagnosis of diseases that causes due to mutation of genes and to overcome arising challenges in the development of large-scale healthcare systems.


2017 ◽  
Vol 68 ◽  
pp. 32-42 ◽  
Author(s):  
Rodrigo F. Berriel ◽  
Franco Schmidt Rossi ◽  
Alberto F. de Souza ◽  
Thiago Oliveira-Santos

Sign in / Sign up

Export Citation Format

Share Document