Efficient large-scale language model training on GPU clusters using megatron-LM

Abstract Motivation Methodological advances in metagenome assembly are rapidly increasing in the number of published metagenome assemblies. However, identifying misassemblies is challenging due to a lack of closely related reference genomes that can act as pseudo ground truth. Existing reference-free methods are no longer maintained, can make strong assumptions that may not hold across a diversity of research projects, and have not been validated on large-scale metagenome assemblies. Results We present DeepMAsED, a deep learning approach for identifying misassembled contigs without the need for reference genomes. Moreover, we provide an in silico pipeline for generating large-scale, realistic metagenome assemblies for comprehensive model training and testing. DeepMAsED accuracy substantially exceeds the state-of-the-art when applied to large and complex metagenome assemblies. Our model estimates a 1% contig misassembly rate in two recent large-scale metagenome assembly publications. Conclusions DeepMAsED accurately identifies misassemblies in metagenome-assembled contigs from a broad diversity of bacteria and archaea without the need for reference genomes or strong modeling assumptions. Running DeepMAsED is straight-forward, as well as is model re-training with our dataset generation pipeline. Therefore, DeepMAsED is a flexible misassembly classifier that can be applied to a wide range of metagenome assembly projects. Availability and implementation DeepMAsED is available from GitHub at https://github.com/leylabmpi/DeepMAsED. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Post-dialogue confidence scoring for unsupervised statistical language model training

Speech Communication ◽

10.1016/j.specom.2004.10.017 ◽

2005 ◽

Vol 45 (4) ◽

pp. 387-400 ◽

Cited By ~ 1

Author(s):

Katsuhito Sudoh ◽

Mikio Nakano

Keyword(s):

Language Model ◽

Statistical Language Model ◽

Model Training

Download Full-text

ZenLDA: Large-scale topic model training on distributed data-parallel platform

Big Data Mining and Analytics ◽

10.26599/bdma.2018.9020006 ◽

2018 ◽

Vol 1 (1) ◽

pp. 57-74 ◽

Cited By ~ 11

Keyword(s):

Large Scale ◽

Topic Model ◽

Distributed Data ◽

Data Parallel ◽

Parallel Platform ◽

Model Training

Download Full-text

A Modified Skip-Gram Algorithm for Extracting Drug-Drug Interactions from AERS Reports

Computational and Mathematical Methods in Medicine ◽

10.1155/2020/1747413 ◽

2020 ◽

Vol 2020 ◽

pp. 1-9

Author(s):

Li Wang ◽

Wenjie Pan ◽

QingHua Wang ◽

Heming Bai ◽

Wei Liu ◽

...

Keyword(s):

Adverse Event ◽

Drug Interactions ◽

Large Scale ◽

Characteristic Curve ◽

Language Model ◽

Adverse Event Reporting System ◽

Adverse Event Reporting ◽

Concept Representation ◽

Average Accuracy ◽

Tenfold Cross Validation

Drug-drug interactions (DDIs) are one of the indispensable factors leading to adverse event reactions. Considering the unique structure of AERS (Food and Drug Administration Adverse Event Reporting System (FDA AERS)) reports, we changed the scope of the window value in the original skip-gram algorithm, then propose a language concept representation model and extract features of drug name and reaction information from large-scale AERS reports. The validation of our scheme was tested and verified by comparing with vectors originated from the cooccurrence matrix in tenfold cross-validation. In the verification of description enrichment of the DrugBank DDI database, accuracy was calculated for measurement. The average area under the receiver operating characteristic curve of logistic regression classifiers based on the proposed language model is 6% higher than that of the cooccurrence matrix. At the same time, the average accuracy in five severe adverse event classes is 88%. These results indicate that our language model can be useful for extracting drug and reaction features from large-scale AERS reports.

Download Full-text

A Waterfall Model to Achieve Energy Efficient Tasks Mapping for Large Scale GPU Clusters

2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum ◽

10.1109/ipdps.2011.129 ◽

2011 ◽

Cited By ~ 13

Author(s):

Wenjie Liu ◽

Zhihui Du ◽

Yu Xiao ◽

David A. Bader ◽

Chen Xu

Keyword(s):

Energy Efficient ◽

Large Scale ◽

Gpu Clusters ◽

Waterfall Model

Download Full-text

Pushdown Automata in Statistical Machine Translation

Computational Linguistics ◽

10.1162/coli_a_00197 ◽

2014 ◽

Vol 40 (3) ◽

pp. 687-723 ◽

Cited By ~ 3

Author(s):

Cyril Allauzen ◽

Bill Byrne ◽

Adrià de Gispert ◽

Gonzalo Iglesias ◽

Michael Riley

Keyword(s):

Machine Translation ◽

Large Scale ◽

Complexity Analysis ◽

Statistical Machine Translation ◽

Language Model ◽

General Purpose ◽

Language Models ◽

Experimental Conditions ◽

Context Free ◽

Pushdown Automata

This article describes the use of pushdown automata (PDA) in the context of statistical machine translation and alignment under a synchronous context-free grammar. We use PDAs to compactly represent the space of candidate translations generated by the grammar when applied to an input sentence. General-purpose PDA algorithms for replacement, composition, shortest path, and expansion are presented. We describe HiPDT, a hierarchical phrase-based decoder using the PDA representation and these algorithms. We contrast the complexity of this decoder with a decoder based on a finite state automata representation, showing that PDAs provide a more suitable framework to achieve exact decoding for larger synchronous context-free grammars and smaller language models. We assess this experimentally on a large-scale Chinese-to-English alignment and translation task. In translation, we propose a two-pass decoding strategy involving a weaker language model in the first-pass to address the results of PDA complexity analysis. We study in depth the experimental conditions and tradeoffs in which HiPDT can achieve state-of-the-art performance for large-scale SMT.

Download Full-text

Large scale three‐dimensional electromagnetic inversion on GPU clusters

International Workshop on Gravity, Electrical & Magnetic Methods and Their Applications, Beijing, China, October 10–13, 2011 ◽

10.1190/1.3659039 ◽

2011 ◽

Author(s):

Shangli Ou ◽

Dennis Willen ◽

Alexander Langwost

Keyword(s):

Large Scale ◽

Three Dimensional ◽

Gpu Clusters

Download Full-text

Efficient large-scale language model training on GPU clusters using megatron-LM

Unsupervised acoustic and language model training with small amounts of labelled data

A heterogeneous parallel implementation of the Markov clustering algorithm for large-scale biological networks on distributed CPU–GPU clusters

Rapid transition to new spoken dialogue domains: language model training using knowledge from previous domain applications and web text resources

DeepMAsED: evaluating the quality of metagenomic assemblies

Post-dialogue confidence scoring for unsupervised statistical language model training

ZenLDA: Large-scale topic model training on distributed data-parallel platform

A Modified Skip-Gram Algorithm for Extracting Drug-Drug Interactions from AERS Reports

A Waterfall Model to Achieve Energy Efficient Tasks Mapping for Large Scale GPU Clusters

Pushdown Automata in Statistical Machine Translation

Large scale three‐dimensional electromagnetic inversion on GPU clusters

Export Citation Format