HD Spot: Interpretable Deep Learning Classification of Single Cell Transcript Data

Mapping Intimacies ◽

10.1101/822759 ◽

2019 ◽

Cited By ~ 1

Author(s):

Eric Prince ◽

Todd C. Hankinson

Keyword(s):

Deep Learning ◽

Single Cell ◽

High Throughput ◽

Ground Truth ◽

Sequencing Technologies ◽

Bioinformatic Tool ◽

Complex Relationships ◽

Insight Into ◽

Generation Sequencing

ABSTRACTHigh throughput data is commonplace in biomedical research as seen with technologies such as single-cell RNA sequencing (scRNA-seq) and other Next Generation Sequencing technologies. As these techniques continue to be increasingly utilized it is critical to have analysis tools that can identify meaningful complex relationships between variables (i.e., in the case of scRNA-seq: genes) in a way such that human bias is absent. Moreover, it is equally paramount that both linear and non-linear (i.e., one-to-many) variable relationships be considered when contrasting datasets. HD Spot is a deep learning-based framework that generates an optimal interpretable classifier a given high-throughput dataset using a simple genetic algorithm as well as an autoencoder to classifier transfer learning approach. Using four unique publicly available scRNA-seq datasets with published ground truth, we demonstrate the robustness of HD Spot and the ability to identify ontologically accurate gene lists for a given data subset. HD Spot serves as a bioinformatic tool to allow novice and advanced analysts to gain complex insight into their respective datasets enabling novel hypotheses development.

Download Full-text

Accurate classification of protein subcellular localization from high throughput microscopy images using deep learning

10.1101/050757 ◽

2016 ◽

Cited By ~ 4

Author(s):

Tanel Pärnamaa ◽

Leopold Parts

Keyword(s):

Deep Learning ◽

Subcellular Localization ◽

High Throughput ◽

Single Cells ◽

Cellular Compartment ◽

Cell Localization ◽

Image Characteristics ◽

Basic Image ◽

Training Examples

High throughput microscopy of many single cells generates high-dimensional data that are far from straightforward to analyze. One important problem is automatically detecting the cellular compartment where a fluorescently tagged protein resides, a task relatively simple for an experienced human, but difficult to automate on a computer. Here, we train an 11-layer neural network on data from mapping thousands of yeast proteins, achieving per cell localization classification accuracy of 91%, and per protein accuracy of 99% on held out images. We confirm that low-level network features correspond to basic image characteristics, while deeper layers separate localization classes. Using this network as a feature calculator, we train standard classifiers that assign proteins to previously unseen compartments after observing only a small number of training examples. Our results are the most accurate subcellular localization classifications to date, and demonstrate the usefulness of deep learning for high throughput microscopy.

Download Full-text

Application of BERT to Enable Gene Classification Based on Clinical Evidence

BioMed Research International ◽

10.1155/2020/5491963 ◽

2020 ◽

Vol 2020 ◽

pp. 1-13

Author(s):

Yuhan Su ◽

Hongxin Xiang ◽

Haotian Xie ◽

Yong Yu ◽

Shiyan Dong ◽

...

Keyword(s):

Clinical Evidence ◽

Genetic Mutations ◽

Data Presentation ◽

Sequencing Technologies ◽

Practical Tool ◽

Logarithmic Loss ◽

Manual Classification ◽

High Repeatability ◽

Generation Sequencing

The identification of profiled cancer-related genes plays an essential role in cancer diagnosis and treatment. Based on literature research, the classification of genetic mutations continues to be done manually nowadays. Manual classification of genetic mutations is pathologist-dependent, subjective, and time-consuming. To improve the accuracy of clinical interpretation, scientists have proposed computational-based approaches for automatic analysis of mutations with the advent of next-generation sequencing technologies. Nevertheless, some challenges, such as multiple classifications, the complexity of texts, redundant descriptions, and inconsistent interpretation, have limited the development of algorithms. To overcome these difficulties, we have adapted a deep learning method named Bidirectional Encoder Representations from Transformers (BERT) to classify genetic mutations based on text evidence from an annotated database. During the training, three challenging features such as the extreme length of texts, biased data presentation, and high repeatability were addressed. Finally, the BERT+abstract demonstrates satisfactory results with 0.80 logarithmic loss, 0.6837 recall, and 0.705 F -measure. It is feasible for BERT to classify the genomic mutation text within literature-based datasets. Consequently, BERT is a practical tool for facilitating and significantly speeding up cancer research towards tumor progression, diagnosis, and the design of more precise and effective treatments.

Download Full-text

Deep learning enables high-throughput early detection and classification of bacterial colonies using time-lapse coherent imaging (Conference Presentation)

Optics and Biophotonics in Low-Resource Settings VI ◽

10.1117/12.2547399 ◽

2020 ◽

Author(s):

Hongda Wang ◽

Hatice C. Koydemir ◽

Yunzhe Qiu ◽

Bijie Bai ◽

Yibo Zhang ◽

...

Keyword(s):

Deep Learning ◽

Early Detection ◽

High Throughput ◽

Time Lapse ◽

Coherent Imaging

Download Full-text

Non-coding RNA bioinformatics platform for full backing of the high-throughput sequencing experiments generated by next-generation sequencing technologies

EMBnet journal ◽

10.14806/ej.18.a.461 ◽

2012 ◽

Vol 18 (A) ◽

pp. 132

Author(s):

F Licciulli ◽

A Consiglio ◽

G De Caro ◽

A Gisel ◽

G Grillo ◽

...

Keyword(s):

Next Generation Sequencing ◽

High Throughput ◽

High Throughput Sequencing ◽

Next Generation ◽

Sequencing Technologies ◽

Non Coding Rna ◽

Generation Sequencing

Download Full-text

Image-based taxonomic classification of bulk biodiversity samples using deep learning and domain adaptation

10.1101/2021.12.22.473797 ◽

2021 ◽

Author(s):

Tomochika Fujisawa ◽

Victor Noguerales ◽

Emmanouil Meramveliotakis ◽

Anna Papadopoulou ◽

Alfried P Vogler

Keyword(s):

Deep Learning ◽

High Throughput ◽

Domain Adaptation ◽

Network Models ◽

Neural Network Models ◽

Data Set ◽

Model Training ◽

Trained Neural Network ◽

Domain Transfer

Complex bulk samples of invertebrates from biodiversity surveys present a great challenge for taxonomic identification, especially if obtained from unexplored ecosystems. High-throughput imaging combined with machine learning for rapid classification could overcome this bottleneck. Developing such procedures requires that taxonomic labels from an existing source data set are used for model training and prediction of an unknown target sample. Yet the feasibility of transfer learning for the classification of unknown samples remains to be tested. Here, we assess the efficiency of deep learning and domain transfer algorithms for family-level classification of below-ground bulk samples of Coleoptera from understudied forests of Cyprus. We trained neural network models with images from local surveys versus global databases of above-ground samples from tropical forests and evaluated how prediction accuracy was affected by: (a) the quality and resolution of images, (b) the size and complexity of the training set and (c) the transferability of identifications across very disparate source-target pairs that do not share any species or genera. Within-dataset classification accuracy reached 98% and depended on the number and quality of training images and on dataset complexity. The accuracy of between-datasets predictions was reduced to a maximum of 82% and depended greatly on the standardisation of the imaging procedure. When the source and target images were of similar quality and resolution, albeit from different faunas, the reduction of accuracy was minimal. Application of algorithms for domain adaptation significantly improved the prediction performance of models trained by non-standardised, low-quality images. Our findings demonstrate that existing databases can be used to train models and successfully classify images from unexplored biota, when the imaging conditions and classification algorithms are carefully considered. Also, our results provide guidelines for data acquisition and algorithmic development for high-throughput image-based biodiversity surveys.

Download Full-text

Platon: identification and characterization of bacterial plasmid contigs in short-read draft assemblies exploiting protein-sequence-based replicon distribution scores

10.1101/2020.04.21.053082 ◽

2020 ◽

Cited By ~ 2

Author(s):

Oliver Schwengers ◽

Patrick Barth ◽

Linda Falgenhauer ◽

Torsten Hain ◽

Trinad Chakraborty ◽

...

Keyword(s):

High Throughput ◽

Protein Sequence ◽

Scientific Community ◽

Vital Role ◽

Bacterial Genomes ◽

Short Read ◽

Link Type ◽

Sequencing Technologies ◽

Generation Sequencing

ABSTRACTPlasmids are extrachromosomal genetic elements replicating independently of the chromosome which play a vital role in the environmental adaptation of bacteria. Due to potential mobilization or conjugation capabilities, plasmids are important genetic vehicles for antimicrobial resistance genes and virulence factors with huge and increasing clinical implications. They are therefore subject to large genomic studies within the scientific community worldwide. As a result of rapidly improving next generation sequencing methods, the amount of sequenced bacterial genomes is constantly increasing, in turn raising the need for specialized tools to (i) extract plasmid sequences from draft assemblies, (ii) derive their origin and distribution, and (iii) further investigate their genetic repertoire. Recently, several bioinformatic methods and tools have emerged to tackle this issue; however, a combination of both high sensitivity and specificity in plasmid sequence identification is rarely achieved in a taxon-independent manner. In addition, many software tools are not appropriate for large high-throughput analyses or cannot be included into existing software pipelines due to their technical design or software implementation. In this study, we investigated differences in the replicon distributions of protein-coding genes on a large scale as a new approach to distinguish plasmid-borne from chromosome-borne contigs. We defined and computed statistical discrimination thresholds for a new metric: the replicon distribution score (RDS) which achieved an accuracy of 96.6%. The final performance was further improved by the combination of the RDS metric with heuristics exploiting several plasmid specific higher-level contig characterizations. We implemented this workflow in a new high-throughput taxon-independent bioinformatics software tool called Platon for the recruitment and characterization of plasmid-borne contigs from short-read draft assemblies. Compared to PlasFlow, Platon achieved a higher accuracy (97.5%) and more balanced predictions (F1=82.6%) tested on a broad range of bacterial taxa and better or equal performance against the targeted tools PlasmidFinder and PlaScope on sequenced E. coli isolates. Platon is available at: platon.computational.bioData SummaryPlaton was developed as a Python 3 command line application for Linux.The complete source code and documentation is available on GitHub under a GPL3 license: https://github.com/oschwengers/platon and platon.computational.bio.All database versions are hosted at Zenodo: DOI 10.5281/zenodo.3349651.Platon is available via bioconda package platonPlaton is available via PyPI package cb-platonBacterial representative sequences for UniProt’s UniRef90 protein clusters, complete bacterial genome sequences from the NCBI RefSeq database, complete plasmid sequences from the NCBI genomes plasmid section, created artificial contigs, RDS threshold metrics and raw protein replicon hit counts used to create and evaluate the marker protein sequence database are hosted at Zenodo: DOI 10.5281/zenodo.375916924 Escherichia coli isolates sequenced with short read (Illumina MiSeq) and long read sequencing technologies (Oxford Nanopore Technology GridION platform) used for real data benchmarks are available under the following NCBI BioProjects: PRJNA505407, PRJNA387731Impact StatementPlasmids play a vital role in the spread of antibiotic resistance and pathogenicity genes. The increasing numbers of clinical outbreaks involving resistant pathogens worldwide pushed the scientific community to increase their efforts to comprehensively investigate bacterial genomes. Due to the maturation of next-generation sequencing technologies, nowadays entire bacterial genomes including plasmids are sequenced in huge scale. To analyze draft assemblies, a mandatory first step is to separate plasmid from chromosome contigs. Recently, many bioinformatic tools have emerged to tackle this issue. Unfortunately, several tools are implemented only as interactive or web-based tools disabling them for necessary high-throughput analysis of large data sets. Other tools providing such a high-throughput implementation however often come with certain drawbacks, e.g. providing taxon-specific databases only, not providing actionable, i.e. true binary classification or achieving biased classification performances towards either sensitivity or specificity.Here, we introduce the tool Platon implementing a new replicon distribution-based approach combined with higher-level contig characterizations to address the aforementioned issues. In addition to the plasmid detection within draft assemblies, Platon provides the user with valuable information on certain higher-level contig characterizations. We show that Platon provides a balanced classification performance as well as a scalable implementation for high-throughput analyses. We therefore consider Platon to be a powerful, species-independent and flexible tool to scan large amounts of bacterial whole-genome sequencing data for their plasmid content.

Download Full-text

Single-cell classification of foodborne pathogens using hyperspectral microscope imaging coupled with deep learning frameworks

Sensors and Actuators B Chemical ◽

10.1016/j.snb.2020.127789 ◽

2020 ◽

Vol 309 ◽

pp. 127789 ◽

Cited By ~ 3

Author(s):

Rui Kang ◽

Bosoon Park ◽

Matthew Eady ◽

Qin Ouyang ◽

Kunjie Chen

Keyword(s):

Deep Learning ◽

Single Cell ◽

Foodborne Pathogens ◽

Cell Classification ◽

Microscope Imaging ◽

Learning Frameworks

Download Full-text

Assessment of Bioleaching Microbial Community Structure and Function Based on Next-Generation Sequencing Technologies

Minerals ◽

10.3390/min8120596 ◽

2018 ◽

Vol 8 (12) ◽

pp. 596 ◽

Cited By ~ 1

Author(s):

Shuang Zhou ◽

Min Gan ◽

Jianyu Zhu ◽

Xinxing Liu ◽

Guanzhou Qiu

Keyword(s):

Community Structure ◽

Next Generation Sequencing ◽

Microbial Ecology ◽

High Throughput ◽

High Throughput Sequencing ◽

Structure And Function ◽

Next Generation ◽

Sequencing Technologies ◽

And Function ◽

Generation Sequencing

It is widely known that bioleaching microorganisms have to cope with the complex extreme environment in which microbial ecology relating to community structure and function varies across environmental types. However, analyses of microbial ecology of bioleaching bacteria is still a challenge. To address this challenge, numerous technologies have been developed. In recent years, high-throughput sequencing technologies enabling comprehensive sequencing analysis of cellular RNA and DNA within the reach of most laboratories have been added to the toolbox of microbial ecology. The next-generation sequencing technology allowing processing DNA sequences can produce available draft genomic sequences of more bioleaching bacteria, which provides the opportunity to predict models of genetic and metabolic potential of bioleaching bacteria and ultimately deepens our understanding of bioleaching microorganism. High-throughput sequencing that focuses on targeted phylogenetic marker 16S rRNA has been effectively applied to characterize the community diversity in an ore leaching environment. RNA-seq, another application of high-throughput sequencing to profile RNA, can be for both mapping and quantifying transcriptome and has demonstrated a high efficiency in quantifying the changing expression level of each transcript under different conditions. It has been demonstrated as a powerful tool for dissecting the relationship between genotype and phenotype, leading to interpreting functional elements of the genome and revealing molecular mechanisms of adaption. This review aims to describe the high-throughput sequencing approach for bioleaching environmental microorganisms, particularly focusing on its application associated with challenges.

Download Full-text

Genomics of medulloblastoma: from Giemsa-banding to next-generation sequencing in 20 years

Neurosurgical FOCUS ◽

10.3171/2009.10.focus09218 ◽

2010 ◽

Vol 28 (1) ◽

pp. E6 ◽

Cited By ~ 39

Author(s):

Paul A. Northcott ◽

James T. Rutka ◽

Michael D. Taylor

Keyword(s):

Next Generation Sequencing ◽

Molecular Mechanisms ◽

Genomic Medicine ◽

Prognostic Significance ◽

Next Generation ◽

Sequencing Technologies ◽

Disease Stratification ◽

Insight Into ◽

Generation Sequencing

Advances in the field of genomics have recently enabled the unprecedented characterization of the cancer genome, providing novel insight into the molecular mechanisms underlying malignancies in humans. The application of high-resolution microarray platforms to the study of medulloblastoma has revealed new oncogenes and tumor suppressors and has implicated changes in DNA copy number, gene expression, and methylation state in its etiology. Additionally, the integration of medulloblastoma genomics with patient clinical data has confirmed molecular markers of prognostic significance and highlighted the potential utility of molecular disease stratification. The advent of next-generation sequencing technologies promises to greatly transform our understanding of medulloblastoma pathogenesis in the next few years, permitting comprehensive analyses of all aspects of the genome and increasing the likelihood that genomic medicine will become part of the routine diagnosis and treatment of medulloblastoma.

Download Full-text

A rank-based marker selection method for high throughput scRNA-seq data

BMC Bioinformatics ◽

10.1186/s12859-020-03641-z ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Alexander H. S. Vargo ◽

Anna C. Gilbert

Keyword(s):

Single Cell ◽

High Throughput ◽

Parametric Method ◽

Ground Truth ◽

Specific Cell ◽

Data Sets ◽

Fast Method ◽

Selection Methods ◽

Single Experiment ◽

Marker Selection

Abstract Background High throughput microfluidic protocols in single cell RNA sequencing (scRNA-seq) collect mRNA counts from up to one million individual cells in a single experiment; this enables high resolution studies of rare cell types and cell development pathways. Determining small sets of genetic markers that can identify specific cell populations is thus one of the major objectives of computational analysis of mRNA counts data. Many tools have been developed for marker selection on single cell data; most of them, however, are based on complex statistical models and handle the multi-class case in an ad-hoc manner. Results We introduce RankCorr, a fast method with strong mathematical underpinnings that performs multi-class marker selection in an informed manner. RankCorr proceeds by ranking the mRNA counts data before linearly separating the ranked data using a small number of genes. The step of ranking is intuitively natural for scRNA-seq data and provides a non-parametric method for analyzing count data. In addition, we present several performance measures for evaluating the quality of a set of markers when there is no known ground truth. Using these metrics, we compare the performance of RankCorr to a variety of other marker selection methods on an assortment of experimental and synthetic data sets that range in size from several thousand to one million cells. Conclusions According to the metrics introduced in this work, RankCorr is consistently one of most optimal marker selection methods on scRNA-seq data. Most methods show similar overall performance, however; thus, the speed of the algorithm is the most important consideration for large data sets (and comparing the markers selected by several methods can be fruitful). RankCorr is fast enough to easily handle the largest data sets and, as such, it is a useful tool to add into computational pipelines when dealing with high throughput scRNA-seq data. RankCorr software is available for download at https://github.com/ahsv/RankCorrwith extensive documentation.

Download Full-text