MLSeq: Machine learning interface for RNA-sequencing data

AbstractThe research landscape of single-cell and single-nuclei RNA sequencing is evolving rapidly, and one area that is enabled by this technology, is the detection of rare cells. An automated, unbiased and accurate annotation of rare subpopulations is challenging. Once rare cells are identified in one dataset, it will usually be necessary to generate other datasets to enrich the analysis (e.g., with samples from other tissues). From a machine learning perspective, the challenge arises from the fact that rare cell subpopulations constitute an imbalanced classification problem.We here introduce a Machine Learning (ML)-based oversampling method that uses gene expression counts of already identified rare cells as an input to generate synthetic cells to then identify similar (rare) cells in other publicly available experiments. We utilize single-cell synthetic oversampling (sc-SynO), which is based on the Localized Random Affine Shadowsampling (LoRAS) algorithm. The algorithm corrects for the overall imbalance ratio of the minority and majority class.We demonstrate the effectiveness of the method for two independent use cases, each consisting of two published datasets. The first use case identifies cardiac glial cells in snRNA-Seq data (17 nuclei out of 8,635). This use case was designed to take a larger imbalance ratio (∼1 to 500) into account and only uses single-nuclei data. The second use case was designed to jointly use snRNA-Seq data and scRNA-Seq on a lower imbalance ratio (∼1 to 26) for the training step to likewise investigate the potential of the algorithm to consider both single cell capture procedures and the impact of “less” rare-cell types. For validation purposes, all datasets have also been analyzed in a traditional manner using common data analysis approaches, such as the Seurat3 workflow.Our algorithm identifies rare-cell populations with a high accuracy and low false positive detection rate. A striking benefit of our algorithm is that it can be readily implemented in other and existing workflows. The code basis is publicly available at FairdomHub (https://fairdomhub.org/assays/1368) and can easily be transferred to train other customized approaches.

Download Full-text

Authentication of Differential Gene Expression in Oral Squamous Cell Carcinoma using Machine Learning Applications

10.21203/rs.3.rs-128045/v1 ◽

2020 ◽

Author(s):

Rian Pratama ◽

Jae Joon Hwang ◽

Ji Hye Lee ◽

Giltae Song ◽

Hae Ryoun Park

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Squamous Cell Carcinoma ◽

Oral Squamous Cell Carcinoma ◽

Cell Carcinoma ◽

Rna Sequencing ◽

Differential Gene Expression ◽

Sequencing Data ◽

Machine Learning Applications ◽

Differential Gene

Abstract Background: Recently, the possibility of tumour classification based on genetic data has been investigated. However, genetic datasets are difficult to handle because of their massive size and complexity of manipulation. In the present study, we examined the diagnostic performance of machine learning applications using imaging-based classifications of oral squamous cell carcinoma (OSCC) gene sets.Methods: RNA sequencing data from SCC tissues from various sites, including oral, non-oral head and neck, oesophageal, and cervical regions, were downloaded from The Cancer Genome Atlas (TCGA). The feature genes were extracted through a convolutional neural network (CNN) and machine learning, and the performance of each analysis was compared.Results: The ability of the machine learning analysis to classify OSCC tumours was excellent. However, the tool exhibited poorer performance in discriminating histopathologically dissimilar cancers derived from the same type of tissue than in differentiating cancers of the same histopathologic type with different tissue origins, revealing that the differential gene expression pattern is a more important factor than the histopathologic features for differentiating cancer types.Conclusion: The CNN-based diagnostic model and the visualisation methods using RNA sequencing data were useful for correctly categorising OSCC. The analysis showed differentially expressed genes in multiwise comparisons of various types of SCCs, such as KCNA10, FOSL2, and PRDM16, and extracted leader genes from pairwise comparisons were FGF20, DLC1, and ZNF705D.

Download Full-text

A Machine Learning Approach to Prostate Cancer Risk Classification Through Use of RNA Sequencing Data

Lecture Notes in Computer Science - Big Data – BigData 2019 ◽

10.1007/978-3-030-23551-2_5 ◽

2019 ◽

pp. 65-79

Author(s):

Matthew Casey ◽

Baldwin Chen ◽

Jonathan Zhou ◽

Nianjun Zhou

Keyword(s):

Prostate Cancer ◽

Machine Learning ◽

Cancer Risk ◽

Rna Sequencing ◽

Prostate Cancer Risk ◽

Risk Classification ◽

Learning Approach ◽

Sequencing Data ◽

Machine Learning Approach

Download Full-text

Machine learning applied to whole‐blood RNA‐sequencing data uncovers distinct subsets of patients with systemic lupus erythematosus

Clinical & Translational Immunology ◽

10.1002/cti2.1093 ◽

2019 ◽

Vol 8 (12) ◽

Cited By ~ 4

Author(s):

William A Figgett ◽

Katherine Monaghan ◽

Milica Ng ◽

Monther Alhamdoosh ◽

Eugene Maraskovsky ◽

...

Keyword(s):

Machine Learning ◽

Systemic Lupus Erythematosus ◽

Rna Sequencing ◽

Lupus Erythematosus ◽

Whole Blood ◽

Sequencing Data ◽

Systemic Lupus

Download Full-text

Machine Learning-Assisted Identification of Factors Contributing to the Technical Variability Between Bulk and Single-Cell RNA-Seq Experiments

10.21203/rs.3.rs-1247889/v1 ◽

2022 ◽

Author(s):

Sofya Lipnitskaya ◽

Yang Shen ◽

Stefan Legewie ◽

Holger Klein ◽

Kolja Becker

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Single Cell ◽

Rna Sequencing ◽

Quantitative Difference ◽

Rna Seq ◽

Sequencing Data ◽

Factors Affecting ◽

Expression Variability ◽

Technical Variability

Abstract Background: Recent studies in the area of transcriptomics performed on single-cell and population levels reveal noticeable variability in gene expression measurements provided by different RNA sequencing technologies. Due to increased noise and complexity of single-cell RNA-Seq (scRNA-Seq) data over the bulk experiment, there is a substantial number of variably-expressed genes and so-called dropouts, challenging the subsequent computational analysis and potentially leading to false positive discoveries. In order to investigate factors affecting technical variability between RNA sequencing experiments of different technologies, we performed a systematic assessment of single-cell and bulk RNA-Seq data, which have undergone the same pre-processing and sample preparation procedures. Results: Our analysis indicates that variability between gene expression measurements as well as dropout events are not exclusively caused by biological variability, low expression levels, or random variation. Furthermore, we propose FAVSeq, a machine learning-assisted pipeline for detection of factors contributing to gene expression variability in matched RNA-Seq data provided by two technologies. Based on the analysis of the matched bulk and single-cell dataset, we found the 3'-UTR and transcript lengths as the most relevant effectors of the observed variation between RNA-Seq experiments, while the same factors together with cellular compartments were shown to be associated with dropouts. Conclusions: Here, we investigated the sources of variation in RNA-Seq profiles of matched single-cell and bulk experiments. In addition, we proposed the FAVSeq pipeline for analyzing multimodal RNA sequencing data, which allowed to identify factors affecting quantitative difference in gene expression measurements as well as the presence of dropouts. Hereby, the derived knowledge can be employed further in order to improve the interpretation of RNA-Seq data and identify genes that can be affected by assay-based deviations. Source code is available under the MIT license at https://github.com/slipnitskaya/FAVSeq.

Download Full-text

Characterization of Cancer Types by Applying Machine Learning Methods on Blood RNA-Sequencing Data

2019 3rd International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT) ◽

10.1109/ismsit.2019.8932905 ◽

2019 ◽

Author(s):

Cem Bugra Alkan ◽

Zerrin Isik

Keyword(s):

Machine Learning ◽

Rna Sequencing ◽

Sequencing Data ◽

Learning Methods ◽

Machine Learning Methods ◽

Cancer Types

Download Full-text

Identifying Cancer Biomarkers from High-Throughput RNA Sequencing Data by Machine Learning

Intelligent Computing Theories and Application - Lecture Notes in Computer Science ◽

10.1007/978-3-030-26969-2_49 ◽

2019 ◽

pp. 517-528

Author(s):

Zishuang Zhang ◽

Zhi-Ping Liu

Keyword(s):

Machine Learning ◽

Rna Sequencing ◽

High Throughput ◽

Cancer Biomarkers ◽

Sequencing Data

Download Full-text

Chord: Identifying Doublets in Single-Cell RNA Sequencing Data by an Ensemble Machine Learning Algorithm

10.1101/2021.05.07.442884 ◽

2021 ◽

Author(s):

Ke-Xu Xiong ◽

Han-Lin Zhou ◽

Jian-Hua Yin ◽

Karsten Kristiansen ◽

Huan-Ming Yang ◽

...

Keyword(s):

Machine Learning ◽

Single Cell ◽

Rna Sequencing ◽

Learning Algorithm ◽

Synthetic Data ◽

Detection Methods ◽

Machine Learning Algorithm ◽

Sequencing Data ◽

Modular Architecture ◽

Single Cell Rna Sequencing

High-throughput single-cell RNA sequencing (scRNA-seq) is a popular method, but it is accompanied by doublet rate problems that disturb the downstream analysis. Several computational approaches have been developed to detect doublets. However, most of these methods have good performance in some datasets but lack stability in others; thus, it is difficult to regard a single method as the gold standard for each scenario, and it is a difficult and time-consuming task for researcher to choose the most appropriate software. To address these issues, we propose Chord which implements a machine learning algorithm that integrates multiple doublet detection methods. Chord had a higher accuracy and stability than the individual approaches on different datasets containing real and synthetic data. Moreover, Chord was designed with a modular architecture port, which has high flexibility and adaptability to the incorporation of any new tools. Chord is a general solution to the doublet detection problem.

Download Full-text

Identification of Three Rheumatoid Arthritis Disease Subtypes by Machine Learning Integration of Synovial Histologic Features and RNA Sequencing Data

Arthritis & Rheumatology ◽

10.1002/art.40428 ◽

2018 ◽

Vol 70 (5) ◽

pp. 690-701 ◽

Cited By ~ 49

Author(s):

Dana E. Orange ◽

Phaedra Agius ◽

Edward F. DiCarlo ◽

Nicolas Robine ◽

Heather Geiger ◽

...

Keyword(s):

Rheumatoid Arthritis ◽

Machine Learning ◽

Rna Sequencing ◽

Sequencing Data ◽

Rheumatoid Arthritis Disease ◽

Disease Subtypes

Download Full-text

How Machine Learning and Statistical Models Advance Molecular Diagnostics of Rare Disorders Via Analysis of RNA Sequencing Data

Frontiers in Molecular Biosciences ◽

10.3389/fmolb.2021.647277 ◽

2021 ◽

Vol 8 ◽

Author(s):

Lea D. Schlieben ◽

Holger Prokisch ◽

Vicente A. Yépez

Keyword(s):

Machine Learning ◽

Rna Sequencing ◽

Rare Diseases ◽

Statistical Models ◽

Multiple Testing ◽

Diagnostic Yield ◽

Genetic Diagnosis ◽

Sequencing Data ◽

Rare Disorders ◽

Aberrant Expression

Rare diseases, although individually rare, collectively affect approximately 350 million people worldwide. Currently, nearly 6,000 distinct rare disorders with a known molecular basis have been described, yet establishing a specific diagnosis based on the clinical phenotype is challenging. Increasing integration of whole exome sequencing into routine diagnostics of rare diseases is improving diagnostic rates. Nevertheless, about half of the patients do not receive a genetic diagnosis due to the challenges of variant detection and interpretation. During the last years, RNA sequencing is increasingly used as a complementary diagnostic tool providing functional data. Initially, arbitrary thresholds have been applied to call aberrant expression, aberrant splicing, and mono-allelic expression. With the application of RNA sequencing to search for the molecular diagnosis, the implementation of robust statistical models on normalized read counts allowed for the detection of significant outliers corrected for multiple testing. More recently, machine learning methods have been developed to improve the normalization of RNA sequencing read count data by taking confounders into account. Together the methods have increased the power and sensitivity of detection and interpretation of pathogenic variants, leading to diagnostic rates of 10–35% in rare diseases. In this review, we provide an overview of the methods used for RNA sequencing and illustrate how these can improve the diagnostic yield of rare diseases.

Download Full-text