SomaticSiMu: A mutational signature simulator

AbstractSummarySomaticSiMu is an in silico simulator of mutations in genome sequences. SomaticSiMu simulates single and double base substitutions, and single base insertions and deletions in an input genomic sequence to mimic mutational signatures. The tool is the first mutational signature simulator featuring a graphical user interface, control of mutation rates, and built-in visualization tools of the simulated mutations. SomaticSiMu generates simulated FASTA sequences and mutational catalogs with imposed mutational signatures. The reliability of SomaticSiMu to simulate mutational signatures was affirmed by supervised machine learning classification of simulated sequences with different mutation types and burdens, and mutational signature extraction from simulated mutational catalogs. SomaticSiMu is useful in validating sequence classification and mutational signature extraction tools.Availability and ImplementationSomaticSiMu is written in Python 3.8.3. The open-source code, documentation, and tutorials are available at https://github.com/HillLab/SomaticSiMu under the terms of the Creative Commons Attribution 4.0 International [email protected] informationSupplementary data are appended.

Download Full-text

Mutational signature learning with supervised negative binomial non-negative matrix factorization

Bioinformatics ◽

10.1093/bioinformatics/btaa473 ◽

2020 ◽

Vol 36 (Supplement_1) ◽

pp. i154-i160 ◽

Cited By ~ 1

Author(s):

Xinrui Lyu ◽

Jean Garret ◽

Gunnar Rätsch ◽

Kjong-Van Lehmann

Keyword(s):

Matrix Factorization ◽

Negative Binomial ◽

Extraction Methods ◽

Supplementary Information ◽

Cancer Type ◽

Mutational Signatures ◽

Signature Extraction ◽

Mutational Signature ◽

Mutational Processes ◽

Non Negative Matrix Factorization

Abstract Motivation Understanding the underlying mutational processes of cancer patients has been a long-standing goal in the community and promises to provide new insights that could improve cancer diagnoses and treatments. Mutational signatures are summaries of the mutational processes, and improving the derivation of mutational signatures can yield new discoveries previously obscured by technical and biological confounders. Results from existing mutational signature extraction methods depend on the size of available patient cohort and solely focus on the analysis of mutation count data without considering the exploitation of metadata. Results Here we present a supervised method that utilizes cancer type as metadata to extract more distinctive signatures. More specifically, we use a negative binomial non-negative matrix factorization and add a support vector machine loss. We show that mutational signatures extracted by our proposed method have a lower reconstruction error and are designed to be more predictive of cancer type than those generated by unsupervised methods. This design reduces the need for elaborate post-processing strategies in order to recover most of the known signatures unlike the existing unsupervised signature extraction methods. Signatures extracted by a supervised model used in conjunction with cancer-type labels are also more robust, especially when using small and potentially cancer-type limited patient cohorts. Finally, we adapted our model such that molecular features can be utilized to derive an according mutational signature. We used APOBEC expression and MUTYH mutation status to demonstrate the possibilities that arise from this ability. We conclude that our method, which exploits available metadata, improves the quality of mutational signatures as well as helps derive more interpretable representations. Availability and implementation https://github.com/ratschlab/SNBNMF-mutsig-public. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Machine Learning Classification and Regression Approaches for Optical Network Traffic Prediction

Electronics ◽

10.3390/electronics10131578 ◽

2021 ◽

Vol 10 (13) ◽

pp. 1578

Author(s):

Daniel Szostak ◽

Adam Włodarczyk ◽

Krzysztof Walkowiak

Keyword(s):

Machine Learning ◽

Optical Networks ◽

Network Traffic ◽

Optical Network ◽

Optimization Methods ◽

Supervised Machine Learning ◽

Traffic Prediction ◽

Machine Learning Classification ◽

Classification And Regression ◽

Network Technologies

Rapid growth of network traffic causes the need for the development of new network technologies. Artificial intelligence provides suitable tools to improve currently used network optimization methods. In this paper, we propose a procedure for network traffic prediction. Based on optical networks’ (and other network technologies) characteristics, we focus on the prediction of fixed bitrate levels called traffic levels. We develop and evaluate two approaches based on different supervised machine learning (ML) methods—classification and regression. We examine four different ML models with various selected features. The tested datasets are based on real traffic patterns provided by the Seattle Internet Exchange Point (SIX). Obtained results are analyzed using a new quality metric, which allows researchers to find the best forecasting algorithm in terms of network resources usage and operational costs. Our research shows that regression provides better results than classification in case of all analyzed datasets. Additionally, the final choice of the most appropriate ML algorithm and model should depend on the network operator expectations.

Download Full-text

Source allocation of per- and polyfluoroalkyl substances (PFAS) with supervised machine learning: Classification performance and the role of feature selection in an expanded dataset

Chemosphere ◽

10.1016/j.chemosphere.2021.130124 ◽

2021 ◽

Vol 275 ◽

pp. 130124

Author(s):

Tohren C.G. Kibbey ◽

Rafal Jabrzemski ◽

Denis M. O’Carroll

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Classification Performance ◽

Supervised Machine Learning ◽

Machine Learning Classification ◽

Polyfluoroalkyl Substances ◽

Source Allocation

Download Full-text

Evaluating disaster-related tweet credibility using content-based and user-based features

Information Discovery and Delivery ◽

10.1108/idd-04-2020-0044 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Nasser Assery ◽

Yuan (Dorothy) Xiaohong ◽

Qu Xiuli ◽

Roy Kaushik ◽

Sultan Almalki

Keyword(s):

Machine Learning ◽

Unsupervised Learning ◽

Supervised Learning ◽

Emergency Response ◽

Learning Model ◽

Performance Comparison ◽

Supervised Machine Learning ◽

Learning Methods ◽

Content Type ◽

Machine Learning Classification

Purpose This study aims to propose an unsupervised learning model to evaluate the credibility of disaster-related Twitter data and present a performance comparison with commonly used supervised machine learning models. Design/methodology/approach First historical tweets on two recent hurricane events are collected via Twitter API. Then a credibility scoring system is implemented in which the tweet features are analyzed to give a credibility score and credibility label to the tweet. After that, supervised machine learning classification is implemented using various classification algorithms and their performances are compared. Findings The proposed unsupervised learning model could enhance the emergency response by providing a fast way to determine the credibility of disaster-related tweets. Additionally, the comparison of the supervised classification models reveals that the Random Forest classifier performs significantly better than the SVM and Logistic Regression classifiers in classifying the credibility of disaster-related tweets. Originality/value In this paper, an unsupervised 10-point scoring model is proposed to evaluate the tweets’ credibility based on the user-based and content-based features. This technique could be used to evaluate the credibility of disaster-related tweets on future hurricanes and would have the potential to enhance emergency response during critical events. The comparative study of different supervised learning methods has revealed effective supervised learning methods for evaluating the credibility of Tweeter data.

Download Full-text

De Novo Mutational Signature Discovery in Tumor Genomes using SparseSignatures

10.1101/384834 ◽

2018 ◽

Cited By ~ 5

Author(s):

Avantika Lal ◽

Keli Liu ◽

Robert Tibshirani ◽

Arend Sidow ◽

Daniele Ramazzotti

Keyword(s):

Cross Validation ◽

De Novo ◽

State Of The Art ◽

Point Mutations ◽

Simulated Data ◽

Large Datasets ◽

Genome Sequences ◽

Mutational Signatures ◽

Mutational Signature ◽

Current State

AbstractCancer is the result of mutagenic processes that can be inferred from tumor genomes by analyzing rate spectra of point mutations, or “mutational signatures”. Here we present SparseSignatures, a novel framework to extract signatures from somatic point mutation data. Our approach incorporates DNA replication error as a background, employs regularization to reduce noise in non-background signatures, uses cross-validation to identify the number of signatures, and is scalable to large datasets. We show that SparseSignatures outperforms current state-of-the-art methods on simulated data using standard metrics. We then apply SparseSignatures to whole genome sequences of 147 tumors from pancreatic cancer, discovering 8 signatures in addition to the background.

Download Full-text

Learning mutational signatures and their multidimensional genomic properties with TensorSignatures

10.1101/850453 ◽

2019 ◽

Cited By ~ 3

Author(s):

Harald Vöhringer ◽

Arne van Hoeck ◽

Edwin Cuppen ◽

Moritz Gerstung

Keyword(s):

Somatic Hypermutation ◽

Translesion Synthesis ◽

Strand Break ◽

Signature Analysis ◽

Transcription Start ◽

Mutational Signatures ◽

Transcription Start Sites ◽

Mutational Signature ◽

Cancer Genomes ◽

Cancer Types

AbstractMutational signature analysis is an essential part of the cancer genome analysis toolkit. Conventionally, mutational signature analysis extracts patterns of different mutation types across many cancer genomes. Here we present TensorSignatures, an algorithm to learn mutational signatures jointly across all variant categories and their genomic context. The analysis of 2,778 primary and 3,824 metastatic cancer genomes of the PCAWG consortium and the HMF cohort shows that practically all signatures operate dynamically in response to various genomic and epigenomic states. The analysis pins differential spectra of UV mutagenesis found in active and inactive chromatin to global genome nucleotide excision repair. TensorSignatures accurately characterises transcription-associated mutagenesis, which is detected in 7 different cancer types. The analysis also unmasks replication- and double strand break repair-driven APOBEC mutagenesis, which manifests with differential numbers and length of mutation clusters indicating a differential processivity of the two triggers. As a fourth example, TensorSignatures detects a signature of somatic hypermutation generating highly clustered variants around the transcription start sites of active genes in lymphoid leukaemia, distinct from a more general and less clustered signature of Polη-driven translesion synthesis found in a broad range of cancer types.Key findingsSimultaneous inference of mutational signatures across mutation types and genomic features refines signature spectra and defines their genomic determinants.Analysis of 6,602 cancer genomes reveals pervasive intra-genomic variation of mutational processes.Distinct mutational signatures found in quiescent and active regions of the genome reveal differential repair and mutagenicity of UV- and tobacco-induced DNA damage.APOBEC mutagenesis produces two signatures reflecting highly clustered, double strand break repair-initiated and lowly clustered replication-driven mutagenesis, respectively.Somatic hypermutation in lymphoid cancers produces a strongly clustered mutational signature localised to transcription start sites, which is distinct from a weakly clustered translesion synthesis signature found in multiple tumour types.

Download Full-text

Prediction of Compound-Protein Interactions with Machine Learning Methods

Chemoinformatics and Advanced Machine Learning Perspectives ◽

10.4018/978-1-61520-911-8.ch016 ◽

2011 ◽

pp. 304-317

Author(s):

Yoshihiro Yamanishi ◽

Hisashi Kashima

Keyword(s):

Machine Learning ◽

Protein Interactions ◽

Chemical Structure ◽

Genomic Sequence ◽

Sequence Data ◽

Binary Classification ◽

Biological Data ◽

Supervised Machine Learning ◽

Learning Methods ◽

Machine Learning Methods

In silico prediction of compound-protein interactions from heterogeneous biological data is critical in the process of drug development. In this chapter the authors review several supervised machine learning methods to predict unknown compound-protein interactions from chemical structure and genomic sequence information simultaneously. The authors review several kernel-based algorithms from two different viewpoints: binary classification and dimension reduction. In the results, they demonstrate the usefulness of the methods on the prediction of drug-target interactions and ligand-protein interactions from chemical structure data and genomic sequence data.

Download Full-text

Supervised Machine Learning Classification of Human Sperm Head Based on Morphological Features

10.1007/978-3-030-75945-2_9 ◽

2021 ◽

pp. 177-191

Author(s):

Natalia V. Revollo ◽

G. Noelia Revollo Sarmiento ◽

Claudio Delrieux ◽

Marcela Herrera ◽

Rolando González-José

Keyword(s):

Machine Learning ◽

Human Sperm ◽

Sperm Head ◽

Morphological Features ◽

Supervised Machine Learning ◽

Machine Learning Classification

Download Full-text

Studying 3D genome evolution using genomic sequence

Bioinformatics ◽

10.1093/bioinformatics/btz775 ◽

2019 ◽

Author(s):

Raphaël Mourad

Keyword(s):

Genome Evolution ◽

High Throughput Sequencing ◽

Genomic Sequence ◽

Regulation Of Gene Expression ◽

Replication Timing ◽

Three Dimensions ◽

Supplementary Information ◽

3D Genome ◽

Chromatin Looping ◽

Novel Approach

Abstract Motivation The three dimensions (3D) genome is essential to numerous key processes such as the regulation of gene expression and the replication-timing program. In vertebrates, chromatin looping is often mediated by CTCF, and marked by CTCF motif pairs in convergent orientation. Comparative high-throughput sequencing technique (Hi-C) recently revealed that chromatin looping evolves across species. However, Hi-C experiments are complex and costly, which currently limits their use for evolutionary studies over a large number of species. Results Here, we propose a novel approach to study the 3D genome evolution in vertebrates using the genomic sequence only, e.g. without the need for Hi-C data. The approach is simple and relies on comparing the distances between convergent and divergent CTCF motifs by computing a ratio we named the 3D ratio or ‘3DR’. We show that 3DR is a powerful statistic to detect CTCF looping encoded in the human genome sequence, thus reflecting strong evolutionary constraints encoded in DNA and associated with the 3D genome. When comparing vertebrate genomes, our results reveal that 3DR which underlies CTCF looping and topologically associating domain organization evolves over time and suggest that ancestral character reconstruction can be used to infer 3DR in ancestral genomes. Availability and implementation The R code is available at https://github.com/morphos30/PhyloCTCFLooping. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

The software for interactive evaluation of mass spectra stability and reproducibility

Bioinformatics ◽

10.1093/bioinformatics/btaa1072 ◽

2020 ◽

Author(s):

E S Zhvansky ◽

A A Sorokin ◽

D S Bormotov ◽

K V Bocharov ◽

D S Zavorotnyuk ◽

...

Keyword(s):

Outlier Detection ◽

Mass Spectra ◽

Variance Estimation ◽

Ion Sources ◽

Supplementary Information ◽

Operator Experience ◽

Machine Learning Classification ◽

Other Information ◽

The Cost

Abstract Summary Mass spectrometry (MS) methods are widely used for the analysis of biological and medical samples. Recently developed methods, such as DESI, REIMS and NESI allow fast analyses without sample preparation at the cost of higher variability of spectra. In biology and medicine, MS profiles are often used with machine learning (classification, regression, etc.) algorithms and statistical analysis, which are sensitive to outliers and intraclass variability. Here, we present spectra similarity matrix (SSM) Display software, a tool for fast visual outlier detection and variance estimation in mass spectrometric profiles. The tool speeds up the process of manual spectra inspection, improves accuracy and explainability of outlier detection, and decreases the requirements to the operator experience. It was shown that the batch effect could be revealed through SSM analysis and that the SSM calculation can also be used for tuning novel ion sources concerning the quality of obtained mass spectra. Availability and implementation Source code, example datasets, binaries and other information are available at https://github.com/EvgenyZhvansky/R_matrix. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text