Journal of Bioinformatics and Computational Biology

Optimized splitting of mixed-species RNA sequencing data

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720022500019 ◽

2022 ◽

Author(s):

Xuan Song ◽

Hai Yun Gao ◽

Karl Herrup ◽

Ronald P. Hart

Keyword(s):

Rna Sequencing ◽

Traditional Approach ◽

Optimal Strategies ◽

Error Rates ◽

Sequencing Data ◽

Mixed Species ◽

Transcript Quantification ◽

Gene Expression Studies ◽

Reference Index ◽

Human And Mouse

Gene expression studies using xenograft transplants or co-culture systems, usually with mixed human and mouse cells, have proven to be valuable to uncover cellular dynamics during development or in disease models. However, the mRNA sequence similarities among species presents a challenge for accurate transcript quantification. To identify optimal strategies for analyzing mixed-species RNA sequencing data, we evaluate both alignment-dependent and alignment-independent methods. Alignment of reads to a pooled reference index is effective, particularly if optimal alignments are used to classify sequencing reads by species, which are re-aligned with individual genomes, generating [Formula: see text] accuracy across a range of species ratios. Alignment-independent methods, such as convolutional neural networks, which extract the conserved patterns of sequences from two species, classify RNA sequencing reads with over 85% accuracy. Importantly, both methods perform well with different ratios of human and mouse reads. While non-alignment strategies successfully partitioned reads by species, a more traditional approach of mixed-genome alignment followed by optimized separation of reads proved to be the more successful with lower error rates.

EdClust: A heuristic sequence clustering method with higher sensitivity

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720021500360 ◽

2021 ◽

Author(s):

Ming Cao ◽

Qinke Peng ◽

Ze-Gang Wei ◽

Fei Liu ◽

Yi-Fan Hou

Keyword(s):

Large Scale ◽

Sequence Data ◽

Clustering Algorithms ◽

Clustering Methods ◽

Sequencing Data ◽

Clustering Method ◽

Cluster Number ◽

Sequence Clustering ◽

Downstream Analysis ◽

Heuristic Clustering

The development of high-throughput technologies has produced increasing amounts of sequence data and an increasing need for efficient clustering algorithms that can process massive volumes of sequencing data for downstream analysis. Heuristic clustering methods are widely applied for sequence clustering because of their low computational complexity. Although numerous heuristic clustering methods have been developed, they suffer from two limitations: overestimation of inferred clusters and low clustering sensitivity. To address these issues, we present a new sequence clustering method (edClust) based on Edlib, a C/C[Formula: see text] library for fast, exact semi-global sequence alignment to group similar sequences. The new method edClust was tested on three large-scale sequence databases, and we compared edClust to several classic heuristic clustering methods, such as UCLUST, CD-HIT, and VSEARCH. Evaluations based on the metrics of cluster number and seed sensitivity (SS) demonstrate that edClust can produce fewer clusters than other methods and that its SS is higher than that of other methods. The source codes of edClust are available from https://github.com/zhang134/EdClust.git under the GNU GPL license.

Clinical drug response prediction from preclinical cancer cell lines by logistic matrix factorization approach

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720021500359 ◽

2021 ◽

Author(s):

Akram Emdadi ◽

Changiz Eslahchi

Keyword(s):

Cell Line ◽

Cell Lines ◽

Cancer Cell ◽

Matrix Factorization ◽

Drug Response ◽

Drug Sensitivity ◽

Cancer Cell Lines ◽

Tissue Type ◽

Factorization Approach ◽

Clinical Drug

Predicting tumor drug response using cancer cell line drug response values for a large number of anti-cancer drugs is a significant challenge in personalized medicine. Predicting patient response to drugs from data obtained from preclinical models is made easier by the availability of different knowledge on cell lines and drugs. This paper proposes the TCLMF method, a predictive model for predicting drug response in tumor samples that was trained on preclinical samples and is based on the logistic matrix factorization approach. The TCLMF model is designed based on gene expression profiles, tissue type information, the chemical structure of drugs and drug sensitivity (IC 50) data from cancer cell lines. We use preclinical data from the Genomics of Drug Sensitivity in Cancer dataset (GDSC) to train the proposed drug response model, which we then use to predict drug sensitivity of samples from the Cancer Genome Atlas (TCGA) dataset. The TCLMF approach focuses on identifying successful features of cell lines and drugs in order to calculate the probability of the tumor samples being sensitive to drugs. The closest cell line neighbours for each tumor sample are calculated using a description of similarity between tumor samples and cell lines in this study. The drug response for a new tumor is then calculated by averaging the low-rank features obtained from its neighboring cell lines. We compare the results of the TCLMF model with the results of the previously proposed methods using two databases and two approaches to test the model’s performance. In the first approach, 12 drugs with enough known clinical drug response, considered in previous methods, are studied. For 7 drugs out of 12, the TCLMF can significantly distinguish between patients that are resistance to these drugs and the patients that are sensitive to them. These approaches are converted to classification models using a threshold in the second approach, and the results are compared. The results demonstrate that the TCLMF method provides accurate predictions across the results of the other algorithms. Finally, we accurately classify tumor tissue type using the latent vectors obtained from TCLMF’s logistic matrix factorization process. These findings demonstrate that the TCLMF approach produces effective latent vectors for tumor samples. The source code of the TCLMF method is available in https://github.com/emdadi/TCLMF.

Involving repetitive regions in scaffolding improvement

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720021400163 ◽

2021 ◽

Author(s):

Quentin Delorme ◽

Rémy Costa ◽

Yasmine Mansour ◽

Anna-Sophie Fiston-Lavier ◽

Annie Chateau

Keyword(s):

Transposable Element ◽

Repeat Element ◽

Assembly Process ◽

Repeat Elements ◽

Before And After ◽

Genome Assemblies

In this paper, we investigate througth a premilinary study the influence of repeat elements during the assembly process. We analyze the link between the presence and the nature of one type of repeat element, called transposable element (TE) and misassembly events in genome assemblies. We propose to improve assemblies by taking into account the presence of repeat elements, including TEs, during the scaffolding step. We analyze the results and relate the misassemblies to TEs before and after correction.

Introduction to the Special Issue of the 18th Annual International RECOMB Satellite Workshop on Comparative Genomics

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720021020030 ◽

2021 ◽

Author(s):

Rohan B. H. Williams ◽

Louxin Zhang

Keyword(s):

Comparative Genomics ◽

Special Issue

DNN-Boost: Somatic mutation identification of tumor-only whole-exome sequencing data using deep neural network and XGBoost

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720021400175 ◽

2021 ◽

Author(s):

Firda Aminy Maruf ◽

Rian Pratama ◽

Giltae Song

Keyword(s):

Neural Network ◽

Exome Sequencing ◽

Whole Exome Sequencing ◽

Somatic Mutation ◽

Deep Neural Network ◽

Somatic Mutations ◽

Sequencing Data ◽

Exome Sequencing Data ◽

Whole Exome ◽

Whole Exome Sequencing Data

Detection of somatic mutation in whole-exome sequencing data can help elucidate the mechanism of tumor progression. Most computational approaches require exome sequencing for both tumor and normal samples. However, it is more common to sequence exomes for tumor samples only without the paired normal samples. To include these types of data for extensive studies on the process of tumorigenesis, it is necessary to develop an approach for identifying somatic mutations using tumor exome sequencing data only. In this study, we designed a machine learning approach using Deep Neural Network (DNN) and XGBoost to identify somatic mutations in tumor-only exome sequencing data and we integrated this into a pipeline called DNN-Boost. The XGBoost algorithm is used to extract the features from the results of variant callers and these features are then fed into the DNN model as input. The XGBoost algorithm resolves issues of missing values and overfitting. We evaluated our proposed model and compared its performance with other existing benchmark methods. We noted that the DNN-Boost classification model outperformed the benchmark method in classifying somatic mutations from paired tumor-normal exome data and tumor-only exome data.

Comparing the topology of phylogenetic network generators

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720021400126 ◽

2021 ◽

Author(s):

Remie Janssen ◽

Pengyu Liu

Keyword(s):

Bayesian Methods ◽

Phylogenetic Trees ◽

Evolutionary History ◽

Phylogenetic Network ◽

Random Trees ◽

Phylogenetic Networks ◽

Phylogeny Reconstruction ◽

Evolutionary Analysis ◽

Summary Statistics ◽

History Of

Phylogenetic networks represent evolutionary history of species and can record natural reticulate evolutionary processes such as horizontal gene transfer and gene recombination. This makes phylogenetic networks a more comprehensive representation of evolutionary history compared to phylogenetic trees. Stochastic processes for generating random trees or networks are important tools in evolutionary analysis, especially in phylogeny reconstruction where they can be utilized for validation or serve as priors for Bayesian methods. However, as more network generators are developed, there is a lack of discussion or comparison for different generators. To bridge this gap, we compare a set of phylogenetic network generators by profiling topological summary statistics of the generated networks over the number of reticulations and comparing the topological profiles.

Author Index Volume 19 (2021)

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720021990018 ◽

2021 ◽

Vol 19 (06) ◽

Keyword(s):

Index Volume

A new Bayesian approach for QTL mapping of family data

Journal of Bioinformatics and Computational Biology ◽

10.1142/s021972002150030x ◽

2021 ◽

Author(s):

Daiane Aparecida Zuanetti ◽

Luis Aparecido Milan

Keyword(s):

Qtl Mapping ◽

Random Effects ◽

Bayesian Approach ◽

Variance Component ◽

Mendelian Inheritance ◽

Family Data ◽

Data Sets ◽

Mendelian Segregation ◽

Gaw17 Data ◽

The Bayesian Approach

In this paper, we propose a new Bayesian approach for QTL mapping of family data. The main purpose is to model a phenotype as a function of QTLs’ effects. The model considers the detailed familiar dependence and it does not rely on random effects. It combines the probability for Mendelian inheritance of parents’ genotype and the correlation between flanking markers and QTLs. This is an advance when compared with models which use only Mendelian segregation or only the correlation between markers and QTLs to estimate transmission probabilities. We use the Bayesian approach to estimate the number of QTLs, their location and the additive and dominance effects. We compare the performance of the proposed method with variance component and LASSO models using simulated and GAW17 data sets. Under tested conditions, the proposed method outperforms other methods in aspects such as estimating the number of QTLs, the accuracy of the QTLs’ position and the estimate of their effects. The results of the application of the proposed method to data sets exceeded all of our expectations.

The monoploid chromosome complement of reconstructed ancestral genomes in a phylogeny

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720021400084 ◽

2021 ◽

Author(s):

Qiaoji Xu ◽

Xiaomeng Zhang ◽

Yue Zhang ◽

Chunfang Zheng ◽

James H. Leebens-Mack ◽

...

Keyword(s):

Woody Plants ◽

Chromosome Complement ◽

Gene Content ◽

Number Formula ◽

Complete Set

Using RACCROCHE, a method for reconstructing gene content and order of ancestral chromosomes from a phylogeny of extant genomes represented by the gene orders on their chromosomes, we study the evolution of three orders of woody plants. The method retrieves the monoploid complement of each Ancestor in a phylogeny, consisting a complete set of distinct chromosomes, despite some of the extant genomes being recently or historically polyploidized. The three orders are the Sapindales, the Fagales and the Malvales. All of these are independently estimated to have ancestral monoploid number [Formula: see text].

Journal of Bioinformatics and Computational Biology
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By World Scientific

Optimized splitting of mixed-species RNA sequencing data

EdClust: A heuristic sequence clustering method with higher sensitivity

Clinical drug response prediction from preclinical cancer cell lines by logistic matrix factorization approach

Involving repetitive regions in scaffolding improvement

Introduction to the Special Issue of the 18th Annual International RECOMB Satellite Workshop on Comparative Genomics

DNN-Boost: Somatic mutation identification of tumor-only whole-exome sequencing data using deep neural network and XGBoost

Comparing the topology of phylogenetic network generators

Author Index Volume 19 (2021)

A new Bayesian approach for QTL mapping of family data

The monoploid chromosome complement of reconstructed ancestral genomes in a phylogeny

Export Citation Format

Journal of Bioinformatics and Computational BiologyLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By World Scientific

Optimized splitting of mixed-species RNA sequencing data

EdClust: A heuristic sequence clustering method with higher sensitivity

Clinical drug response prediction from preclinical cancer cell lines by logistic matrix factorization approach

Involving repetitive regions in scaffolding improvement

Introduction to the Special Issue of the 18th Annual International RECOMB Satellite Workshop on Comparative Genomics

DNN-Boost: Somatic mutation identification of tumor-only whole-exome sequencing data using deep neural network and XGBoost

Comparing the topology of phylogenetic network generators

Author Index Volume 19 (2021)

A new Bayesian approach for QTL mapping of family data

The monoploid chromosome complement of reconstructed ancestral genomes in a phylogeny

Journal of Bioinformatics and Computational Biology
Latest Publications