Allowing mutations in maximal matches boosts genome compression performance

Yuansheng Liu; Limsoon Wong; Jinyan Li

doi:10.1093/bioinformatics/btaa572

Allowing mutations in maximal matches boosts genome compression performance

Bioinformatics ◽

10.1093/bioinformatics/btaa572 ◽

2020 ◽

Vol 36 (18) ◽

pp. 4675-4681 ◽

Cited By ~ 1

Author(s):

Yuansheng Liu ◽

Limsoon Wong ◽

Jinyan Li

Keyword(s):

Data Storage ◽

State Of The Art ◽

Dna Bases ◽

Supplementary Information ◽

Maximal Match ◽

Compression Performance ◽

Genome Data ◽

Compression Speed ◽

Benchmark Datasets ◽

Better Than

Abstract Motivation A maximal match between two genomes is a contiguous non-extendable sub-sequence common in the two genomes. DNA bases mutate very often from the genome of one individual to another. When a mutation occurs in a maximal match, it breaks the maximal match into shorter match segments. The coding cost using these broken segments for reference-based genome compression is much higher than that of using the maximal match which is allowed to contain mutations. Results We present memRGC, a novel reference-based genome compression algorithm that leverages mutation-containing matches (MCMs) for genome encoding. MemRGC detects maximal matches between two genomes using a coprime double-window k-mer sampling search scheme, the method then extends these matches to cover mismatches (mutations) and their neighbouring maximal matches to form long and MCMs. Experiments reveal that memRGC boosts the compression performance by an average of 27% in reference-based genome compression. MemRGC is also better than the best state-of-the-art methods on all of the benchmark datasets, sometimes better by 50%. Moreover, memRGC uses much less memory and de-compression resources, while providing comparable compression speed. These advantages are of significant benefits to genome data storage and transmission. Availability and implementation https://github.com/yuansliu/memRGC. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Unsupervised Outlier Detection in Multidimensional Data

10.21203/rs.3.rs-250665/v1 ◽

2021 ◽

Author(s):

Atiq Rehman ◽

Samir Brahim Belhaouari

Keyword(s):

State Of The Art ◽

Machine Learning Algorithms ◽

Multidimensional Data ◽

High Dimensions ◽

Comprehensive Performance ◽

Benchmark Datasets ◽

Distance Vector ◽

Detection Schemes ◽

Unsupervised Outlier Detection ◽

Better Than

Abstract Detection and removal of outliers in a dataset is a fundamental preprocessing task without which the analysis of the data can be misleading. Furthermore, the existence of anomalies in the data can heavily degrade the performance of machine learning algorithms. In order to detect the anomalies in a dataset in an unsupervised manner, some novel statistical techniques are proposed in this paper. The proposed techniques are based on statistical methods considering data compactness and other properties. The newly proposed ideas are found efficient in terms of performance, ease of implementation, and computational complexity. Furthermore, two proposed techniques presented in this paper use only a single dimensional distance vector to detect the outliers, so irrespective of the data’s high dimensions, the techniques remain computationally inexpensive and feasible. Comprehensive performance analysis of the proposed anomaly detection schemes is presented in the paper, and the newly proposed schemes are found better than the state-of-the-art methods when tested on several benchmark datasets.

Download Full-text

DeepPurpose: a deep learning library for drug–target interaction prediction

Bioinformatics ◽

10.1093/bioinformatics/btaa1005 ◽

2020 ◽

Author(s):

Kexin Huang ◽

Tianfan Fu ◽

Lucas M Glass ◽

Marinka Zitnik ◽

Cao Xiao ◽

...

Keyword(s):

Deep Learning ◽

Drug Target ◽

Prediction Models ◽

State Of The Art ◽

Supplementary Information ◽

Target Interaction ◽

Interaction Prediction ◽

Computer Scientists ◽

Benchmark Datasets ◽

Biomedical Field

Abstract Summary Accurate prediction of drug–target interactions (DTI) is crucial for drug discovery. Recently, deep learning (DL) models for show promising performance for DTI prediction. However, these models can be difficult to use for both computer scientists entering the biomedical field and bioinformaticians with limited DL experience. We present DeepPurpose, a comprehensive and easy-to-use DL library for DTI prediction. DeepPurpose supports training of customized DTI prediction models by implementing 15 compound and protein encoders and over 50 neural architectures, along with providing many other useful features. We demonstrate state-of-the-art performance of DeepPurpose on several benchmark datasets. Availability and implementation https://github.com/kexinhuang12345/DeepPurpose. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Effective Representing of Information Network by Variational Autoencoder

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/292 ◽

2017 ◽

Cited By ~ 3

Author(s):

Hang Li ◽

Haozheng Wang ◽

Zhenglu Yang ◽

Haochen Liu

Keyword(s):

State Of The Art ◽

Vector Model ◽

Information Network ◽

Network Representation ◽

Linear Relationships ◽

Variational Autoencoder ◽

Benchmark Datasets ◽

Text Information ◽

Complex Features ◽

Better Than

Network representation is the basis of many applications and of extensive interest in various fields, such as information retrieval, social network analysis, and recommendation systems. Most previous methods for network representation only consider the incomplete aspects of a problem, including link structure, node information, and partial integration. The present study proposes a deep network representation model that seamlessly integrates the text information and structure of a network. Our model captures highly non-linear relationships between nodes and complex features of a network by exploiting the variational autoencoder (VAE), which is a deep unsupervised generation algorithm. We also merge the representation learned with a paragraph vector model and that learned with the VAE to obtain the network representation that preserves both structure and text information. We conduct comprehensive empirical experiments on benchmark datasets and find our model performs better than state-of-the-art techniques by a large margin.

Download Full-text

Category Trees - Classifiers that Branch on Category

International Journal of Artificial Intelligence & Applications ◽

10.5121/ijaia.2021.12606 ◽

2021 ◽

Vol 12 (06) ◽

pp. 65-76

Author(s):

Kieran Greer

Keyword(s):

State Of The Art ◽

The State ◽

Biological Analogy ◽

Category Type ◽

Benchmark Datasets ◽

Batch Input ◽

Distinguishing Features ◽

Incorrect Data ◽

Better Than

This paper presents a batch classifier that splits a dataset into tree branches depending on the category type. It has been improved from the earlier version and fixed a mistake in the earlier paper. Two important changes have been made. The first is to represent each category with a separate classifier. Each classifier then classifies its own subset of data rows, using batch input values to create the centroid and also represent the category itself. If the classifier contains data from more than one category however, it needs to create new classifiers for the incorrect data. The second change therefore is to allow the classifier to branch to new layers when there is a split in the data, and create new classifiers there for the data rows that are incorrectly classified. Each layer can therefore branch like a tree - not for distinguishing features, but for distinguishing categories. The paper then suggests a further innovation, which is to represent some data columns with fixed value ranges, or bands. When considering features, it is shown that some of the data can be classified directly through fixed value ranges, while the rest must be classified using a classifier technique and the idea allows the paper to discuss a biological analogy with neurons and neuron links. Tests show that the method can successfully classify a diverse set of benchmark datasets to better than the state-of-the-art.

Download Full-text

DiTaxa: nucleotide-pair encoding of 16S rRNA for host phenotype and biomarker detection

Bioinformatics ◽

10.1093/bioinformatics/bty954 ◽

2018 ◽

Vol 35 (14) ◽

pp. 2498-2500 ◽

Cited By ~ 6

Author(s):

Ehsaneddin Asgari ◽

Philipp C Münch ◽

Till R Lesker ◽

Alice C McHardy ◽

Mohammad R K Mofrad

Keyword(s):

16S Rrna ◽

State Of The Art ◽

Operational Taxonomic Unit ◽

Supplementary Information ◽

Biomarker Detection ◽

New Paradigm ◽

Nucleotide Pair ◽

Bowel Diseases ◽

Benchmark Datasets ◽

Inflammatory Bowel

Abstract Summary Identifying distinctive taxa for micro-biome-related diseases is considered key to the establishment of diagnosis and therapy options in precision medicine and imposes high demands on the accuracy of micro-biome analysis techniques. We propose an alignment- and reference- free subsequence based 16S rRNA data analysis, as a new paradigm for micro-biome phenotype and biomarker detection. Our method, called DiTaxa, substitutes standard operational taxonomic unit (OTU)-clustering by segmenting 16S rRNA reads into the most frequent variable-length subsequences. We compared the performance of DiTaxa to the state-of-the-art methods in phenotype and biomarker detection, using human-associated 16S rRNA samples for periodontal disease, rheumatoid arthritis and inflammatory bowel diseases, as well as a synthetic benchmark dataset. DiTaxa performed competitively to the k-mer based state-of-the-art approach in phenotype prediction while outperforming the OTU-based state-of-the-art approach in finding biomarkers in both resolution and coverage evaluated over known links from literature and synthetic benchmark datasets. Availability and implementation DiTaxa is available under the Apache 2 license at http://llp.berkeley.edu/ditaxa. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

FQSqueezer: k-mer-based compression of sequencing data

10.1101/559807 ◽

2019 ◽

Cited By ~ 1

Author(s):

Sebastian Deorowicz

Keyword(s):

Data Compression ◽

State Of The Art ◽

Genomic Data ◽

General Purpose ◽

Supplementary Information ◽

Supplementary Data ◽

Sequencing Data ◽

Partial Matching ◽

Supplementary Material ◽

Better Than

AbstractMotivationThe amount of genomic data that needs to be stored is huge. Therefore it is not surprising that a lot of work has been done in the field of specialized data compression of FASTQ files. The existing algorithms are, however, still imperfect and the best tools produce quite large archives.ResultsWe present FQSqueezer, a novel compression algorithm for sequencing data able to process single- and paired-end reads of variable lengths. It is based on the ideas from the famous prediction by partial matching and dynamic Markov coder algorithms known from the general-purpose-compressors world. The compression ratios are often tens of percent better than offered by the state-of-the-art tools.Availability and Implementationhttps://github.com/refresh-bio/[email protected] informationSupplementary data are available at publisher’s Web site.

Download Full-text

Impact of lossy compression of nanopore raw signal data on basecalling and consensus accuracy

Bioinformatics ◽

10.1093/bioinformatics/btaa1017 ◽

2020 ◽

Author(s):

Shubham Chandak ◽

Kedar Tatwawadi ◽

Srivatsan Sridhar ◽

Tsachy Weissman

Keyword(s):

State Of The Art ◽

Lossy Compression ◽

Supplementary Information ◽

Nanopore Sequencing ◽

Raw Data ◽

Variant Discovery ◽

Benchmark Datasets ◽

Space Requirements ◽

The Impact ◽

Sequencing Process

Abstract Motivation Nanopore sequencing provides a real-time and portable solution to genomic sequencing, enabling better assembly, structural variant discovery and modified base detection than second generation technologies. The sequencing process generates a huge amount of data in the form of raw signal contained in fast5 files, which must be compressed to enable efficient storage and transfer. Since the raw data is inherently noisy, lossy compression has potential to significantly reduce space requirements without adversely impacting performance of downstream applications. Results We explore the use of lossy compression for nanopore raw data using two state-of-the-art lossy time-series compressors, and evaluate the tradeoff between compressed size and basecalling/consensus accuracy. We test several basecallers and consensus tools on a variety of datasets at varying depths of coverage, and conclude that lossy compression can provide 35–50% further reduction in compressed size of raw data over the state-of-the-art lossless compressor with negligible impact on basecalling accuracy (≲0.2% reduction) and consensus accuracy (≲0.002% reduction). In addition, we evaluate the impact of lossy compression on methylation calling accuracy and observe that this impact is minimal for similar reductions in compressed size, although further evaluation with improved benchmark datasets is required for reaching a definite conclusion. The results suggest the possibility of using lossy compression, potentially on the nanopore sequencing device itself, to achieve significant reductions in storage and transmission costs while preserving the accuracy of downstream applications. Availabilityand implementation The code is available at https://github.com/shubhamchandak94/lossy_compression_evaluation. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A New End-to-End Multi-Dimensional CNN Framework for Land Cover/Land Use Change Detection in Multi-Source Remote Sensing Datasets

Remote Sensing ◽

10.3390/rs12122010 ◽

2020 ◽

Vol 12 (12) ◽

pp. 2010 ◽

Cited By ~ 5

Author(s):

Seyd Teymoor Seydi ◽

Mahdi Hasanlou ◽

Meisam Amani

Keyword(s):

Remote Sensing ◽

Change Detection ◽

State Of The Art ◽

Polarimetric Synthetic Aperture Radar ◽

Different Types ◽

Benchmark Datasets ◽

Convolution Kernels ◽

2D And 3D ◽

Accuracy Indices ◽

Better Than

The diversity of change detection (CD) methods and the limitations in generalizing these techniques using different types of remote sensing datasets over various study areas have been a challenge for CD applications. Additionally, most CD methods have been implemented in two intensive and time-consuming steps: (a) predicting change areas, and (b) decision on predicted areas. In this study, a novel CD framework based on the convolutional neural network (CNN) is proposed to not only address the aforementioned problems but also to considerably improve the level of accuracy. The proposed CNN-based CD network contains three parallel channels: the first and second channels, respectively, extract deep features on the original first- and second-time imagery and the third channel focuses on the extraction of change deep features based on differencing and staking deep features. Additionally, each channel includes three types of convolution kernels: 1D-, 2D-, and 3D-dilated-convolution. The effectiveness and reliability of the proposed CD method are evaluated using three different types of remote sensing benchmark datasets (i.e., multispectral, hyperspectral, and Polarimetric Synthetic Aperture RADAR (PolSAR)). The results of the CD maps are also evaluated both visually and statistically by calculating nine different accuracy indices. Moreover, the results of the CD using the proposed method are compared to those of several state-of-the-art CD algorithms. All the results prove that the proposed method outperforms the other remote sensing CD techniques. For instance, considering different scenarios, the Overall Accuracies (OAs) and Kappa Coefficients (KCs) of the proposed CD method are better than 95.89% and 0.805, respectively, and the Miss Detection (MD) and the False Alarm (FA) rates are lower than 12% and 3%, respectively.

Download Full-text

Graph convolutional networks for epigenetic state prediction using both sequence and 3D genome data

Bioinformatics ◽

10.1093/bioinformatics/btaa793 ◽

2020 ◽

Vol 36 (Supplement_2) ◽

pp. i659-i667

Author(s):

Jack Lanchantin ◽

Yanjun Qi

Keyword(s):

Long Range ◽

Dna Sequences ◽

State Of The Art ◽

Supplementary Information ◽

Convolutional Network ◽

3D Genome ◽

Convolutional Networks ◽

Genome Data ◽

Epigenetic State ◽

Local Sequence

Abstract Motivation Predictive models of DNA chromatin profile (i.e. epigenetic state), such as transcription factor binding, are essential for understanding regulatory processes and developing gene therapies. It is known that the 3D genome, or spatial structure of DNA, is highly influential in the chromatin profile. Deep neural networks have achieved state of the art performance on chromatin profile prediction by using short windows of DNA sequences independently. These methods, however, ignore the long-range dependencies when predicting the chromatin profiles because modeling the 3D genome is challenging. Results In this work, we introduce ChromeGCN, a graph convolutional network for chromatin profile prediction by fusing both local sequence and long-range 3D genome information. By incorporating the 3D genome, we relax the independent and identically distributed assumption of local windows for a better representation of DNA. ChromeGCN explicitly incorporates known long-range interactions into the modeling, allowing us to identify and interpret those important long-range dependencies in influencing chromatin profiles. We show experimentally that by fusing sequential and 3D genome data using ChromeGCN, we get a significant improvement over the state-of-the-art deep learning methods as indicated by three metrics. Importantly, we show that ChromeGCN is particularly useful for identifying epigenetic effects in those DNA windows that have a high degree of interactions with other DNA windows. Availability and implementation https://github.com/QData/ChromeGCN. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Impact of lossy compression of nanopore raw signal data on basecalling and consensus accuracy

10.1101/2020.04.19.049262 ◽

2020 ◽

Author(s):

Shubham Chandak ◽

Kedar Tatwawadi ◽

Srivatsan Sridhar ◽

Tsachy Weissman

Keyword(s):

State Of The Art ◽

Lossy Compression ◽

Supplementary Information ◽

Nanopore Sequencing ◽

Raw Data ◽

Variant Discovery ◽

Benchmark Datasets ◽

Space Requirements ◽

The Impact ◽

Sequencing Process

AbstractMotivationNanopore sequencing provides a real-time and portable solution to genomic sequencing, enabling better assembly, structural variant discovery and modified base detection than second generation technologies. The sequencing process generates a huge amount of data in the form of raw signal contained in fast5 files, which must be compressed to enable efficient storage and transfer. Since the raw data is inherently noisy, lossy compression has potential to significantly reduce space requirements without adversely impacting performance of downstream applications.ResultsWe explore the use of lossy compression for nanopore raw data using two state-of-the-art lossy time-series compressors, and evaluate the tradeoff between compressed size and basecalling/consensus accuracy. We test several basecallers and consensus tools on a variety of datasets at varying depths of coverage, and conclude that lossy compression can provide 35-50% further reduction in compressed size of raw data over the state-of-the-art lossless compressor with negligible impact on basecalling accuracy (≲0.2% reduction) and consensus accuracy (≲0.002% reduction). In addition, we evaluate the impact of lossy compression on methylation calling accuracy and observe that this impact is minimal for similar reductions in compressed size, although further evaluation with improved benchmark datasets is required for reaching a definite conclusion. The results suggest the possibility of using lossy compression, potentially on the nanopore sequencing device itself, to achieve significant reductions in storage and transmission costs while preserving the accuracy of downstream applications.AvailabilityThe code is available at https://github.com/shubhamchandak94/lossy_compression_evaluation.Supplementary informationSupplementary data are available at Bioinformatics [email protected]

Download Full-text