CpG Island Identification with Higher Order and Variable Order Markov Models

Author(s):  
Zhenqiu Liu ◽  
Dechang Chen ◽  
Xue-wen Chen
2020 ◽  
Vol 36 (14) ◽  
pp. 4130-4136
Author(s):  
David J Burks ◽  
Rajeev K Azad

Abstract Motivation Alignment-free, stochastic models derived from k-mer distributions representing reference genome sequences have a rich history in the classification of DNA sequences. In particular, the variants of Markov models have previously been used extensively. Higher-order Markov models have been used with caution, perhaps sparingly, primarily because of the lack of enough training data and computational power. Advances in sequencing technology and computation have enabled exploitation of the predictive power of higher-order models. We, therefore, revisited higher-order Markov models and assessed their performance in classifying metagenomic sequences. Results Comparative assessment of higher-order models (HOMs, 9th order or higher) with interpolated Markov model, interpolated context model and lower-order models (8th order or lower) was performed on metagenomic datasets constructed using sequenced prokaryotic genomes. Our results show that HOMs outperform other models in classifying metagenomic fragments as short as 100 nt at all taxonomic ranks, and at lower ranks when the fragment size was increased to 250 nt. HOMs were also found to be significantly more accurate than local alignment which is widely relied upon for taxonomic classification of metagenomic sequences. A novel software implementation written in C++ performs classification faster than the existing Markovian metagenomic classifiers and can therefore be used as a standalone classifier or in conjunction with existing taxonomic classifiers for more robust classification of metagenomic sequences. Availability and implementation The software has been made available at https://github.com/djburks/SMM. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 35 (22) ◽  
pp. 4607-4616
Author(s):  
Fabio Cunial ◽  
Jarno Alanko ◽  
Djamal Belazzougui

Abstract Motivation Markov models with contexts of variable length are widely used in bioinformatics for representing sets of sequences with similar biological properties. When models contain many long contexts, existing implementations are either unable to handle genome-scale training datasets within typical memory budgets, or they are optimized for specific model variants and are thus inflexible. Results We provide practical, versatile representations of variable-order Markov models and of interpolated Markov models, that support a large number of context-selection criteria, scoring functions, probability smoothing methods, and interpolations, and that take up to four times less space than previous implementations based on the suffix array, regardless of the number and length of contexts, and up to ten times less space than previous trie-based representations, or more, while matching the size of related, state-of-the-art data structures from Natural Language Processing. We describe how to further compress our indexes to a quantity related to the redundancy of the training data, saving up to 90% of their space on very repetitive datasets, and making them become up to 60 times smaller than previous implementations based on the suffix array. Finally, we show how to exploit constraints on the length and frequency of contexts to further shrink our compressed indexes to half of their size or more, achieving data structures that are a hundred times smaller than previous implementations based on the suffix array, or more. This allows variable-order Markov models to be used with bigger datasets and with longer contexts on the same hardware, thus possibly enabling new applications. Availability and implementation https://github.com/jnalanko/VOMM Supplementary information Supplementary data are available at Bioinformatics online.


2017 ◽  
Author(s):  
Ahmad Fadly Nurullah Rasedee ◽  
Hazizah Mohd Ijam ◽  
Mohammad Hasan Abdul Sathar ◽  
Norizarina Ishak ◽  
Muhamad Azrin Nazri ◽  
...  

2014 ◽  
Vol 136 (6) ◽  
Author(s):  
Alberto Varello ◽  
Erasmo Carrera

The free vibration analysis of thin- and thick-walled layered structures via a refined one-dimensional (1D) approach is addressed in this paper. Carrera unified formulation (CUF) is employed to introduce higher-order 1D models with a variable order of expansion for the displacement unknowns over the cross section. Classical Euler–Bernoulli (EBBM) and Timoshenko (TBM) beam theories are obtained as particular cases. Different kinds of vibrational modes with increasing half-wave numbers are investigated for short and relatively short cylindrical shells with different cross section geometries and laminations. Numerical results of natural frequencies and modal shapes are provided by using the finite element method (FEM), which permits various boundary conditions to be handled with ease. The analyses highlight that the refinement of the displacement field by means of higher-order terms is fundamental especially to capture vibrational modes that require warping and in-plane deformation to be detected. Classical beam models are not able to predict the realistic dynamic behavior of shells. Comparisons with three-dimensional elasticity solutions and solid finite element solutions prove that CUF provides accuracy in the free vibration analysis of even short, nonhomogeneous thin- and thick-walled shell structures, despite its 1D approach. The results clearly show that bending, radial, axial, and also shell lobe-type modes can be accurately evaluated by variable kinematic 1D CUF models with a remarkably lower computational effort compared to solid FE models.


Geophysics ◽  
2012 ◽  
Vol 77 (3) ◽  
pp. T69-T82 ◽  
Author(s):  
Shen Wang ◽  
Jianlin Xia ◽  
Maarten V. de Hoop ◽  
Xiaoye S. Li

We considered the discretization and approximate solutions of equations describing time-harmonic qP-polarized waves in 3D inhomogeneous anisotropic media. The anisotropy comprises general (tilted) transversely isotropic symmetries. We are concerned with solving these equations for a large number of different sources. We considered higher-order partial differential equations and variable-order finite-difference schemes to accommodate anisotropy on the one hand and allow higher-order accuracy — to control sampling rates for relatively high frequencies — on the other hand. We made use of a nested dissection based domain decomposition in a massively parallel multifrontal solver combined with hierarchically semiseparable matrix compression techniques. The higher-order partial differential operators and the variable-order finite-difference schemes require the introduction of separators with variable thickness in the nested dissection; the development of these and their integration with the multifrontal solver is the main focus of our study. The algorithm that we developed is a powerful tool for anisotropic full-waveform inversion.


2018 ◽  
Author(s):  
Fabio Cunial ◽  
Jarno Alanko ◽  
Djamal Belazzougui

AbstractMotivationMarkov models with contexts of variable length are widely used in bioinformatics for representing sets of sequences with similar biological properties. When models contain many long contexts, existing implementations are either unable to handle genome-scale training datasets within typical memory budgets, or they are optimized for specific model variants and are thus inflexible.ResultsWe provide practical, versatile representations of variable-order Markov models and of interpolated Markov models, that support a large number of context-selection criteria, scoring functions, probability smoothing methods, and interpolations, and that take up to 4 times less space than previous implementations based on the suffix array, regardless of the number and length of contexts, and up to 10 times less space than previous trie-based representations, or more, while matching the size of related, state-of-the-art data structures from Natural Language Processing. We describe how to further compress our indexes to a quantity related to the redundancy of the training data, saving up to 90% of their space on repetitive datasets, and making them become up to 60 times smaller than previous implementations based on the suffix array. Finally, we show how to exploit constraints on the length and frequency of contexts to further shrink our compressed indexes to half of their size or more, achieving data structures that are 100 times smaller than previous implementations based on the suffix array, or more. This allows variable-order Markov models to be trained on bigger datasets and with longer contexts on the same hardware, thus possibly enabling new applications.Availability and implementationhttps://github.com/jnalanko/VOMM


2004 ◽  
Vol 22 ◽  
pp. 385-421 ◽  
Author(s):  
R. Begleiter ◽  
R. El-Yaniv ◽  
G. Yona

This paper is concerned with algorithms for prediction of discrete sequences over a finite alphabet, using variable order Markov models. The class of such algorithms is large and in principle includes any lossless compression algorithm. We focus on six prominent prediction algorithms, including Context Tree Weighting (CTW), Prediction by Partial Match (PPM) and Probabilistic Suffix Trees (PSTs). We discuss the properties of these algorithms and compare their performance using real life sequences from three domains: proteins, English text and music pieces. The comparison is made with respect to prediction quality as measured by the average log-loss. We also compare classification algorithms based on these predictors with respect to a number of large protein classification tasks. Our results indicate that a ``decomposed'' CTW (a variant of the CTW algorithm) and PPM outperform all other algorithms in sequence prediction tasks. Somewhat surprisingly, a different algorithm, which is a modification of the Lempel-Ziv compression algorithm, significantly outperforms all algorithms on the protein classification problems.


Sign in / Sign up

Export Citation Format

Share Document