scholarly journals GRaphical footprint based Alignment-Free method (GRAFree) for reconstructing evolutionary Traits in Large-Scale Genomic Features

2018 ◽  
Author(s):  
Aritra Mahapatra ◽  
Jayanta Mukherjee

abstractIn our study, we attempt to extract novel features from mitochondrial genomic sequences reflecting their evolutionary traits by our proposed method GRAFree (GRaphical footprint based Alignment-Free method). These features are used to build a phylogenetic tree given a set of species from insect, fish, bird, and mammal. A novel distance measure in the feature space is proposed for the purpose of reflecting the proximity of these species in the evolutionary processes. The distance function is found to be a metric. We have proposed a three step technique to select a feature vector from the feature space. We have carried out variations of these selected feature vectors for generating multiple hypothesis of these trees and finally we used a consensus based tree merging algorithm to obtain the phylogeny. Experimentations were carried out with 157 species covering four different classes such as, Insecta, Actinopterygii, Aves, and Mammalia. We also introduce a measure of quality of the inferred tree especially when the reference tree is not present. The performance of the output tree can be measured at each clade by considering the presence of each species at the corresponding clade. GRAFree can be applied on any graphical representation of genome to reconstruct the phylogenetic tree. We apply our proposed distance function on the selected feature vectors for three naive methods of graphical representation of genome. The inferred tree reflects some accepted evolutionary traits with a high bootstrap support. This concludes that our proposed distance function can be applied to capture the evolutionary relationships of a large number of both close and distance species using graphical methods.

2019 ◽  
Vol 15 ◽  
pp. 117693431984907 ◽  
Author(s):  
Tomáš Farkaš ◽  
Jozef Sitarčík ◽  
Broňa Brejová ◽  
Mária Lucká

Computing similarity between 2 nucleotide sequences is one of the fundamental problems in bioinformatics. Current methods are based mainly on 2 major approaches: (1) sequence alignment, which is computationally expensive, and (2) faster, but less accurate, alignment-free methods based on various statistical summaries, for example, short word counts. We propose a new distance measure based on mathematical transforms from the domain of signal processing. To tolerate large-scale rearrangements in the sequences, the transform is computed across sliding windows. We compare our method on several data sets with current state-of-art alignment-free methods. Our method compares favorably in terms of accuracy and outperforms other methods in running time and memory requirements. In addition, it is massively scalable up to dozens of processing units without the loss of performance due to communication overhead. Source files and sample data are available at https://bitbucket.org/fiitstubioinfo/swspm/src


2021 ◽  
Author(s):  
Xuemei Liu ◽  
Wen Li ◽  
Guanda Huang ◽  
Tianlai Huang ◽  
Qingang Xiong ◽  
...  

Algorithms for constructing phylogenetic trees are fundamental to study the evolution of viruses, bacteria, and other microbes. Established multiple alignment-based algorithms are inefficient for large scale metagenomic sequence data because of their high requirement of inter-sequence correlation and high computational complexity. In this paper, we present SeqDistK, a novel tool for alignment-free phylogenetic analysis. SeqDistK computes the dissimilarity matrix for phylogenetic analysis, incorporating seven k-mer based dissimilarity measures, namely d2, d2S, d2star, Euclidean, Manhattan, CVTree, and Chebyshev. Based on these dissimilarities, SeqDistK constructs phylogenetic tree using the Unweighted Pair Group Method with Arithmetic Mean algorithm. Using a golden standard dataset of 16S rRNA and its associated phylogenetic tree, we compared SeqDistK to Muscle - a multi sequence aligner. We found SeqDistK was not only 38 times faster than Muscle in computational efficiency but also more accurate. SeqDistK achieved the smallest symmetric difference between the inferred and ground truth trees with a range between 13 to 18, while that of Muscle was 62. When measures d2, d2star, d2S, Euclidean, and k-mer size k=5 were used, SeqDistK consistently inferred phylogenetic tree almost identical to the ground truth tree. We also performed clustering of 16S rRNA sequences using SeqDistK and found the clustering was highly consistent with known biological taxonomy. Among all the measures, d2S (k=5, M=2) showed the best accuracy as it correctly clustered and classified all sample sequences. In summary, SeqDistK is a novel, fast and accurate alignment-free tool for large-scale phylogenetic analysis. SeqDistK software is freely available at https://github.com/htczero/SeqDistK.


2019 ◽  
Author(s):  
Anna Danese ◽  
Maria L. Richter ◽  
David S. Fischer ◽  
Fabian J. Theis ◽  
Maria Colomé-Tatché

ABSTRACTEpigenetic single-cell measurements reveal a layer of regulatory information not accessible to single-cell transcriptomics, however single-cell-omics analysis tools mainly focus on gene expression data. To address this issue, we present epiScanpy, a computational framework for the analysis of single-cell DNA methylation and single-cell ATAC-seq data. EpiScanpy makes the many existing RNA-seq workflows from scanpy available to large-scale single-cell data from other -omics modalities. We introduce and compare multiple feature space constructions for epigenetic data and show the feasibility of common clustering, dimension reduction and trajectory learning techniques. We benchmark epiScanpy by interrogating different single-cell brain mouse atlases of DNA methylation, ATAC-seq and transcriptomics. We find that differentially methylated and differentially open markers between cell clusters enrich transcriptome-based cell type labels by orthogonal epigenetic information.


Author(s):  
Liguo Fei ◽  
Yuqiang Feng

Belief function has always played an indispensable role in modeling cognitive uncertainty. As an inherited version, the theory of D numbers has been proposed and developed in a more efficient and robust way. Within the framework of D number theory, two more generalized properties are extended: (1) the elements in the frame of discernment (FOD) of D numbers do not required to be mutually exclusive strictly; (2) the completeness constraint is released. The investigation shows that the distance function is very significant in measuring the difference between two D numbers, especially in information fusion and decision. Modeling methods of uncertainty that incorporate D numbers have become increasingly popular, however, very few approaches have tackled the challenges of distance metrics. In this study, the distance measure of two D numbers is presented in cases, including complete information, incomplete information, and non-exclusive elements


2020 ◽  
Author(s):  
Yu Wang ◽  
ZAHEER ULLAH KHAN ◽  
Shaukat Ali ◽  
Maqsood Hayat

Abstract BackgroundBacteriophage or phage is a type of virus that replicates itself inside bacteria. It consist of genetic material surrounded by a protein structure. Bacteriophage plays a vital role in the domain of phage therapy and genetic engineering. Phage and hydrolases enzyme proteins have a significant impact on the cure of pathogenic bacterial infections and disease treatment. Accurate identification of bacteriophage proteins is important in the host subcellular localization for further understanding of the interaction between phage, hydrolases, and in designing antibacterial drugs. Looking at the significance of Bacteriophage proteins, besides wet laboratory-based methods several computational models have been developed so far. However, the performance was not considerable due to inefficient feature schemes, redundancy, noise, and lack of an intelligent learning engine. Therefore we have developed an anovative bi-layered model name DeepEnzyPred. A Hybrid feature vector was obtained via a novel Multi-Level Multi-Threshold subset feature selection (MLMT-SFS) algorithm. A two-dimensional convolutional neural network was adopted as a baseline classifier.ResultsA conductive hybrid feature was obtained via a serial combination of CTD and KSAACGP features. The optimum feature was selected via a Novel Multi-Level Multi-Threshold Subset Feature selection algorithm. Over 5-fold jackknife cross-validation, an accuracy of 91.6 %, Sensitivity of 63.39%, Specificity 95.72%, MCC of 0.6049, and ROC value of 0.8772 over Layer-1 were recorded respectively. Similarly, the underline model obtained an Accuracy of 96.05%, Sensitivity of 96.22%, Specificity of 95.91%, MCC of 0.9219, and ROC value of 0.9899 over layer-2 respectivily.ConclusionThis paper presents a robust and effective classification model was developed for bacteriophage and their types. Primitive features were extracted via CTD and KSAACGP. A novel method (MLMT-SFS ) was devised for yielding optimum hybrid feature space out of primitive features. The result drew over hybrid feature space and 2D-CNN shown an excellent classification. Based on the recorded results, we believe that the developed predictor will be a valuable resource for large scale discrimination of unknown Phage and hydrolase enzymes in particular and new antibacterial drug design in pharmaceutical companies in general.


Author(s):  
Anne H.H. Ngu ◽  
Jialie Shen ◽  
John Shepherd

The optimized distance-based access methods currently available for multimedia databases are based on two major assumptions: a suitable distance function is known a priori, and the dimensionality of image features is low. The standard approach to building image databases is to represent images via vectors based on low-level visual features and make retrieval based on these vectors. However, due to the large gap between the semantic notions and low-level visual content, it is extremely difficult to define a distance function that accurately captures the similarity of images as perceived by humans. Furthermore, popular dimension reduction methods suffer from either the inability to capture the nonlinear correlations among raw data or very expensive training cost. To address the problems, in this chapter we introduce a new indexing technique called Combining Multiple Visual Features (CMVF) that integrates multiple visual features to get better query effectiveness. Our approach is able to produce low-dimensional image feature vectors that include not only low-level visual properties but also high-level semantic properties. The hybrid architecture can produce feature vectors that capture the salient properties of images yet are small enough to allow the use of existing high-dimensional indexing methods to provide efficient and effective retrieval.


GigaScience ◽  
2020 ◽  
Vol 9 (5) ◽  
Author(s):  
Morteza Hosseini ◽  
Diogo Pratas ◽  
Burkhard Morgenstern ◽  
Armando J Pinho

Abstract Background The development of high-throughput sequencing technologies and, as its result, the production of huge volumes of genomic data, has accelerated biological and medical research and discovery. Study on genomic rearrangements is crucial owing to their role in chromosomal evolution, genetic disorders, and cancer. Results We present Smash++, an alignment-free and memory-efficient tool to find and visualize small- and large-scale genomic rearrangements between 2 DNA sequences. This computational solution extracts information contents of the 2 sequences, exploiting a data compression technique to find rearrangements. We also present Smash++ visualizer, a tool that allows the visualization of the detected rearrangements along with their self- and relative complexity, by generating an SVG (Scalable Vector Graphics) image. Conclusions Tested on several synthetic and real DNA sequences from bacteria, fungi, Aves, and Mammalia, the proposed tool was able to accurately find genomic rearrangements. The detected regions were in accordance with previous studies, which took alignment-based approaches or performed FISH (fluorescence in situ hybridization) analysis. The maximum peak memory usage among all experiments was ∼1 GB, which makes Smash++ feasible to run on present-day standard computers.


Author(s):  
Prashant Tiwari ◽  
SH Upadhyay

The performance degradation assessment of ball bearings is of great importance to increase the efficiency and the reliability of rotating mechanical systems. The large dimensionality of feature space introduces a lot of noise and buries the potential information about faults hidden in the feature data. This paper proposes a novel health assessment method facilitated with two compatible methods, namely curvilinear component analysis and self-organizing map network. The novelty lies in the implementation of a vector quantization approach for the sub-manifolds in the feature space and to extract the fault signatures through nonlinear mapping technique. Curvilinear component analysis is a nonlinear mapping tool that can effectively represent the average manifold of the highly folded information and further preserves the local topology of the data. To answer the complications and to accomplish reliability and accuracy in bearing performance degradation assessment, the work is carried out with following steps; first, ensemble empirical mode decomposition is used to decompose the vibration signals into useful intrinsic mode functions; second, two fault features i.e. singular values and energy entropies are extracted from the envelopes of the intrinsic mode function signals; third, the extracted feature vectors under healthy conditions, further reduced with curvilinear component analysis are used to train the self-organizing map model; finally, the reduced test feature vectors are supplied to the trained self-organizing map and the confidence value is obtained. The effectiveness of the proposed technique is validated on three run-to-failure test signals with the different type of defects. The results indicate that the proposed technique detects the weak degradation earlier than the widely used indicators such as root mean square, kurtosis, self-organizing map-based minimum quantization error, and minimum quantization error-based on the principal component analysis.


2019 ◽  
Vol 58 (10) ◽  
pp. 2295-2311
Author(s):  
Yonghe Liu ◽  
Jinming Feng ◽  
Zongliang Yang ◽  
Yonghong Hu ◽  
Jianlin Li

AbstractFew statistical downscaling applications have provided gridded products that can provide downscaled values for a no-gauge area as is done by dynamical downscaling. In this study, a gridded statistical downscaling scheme is presented to downscale summer precipitation to a dense grid that covers North China. The main innovation of this scheme is interpolating the parameters of single-station models to this dense grid and assigning optimal predictor values according to an interpolated predictand–predictor distance function. This method can produce spatial dependence (spatial autocorrelation) and transmit the spatial heterogeneity of predictor values from the large-scale predictors to the downscaled outputs. Such gridded output at no-gauge stations shows performances comparable to that at the gauged stations. The area mean precipitation of the downscaled results is comparable to other products. The main value of the downscaling scheme is that it can obtain reasonable outputs for no-gauge stations.


Sign in / Sign up

Export Citation Format

Share Document