Grammatical Inference: Algorithms and Applications

Abstract Estimating the genomic location and length of identical-by-descent (IBD) segments among individuals is a crucial step in many genetic analyses. However, the exponential growth in the size of biobank and direct-to-consumer (DTC) genetic data sets makes accurate IBD inference a significant computational challenge. Here we present the templated positional Burrows-Wheeler transform (TPBWT) to make fast IBD estimates robust to genotype and phasing errors. Using haplotype data simulated over pedigrees with realistic genotyping and phasing errors we show that the TPBWT outperforms other state-of-the-art IBD inference algorithms in terms of speed and accuracy. For each phase-aware method, we explore the false positive and false negative rates of inferring IBD by segment length and characterize the types of error commonly found. Our results highlight the fragility of most phased IBD inference methods; the accuracy of IBD estimates can be highly sensitive to the quality of haplotype phasing. Additionally we compare the performance of the TPBWT against a widely used phase-free IBD inference approach that is robust to phasing errors. We introduce both in-sample and out-of-sample TPBWT-based IBD inference algorithms and demonstrate their computational efficiency on massive-scale datasets with millions of samples. Furthermore we describe the binary file format for TPBWT-compressed haplotypes that results in fast and efficient out-of-sample IBD computes against very large cohort panels. Finally, we demonstrate the utility of the TPBWT in a brief empirical analysis exploring geographic patterns of haplotype sharing within Mexico. Hierarchical clustering of IBD shared across regions within Mexico reveals geographically structured haplotype sharing and a strong signal of isolation by distance. Our software implementation of the TPBWT is freely available for non-commercial use in the code repository https://github.com/23andMe/phasedibd.

Download Full-text

ModularBoost: an efficient network inference algorithm based on module decomposition

BMC Bioinformatics ◽

10.1186/s12859-021-04074-y ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Xinyu Li ◽

Wei Zhang ◽

Jianming Zhang ◽

Guang Li

Keyword(s):

Network Inference ◽

Detection Methods ◽

Inference Problem ◽

Topological Constraints ◽

Inference Algorithms ◽

Module Detection ◽

Series Expression ◽

Gene Modules ◽

Inference Methods ◽

Complicated Task

Abstract Background Given expression data, gene regulatory network(GRN) inference approaches try to determine regulatory relations. However, current inference methods ignore the inherent topological characters of GRN to some extent, leading to structures that lack clear biological explanation. To increase the biophysical meanings of inferred networks, this study performed data-driven module detection before network inference. Gene modules were identified by decomposition-based methods. Results ICA-decomposition based module detection methods have been used to detect functional modules directly from transcriptomic data. Experiments about time-series expression, curated and scRNA-seq datasets suggested that the advantages of the proposed ModularBoost method over established methods, especially in the efficiency and accuracy. For scRNA-seq datasets, the ModularBoost method outperformed other candidate inference algorithms. Conclusions As a complicated task, GRN inference can be decomposed into several tasks of reduced complexity. Using identified gene modules as topological constraints, the initial inference problem can be accomplished by inferring intra-modular and inter-modular interactions respectively. Experimental outcomes suggest that the proposed ModularBoost method can improve the accuracy and efficiency of inference algorithms by introducing topological constraints.

Download Full-text

A Nonstationary Hidden Markov Model with Approximately Infinitely-Long Time-Dependencies

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213016400017 ◽

2016 ◽

Vol 25 (05) ◽

pp. 1640001 ◽

Cited By ~ 3

Author(s):

Sotirios Chatzis ◽

Dimitrios Kosmopoulos ◽

George Papadourakis

Keyword(s):

Markov Models ◽

Hidden Markov ◽

Mean Field ◽

Sequential Data ◽

First Order ◽

Order Markov Chain ◽

Inference Algorithms ◽

Unrealistic Assumption ◽

Long Time ◽

Time Dependencies

Hidden Markov models (HMMs) are a popular approach for modeling sequential data, typically based on the assumption of a first-order Markov chain. In other words, only one-step back dependencies are modeled which is a rather unrealistic assumption in most applications. In this paper, we propose a method for postulating HMMs with approximately infinitely-long time-dependencies. Our approach considers the whole history of model states in the postulated dependencies, by making use of a recently proposed nonparametric Bayesian method for modeling label sequences with infinitely-long time dependencies, namely the sequence memoizer. We manage to derive training and inference algorithms for our model with computational costs identical to simple first-order HMMs, despite its entailed infinitely-long time-dependencies, by employing a mean-field-like approximation. The efficacy of our proposed model is experimentally demonstrated.

Download Full-text