scholarly journals SimRVSequences: an R package to simulate genetic sequence data for pedigrees

2019 ◽  
Author(s):  
Christina Nieuwoudt ◽  
Angela Brooks-Wilson ◽  
Jinko Graham

1AbstractSummaryFamily-based studies have several advantages over case-control studies for finding causal rare variants for a disease; these include increased power, smaller sample size requirements, and improved detection of sequencing errors. However, collecting suitable families and compiling their data is time-consuming and expensive. To evaluate methodology to identify causal rare variants in family-based studies, one can use simulated data. For this purpose we present the R package SimRVSequences. Users supply a sample of pedigrees and single-nucleotide variant data from a sample of unrelated individuals representing the pedigree founders. Users may also model genetic heterogeneity among families. For ease of use, SimRVSequences offers methods to import and format single-nucleotide variant data and pedigrees from existing software.Availability and ImplementationSimRVSequences is available as a library for R≥ 3.5.0 on the comprehensive R archive network.


2019 ◽  
Vol 36 (7) ◽  
pp. 2295-2297
Author(s):  
Christina Nieuwoudt ◽  
Angela Brooks-Wilson ◽  
Jinko Graham

Abstract Summary We present the R package SimRVSequences to simulate sequence data for pedigrees. SimRVSequences allows for simulations of large numbers of single-nucleotide variants (SNVs) and scales well with increasing numbers of pedigrees. Users provide a sample of pedigrees and SNV data from a sample of unrelated individuals. Availability and implementation SimRVSequences is publicly-available on CRAN https://cran.r-project.org/web/packages/SimRVSequences/. Supplementary information Supplementary data are available at Bioinformatics online.



2014 ◽  
Vol 8 (Suppl 1) ◽  
pp. S27 ◽  
Author(s):  
Jing Huang ◽  
Yong Chen ◽  
Michael D Swartz ◽  
Iuliana Ionita-Laza


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Pavlos Mamouris ◽  
Vahid Nassiri ◽  
Geert Molenberghs ◽  
Marjan van den Akker ◽  
Joep van der Meer ◽  
...  

Abstract Background In case-control studies most algorithms allow the controls to be sampled several times, which is not always optimal. If many controls are available and adjustment for several covariates is necessary, matching without replacement might increase statistical efficiency. Comparing similar units when having observational data is of utter importance, since confounding and selection bias is present. The aim was twofold, firstly to create a method that accommodates the option that a control is not resampled, and second, to display several scenarios that identify changes of Odds Ratios (ORs) while increasing the balance of the matched sample. Methods The algorithm was derived in an iterative way starting from the pre-processing steps to derive the data until its application in a study to investigate the risk of antibiotics on colorectal cancer in the INTEGO registry (Flanders, Belgium). Different scenarios were developed to investigate the fluctuation of ORs using the combination of exact and varying variables with or without replacement of controls. To achieve balance in the population, we introduced the Comorbidity Index (CI) variable, which is the sum of chronic diseases as a means to have comparable units for drawing valid associations. Results This algorithm is fast and optimal. We simulated data and demonstrated that the run-time of matching even with millions of patients is minimal. Optimal, since the closest controls is always captured (using the appropriate ordering and by creating some auxiliary variables), and in the scenario that a case has only one control, we assure that this control will be matched to this case, thus maximizing the cases to be used in the analysis. In total, 72 different scenarios were displayed indicating the fluctuation of ORs, and revealing patterns, especially a drop when balancing the population. Conclusions We created an optimal and computationally efficient algorithm to derive a matched case-control sample with and without replacement of controls. The code and the functions are publicly available as an open source in an R package. Finally, we emphasize the importance of displaying several scenarios and assess the difference of ORs while using an index to balance population in observational data.



Author(s):  
Alexandre Yahi ◽  
Paul Hoffman ◽  
Margot Brandt ◽  
Pejman Mohammadi ◽  
Nicholas P. Tatonetti ◽  
...  

AbstractGenome editing experiments are generating an increasing amount of targeted sequencing data with specific mutational patterns indicating the success of the experiments and genotypes of clonal cell lines. We present EdiTyper, a high-throughput command line tool specifically designed for analysis of sequencing data from polyclonal and monoclonal cell populations from CRISPR gene editing. It requires simple inputs of sequencing data and reference sequences, and provides comprehensive outputs including summary statistics, plots, and SAM/BAM alignments. Analysis of simulated data showed that EdiTyper is highly accurate for detection of both single nucleotide mutations and indels, robust to sequencing errors, as well as fast and scalable to large experimental batches. EdiTyper is available in github (https://github.com/LappalainenLab/edityper) under the MIT license.



2018 ◽  
Author(s):  
Ehsan Motazedi ◽  
Richard Finkers ◽  
Chris Maliepaard ◽  
Dick de Ridder

AbstractDNA sequence reads contain information about the genomic variants located on a single chromosome. By extracting and extending this information (using the overlaps of the reads), the haplotypes of an individual can be obtained. Adding parent-offspring relationships to the read information in a population can considerably improve the quality of the haplotypes obtained from short reads, as pedigree information can compensate for spurious overlaps (due to sequencing errors) and insufficient overlaps (due to shallow coverage). This improvement is especially beneficial for polyploid organisms, which have more than two copies of each chromosome and are therefore more difficult to be haplotyped compared to diploids. We develop a novel method, PopPoly, to estimate polyploid haplotypes in an F1-population from short sequence data by considering the transmission of the haplotypes from the parents to the offspring. In addition, PopPoly employs this information to improve genotype dosage estimation and to call missing genotypes in the population. Through realistic simulations, we compare PopPoly to other haplotyping methods and show its better performance in terms of phasing accuracy and the accuracy of phased genotypes. We apply PopPoly to estimate the parental and offspring haplotypes for a tetraploid potato cross with 10 offspring, using Illumina HiSeq sequence data of 9 genomic regions involved in plant maturity and tuberisation.



2021 ◽  
Author(s):  
Charles S.P. Foster ◽  
Sacha Stelzer-Braid ◽  
Ira W. Deveson ◽  
Rowena A. Bull ◽  
Malinna Yeang ◽  
...  

Whole-genome sequencing of viral isolates is critical for informing transmission patterns and ongoing evolution of pathogens, especially during a pandemic. However, when genomes have low variability in the early stages of a pandemic, the impact of technical and/or sequencing errors increases. We quantitatively assessed inter-laboratory differences in consensus genome assemblies of 72 matched SARS-CoV-2-positive specimens sequenced at different laboratories in Sydney, Australia. Raw sequence data were assembled using two different bioinformatics pipelines in parallel, and resulting consensus genomes were compared to detect laboratory-specific differences. Matched genome sequences were predominantly concordant, with a median pairwise identity of 99.997%. Identified differences were predominantly driven by ambiguous site content. Ignoring these produced differences in only 2.3% (5/216) of pairwise comparisons, each differing by a single nucleotide. Matched samples were assigned the same Pango lineage in 98.2% (212/216) of pairwise comparisons, and were mostly assigned to the same phylogenetic clade. However, epidemiological inference based only on single nucleotide variant distances may lead to significant differences in the number of defined clusters if variant allele frequency thresholds for consensus genome generation differ between laboratories. These results underscore the need for a unified, best-practices approach to bioinformatics between laboratories working on a common outbreak problem.



2017 ◽  
Vol 7 (1) ◽  
pp. 17-20 ◽  
Author(s):  
Shiro Fujita ◽  
Katsuhiro Masago ◽  
Chiyuki Okuda ◽  
Akito Hata ◽  
Reiko Kaji ◽  
...  


2017 ◽  
Author(s):  
J.E. Hicks ◽  
M. A. Province

AbstractThe contribution of rare variants to disease burden has become an important focus in genetic epidemiology. These effects are difficult to detect in population-based datasets, and as a result, interest in family-based study designs has resurfaced. Linkage analysis tools will need to be updated to accommodate the scale of data generated by modern genotyping and sequencing technologies.In conventional linkage analysis individuals in different pedigrees are assumed to be independent of each other. However, cryptic relatedness is often present in populations and haplotypes that harbor rare variants may be shared between pedigrees as well as within them.With millions of polymorphisms, Identity-by-descent (IBD) states across the genome can now be inferred without use of pedigree information. This is done by identifying long runs of identical-by-state genotypes which are unlikely to arise without IBD. Previously, IBD had to be estimated in pedigrees from recombination events in a sparse set of markers.We present a method for variance-components linkage that can incorporate large number of markers and allows for between-pedigree relatedness. We replace the IBD matrix generated from pedigree-based analysis with one generated from a genotype-based method. All pedigrees in a dataset are considered jointly, allowing between-pedigree IBD to be included in the model.In simulated data, we show that power is increased in the scenario when there is a haplotype shared IBD between members of different pedigrees. If there is no between-pedigree IBD, the analysis reduces to conventional variance-components analysis. By determining IBD states by long runs of dense IBS genotypes, linkage signals can be determined from their physical position, allowing more precise localization.



2018 ◽  
Author(s):  
Mara Battagin ◽  
Serap Gonen ◽  
Roger Ros-Freixedes ◽  
Andrew Whalen ◽  
Gregor Gorjanc ◽  
...  

This paper describes a family-based phasing algorithm, for variable-coverage sequence data, that first minimises phasing errors and then maximises the proportion of alleles phased. This algorithm is one of the essential tools that underpin an overall strategy for generating highly accurate sequence data on whole populations at low cost. The algorithm is called AlphaFamSeq. It uses sequence data on the focal individual and at least two generations of ancestors to phase alleles. In the first step, AlphaFamSeq calculates allele probabilities using iterative peeling. In subsequent steps, the alleles are phased using heuristics deriving information from the sequence data of parents, grandparents and progenies and, if available, from other families in the pedigree. AlphaFamSeq was tested on a range of simulated data sets. AlphaFamSeq gives low phasing error rates and, if there is sufficient sequence information and haplotype sharing amongst individuals, it can give a high yield of correctly phased alleles. The allele threshold had a large effect and window size had a small effect on performance. When all individuals in a single family were sequenced at different coverages the highest correctly phased alleles reached 90% of the possible maximum (98.9%) at ~1/6 of the maximum aggregate coverage. Adding sequence information from other related individuals increased the percentage of correctly phased alleles. Imputation performance was high across all allele frequencies (average correlation by marker of 0.94), except for a slight decrease at very low frequencies (≤0.01 MAF). Within an overall strategy for generating highly accurate sequence data on whole populations at low cost the role of AlphaFamSeq is to provide very accurately phased haplotypes on focal individuals, who are individuals whose haplotypes are very common in the population.



2018 ◽  
Author(s):  
Maria Victoria Fernández ◽  
John Budde ◽  
Jorge Del-Aguila ◽  
Laura Ibañez ◽  
Yuetiva Deming ◽  
...  

AbstractGene-based tests to study the combined effect of rare variants towards a particular phenotype have been widely developed for case-control studies, but their evolution and adaptation for family-based studies, especially for complex incomplete families, has been slower. In this study, we have performed a practical examination of all the latest gene-based methods available for family-based study designs using both simulated and real datasets. We have examined the performance of several collapsing, variance-component and transmission disequilibrium tests across eight different software and twenty-two models utilizing a cohort of 285 families (N=1,235) with late-onset Alzheimer disease (LOAD). After a thorough examination of each of these tests, we propose a methodological approach to identify, with high confidence, genes associated with the studied phenotype with high confidence and we provide recommendations to select the best software and model for family-based gene-based analyses. Additionally, in our dataset, we identified PTK2B, a GWAS candidate gene for sporadic AD, along with six novel genes (CHRD, CLCN2, HDLBP, CPAMD8, NLRP9, MAS1L) as candidates genes for familial LOAD.



Sign in / Sign up

Export Citation Format

Share Document