scholarly journals SVEngine: an efficient and versatile simulator of genome structural variations with features of cancer clonal evolution

2018 ◽  
Author(s):  
Li Charlie Xia ◽  
Dongmei Ai ◽  
Hojoon Lee ◽  
Noemi Andor ◽  
Chao Li ◽  
...  

ABSTRACTBackgroundSimulating genome sequence data with features can facilitate the development and benchmarking of structural variant analysis programs. However, there are a limited number of data simulators that provide structural variants in silico. Moreover, there are a paucity of programs that generate structural variants with different allelic fraction and haplotypes.FindingsWe developed SVEngine, an open source tool to address this need. SVEngine simulates next generation sequencing data with embedded structural variations. As input, SVEngine takes template haploid sequences (FASTA) and an external variant file, a variant distribution file and/or a clonal phylogeny tree file (NEWICK) as input. Subsequently, it simulates and outputs sequence contigs (FASTAs), sequence reads (FASTQs) and/or post-alignment files (BAMs). All of the files contain the desired variants, along with BED files containing the ground truth. SVEngine’s flexible design process enables one to specify size, position, and allelic fraction for deletion, insertion, duplication, inversion and translocation variants. Finally, SVEngine simulates sequence data that replicates the characteristics of a sequencing library with mixed sizes of DNA insert molecules. To improve the compute speed, SVEngine is highly parallelized to reduce the simulation time.ConclusionsWe demonstrated the versatile features of SVEngine and its improved runtime comparisons with other available simulators. SVEngine’s features include the simulation of locus-specific variant frequency designed to mimic the phylogeny of cancer clonal evolution. We validated the accuracy of the simulations. Our evaluation included checking various sequencing mapping features such as coverage change, read clipping, insert size shift and neighbouring hanging read pairs for representative variant types. SVEngine is implemented as a standard Python package and is freely available for academic use at: https://bitbucket.org/charade/svengine.

2018 ◽  
Author(s):  
Alfredo Iacoangeli ◽  
Ahmad Al Khleifat ◽  
William Sproviero ◽  
Aleksey Shatunov ◽  
Ashley R Jones ◽  
...  

AbstractAmyotrophic lateral sclerosis (ALS, MND) is a neurodegenerative disease of upper and lower motor neurons resulting in death from neuromuscular respiratory failure, typically within two years of first symptoms. Genetic factors are an important cause of ALS, with variants in more than 25 genes having strong evidence, and weaker evidence available for variants in more than 120 genes. With the increasing availability of Next-Generation sequencing data, non-specialists, including health care professionals and patients, are obtaining their genomic information without a corresponding ability to analyse and interpret it. Furthermore, the relevance of novel or existing variants in ALS genes is not always apparent. Here we present ALSgeneScanner, a tool that is easy to install and use, able to provide an automatic, detailed, annotated report, on a list of ALS genes from whole genome sequence data in a few hours and whole exome sequence data in about one hour on a readily available mid-range computer. This will be of value to non-specialists and aid in the interpretation of the relevance of novel and existing variants identified in DNA sequencing data.


2019 ◽  
Author(s):  
Joshua I Brian ◽  
Simon K Davy ◽  
Shaun P Wilkinson

Coral reefs rely on their intracellular dinoflagellate symbionts (family Symbiodiniaceae) for nutritional provision in nutrient-poor waters, yet this association is threatened by thermally stressful conditions. Despite this, the evolutionary potential of these symbionts remains poorly characterised. In this study, we tested the potential for divergent Symbiodiniaceae types to sexually reproduce (i.e. hybridise) within Cladocopium, the most ecologically prevalent genus in this family. With sequence data from three organelles (cob gene, mitochondria; psbAncr region, chloroplast; and ITS2 region, nucleus), we utilised the Incongruence Length Difference test, Approximately Unbiased test, tree hybridisation analyses and visual inspection of raw data in stepwise fashion to highlight incongruences between organelles, and thus provide evidence of reticulate evolution. Using this approach, we identified three putative hybrid Cladocopium samples among the 158 analysed, at two of the seven sites sampled. These samples were identified as the common Cladocopium types C40 or C1 with respect to the mitochondria and chloroplasts, but the rarer types C3z, C3u and C1# with respect to their nuclear identity. These five Cladocopium types have previously been confirmed as evolutionarily distinct and were also recovered in non-incongruent samples multiple times, which is strongly suggestive that they sexually reproduced to produce the incongruent samples. A concomitant inspection of Next Generation Sequencing data for these samples suggests that other plausible explanations, such as incomplete lineage sorting, are much less likely. The approach taken in this study allows incongruences between gene regions to be identified with confidence, and brings new light to the evolutionary potential within Symbiodiniaceae.


2019 ◽  
Author(s):  
David Wyllie ◽  
Trien Do ◽  
Richard Myers ◽  
Vlad Nikolayevskyy ◽  
Derrick Crook ◽  
...  

AbstractBackgroundThe prevalence, association with disease status, and public health impact of infection with mixtures of M. tuberculosis strains is unclear, in part due to limitations of existing methods for detecting mixed infections.MethodsWe developed an algorithm to identify mixtures of M. tuberculosis strains using next generation sequencing data, assessing performance using simulated sequences. We identified mixed M. tuberculosis strains when there was at least one mixed nucleotide position, and where both the mixture’s components were present in similar isolates from other individuals. We determined risk factors for mixed infection among isolations of M. tuberculosis in England using logistic regression. We used survival analyses to assess the association between mixed infection and putative transmission.Findings6,560 isolations of TB were successfully sequenced in England 2016-2018. Of 3,691 (56%) specimens for which similar sequences had been isolated from at least two other individuals, 341 (9.2%) were mixed. Infection with lineages other than Lineage 4 were associated with mixed infection. Among the 1,823 individuals with pulmonary infection with Lineage 4 M. tuberculosis, mixed infection was associated with significantly increased risk of subsequent isolation of closely related organisms from a different individual (HR 1.43, 95% CI 1.05,1.94), indicative of transmission.InterpretationMixtures of transmissible strains occur in at least 5% of tuberculosis infections in England; when present in pulmonary disease, such mixtures are associated with an increased risk of tuberculosis transmission.FundingPublic Health England; NIHR Health Protection Research Unit Oxford; European Union.Research in ContextEvidence Before This StudyWe searched Pubmed using the search terms ‘tuberculosis’ and ‘mixed’ or ‘mixture’ for English Language articles published up to 1 April 2019. Studies, most performed without the benefit of genomic sequencing, report mixed TB infection from a range of medium and high prevalence areas and show it to be associated with delayed treatment response. Modelling suggests detection and treatment of mixed TB infection is an important goal for TB eradication campaigns. Although routine DNA sequencing of M. tuberculosis isolates is becoming widespread, efficient methods for detecting mixed infection from such data are underdeveloped, and the true prevalence of mixed infection and its association with transmission is unclear.Added Value of This StudyThis study investigated a large series of TB isolations obtained as part of a routine Mycobacterial sequencing program by two reference laboratories, in a low incidence area, England. We developed an efficient generalisable approach to identify transmitted mixed M. tuberculosis infection; our approach is capable of sensitive and specific detection of a single mixed nucleotide position. We identified mixed infection of similar strains (‘microvariation’) in about 9.2% of the M. tuberculosis samples which we were able to assess, and found evidence of increased transmission from individuals with mixed infection.Implications of All the Available EvidenceTB microvariation is a risk factor for TB transmission, even in the low incidence area studied. Although an efficient and highly specific technique identifying microvariation exists, it relies on comparison with similar sequences isolated from other patients. Sharing of sequence data from the many TB sequencing programs being deployed globally will increase the sensitivity of microvariation detection, and may assist targeted public health interventions.


2017 ◽  
Author(s):  
Maurizio Pellegrino ◽  
Adam Sciambi ◽  
Sebastian Treusch ◽  
Robert Durruthy-Durruthy ◽  
Kaustubh Gokhale ◽  
...  

ABSTRACTTo enable the characterization of genetic heterogeneity in tumor cell populations, we developed a novel microfluidic approach that barcodes amplified genomic DNA from thousands of individual cancer cells confined to droplets. The barcodes are then used to reassemble the genetic profiles of cells from next generation sequencing data. Using this approach, we sequenced longitudinally collected AML tumor populations from two patients and genotyped up to 62 disease relevant loci across more than 16,000 individual cells. Targeted single-cell sequencing was able to sensitively identify tumor cells during complete remission and uncovered complex clonal evolution within AML tumors that was not observable with bulk sequencing. We anticipate that this approach will make feasible the routine analysis of heterogeneity in AML leading to improved stratification and therapy selection for the disease.


PeerJ ◽  
2019 ◽  
Vol 7 ◽  
pp. e7178 ◽  
Author(s):  
Joshua I. Brian ◽  
Simon K. Davy ◽  
Shaun P. Wilkinson

Coral reefs rely on their intracellular dinoflagellate symbionts (family Symbiodiniaceae) for nutritional provision in nutrient-poor waters, yet this association is threatened by thermally stressful conditions. Despite this, the evolutionary potential of these symbionts remains poorly characterised. In this study, we tested the potential for divergent Symbiodiniaceae types to sexually reproduce (i.e. hybridise) within Cladocopium, the most ecologically prevalent genus in this family. With sequence data from three organelles (cob gene, mitochondrion; psbAncr region, chloroplast; and ITS2 region, nucleus), we utilised the Incongruence Length Difference test, Approximately Unbiased test, tree hybridisation analyses and visual inspection of raw data in stepwise fashion to highlight incongruences between organelles, and thus provide evidence of reticulate evolution. Using this approach, we identified three putative hybrid Cladocopium samples among the 158 analysed, at two of the seven sites sampled. These samples were identified as the common Cladocopium types C40 or C1 with respect to the mitochondria and chloroplasts, but the rarer types C3z, C3u and C1# with respect to their nuclear identity. These five Cladocopium types have previously been confirmed as evolutionarily distinct and were also recovered in non-incongruent samples multiple times, which is strongly suggestive that they sexually reproduced to produce the incongruent samples. A concomitant inspection of next generation sequencing data for these samples suggests that other plausible explanations, such as incomplete lineage sorting or the presence of co-dominance, are much less likely. The approach taken in this study allows incongruences between gene regions to be identified with confidence, and brings new light to the evolutionary potential within Symbiodiniaceae.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Michael M. Khayat ◽  
Sayed Mohammad Ebrahim Sahraeian ◽  
Samantha Zarate ◽  
Andrew Carroll ◽  
Huixiao Hong ◽  
...  

Abstract Background Genomic structural variations (SV) are important determinants of genotypic and phenotypic changes in many organisms. However, the detection of SV from next-generation sequencing data remains challenging. Results In this study, DNA from a Chinese family quartet is sequenced at three different sequencing centers in triplicate. A total of 288 derivative data sets are generated utilizing different analysis pipelines and compared to identify sources of analytical variability. Mapping methods provide the major contribution to variability, followed by sequencing centers and replicates. Interestingly, SV supported by only one center or replicate often represent true positives with 47.02% and 45.44% overlapping the long-read SV call set, respectively. This is consistent with an overall higher false negative rate for SV calling in centers and replicates compared to mappers (15.72%). Finally, we observe that the SV calling variability also persists in a genotyping approach, indicating the impact of the underlying sequencing and preparation approaches. Conclusions This study provides the first detailed insights into the sources of variability in SV identification from next-generation sequencing and highlights remaining challenges in SV calling for large cohorts. We further give recommendations on how to reduce SV calling variability and the choice of alignment methodology.


2019 ◽  
Author(s):  
Sergey Aganezov ◽  
Benjamin J. Raphael

AbstractMany cancer genomes are extensively rearranged with highly aberrant chromosomal karyotypes. These genome rearrangements, or structural variants, can be detected in tumor DNA sequencing data by abnormal mapping of se-quence reads to the reference genome. However, nearly all cancer sequencing to date is of bulk tumor samples which consist of a heterogeneous mixture of normal cells and subpopulations of cancers cells, or clones, that harbor distinct somatic structural variants. We introduce a novel algorithm, Reconstructing Cancer Karyotypes (RCK), to reconstruct haplotype-specific karyotypes of one or more rearranged cancer genomes, or clones, that best explain the read alignments from a bulk tumor sample. RCK leverages specific evolutionary constraints on the somatic mutation process in cancer to reduce ambiguity in the deconvolution of admixed DNA sequence data into multiple haplotype-specific cancer karyotypes. In particular, RCK relies on generalizations of the infinite sites assumption that a genome re-arrangement is highly unlikely to occur at the same nucleotide position more than once during somatic evolution. RCK’s comprehensive model allows us to incorporate information both from short and long-read sequencing technologies and is applicable to bulk tumor samples containing a mixture of an arbitrary number of derived genomes. We compared RCK to the state-of-the-art method ReMixT on a dataset of 17 primary and metastatic prostate cancer samples. We demonstrate that ReMixT’s limited support for heterogeneity and lack of evolutionary constrains leads to reconstruction of implausible karyotypes. In contrast, RCK’s infers cancer karyotypes that better explain read alignments from bulk tumor samples and are consistent with a reasonable evolutionary model. RCK’s reconstructions of clone- and haplotype-specific karyotypes will aid further studies of the role of intra-tumor heterogeneity in cancer development and response to treatment. RCK is available at https://github.com/raphael-group/RCK.


2021 ◽  
Author(s):  
Jean-Pierre Kocher ◽  
Zachary Stephens ◽  
Daniel O'Brien ◽  
Mrunal Dehankar ◽  
Lewis Roberts ◽  
...  

The integration of viruses into the human genome is known to be associated with tumorigenesis in many cancers, but the accurate detection of integration breakpoints from short read sequencing data is made difficult by human-viral homologies, viral genome heterogeneity, coverage limitations, and other factors. To address this, we present Exogene, a sensitive and efficient workflow for detecting viral integrations from paired-end next generation sequencing data. Exogene's read filtering and breakpoint detection strategies yield integration coordinates that are highly concordant with those found in long read validation sets. We demonstrate this concordance across 6 TCGA Hepatocellular carcinoma (HCC) tumor samples, identifying integrations of hepatitis B virus that are validated by long reads. Additionally, we applied Exogene to targeted capture data from 426 previously studied HCC samples, achieving 98.9% concordance with existing methods and identifying 238 high-confidence integrations that were not previously reported. Exogene is applicable to multiple types of paired-end sequence data, including genome, exome, RNA-Seq or targeted capture.


Author(s):  
Lianming Du ◽  
Qin Liu ◽  
Zhenxin Fan ◽  
Jie Tang ◽  
Xiuyue Zhang ◽  
...  

Abstract FASTA and FASTQ are the most widely used biological data formats that have become the de facto standard to exchange sequence data between bioinformatics tools. With the avalanche of next-generation sequencing data, the amount of sequence data being deposited and accessed in FASTA/Q formats is increasing dramatically. However, the existing tools have very low efficiency at random retrieval of subsequences due to the requirement of loading the entire index into memory. In addition, most existing tools have no capability to build index for large FASTA/Q files because of the limited memory. Furthermore, the tools do not provide support to randomly accessing sequences from FASTA/Q files compressed by gzip, which is extensively adopted by most public databases to compress data for saving storage. In this study, we developed pyfastx as a versatile Python package with commonly used command-line tools to overcome the above limitations. Compared to other tools, pyfastx yielded the highest performance in terms of building index and random access to sequences, particularly when dealing with large FASTA/Q files with hundreds of millions of sequences. A key advantage of pyfastx over other tools is that it offers an efficient way to randomly extract subsequences directly from gzip compressed FASTA/Q files without needing to uncompress beforehand. Pyfastx can easily be installed from PyPI (https://pypi.org/project/pyfastx) and the source code is freely available at https://github.com/lmdu/pyfastx.


PLoS ONE ◽  
2021 ◽  
Vol 16 (9) ◽  
pp. e0250915
Author(s):  
Zachary Stephens ◽  
Daniel O’Brien ◽  
Mrunal Dehankar ◽  
Lewis R. Roberts ◽  
Ravishankar K. Iyer ◽  
...  

The integration of viruses into the human genome is known to be associated with tumorigenesis in many cancers, but the accurate detection of integration breakpoints from short read sequencing data is made difficult by human-viral homologies, viral genome heterogeneity, coverage limitations, and other factors. To address this, we present Exogene, a sensitive and efficient workflow for detecting viral integrations from paired-end next generation sequencing data. Exogene’s read filtering and breakpoint detection strategies yield integration coordinates that are highly concordant with long read validation. We demonstrate this concordance across 6 TCGA Hepatocellular carcinoma (HCC) tumor samples, identifying integrations of hepatitis B virus that are also supported by long reads. Additionally, we applied Exogene to targeted capture data from 426 previously studied HCC samples, achieving 98.9% concordance with existing methods and identifying 238 high-confidence integrations that were not previously reported. Exogene is applicable to multiple types of paired-end sequence data, including genome, exome, RNA-Seq and targeted capture.


Sign in / Sign up

Export Citation Format

Share Document