scholarly journals Sarek: A portable workflow for whole-genome sequencing analysis of germline and somatic variants

2018 ◽  
Author(s):  
Maxime Garcia ◽  
Szilveszter Juhos ◽  
Malin Larsson ◽  
Pall I. Olason ◽  
Marcel Martin ◽  
...  

AbstractSummaryWhole-genome sequencing (WGS) is a cornerstone of precision medicine, but portable and reproducible open-source workflows for WGS analyses of germline and somatic variants are lacking. We present Sarek, a modular, comprehensive, and easy-to-install workflow, combining a range of software for the identification and annotation of single-nucleotide variants (SNVs), insertion and deletion variants (indels), structural variants, tumor sample heterogeneity, and karyotyping from germline or paired tumor/normal samples. Sarek is implemented in a bioinformatics workflow language (Nextflow) with Docker and Singularity compatible containers, ensuring easy deployment and full reproducibility at any Linux based compute cluster or cloud computing environment. Sarek supports the human reference genomes GRCh37 and GRCh38, and can readily be used both as a core production workflow at sequencing facilities and as a powerful stand-alone tool for individual research groups.AvailabilitySource code and instructions for local installation are available at GitHub (https://github.com/SciLifeLab/Sarek) under the MIT open-source license, and we invite the research community to contribute additional functionality as a collaborative open-source development project.

BMC Genetics ◽  
2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Lucy Crooks ◽  
Johnathan Cooper-Knock ◽  
Paul R. Heath ◽  
Ahmed Bouhouche ◽  
Mostafa Elfahime ◽  
...  

Abstract Background Large-scale human sequencing projects have described around a hundred-million single nucleotide variants (SNVs). These studies have predominately involved individuals with European ancestry despite the fact that genetic diversity is expected to be highest in Africa where Homo sapiens evolved and has maintained a large population for the longest time. The African Genome Variation Project examined several African populations but these were all located south of the Sahara. Morocco is on the northwest coast of Africa and mostly lies north of the Sahara, which makes it very attractive for studying genetic diversity. The ancestry of present-day Moroccans is unknown and may be substantially different from Africans found South of the Sahara desert, Recent genomic data of Taforalt individuals in Eastern Morocco revealed 15,000-year-old modern humans and suggested that North African individuals may be genetically distinct from previously studied African populations. Results We present SNVs discovered by whole genome sequencing (WGS) of three Moroccans. From a total of 5.9 million SNVs detected, over 200,000 were not identified by 1000G and were not in the extensive gnomAD database. We summarise the SNVs by genomic position, type of sequence gene context and effect on proteins encoded by the sequence. Analysis of the overall genomic information of the Moroccan individuals to individuals from 1000G supports the Moroccan population being distinct from both sub-Saharan African and European populations. Conclusions We conclude that Moroccan samples are genetically distinct and lie in the middle of the previously observed cline between populations of European and African ancestry. WGS of Moroccan individuals can identify a large number of novel SNVs and aid in functional characterisation of the genome.


2021 ◽  
Author(s):  
Arya R. Massarat ◽  
Arko Sen ◽  
Jeff Jaureguy ◽  
Sélène T. Tyndale ◽  
Yi Fu ◽  
...  

ABSTRACTGenetic variants and de novo mutations in regulatory regions of the genome are typically discovered by whole-genome sequencing (WGS), however WGS is expensive and most WGS reads come from non-regulatory regions. The Assay for Transposase-Accessible Chromatin (ATAC-seq) generates reads from regulatory sequences and could potentially be used as a low-cost ‘capture’ method for regulatory variant discovery, but its use for this purpose has not been systematically evaluated. Here we apply seven variant callers to bulk and single-cell ATAC-seq data and evaluate their ability to identify single nucleotide variants (SNVs) and insertions/deletions (indels). In addition, we develop an ensemble classifier, VarCA, which combines features from individual variant callers to predict variants. The Genome Analysis Toolkit (GATK) is the best-performing individual caller with precision/recall on a bulk ATAC test dataset of 0.92/0.97 for SNVs and 0.87/0.82 for indels. On bulk ATAC-seq reads, VarCA achieves superior performance with precision/recall of 0.99/0.95 for SNVs and 0.93/0.80 for indels. On single-cell ATAC-seq reads, VarCA attains precision/recall of 0.98/0.94 for SNVs and 0.82/0.82 for indels. In summary, ATAC-seq reads can be used to accurately discover non-coding regulatory variants in the absence of whole-genome sequencing data and our ensemble method, VarCA, has the best overall performance.


2017 ◽  
Author(s):  
Zhiting Wei ◽  
Funan He ◽  
Guohui Chuai ◽  
Hanhui Ma ◽  
Zhixi Su ◽  
...  

To the EditorSchaefer et al.1 (referred to as Study_1) recently presented the provocative conclusion that CRISPR-Cas9 nuclease can induce many unexpected off-target mutations across the genome that arise from the sites with poor homology to the gRNA. As Wilson et al.2 pointed out, however, the selection of a co-housed mouse as the control is insufficient to attribute the observed mutation differences between the CRISPR-treated mice and control mice. Therefore, the causes of these mutations need to be further investigated. In 2015, Iyer et al.3 (referred to as Study_2) used Cas9 and a pair of sgRNAs to mutate the Ar gene in vivo and off-target mutations were investigated by comparison the control mice and the offspring of the modified mice. After analyzing the whole genome sequencing (WGS) of the offspring and the control mice, they claimed that off-target mutations are rare from CRISPR-Cas9 engineering. Notably, their study only focused on indel off-target mutations. We re-analyzed the WGS data of these two studies and detected both single nucleotide variants (SNVs) and indel mutations.


2020 ◽  
Author(s):  
Lucy Crooks ◽  
Johnathan Cooper-Knock ◽  
Paul R. Heath ◽  
Ahmed Bouhouche ◽  
Elmostafa El Fahime ◽  
...  

Abstract Background Large-scale human sequencing projects have described around a hundred-million single nucleotide variants (SNVs), which have predominately focused on individuals with European ancestry despite the fact that genetic diversity is expected to be highest in Africa where Homo sapiens evolved and has maintained a large population for the longest time. The more recent African Genome Variation Project examined several African populations but these were all located south of the Sahara. Morocco is on the northwest coast of Africa and mostly lies north of the Sahara, which makes it very attractive for studying genetic diversity. Recent genomic data of Taforalt individuals in Eastern Morocco revealed 15,000-year-old modern humans, showed that North Africa individuals are expected to show genetic differences from previously studied African populations. Results We present single nucleotide variant (SNV) results from whole genome sequencing (WGS) of three Moroccans. From a total of 5.9 million SNVs detected, over 200,000 were not identified by 1000G. We provide a summary of the SNVs by genomic position, gene context and effect on protein coding. Comparison of genome-wide information of the Moroccan individuals to individuals from 1000G by principal component analysis revealed a substantial genomic distinction between the Moroccan population and sub-Saharan African populations. Conclusions We conclude that Moroccan samples lie in the middle of the previously observed cline between populations of European and African ancestry. WGS of Moroccan individuals can identify a large number of new SNVs and aid in functional characterisation of the genome.


Author(s):  
Se Jin Park ◽  
Gwan Woo Ku ◽  
Su Yel Lee ◽  
Daeun Kang ◽  
Wan Jin Hwang ◽  
...  

There are many epidemiological studies asserting that fine dust causes lung cancer, but the biological mechanism is not clear. This study was conducted to investigate the effect of PM10 (particulate matter less than 10 μm) on single nucleotide variants through whole genome sequencing in lung epithelial cancer cell lines (HCC-827, NCI-H358) that have been exposed to PM10. The two cell lines were exposed to PM10 for 15 days. We performed experimental and next generation sequencing analyses on experimental group that had been exposed to PM10 as well as an unexposed control group. After exposure to PM10, 3005 single nucleotide variants were newly identified in the NCI-H358 group, and 4402 mutations were identified in the HCC-827 group. We analyzed these single nucleotide variants with the Mutalisk program. We observed kataegis in chromosome 1 in NCI-H358 and chromosome 7 in HCC-827. In mutational signatures analysis, the COSMIC mutational signature 5 was highest in both HCC-827 and NCI-H358 groups, and each cosine similarity was 0.964 in HCC-827 and 0.979 in the NCI-H358 group. The etiology of COSMIC mutational signature 5 is unknown at present. Well-designed studies are needed to determine whether environmental factors, such as PM10, cause COSMIC mutational signature 5.


F1000Research ◽  
2020 ◽  
Vol 9 ◽  
pp. 63 ◽  
Author(s):  
Maxime Garcia ◽  
Szilveszter Juhos ◽  
Malin Larsson ◽  
Pall I. Olason ◽  
Marcel Martin ◽  
...  

Whole-genome sequencing (WGS) is a fundamental technology for research to advance precision medicine, but the limited availability of portable and user-friendly workflows for WGS analyses poses a major challenge for many research groups and hampers scientific progress. Here we present Sarek, an open-source workflow to detect germline variants and somatic mutations based on sequencing data from WGS, whole-exome sequencing (WES), or gene panels. Sarek features (i) easy installation, (ii) robust portability across different computer environments, (iii) comprehensive documentation, (iv) transparent and easy-to-read code, and (v) extensive quality metrics reporting. Sarek is implemented in the Nextflow workflow language and supports both Docker and Singularity containers as well as Conda environments, making it ideal for easy deployment on any POSIX-compatible computers and cloud compute environments. Sarek follows the GATK best-practice recommendations for read alignment and pre-processing, and includes a wide range of software for the identification and annotation of germline and somatic single-nucleotide variants, insertion and deletion variants, structural variants, tumour sample purity, and variations in ploidy and copy number. Sarek offers easy, efficient, and reproducible WGS analyses, and can readily be used both as a production workflow at sequencing facilities and as a powerful stand-alone tool for individual research groups. The Sarek source code, documentation and installation instructions are freely available at https://github.com/nf-core/sarek and at https://nf-co.re/sarek/.


2021 ◽  
Vol 3 (3) ◽  
Author(s):  
Farah Qaiser ◽  
Tara Sadoway ◽  
Yue Yin ◽  
Quratulain Zulfiqar Ali ◽  
Charlotte M Nguyen ◽  
...  

Abstract Epilepsies are a group of common neurological disorders with a substantial genetic basis. Despite this, the molecular diagnosis of epilepsies remains challenging due to its heterogeneity. Studies utilizing whole-genome sequencing may provide additional insights into genetic causes of epilepsies of unknown aetiology. Whole-genome sequencing was used to evaluate a cohort of adults with unexplained developmental and epileptic encephalopathies (n = 30), for whom prior genetic tests, including whole-exome sequencing in some cases, were negative or inconclusive. Rare single nucleotide variants, insertions/deletions, copy number variants and tandem repeat expansions were analysed. Seven pathogenic or likely pathogenic single nucleotide variants, and two pathogenic deleterious copy number variants were identified in nine patients (32.1% of the cohort). One of the copy number variants, identified in a patient with Lennox–Gastaut syndrome, was too small to be detected by chromosomal microarray techniques. We also identified two tandem repeat expansions with clinical implications in two other patients with Lennox–Gastaut syndrome: a CGG repeat expansion in the 5′untranslated region of DIP2B, and a CTG expansion in ATXN8OS (previously implicated in spinocerebellar ataxia type 8). Three patients had KCNA2 pathogenic variants. One of them died of sudden unexpected death in epilepsy. The other two patients had, in addition to a KCNA2 variant, a second de novo variant impacting potential epilepsy-relevant genes (KCNIP4 and UBR5). Overall, whole-genome sequencing provided a genetic explanation in 32.1% of the total cohort. This is also the first report of coding and non-coding tandem repeat expansions identified in patients with Lennox–Gastaut syndrome. This study demonstrates that using whole-genome sequencing, the examination of multiple types of rare genetic variation, including those found in the non-coding region of the genome, can help resolve unexplained epilepsies.


2019 ◽  
Vol 2019 ◽  
pp. 1-8
Author(s):  
Luobu Gesang ◽  
Lamu Gusang ◽  
Ciren Dawa ◽  
Gawa Gesang ◽  
Kang Li

Background. The hypoxic conditions at high altitudes are great threats to survival, causing pressure for adaptation. More and more high-altitude denizens are not adapted with the condition known as high-altitude polycythemia (HAPC) that featured excessive erythrocytosis. As a high-altitude sickness, the etiology of HAPC is still unclear. Methods. In this study, we reported the whole-genome sequencing-based study of 10 native Tibetans with HAPC and 10 control subjects followed by genotyping of selected 21 variants from discovered single nucleotide variants (SNVs) in an independent cohort (232 cases and 266 controls). Results. We discovered the egl nine homologue 3 (egln3/phd3) (14q13.1, rs1346902, P=1.91×10−5) and PPP1R2P1 (Protein Phosphatase 1 Regulatory Inhibitor Subunit 2) gene (6p21.32, rs521539, P=0.012). Our results indicated an unbiased framework to identify etiological mechanisms of HAPC and showed that egln3/phd3 and PPP1R2P1 may be associated with the susceptibility to HAPC. Egln3/phd3b is associated with hypoxia-inducible factor subunit α (HIFα). Protein Phosphatase 1 Regulatory Inhibitor is associated with reactive oxygen species (ROS) and oxidative stress. Conclusions. Our genome sequencing conducted in Tibetan HAPC patients identified egln3/phd3 and PPP1R2P1 associated with HAPC.


F1000Research ◽  
2016 ◽  
Vol 5 ◽  
pp. 1336 ◽  
Author(s):  
Gianmarco Contino ◽  
Matthew D. Eldridge ◽  
Maria Secrier ◽  
Lawrence Bower ◽  
Rachael Fels Elliott ◽  
...  

Esophageal adenocarcinoma (EAC) is highly mutated and molecularly heterogeneous. The number of cell lines available for study is limited and their genome has been only partially characterized. The availability of an accurate annotation of their mutational landscape is crucial for accurate experimental design and correct interpretation of genotype-phenotype findings. We performed high coverage, paired end whole genome sequencing on eight EAC cell lines—ESO26, ESO51, FLO-1, JH-EsoAd1, OACM5.1 C, OACP4 C, OE33, SK-GT-4—all verified against original patient material, and one esophageal high grade dysplasia cell line, CP-D. We have made available the aligned sequence data and report single nucleotide variants (SNVs), small insertions and deletions (indels), and copy number alterations, identified by comparison with the human reference genome and known single nucleotide polymorphisms (SNPs). We compare these putative mutations to mutations found in primary tissue EAC samples, to inform the use of these cell lines as a model of EAC.


Sign in / Sign up

Export Citation Format

Share Document