scholarly journals Deep whole-genome sequencing of 3 cancer cell lines on 2 sequencing platforms

2019 ◽  
Vol 9 (1) ◽  
Author(s):  
Kanika Arora ◽  
Minita Shah ◽  
Molly Johnson ◽  
Rashesh Sanghvi ◽  
Jennifer Shelton ◽  
...  

AbstractTo test the performance of a new sequencing platform, develop an updated somatic calling pipeline and establish a reference for future benchmarking experiments, we performed whole-genome sequencing of 3 common cancer cell lines (COLO-829, HCC-1143 and HCC-1187) along with their matched normal cell lines to great sequencing depths (up to 278x coverage) on both Illumina HiSeqX and NovaSeq sequencing instruments. Somatic calling was generally consistent between the two platforms despite minor differences at the read level. We designed and implemented a novel pipeline for the analysis of tumor-normal samples, using multiple variant callers. We show that coupled with a high-confidence filtering strategy, the use of combination of tools improves the accuracy of somatic variant calling. We also demonstrate the utility of the dataset by creating an artificial purity ladder to evaluate the somatic pipeline and benchmark methods for estimating purity and ploidy from tumor-normal pairs. The data and results of the pipeline are made accessible to the cancer genomics community.

2021 ◽  
Vol 20 ◽  
pp. 117693512110492
Author(s):  
Ahmed Ibrahim Samir Khalil ◽  
Anupam Chattopadhyay ◽  
Amartya Sanyal

Background: The revolution in next-generation sequencing (NGS) technology has allowed easy access and sharing of high-throughput sequencing datasets of cancer cell lines and their integrative analyses. However, long-term passaging and culture conditions introduce high levels of genomic and phenotypic diversity in established cell lines resulting in strain differences. Thus, clonal variation in cultured cell lines with respect to the reference standard is a major barrier in systems biology data analyses. Therefore, there is a pressing need for a fast and entry-level assessment of clonal variations within cell lines using their high-throughput sequencing data. Results: We developed a Python-based software, AStra, for de novo estimation of the genome-wide segmental aneuploidy to measure and visually interpret strain-level similarities or differences of cancer cell lines from whole-genome sequencing (WGS). We demonstrated that aneuploidy spectrum can capture the genetic variations in 27 strains of MCF7 breast cancer cell line collected from different laboratories. Performance evaluation of AStra using several cancer sequencing datasets revealed that cancer cell lines exhibit distinct aneuploidy spectra which reflect their previously-reported karyotypic observations. Similarly, AStra successfully identified large-scale DNA copy number variations (CNVs) artificially introduced in simulated WGS datasets. Conclusions: AStra provides an analytical and visualization platform for rapid and easy comparison between different strains or between cell lines based on their aneuploidy spectra solely using the raw BAM files representing mapped reads. We recommend AStra for rapid first-pass quality assessment of cancer cell lines before integrating scientific datasets that employ deep sequencing. AStra is an open-source software and is available at https://github.com/AISKhalil/AStra .


2019 ◽  
Author(s):  
Kanika Arora ◽  
Minita Shah ◽  
Molly Johnson ◽  
Rashesh Sanghvi ◽  
Jennifer Shelton ◽  
...  

AbstractTo test the performance of a new sequencing platform, develop an updated somatic calling pipeline and establish a reference for future benchmarking experiments, we sequenced 3 common cancer cell lines along with their matched normal cell lines to great sequencing depths (up to 278X coverage) on both Illumina HiSeqX and NovaSeq sequencing instruments. Somatic calling was generally consistent between the two platforms despite minor differences at the read level. We designed and implemented a novel pipeline for the analysis of tumor-normal samples, using multiple variant callers. We show that coupled with a high-confidence filtering strategy, it improves the accuracy of somatic calls. We also demonstrate the utility of the dataset by creating an artificial purity ladder to evaluate the somatic pipeline and benchmark methods for estimating purity and ploidy from tumor-normal pairs. The data and results of the pipeline are made accessible to the cancer genomics community.


2021 ◽  
Author(s):  
Niantao Deng ◽  
Andre Minoche ◽  
Kate Harvey ◽  
Andrei Goga ◽  
Alex Swarbrick

Abstract BackgroundBreast cancer cell lines (BCCLs) and patient-derived xenografts (PDX) are the most frequently used models in breast cancer research. Despite their widespread usage, genome sequencing of these models is incomplete, with previous studies only focusing on targeted gene panels, whole exome or shallow whole genome sequencing. Deep whole genome sequencing is the most sensitive and accurate method to detect single nucleotide variants and indels, gene copy number and structural events such as gene fusions. ResultsHere we describe deep whole genome sequencing (WGS) of commonly used BCCL and PDX models using the Illumina X10 platform with an average ~ 60x coverage. We identify novel genomic alterations, including point mutations and genomic rearrangements at base-pair resolution, compared to previously available sequencing data. Through integrative analysis with publicly available functional screening data, we annotate new genomic features likely to be of biological significance. CSMD1 , previously identified as a tumor suppressor gene in various cancer types, including head and neck, lung and breast cancers, has been identified with deletion in 50% of our PDX models, suggesting an important role in aggressive breast cancers. ConclusionsOur WGS data provides a comprehensive genome sequencing resource of these models.


2021 ◽  
Vol 14 (1) ◽  
Author(s):  
Kelley Paskov ◽  
Jae-Yoon Jung ◽  
Brianna Chrisman ◽  
Nate T. Stockham ◽  
Peter Washington ◽  
...  

Abstract Background As next-generation sequencing technologies make their way into the clinic, knowledge of their error rates is essential if they are to be used to guide patient care. However, sequencing platforms and variant-calling pipelines are continuously evolving, making it difficult to accurately quantify error rates for the particular combination of assay and software parameters used on each sample. Family data provide a unique opportunity for estimating sequencing error rates since it allows us to observe a fraction of sequencing errors as Mendelian errors in the family, which we can then use to produce genome-wide error estimates for each sample. Results We introduce a method that uses Mendelian errors in sequencing data to make highly granular per-sample estimates of precision and recall for any set of variant calls, regardless of sequencing platform or calling methodology. We validate the accuracy of our estimates using monozygotic twins, and we use a set of monozygotic quadruplets to show that our predictions closely match the consensus method. We demonstrate our method’s versatility by estimating sequencing error rates for whole genome sequencing, whole exome sequencing, and microarray datasets, and we highlight its sensitivity by quantifying performance increases between different versions of the GATK variant-calling pipeline. We then use our method to demonstrate that: 1) Sequencing error rates between samples in the same dataset can vary by over an order of magnitude. 2) Variant calling performance decreases substantially in low-complexity regions of the genome. 3) Variant calling performance in whole exome sequencing data decreases with distance from the nearest target region. 4) Variant calls from lymphoblastoid cell lines can be as accurate as those from whole blood. 5) Whole-genome sequencing can attain microarray-level precision and recall at disease-associated SNV sites. Conclusion Genotype datasets from families are powerful resources that can be used to make fine-grained estimates of sequencing error for any sequencing platform and variant-calling methodology.


2021 ◽  
Author(s):  
Niantao Deng ◽  
Andre Minoche ◽  
Kate Harvey ◽  
Meng Li ◽  
Juliane Winkler ◽  
...  

Abstract Background: Breast cancer cell lines (BCCLs) and patient-derived xenografts (PDX) are the most frequently used models in breast cancer research. Despite their widespread usage, genome sequencing of these models is incomplete, with previous studies only focusing on targeted gene panels, whole exome or shallow whole genome sequencing. Deep whole genome sequencing is the most sensitive and accurate method to detect single nucleotide variants and indels, gene copy number and structural events such as gene fusions.Results: Here we describe deep whole genome sequencing (WGS) of commonly used BCCL and PDX models using the Illumina X10 platform with an average ~ 60x coverage. We identify novel genomic alterations, including point mutations and genomic rearrangements at base-pair resolution, compared to previously available sequencing data. Through integrative analysis with publicly available functional screening data, we annotate new genomic features likely to be of biological significance. CSMD1, previously identified as a tumor suppressor gene in various cancer types, including head and neck, lung and breast cancers, has been identified with deletion in 50% of our PDX models, suggesting an important role in aggressive breast cancers. Conclusions: Our WGS data provides a comprehensive genome sequencing resource of these models.


2019 ◽  
Author(s):  
Dmitriy Korostin ◽  
Nikolay Kulemin ◽  
Vladimir Naumov ◽  
Vera Belova ◽  
Dmitriy Kwon ◽  
...  

AbstractBackgroundMGISEQ-2000 developed by MGI Tech Co. Ltd. (a subsidiary of the BGI Group) is a new competitor of such next-generation sequencing platforms as NovaSeq and HiSeq (Illumina). Its sequencing principle is based on the DNB and cPAS technologies, which were also used in the previous version of the BGISEQ-500 device. However, the reagents for MGISEQ-2000 have been refined and the platform utilizes updated software. The cPAS method is an advanced technology based on cPAL previously created by Complete Genomics.ResultIn this paper, the authors compare the results of the whole-genome sequencing of a DNA sample from a Russian female donor performed on MGISEQ-2000 and Illumina HiSeq 2500 (both PE150). Two platforms were compared in terms of sequencing quality, number of errors and performance. Additionally, we performed variant calling using four different software packages: Samtools mpileaup, Strelka2, Sentieon, and GATK. The accuracy of single nucleotide polymorphism (SNP) detection was similar in the data generated by MGISEQ-2000 and HiSeq 2500, which was used as a reference. At the same time, a separate indel analysis of the overall error rate revealed similar FPR values and lower sensitivity.Conclusionsit may be concluded with confidence that the data generated by the analyzed sequencing systems is characterized by comparable magnitudes of error and that MGISEQ-2000 can be used for a wide range of research tasks on par with HiSeq 2500.


2021 ◽  
Vol 9 (8) ◽  
pp. 1585
Author(s):  
Ana C. Reis ◽  
Liliana C. M. Salvador ◽  
Suelee Robbe-Austerman ◽  
Rogério Tenreiro ◽  
Ana Botelho ◽  
...  

Classical molecular analyses of Mycobacterium bovis based on spoligotyping and Variable Number Tandem Repeat (MIRU-VNTR) brought the first insights into the epidemiology of animal tuberculosis (TB) in Portugal, showing high genotypic diversity of circulating strains that mostly cluster within the European 2 clonal complex. Previous surveillance provided valuable information on the prevalence and spatial occurrence of TB and highlighted prevalent genotypes in areas where livestock and wild ungulates are sympatric. However, links at the wildlife–livestock interfaces were established mainly via classical genotype associations. Here, we apply whole genome sequencing (WGS) to cattle, red deer and wild boar isolates to reconstruct the M. bovis population structure in a multi-host, multi-region disease system and to explore links at a fine genomic scale between M. bovis from wildlife hosts and cattle. Whole genome sequences of 44 representative M. bovis isolates, obtained between 2003 and 2015 from three TB hotspots, were compared through single nucleotide polymorphism (SNP) variant calling analyses. Consistent with previous results combining classical genotyping with Bayesian population admixture modelling, SNP-based phylogenies support the branching of this M. bovis population into five genetic clades, three with apparent geographic specificities, as well as the establishment of an SNP catalogue specific to each clade, which may be explored in the future as phylogenetic markers. The core genome alignment of SNPs was integrated within a spatiotemporal metadata framework to further structure this M. bovis population by host species and TB hotspots, providing a baseline for network analyses in different epidemiological and disease control contexts. WGS of M. bovis isolates from Portugal is reported for the first time in this pilot study, refining the spatiotemporal context of TB at the wildlife–livestock interface and providing further support to the key role of red deer and wild boar on disease maintenance. The SNP diversity observed within this dataset supports the natural circulation of M. bovis for a long time period, as well as multiple introduction events of the pathogen in this Iberian multi-host system.


Sign in / Sign up

Export Citation Format

Share Document