scholarly journals CNVkit-RNA: Copy number inference from RNA-Sequencing data

2018 ◽  
Author(s):  
Eric Talevich ◽  
A. Hunter Shain

AbstractRNA-sequencing is most commonly used to measure gene expression, but it is possible to extract genotypic information from RNA-sequencing data, too. Point mutations and translocations can be detected when they occur in expressed genes, however, there are few software solutions to infer copy number information from RNA-sequencing data. This is because a gene’s expression is dictated by a number of variables, including, but not limited to, copy number variation. Here, we report new functionalities within the software package CNVkit that enable copy number inference from RNA-sequencing data. First, CNVkit removes technical variation in gene expression associated with GC-content and transcript length. Next, CNVkit assigns a weight, dictated by several variables, to each transcript with the net effect of preferentially inferring copy number from highly and stably expressed genes. We benchmarked our approach on 105 melanomas from The Cancer Genome Atlas project and observed a high degree of concordance (R = 0.739) between our estimates and those from array comparative genomic hybridization (aCGH) on the same samples. After initial configuration, the software requires few inputs, is able to process a batch of up to 100 samples in less than ten minutes, and can be used in conjunction with pre-existing features of CNVkit, including visualization tools. Overall, we present a rapid, user-friendly software solution to infer copy number information from gene expression data.


2020 ◽  
Author(s):  
Christopher W. Whelan ◽  
Robert E. Handsaker ◽  
Giulio Genovese ◽  
Seva Kashin ◽  
Monkol Lek ◽  
...  

AbstractTwo intriguing forms of genome structural variation (SV) – dispersed duplications, and de novo rearrangements of complex, multi-allelic loci – have long escaped genomic analysis. We describe a new way to find and characterize such variation by utilizing identity-by-descent (IBD) relationships between siblings together with high-precision measurements of segmental copy number. Analyzing whole-genome sequence data from 706 families, we find hundreds of “IBD-discordant” (IBDD) CNVs: loci at which siblings’ CNV measurements and IBD states are mathematically inconsistent. We found that commonly-IBDD CNVs identify dispersed duplications; we mapped 95 of these common dispersed duplications to their true genomic locations through family-based linkage and population linkage disequilibrium (LD), and found several to be in strong LD with genome-wide association (GWAS) signals for common diseases or gene expression variation at their revealed genomic locations. Other CNVs that were IBDD in a single family appear to involve de novo mutations in complex and multi-allelic loci; we identified 26 de novo structural mutations that had not been previously detected in earlier analyses of the same families by diverse SV analysis methods. These included a de novo mutation of the amylase gene locus and multiple de novo mutations at chromosome 15q14. Combining these complex mutations with more-conventional CNVs, we estimate that segmental mutations larger than 1kb arise in about one per 22 human meioses. These methods are complementary to previous techniques in that they interrogate genomic regions that are home to segmental duplication, high CNV allele frequencies, and multi-allelic CNVs.Author SummaryCopy number variation is an important form of genetic variation in which individuals differ in the number of copies of segments of their genomes. Certain aspects of copy number variation have traditionally been difficult to study using short-read sequencing data. For example, standard analyses often cannot tell whether the duplicated copies of a segment are located near the original copy or are dispersed to other regions of the genome. Another aspect of copy number variation that has been difficult to study is the detection of mutations in the copy number of DNA segments passed down from parents to their children, particularly when the mutations affect genome segments which already display common copy number variation in the population. We develop an analytical approach to solving these problems when sequencing data is available for all members of families with at least two children. This method is based on determining the number of parental haplotypes the two siblings share at each location in their genome, and using that information to determine the possible inheritance patterns that might explain the copy numbers we observe in each family member. We show that dispersed duplications and mutations can be identified by looking for copy number variants that do not follow these expected inheritance patterns. We use this approach to determine the location of 95 common duplications which are dispersed to distant regions of the genome, and demonstrate that these duplications are linked to genetic variants that affect disease risk or gene expression levels. We also identify a set of copy number mutations not detected by previous analyses of sequencing data from a large cohort of families, and show that repetitive and complex regions of the genome undergo frequent mutations in copy number.



Blood ◽  
2016 ◽  
Vol 128 (22) ◽  
pp. 374-374 ◽  
Author(s):  
Chase Miller ◽  
Jennifer Yesil ◽  
Mary Derome ◽  
Andrea Donnelly ◽  
Jean Marrian ◽  
...  

Abstract Fluorescent in situ hybridization (FISH) is commonly used in the multiple myeloma field to subtype and risk-stratify patients. There are many benefits to FISH based assays, which are widely used around the world and represent true single cell assays. However, there are significant discrepancies in the specific assays, utilization of reflex testing strategies, and enumeration requirements between clinical centers. By comparison next-generation sequencing tests can be designed to simultaneously detect the copy number abnormalities and translocations detected by clinical FISH along with gene mutations that cannot be detected by FISH. As part of the MMRF CoMMpass Study we have compared the results attained using clinical FISH assays compared to sequencing based FISH (Seq-FISH) results. Clinical FISH reports from a random subset of 339 CoMMpass patients were extraction by a single individual based on the ISCN result lines of each report. To validate the accuracy of the central data extraction, two independent cross validations of 10% of the cohort were performed, after which our data entry error rate is expected to be less than 0.348%. The Seq-FISH results were extracted from the whole genome sequencing data available from each patient using a rapid and fully automated informatics process and the results were cross-validated using the matching exome sequencing data for copy number abnormalities and by RNA sequencing data for dysregulated immunoglobulin translocation target genes. There were 230 patients with clinical FISH and Seq-FISH results. In this cohort, 151 translocations were identified by Seq-FISH. This includes translocations to MYC, CCND2, MAFA, and those involving IgK and IgL, which are not tested by clinical FISH. After filtering non-tested translocations there are 118 translocations identified by Seq-FISH. Only 97 of these translocations had a clinical FISH assay performed with 89 (91.75%) of these being detected by clinical FISH, yet spiked target gene expression was observed in all 89 cases by RNA sequencing. Conversely, 93 translocations were called by clinical FISH, of these 89 were called by Seq-FISH(95.7%). Of the 4 translocations only called by clinical FISH, 3 were t(4;14) and 1 was a t(11;14). In two of these t(4;14) cases we did observe spiked target gene expression by RNA sequencing, suggesting these are false negatives by Seq-FISH. However, the remaining two events appear to be false positive clinical FISH results. The t(4;14) event was only observed in 1/200 cells and a co-occuring t(11;14) was also called, which was confirmed by Seq-FISH and spiked gene expression. Similarly, the one t(11;14) was observed in 3/56 cells but a del13q14 was seen in 47/50 cells, unfortunately RNA sequencing data is not available to cross-validate in this case. Plasma cell enrichment or identification is commonly used to prepare myeloma samples for FISH because even in myeloma, the total plasma cell percentage can be low (median 8.3% in the MMRF CoMMpass Baseline Cohort). Therefore, performing FISH on a sample without performing purification or plasma cell identification will indiscriminately assay non-plasma cells and limit the efficacy of the assay. We looked at the two most common translocations in myeloma, t(4;14) and t(11;14), to test the effect of enrichment on sensitivity. Sensitivity was higher for both sets of translocations in the enriched cohort. There was 1 false negative in the enriched population, yielding sensitivities of 100% (32/32) and 95%(19/20) for CCND1 and WHSC1 respectively. For those reports that did not indicate enrichment was performed the observed sensitivities were 86.36% (19/22) and 92.86% (13/14). Seq-FISH identified almost all of the translocations called by clinical FISH and simultaneously; it identified 30 translocations missed by clinical FISH. The translocations that were not reported by clinical FISH can be attributed to a mixture of the correct assay not being performed and the translocation being missed even though the assay was performed. We believe that Seq-FISH is a viable alternative to clinical FISH, with similar specificity and greater sensitivity. It is important to note that a single Seq-FISH assay is sufficient to investigate all translocations, while each translocation must be investigated separately with clinical FISH. As such, Seq-FISH obviates the concern that a translocation would be missed because the correct assay was not performed. Disclosures McBride: Instat: Employment.



2013 ◽  
Vol 2013 ◽  
pp. 1-7 ◽  
Author(s):  
Yan Guo ◽  
Quanghu Sheng ◽  
David C. Samuels ◽  
Brian Lehmann ◽  
Joshua A. Bauer ◽  
...  

Exome sequencing using next-generation sequencing technologies is a cost-efficient approach to selectively sequencing coding regions of the human genome for detection of disease variants. One of the lesser known yet important applications of exome sequencing data is to identify copy number variation (CNV). There have been many exome CNV tools developed over the last few years, but the performance and accuracy of these programs have not been thoroughly evaluated. In this study, we systematically compared four popular exome CNV tools (CoNIFER, cn.MOPS, exomeCopy, and ExomeDepth) and evaluated their effectiveness against array comparative genome hybridization (array CGH) platforms. We found that exome CNV tools are capable of identifying CNVs, but they can have problems such as high false positives, low sensitivity, and duplication bias when compared to array CGH platforms. While exome CNV tools do serve their purpose for data mining, careful evaluation and additional validation is highly recommended. Based on all these results, we recommend CoNIFER and cn.MOPs for nonpaired exome CNV detection over the other two tools due to a low false-positive rate, although none of the four exome CNV tools performed at an outstanding level when compared to array CGH.



2013 ◽  
Vol 31 (15_suppl) ◽  
pp. 3623-3623
Author(s):  
F. Anthony San Lucas ◽  
Scott Kopetz ◽  
Paul A. Scheet ◽  
Eduardo Vilar Sanchez

3623 Background: Approximately 10% of colorectal cancers (CRCs) harbor a BRAF mutation (BRAFm). Patients with BRAFm tumors have poor prognosis and are a therapeutic challenge. A BRAFm gene expression signature has been communicated (Popovici et al, JCO 2012), which can identify BRAFm tumors as well as BRAF wild-type tumors that display a similar expression pattern. Collectively, these tumors are termed BRAFm-like. Our goal was to validate this signature using next-generation sequencing and to discover novel therapies for BRAFm-like CRCs using a systems biology approach. Methods: We developed a semi-automated workflow that integrates publicly available tools named the Cancer In-silico Drug Discovery (CIDD). To validate the BRAFm-like signature, we used CIDD to analyze the CRC dataset from the The Cancer Genome Atlas Network (TCGA). Samples were stratified on BRAFm status using exome-sequencing, and expression profiles were inferred from RNA-sequencing. We matched expression profiles with drug-induced signatures inferred from the Connectivity Map (CMap) – a systems biology tool that contains expression data of cell lines treated with 1,500 compounds. CIDD statistically ranks candidate compounds and annotates them to pathways using public databases. Results: When applied to TCGA RNA-sequencing data, a classifier based on the BRAFm-like signature resulted in 93.3% sensitivity and 83.5% specificity for detecting BRAFm samples. When applied to Agilent gene expression data, this resulted in 80% sensitivity and 91.1% specificity. 41% of KRAS-mutated samples and 14% of double wild-type samples were predicted to be BRAFm-like. 100% of MSI-high and 18% of MSS samples were predicted to be BRAFm-like. Compounds near the top of our drug rankings include Gefitinib and MG-262 a proteasome inhibitor. Conclusions: We have validated the BRAFm-like signature using RNA-sequencing and Agilent expression data from the TCGA, and showed a high degree of robustness across technologies. We have identified EGFR and proteasome inhibitors as potential compounds to target BRAFm-like CRCs.



2020 ◽  
Vol 49 (D1) ◽  
pp. D1268-D1275
Author(s):  
Hsi-Yuan Huang ◽  
Jing Li ◽  
Yun Tang ◽  
Yi-Xian Huang ◽  
Yi-Gang Chen ◽  
...  

Abstract DNA methylation is an important epigenetic regulator in gene expression and has several roles in cancer and disease progression. MethHC version 2.0 (MethHC 2.0) is an integrated and web-based resource focusing on the aberrant methylomes of human diseases, specifically cancer. This paper presents an updated implementation of MethHC 2.0 by incorporating additional DNA methylomes and transcriptomes from several public repositories, including 33 human cancers, over 50 118 microarray and RNA sequencing data from TCGA and GEO, and accumulating up to 3586 manually curated data from >7000 collected published literature with experimental evidence. MethHC 2.0 has also been equipped with enhanced data annotation functionality and a user-friendly web interface for data presentation, search, and visualization. Provided features include clinical-pathological data, mutation and copy number variation, multiplicity of information (gene regions, enhancer regions, and CGI regions), and circulating tumor DNA methylation profiles, available for research such as biomarker panel design, cancer comparison, diagnosis, prognosis, therapy study and identifying potential epigenetic biomarkers. MethHC 2.0 is now available at http://awi.cuhk.edu.cn/∼MethHC.



2014 ◽  
Author(s):  
Eric Talevich ◽  
A. Hunter Shain ◽  
Thomas Botton ◽  
Boris C. Bastian

Germline copy number variants (CNVs) and somatic copy number alterations (SCNAs) are of significant importance in syndromic conditions and cancer. Massive parallel sequencing is increasingly used to infer copy number information from variations in the read depth in sequencing data. However, this approach has limitations in the case of targeted re-sequencing, which leaves gaps in coverage between the regions chosen for enrichment and introduces biases related to the efficiency of target capture and library preparation. We present a method for copy number detection, implemented in the software package CNVkit, that uses both the targeted reads and the nonspecifically captured off-target reads to infer copy number evenly across the genome. This combination achieves both exon-level resolution in targeted regions and sufficient resolution in the larger intronic and intergenic regions to identify copy number changes. In particular, we successfully inferred copy number at equivalent to 100-kilobase resolution genome-wide from a platform targeting as few as 293 genes. After normalizing read counts to a pooled reference, we evaluated and corrected for three sources of bias that explain most of the extraneous variability in the sequencing read depth: GC content, target footprint size and spacing, and repetitive sequences. We compared the performance of CNVkit to copy number changes identified by array comparative genomic hybridization. We packaged the components of CNVkit so that it is straightforward to use and provides visualizations, detailed reporting of significant features, and export options for compatibility with other software. Availability: http://github.com/etal/cnvkit



PLoS ONE ◽  
2021 ◽  
Vol 16 (8) ◽  
pp. e0256416
Author(s):  
Keller J. Toral ◽  
Mark A. Wuenschel ◽  
Esther P. Black

The identification of novel therapies, new strategies for combination of therapies, and repurposing of drugs approved for other indications are all important for continued progress in the fight against lung cancers. Antibodies that target immune checkpoints can unmask an immunologically hot tumor from the immune system of a patient. However, despite accounts of significant tumor regression resulting from these medications, most patients do not respond. In this study, we sought to use protein expression and RNA sequencing data from The Cancer Genome Atlas and two smaller studies deposited onto the Gene Expression Omnibus (GEO) to advance our hypothesis that inhibition of SHP-2, a tyrosine phosphatase, will improve the activity of immune checkpoint inhibitors (ICI) that target PD-1 or PD-L1 in lung cancers. We first collected protein expression data from The Cancer Proteome Atlas (TCPA) to study the association of SHP-2 and PD-L1 expression in lung adenocarcinomas. RNA sequencing data was collected from the same subjects through the NCI Genetic Data Commons and evaluated for expression of the PTPN11 (SHP-2) and CD274 (PD-L1) genes. We then analyzed RNA sequencing data from a series of melanoma patients who were either treatment naïve or resistant to ICI therapy. PTPN11 and CD274 expression was compared between groups. Finally, we analyzed gene expression and drug response data collected from 21 non-small cell lung cancer (NSCLC) patients for PTPN11 and CD274 expression. From the three studies, we hypothesize that the activity of SHP-2, rather than the expression, likely controls the expression of PD-L1 as only a weak relationship between PTPN11 and CD274 expression in either lung adenocarcinomas or melanomas was observed. Lastly, the expression of CD274, not PTPN11, correlates with response to ICI in NSCLC.



Author(s):  
Rebecca Elyanow ◽  
Ron Zeira ◽  
Max Land ◽  
Benjamin J. Raphael

AbstractTumors are highly heterogeneous, consisting of cell populations with both transcriptional and genetic diversity. These diverse cell populations are spatially organized within a tumor, creating a distinct tumor microenvironment. A new technology called spatial transcriptomics can measure spatial patterns of gene expression within a tissue by sequencing RNA transcripts from a grid of spots, each containing a small number of cells. In tumor cells, these gene expression patterns represent the combined contribution of regulatory mechanisms, which alter the rate at which a gene is transcribed, and genetic diversity, particularly copy number aberrations (CNAs) which alter the number of copies of a gene in the genome. CNAs are common in tumors and often promote cancer growth through upregulation of oncogenes or downregulation of tumor-suppressor genes. We introduce a new method STARCH (Spatial Transcriptomics Algorithm Reconstructing Copy-number Heterogeneity) to infer CNAs from spatial transcriptomics data. STARCH overcomes challenges in inferring CNAs from RNA-sequencing data by leveraging the observation that cells located nearby in a tumor are likely to share similar CNAs. We find that STARCH outperforms existing methods for inferring CNAs from RNA-sequencing data without incorporating spatial information.



2021 ◽  
Vol 39 (15_suppl) ◽  
pp. e14534-e14534
Author(s):  
Chan-Young Ock ◽  
Seunghwan Shin ◽  
Wonkyung Jung ◽  
Sangheon Ahn ◽  
Haejoon Kim ◽  
...  

e14534 Background: Novel immuno-oncology (IO) agents are promising but showing their efficacy in early phase clinical trials has been challenging due to limited enrichment strategies using practical biomarker platforms. We hypothesize that an artificial intelligence (AI)-powered spatial analysis of TIL using practically feasible H&E slides, can reflect a specific target gene expression derived from RNA sequencing. This enhances its potential application in early development of novel IO agents. Methods: An AI-powered spatial TIL analyzer, namely Lunit SCOPE IO, was developed with data from 2.8 x 109 micrometer2 H&E-stained tissue regions and 5.9 x 106 TILs from 3,166 whole slide images of multiple cancer types, annotated by board-certified pathologists. Inflamed Score and Immune-Excluded Score was defined as the proportion of all tumor-containing 1 mm2-size tiles within a WSI classified as being of inflamed immune phenotype (high TIL density within cancer epithelium) and immune-excluded phenotype (low TIL density within cancer epithelium, but high TIL density within stroma), respectively. We used RNA sequencing data and H&E images from The Cancer Genome Atlas database, excluding those of mesenchymal origin (n = 7,467). Spearman's rank correlation between each gene expression and IS or IES, respectively, was calculated. Correlation coefficient > 0.2 and false discovery rate (FDR) < 1% was considered as a significant correlation. Results: In a total of 20,304 genes, 871 (4.3%) and 1,155 (5.7%) genes were significantly correlated with Inflamed Score (IS) and Immune-Excluded Score (IES), respectively. The IS was highly related to genes reflecting immune cytolytic activity and targets of approved immune checkpoint inhibitors (Table). Interestingly, it was also significantly correlated with target genes of novel IO such as TIGIT, LAG3, TIM3, IDO, Adenosine receptor A2A, OX40, ICOS, M-CSF, IL2, IL7, and IL12. Moreover, the IES was exclusively correlated with the target genes of CEACAM, TGFB, and IL1. Conclusions: Expression levels of novel I-O target genes are correlated with three scores derived from AI-powered TIL analysis using H&E slides, which can be easily applied to clinical research.[Table: see text]



2014 ◽  
Vol 2014 ◽  
pp. 1-13 ◽  
Author(s):  
Samuel M. Peterson ◽  
Jennifer L. Freeman

DNA copy number variation is long associated with highly penetrant genomic disorders, but it was not until recently that the widespread occurrence of copy number variation among phenotypically normal individuals was realized as a considerable source of genetic variation. It is also now appreciated that copy number variants (CNVs) play a role in the onset of complex diseases. Many of the complex diseases in which CNVs are associated are reported to be influenced by yet to be identified environmental factors. It is hypothesized that exposure to environmental chemicals generates CNVs and influences disease onset and pathogenesis. In this study a proof of principle experiment was completed with ethyl methanesulfonate (EMS) and cytosine arabinoside (Ara-C) to investigate the generation of CNVs using array comparative genomic hybridization (CGH) and the zebrafish vertebrate model system. Exposure to both chemicals resulted in CNVs. CNVs were detected in similar genomic regions among multiple exposure concentrations with EMS and five CNVs were common among both chemicals. Furthermore, CNVs were correlated to altered gene expression. This study suggests that chemical exposure generates CNVs with impacts on gene expression warranting further investigation of this phenomenon with environmental chemicals.



Sign in / Sign up

Export Citation Format

Share Document