scholarly journals CanVar: A resource for sharing germline variation in cancer patients

F1000Research ◽  
2016 ◽  
Vol 5 ◽  
pp. 2813 ◽  
Author(s):  
Daniel Chubb ◽  
Peter Broderick ◽  
Sara E. Dobbins ◽  
Richard S. Houlston

The advent of high-throughput sequencing has accelerated our ability to discover genes predisposing to disease and is transforming clinical genomic sequencing. In both contexts knowledge of the spectrum and frequency of genetic variation in the general population and in disease cohorts is vital to the interpretation of sequencing data. While population level data is becoming increasingly available from publicly accessible sources, as exemplified by The Exome Aggregation Consortium (ExAC), the availability of large-scale disease-specific frequency information is limited. These data are of particular importance to contextualise findings from clinical mutation screens and small gene discovery projects. This is especially true for cancer, which is typified by a number of hereditary predisposition syndromes.  Although mutation frequencies in tumours are available from resources such as Cosmic and The Cancer Genome Atlas, a similar facility for germline variation is lacking. Here we present the Cancer Variation Resource (CanVar) an online database which has been developed using the ExAC framework to provide open access to germline variant frequency data from the sequenced exomes of cancer patients. In its first release, CanVar catalogues the exomes of 1,006 familial early-onset colorectal cancer (CRC) patients sequenced at The Institute of Cancer Research. It is anticipated that CanVar will host data for additional cancers, providing a resource for others studying cancer predisposition and an example of how the research community can utilise the ExAC framework to share sequencing data.

2020 ◽  
Author(s):  
Nan Li ◽  
Kai Yu ◽  
Ling Zhong ◽  
Dingyuan Zeng

Abstract Background. The prognosis for prostate cancer patients remains poor. High-throughput sequencing data provide a solid basis for identifying genes associated with cancer prognosis, but genetic markers are needed to predict the clinical outcome of prostate cancer. Methods. The Cancer Genome Atlas (TCGA) database (N = 551) was adopted to estimate the prognostic value of immune genes. RNA-seq and clinical follow-up data were downloaded from TCGA. The samples were randomly divided into training and test. Cox regression analyses and least absolute shrinkage and selection operator (LASSO) were conducted to develop an immune risk score. Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG) and single sample Gene Set Enrichment Analysis (ssGSEA) were used for functional Analysis. Tumor Immune Estimation Resource (TIMER) is used to analyze the immune score, and RMS curve and clinical decision curve analysis is used to analyze the superiority of the comparison with published models. Results. Survival analyses revealed that 19 genes significantly associated with the overall survival (OS). 10-genes signature was ultimately obtained through random forest feature selection. Riskscore effectively stratified samples in the training, test, and external verification sets and all TCGA sets. The 5-year survival AUC in the training, verification sets and all TCGA sets were around 0.7. Univariate and multivariate analysis showed that 10-genes signature has good predictive performance in clinical. TIMER analysis shows that immunosuppression may reduce the chances of survival for patients with prostate cancer. Compared with published models, our model has a higher C-index. Conclusion. We constructed a 10-gene signature as a new prognostic marker for predicting survival of prostate cancer patients.


2019 ◽  
Author(s):  
Wikum Dinalankara ◽  
Qian Ke ◽  
Donald Geman ◽  
Luigi Marchionni

AbstractGiven the ever-increasing amount of high-dimensional and complex omics data becoming available, it is increasingly important to discover simple but effective methods of analysis. Divergence analysis transforms each entry of a high-dimensional omics profile into a digitized (binary or ternary) code based on the deviation of the entry from a given baseline population. This is a novel framework that is significantly different from existing omics data analysis methods: it allows digitization of continuous omics data at the univariate or multivariate level, facilitates sample level analysis, and is applicable on many different omics platforms. The divergence package, available on the R platform through the Bioconductor repository collection, provides easy-to-use functions for carrying out this transformation. Here we demonstrate how to use the package with sample high throughput sequencing data from the Cancer Genome Atlas.


Viruses ◽  
2021 ◽  
Vol 13 (10) ◽  
pp. 2006
Author(s):  
Anna Y Budkina ◽  
Elena V Korneenko ◽  
Ivan A Kotov ◽  
Daniil A Kiselev ◽  
Ilya V Artyushin ◽  
...  

According to various estimates, only a small percentage of existing viruses have been discovered, naturally much less being represented in the genomic databases. High-throughput sequencing technologies develop rapidly, empowering large-scale screening of various biological samples for the presence of pathogen-associated nucleotide sequences, but many organisms are yet to be attributed specific loci for identification. This problem particularly impedes viral screening, due to vast heterogeneity in viral genomes. In this paper, we present a new bioinformatic pipeline, VirIdAl, for detecting and identifying viral pathogens in sequencing data. We also demonstrate the utility of the new software by applying it to viral screening of the feces of bats collected in the Moscow region, which revealed a significant variety of viruses associated with bats, insects, plants, and protozoa. The presence of alpha and beta coronavirus reads, including the MERS-like bat virus, deserves a special mention, as it once again indicates that bats are indeed reservoirs for many viral pathogens. In addition, it was shown that alignment-based methods were unable to identify the taxon for a large proportion of reads, and we additionally applied other approaches, showing that they can further reveal the presence of viral agents in sequencing data. However, the incompleteness of viral databases remains a significant problem in the studies of viral diversity, and therefore necessitates the use of combined approaches, including those based on machine learning methods.


2020 ◽  
Vol 48 (W1) ◽  
pp. W200-W207
Author(s):  
Simone Puccio ◽  
Giorgio Grillo ◽  
Arianna Consiglio ◽  
Maria Felicia Soluri ◽  
Daniele Sblattero ◽  
...  

Abstract High-Throughput Sequencing technologies are transforming many research fields, including the analysis of phage display libraries. The phage display technology coupled with deep sequencing was introduced more than a decade ago and holds the potential to circumvent the traditional laborious picking and testing of individual phage rescued clones. However, from a bioinformatics point of view, the analysis of this kind of data was always performed by adapting tools designed for other purposes, thus not considering the noise background typical of the ‘interactome sequencing’ approach and the heterogeneity of the data. InteractomeSeq is a web server allowing data analysis of protein domains (‘domainome’) or epitopes (‘epitome’) from either Eukaryotic or Prokaryotic genomic phage libraries generated and selected by following an Interactome sequencing approach. InteractomeSeq allows users to upload raw sequencing data and to obtain an accurate characterization of domainome/epitome profiles after setting the parameters required to tune the analysis. The release of this tool is relevant for the scientific and clinical community, because InteractomeSeq will fill an existing gap in the field of large-scale biomarkers profiling, reverse vaccinology, and structural/functional studies, thus contributing essential information for gene annotation or antigen identification. InteractomeSeq is freely available at https://InteractomeSeq.ba.itb.cnr.it/


2020 ◽  
Vol 36 (12) ◽  
pp. 3632-3636 ◽  
Author(s):  
Weibo Zheng ◽  
Jing Chen ◽  
Thomas G Doak ◽  
Weibo Song ◽  
Ying Yan

Abstract Motivation Programmed DNA elimination (PDE) plays a crucial role in the transitions between germline and somatic genomes in diverse organisms ranging from unicellular ciliates to multicellular nematodes. However, software specific for the detection of DNA splicing events is scarce. In this paper, we describe Accurate Deletion Finder (ADFinder), an efficient detector of PDEs using high-throughput sequencing data. ADFinder can predict PDEs with relatively low sequencing coverage, detect multiple alternative splicing forms in the same genomic location and calculate the frequency for each splicing event. This software will facilitate research of PDEs and all down-stream analyses. Results By analyzing genome-wide DNA splicing events in two micronuclear genomes of Oxytricha trifallax and Tetrahymena thermophila, we prove that ADFinder is effective in predicting large scale PDEs. Availability and implementation The source codes and manual of ADFinder are available in our GitHub website: https://github.com/weibozheng/ADFinder. Supplementary information Supplementary data are available at Bioinformatics online.


Genetics ◽  
2019 ◽  
Vol 213 (4) ◽  
pp. 1209-1224 ◽  
Author(s):  
Juho A. J. Kontio ◽  
Mikko J. Sillanpää

Gaussian process (GP)-based automatic relevance determination (ARD) is known to be an efficient technique for identifying determinants of gene-by-gene interactions important to trait variation. However, the estimation of GP models is feasible only for low-dimensional datasets (∼200 variables), which severely limits application of the GP-based ARD method for high-throughput sequencing data. In this paper, we provide a nonparametric prescreening method that preserves virtually all the major benefits of the GP-based ARD method and extends its scalability to the typical high-dimensional datasets used in practice. In several simulated test scenarios, the proposed method compared favorably with existing nonparametric dimension reduction/prescreening methods suitable for higher-order interaction searches. As a real-data example, the proposed method was applied to a high-throughput dataset downloaded from the cancer genome atlas (TCGA) with measured expression levels of 16,976 genes (after preprocessing) from patients diagnosed with acute myeloid leukemia.


2022 ◽  
Author(s):  
Bermond Scoggins ◽  
Matthew Peter Robertson

The scientific method is predicated on transparency -- yet the pace at which transparent research practices are being adopted by the scientific community is slow. The replication crisis in psychology showed that published findings employing statistical inference are threatened by undetected errors, data manipulation, and data falsification. To mitigate these problems and bolster research credibility, open data and preregistration have increasingly been adopted in the natural and social sciences. While many political science and international relations journals have committed to implementing these reforms, the extent of open science practices is unknown. We bring large-scale text analysis and machine learning classifiers to bear on the question. Using population-level data -- 93,931 articles across the top 160 political science and IR journals between 2010 and 2021 -- we find that approximately 21% of all statistical inference papers have open data, and 5% of all experiments are preregistered. Despite this shortfall, the example of leading journals in the field shows that change is feasible and can be effected quickly.


PLoS ONE ◽  
2020 ◽  
Vol 15 (12) ◽  
pp. e0244480
Author(s):  
Xinghuo Ye ◽  
Zhihong Yang ◽  
Yeqin Jiang ◽  
Lan Yu ◽  
Rongkai Guo ◽  
...  

Identification of the target genes of microRNAs (miRNAs), trans-acting small interfering RNAs (ta-siRNAs), and small interfering RNAs (siRNAs) is an important step for understanding their regulatory roles in plants. In recent years, many bioinformatics software packages based on small RNA (sRNA) high-throughput sequencing (HTS) and degradome sequencing data analysis have provided strong technical support for large-scale mining of sRNA-target pairs. However, sRNA-target regulation is achieved using a complex network of interactions since one transcript might be co-regulated by multiple sRNAs and one sRNA may also affect multiple targets. Currently used mining software can realize the mining of multiple unknown targets using known sRNA, but it cannot rule out the possibility of co-regulation of the same target by other unknown sRNAs. Hence, the obtained regulatory network may be incomplete. We have developed a new mining software, sRNATargetDigger, that includes two function modules, “Forward Digger” and “Reverse Digger”, which can identify regulatory sRNA-target pairs bidirectionally. Moreover, it has the ability to identify unknown sRNAs co-regulating the same target, in order to obtain a more authentic and reliable sRNA-target regulatory network. Upon re-examination of the published sRNA-target pairs in Arabidopsis thaliana, sRNATargetDigger found 170 novel co-regulatory sRNA-target pairs. This software can be downloaded from http://www.bioinfolab.cn/sRNATD.html.


eLife ◽  
2019 ◽  
Vol 8 ◽  
Author(s):  
Joshua S Bloom ◽  
James Boocock ◽  
Sebastian Treusch ◽  
Meru J Sadhu ◽  
Laura Day ◽  
...  

How variants with different frequencies contribute to trait variation is a central question in genetics. We use a unique model system to disentangle the contributions of common and rare variants to quantitative traits. We generated ~14,000 progeny from crosses among 16 diverse yeast strains and identified thousands of quantitative trait loci (QTLs) for 38 traits. We combined our results with sequencing data for 1011 yeast isolates to show that rare variants make a disproportionate contribution to trait variation. Evolutionary analyses revealed that this contribution is driven by rare variants that arose recently, and that negative selection has shaped the relationship between variant frequency and effect size. We leveraged the structure of the crosses to resolve hundreds of QTLs to single genes. These results refine our understanding of trait variation at the population level and suggest that studies of rare variants are a fertile ground for discovery of genetic effects.


Sign in / Sign up

Export Citation Format

Share Document