dbCID: a manually curated resource for exploring the driver indels in human cancer

2019 ◽  
Vol 20 (5) ◽  
pp. 1925-1933 ◽  
Author(s):  
Zhenyu Yue ◽  
Le Zhao ◽  
Na Cheng ◽  
Hua Yan ◽  
Junfeng Xia

Abstract While recent advances in next-generation sequencing technologies have enabled the creation of a multitude of databases in cancer genomic research, there is no comprehensive database focusing on the annotation of driver indels (insertions and deletions) yet. Therefore, we have developed the database of Cancer driver InDels (dbCID), which is a collection of known coding indels that likely to be engaged in cancer development, progression or therapy. dbCID contains experimentally supported and putative driver indels derived from manual curation of literature and is freely available online at http://bioinfo.ahu.edu.cn:8080/dbCID. Using the data deposited in dbCID, we summarized features of driver indels in four levels (gene, DNA, transcript and protein) through comparing with putative neutral indels. We found that most of the genes containing driver indels in dbCID are known cancer genes playing a role in tumorigenesis. Contrary to the expectation, the sequences affected by driver frameshift indels are not larger than those by neutral ones. In addition, the frameshift and inframe driver indels prefer to disrupt high-conservative regions both in DNA sequences and protein domains. Finally, we developed a computational method for discriminating cancer driver from neutral frameshift indels based on the deposited data in dbCID. The proposed method outperformed other widely used non-cancer-specific predictors on an external test set, which demonstrated the usefulness of the data deposited in dbCID. We hope dbCID will be a benchmark for improving and evaluating prediction algorithms, and the characteristics summarized here may assist with investigating the mechanism of indel–cancer association.

Author(s):  
Zhenyu Yue ◽  
Xinlu Chu ◽  
Junfeng Xia

Abstract The discrimination of driver from passenger mutations has been a hot topic in the field of cancer biology. Although recent advances have improved the identification of driver mutations in cancer genomic research, there is no computational method specific for the cancer frameshift indels (insertions or/and deletions) yet. In addition, existing pathogenic frameshift indel predictors may suffer from plenty of missing values because of different choices of transcripts during the variant annotation processes. In this study, we proposed a computational model, called PredCID (Predictor for Cancer driver frameshift InDels), for accurately predicting cancer driver frameshift indels. Gene, DNA, transcript and protein level features are combined together and selected for classification with eXtreme Gradient Boosting classifier. Benchmarking results on the cross-validation dataset and independent dataset showed that PredCID achieves better and robust performance compared with existing noncancer-specific methods in distinguishing cancer driver frameshift indels from passengers and is therefore a valuable method for deeper understanding of frameshift indels in human cancer. PredCID is freely available for academic research at http://bioinfo.ahu.edu.cn:8080/PredCID.


2019 ◽  
Vol 14 (2) ◽  
pp. 157-163
Author(s):  
Majid Hajibaba ◽  
Mohsen Sharifi ◽  
Saeid Gorgin

Background: One of the pivotal challenges in nowadays genomic research domain is the fast processing of voluminous data such as the ones engendered by high-throughput Next-Generation Sequencing technologies. On the other hand, BLAST (Basic Local Alignment Search Tool), a longestablished and renowned tool in Bioinformatics, has shown to be incredibly slow in this regard. Objective: To improve the performance of BLAST in the processing of voluminous data, we have applied a novel memory-aware technique to BLAST for faster parallel processing of voluminous data. Method: We have used a master-worker model for the processing of voluminous data alongside a memory-aware technique in which the master partitions the whole data in equal chunks, one chunk for each worker, and consequently each worker further splits and formats its allocated data chunk according to the size of its memory. Each worker searches every split data one-by-one through a list of queries. Results: We have chosen a list of queries with different lengths to run insensitive searches in a huge database called UniProtKB/TrEMBL. Our experiments show 20 percent improvement in performance when workers used our proposed memory-aware technique compared to when they were not memory aware. Comparatively, experiments show even higher performance improvement, approximately 50 percent, when we applied our memory-aware technique to mpiBLAST. Conclusion: We have shown that memory-awareness in formatting bulky database, when running BLAST, can improve performance significantly, while preventing unexpected crashes in low-memory environments. Even though distributed computing attempts to mitigate search time by partitioning and distributing database portions, our memory-aware technique alleviates negative effects of page-faults on performance.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Cesim Erten ◽  
Aissa Houdjedj ◽  
Hilal Kazan

Abstract Background Recent cancer genomic studies have generated detailed molecular data on a large number of cancer patients. A key remaining problem in cancer genomics is the identification of driver genes. Results We propose BetweenNet, a computational approach that integrates genomic data with a protein-protein interaction network to identify cancer driver genes. BetweenNet utilizes a measure based on betweenness centrality on patient specific networks to identify the so-called outlier genes that correspond to dysregulated genes for each patient. Setting up the relationship between the mutated genes and the outliers through a bipartite graph, it employs a random-walk process on the graph, which provides the final prioritization of the mutated genes. We compare BetweenNet against state-of-the art cancer gene prioritization methods on lung, breast, and pan-cancer datasets. Conclusions Our evaluations show that BetweenNet is better at recovering known cancer genes based on multiple reference databases. Additionally, we show that the GO terms and the reference pathways enriched in BetweenNet ranked genes and those that are enriched in known cancer genes overlap significantly when compared to the overlaps achieved by the rankings of the alternative methods.


2016 ◽  
Vol 82 (11) ◽  
pp. 3225-3238 ◽  
Author(s):  
Laura Glendinning ◽  
Steven Wright ◽  
Jolinda Pollock ◽  
Peter Tennant ◽  
David Collie ◽  
...  

ABSTRACTSequencing technologies have recently facilitated the characterization of bacterial communities present in lungs during health and disease. However, there is currently a dearth of information concerning the variability of such data in health both between and within subjects. This study seeks to examine such variability using healthy adult sheep as our model system. Protected specimen brush samples were collected from three spatially disparate segmental bronchi of six adult sheep (age, 20 months) on three occasions (day 0, 1 month, and 3 months). To further explore the spatial variability of the microbiotas, more-extensive brushing samples (n= 16) and a throat swab were taken from a separate sheep. The V2 and V3 hypervariable regions of the bacterial 16S rRNA genes were amplified and sequenced via Illumina MiSeq. DNA sequences were analyzed using the mothur software package. Quantitative PCR was performed to quantify total bacterial DNA. Some sheep lungs contained dramatically different bacterial communities at different sampling sites, whereas in others, airway microbiotas appeared similar across the lung. In our spatial variability study, we observed clustering related to the depth within the lung from which samples were taken. Lung depth refers to increasing distance from the glottis, progressing in a caudal direction. We conclude that both host influence and local factors have impacts on the composition of the sheep lung microbiota.IMPORTANCEUntil recently, it was assumed that the lungs were a sterile environment which was colonized by microbes only during disease. However, recent studies using sequencing technologies have found that there is a small population of bacteria which exists in the lung during health, referred to as the “lung microbiota.” In this study, we characterize the variability of the lung microbiotas of healthy sheep. Sheep not only are economically important animals but also are often used as large animal models of human respiratory disease. We conclude that, while host influence does play a role in dictating the types of microbes which colonize the airways, it is clear that local factors also play an important role in this regard. Understanding the nature and influence of these factors will be key to understanding the variability in, and functional relevance of, the lung microbiota.


2014 ◽  
Vol 67 (2) ◽  
pp. 7247-7260 ◽  
Author(s):  
Pablo Andrés Gutiérrez Sánchez ◽  
Juan Fernando Alzate ◽  
Mauricio Marín Montoya

Spongospora subterranea, the causal agent of Potato powdery scab, is an important soil-borne obligate protozoan commonly found in Andean soils. This is a serious problem that causes cosmetic damage on the skin of tubers and induces root gall formation, diminishing the yield and commercial value of the potato. Genetic studies on S. subterranea are difficult due to its obligate parasitism, which explains the lack of available knowledge on its basic biology. S. subterranea is a member of the Plasmodiophorida order, a protist taxa that includes other important plant pathogens such as Plasmodiophora brassicae and Spongospora nasturtii. Little is known about the genomes of Plasmodiophorida; however, with the use of Next-Generation Sequencing technologies combined with appropriate bioinformatic techniques, it is possible to obtain genomic sequences from obligate pathogens such as S. subterranea. To gain a better understanding of the biology of this pathogen and Plasmodiophorida in general, DNA sequences from a cystosori-enriched sample of S. subterranea were obtained using 454 pyrosequencing technology. As a first step in understanding the nutritional requirements of S. subterranea as well as its infective and resistance structures, we present a bioinformatic analysis of 24 contigs related to genes involved in the glycolysis, starch, celullose and chitin metabolism. Intron structure and codon usage is also discussed. The genes analyzed in this study are a good source of information for studies aimed at characterizing these enzymes in vitro, as well as the generation of new methods for the molecular detection of S. subterranea in either soils or infected plants.


PeerJ ◽  
2019 ◽  
Vol 7 ◽  
pp. e7170
Author(s):  
Daniel Liu

Next-generation sequencing technologies create large, multiplexed DNA sequences that require preprocessing before any further analysis. Part of this preprocessing includes demultiplexing and trimming sequences. Although there are many existing tools that can handle these preprocessing steps, they cannot be easily extended to new sequence schematics when new pipelines are developed. We present Fuzzysplit, a tool that relies on a simple declarative language to describe the schematics of sequences, which makes it incredibly adaptable to different use cases. In this paper, we explain the matching algorithms behind Fuzzysplit and we provide a preliminary comparison of its performance with other well-established tools. Overall, we find that its matching accuracy is comparable to previous tools.


2003 ◽  
Vol 77 (3) ◽  
pp. 2056-2062 ◽  
Author(s):  
Rachel Kim ◽  
Alla Trubetskoy ◽  
Takeshi Suzuki ◽  
Nancy A. Jenkins ◽  
Neal G. Copeland ◽  
...  

ABSTRACT The identification of tumor-inducing genes is a driving force for elucidating the molecular mechanisms underlying cancer. Many retroviruses induce tumors by insertion of viral DNA adjacent to cellular oncogenes, resulting in altered expression and/or structure of the encoded proteins. The availability of the mouse genome sequence now allows analysis of retroviral common integration sites in murine tumors to be used as a genetic screen for identification of large numbers of candidate cancer genes. By positioning the sequences of inverse PCR-amplified, virus-host junction fragments within the mouse genome, 19 target genes were identified in T-cell lymphomas induced by the retrovirus SL3-3. The candidate cancer genes included transcription factors (Fos, Gfi1, Lef1, Myb, Myc, Runx3, and Sox3), all three D cyclins, Ras signaling pathway components (Rras2/TC21 and Rasgrp1), and Cmkbr7/CCR7. The most frequent target was Rras2. Insertions as far as 57 kb away from the transcribed portion were associated with substantially increased transcription of Rras2, and no coding sequence mutations, including those typically involved in Ras activation, were detected. These studies demonstrate the power of genome-based analysis of retroviral insertion sites for cancer gene discovery, identify several new genes worth examining for a role in human cancer, and implicate the pathways in which those genes act in lymphomagenesis. They also provide strong genetic evidence that overexpression of unmutated Rras2 contributes to tumorigenesis, thus suggesting that it may also do so if it is inappropriately expressed in human tumors.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Felix Grassmann ◽  
Yudi Pawitan ◽  
Kamila Czene

Abstract Genes involved in cancer are under constant evolutionary pressure, potentially resulting in diverse molecular properties. In this study, we explore 23 omic features from publicly available databases to define the molecular profile of different classes of cancer genes. Cancer genes were grouped according to mutational landscape (germline and somatically mutated genes), role in cancer initiation (cancer driver genes) or cancer survival (survival genes), as well as being implicated by genome-wide association studies (GWAS genes). For each gene, we also computed feature scores based on all omic features, effectively summarizing how closely a gene resembles cancer genes of the respective class. In general, cancer genes are longer, have a lower GC content, have more isoforms with shorter exons, are expressed in more tissues and have more transcription factor binding sites than non-cancer genes. We found that germline genes more closely resemble single tissue GWAS genes while somatic genes are more similar to pleiotropic cancer GWAS genes. As a proof-of-principle, we utilized aggregated feature scores to prioritize genes in breast cancer GWAS loci and found that top ranking genes were enriched in cancer related pathways. In conclusion, we have identified multiple omic features associated with different classes of cancer genes, which can assist prioritization of genes in cancer gene discovery.


2020 ◽  
Vol 49 (D1) ◽  
pp. D1289-D1301 ◽  
Author(s):  
Tao Wang ◽  
Shasha Ruan ◽  
Xiaolu Zhao ◽  
Xiaohui Shi ◽  
Huajing Teng ◽  
...  

Abstract The prevalence of neutral mutations in cancer cell population impedes the distinguishing of cancer-causing driver mutations from passenger mutations. To systematically prioritize the oncogenic ability of somatic mutations and cancer genes, we constructed a useful platform, OncoVar (https://oncovar.org/), which employed published bioinformatics algorithms and incorporated known driver events to identify driver mutations and driver genes. We identified 20 162 cancer driver mutations, 814 driver genes and 2360 pathogenic pathways with high-confidence by reanalyzing 10 769 exomes from 33 cancer types in The Cancer Genome Atlas (TCGA) and 1942 genomes from 18 cancer types in International Cancer Genome Consortium (ICGC). OncoVar provides four points of view, ‘Mutation’, ‘Gene’, ‘Pathway’ and ‘Cancer’, to help researchers to visualize the relationships between cancers and driver variants. Importantly, identification of actionable driver alterations provides promising druggable targets and repurposing opportunities of combinational therapies. OncoVar provides a user-friendly interface for browsing, searching and downloading somatic driver mutations, driver genes and pathogenic pathways in various cancer types. This platform will facilitate the identification of cancer drivers across individual cancer cohorts and helps to rank mutations or genes for better decision-making among clinical oncologists, cancer researchers and the broad scientific community interested in cancer precision medicine.


2010 ◽  
Vol 10 (1) ◽  
pp. 59-64 ◽  
Author(s):  
Thomas Santarius ◽  
Janet Shipley ◽  
Daniel Brewer ◽  
Michael R. Stratton ◽  
Colin S. Cooper
Keyword(s):  

Sign in / Sign up

Export Citation Format

Share Document