scholarly journals Computational identification and characterization of glioma candidate biomarkers through multi-omics integrative profiling

2020 ◽  
Author(s):  
Lin Liu ◽  
Guangyu Wang ◽  
Liguo Wang ◽  
Chunlei Yu ◽  
Mengwei Li ◽  
...  

Abstract Background: Glioma is one of the most common malignant brain tumors and exhibits low resection rate and high recurrence risk. Although a large number of glioma studies powered by high-throughput sequencing technologies have led to massive multi-omics datasets, there lacks of comprehensive integration of glioma datasets for uncovering candidate biomarker genes.Results: In this study, we collected a large-scale assemble of multi-omics multi-cohort datasets from worldwide public resources, involving a total of 16,939 samples across 19 independent studies. Through comprehensive molecular profiling across different datasets, we revealed that PRKCG (Protein Kinase C Gamma), a brain-specific gene detectable in cerebrospinal fluid, is closely associated with glioma. Specifically, it presents lower expression and higher methylation in glioma samples compared with normal samples. PRKCG expression/methylation change from high to low is indicative of glioma progression from low-grade to high-grade and high RNA expression is suggestive of good survival. Importantly, PRKCG in combination with MGMT is effective to predict survival outcomes in a more precise manner.Conclusions: PRKCG bears the great potential for glioma diagnosis, prognosis and therapy, and PRKCG-like genes may represent a set of important genes associated with different molecular mechanisms in glioma tumorigenesis. Our study indicates the importance of computational integrative multi-omics data analysis and represents a data-driven scheme toward precision tumor subtyping and accurate personalized healthcare.

2019 ◽  
Author(s):  
Lin Liu ◽  
Guangyu Wang ◽  
Liguo Wang ◽  
Chunlei Yu ◽  
Mengwei Li ◽  
...  

AbstractGlioma is one of the most common malignant brain tumors and exhibits low resection rate and high recurrence risk. Although a large number of glioma studies powered by high-throughput sequencing technologies have led to massive multi-omics datasets, there lacks of comprehensive integration of glioma datasets for uncovering candidate biomarker genes. In this study, we collected a large-scale assemble of multi-omics multi-cohort datasets from worldwide public resources, involving a total of 16,939 samples across 19 independent studies. Through comprehensive multi-omics molecular profiling across different datasets, we revealed that PRKCG (Protein Kinase C Gamma), a brain-specific gene detectable in cerebrospinal fluid, is closely associated with glioma. Specifically, it presents lower expression and higher methylation in glioma samples compared with normal samples. PRKCG expression/methylation change from high to low is indicative of glioma progression from low-grade to high-grade and high RNA expression is suggestive of good survival. Importantly, PRKCG in combination with MGMT is effective to predict survival outcomes after TMZ chemotherapy in a more precise manner. Collectively, PRKCG bears the great potential for glioma diagnosis, prognosis and therapy, and PRKCG-like genes may represent a set of important genes associated with different molecular mechanisms in glioma tumorigenesis. Accordingly, our study indicates the importance of computational integrative multi-omics data analysis and represents a data-driven scheme toward precision tumor subtyping and accurate personalized healthcare.Author SummaryGlioma is a type of brain tumors that represents one of the most lethal human malignancies with little chance for recovery. Nowadays, more and more studies have adopted high-throughput sequencing technologies to decode the molecular profiles of glioma from different omics levels, accordingly resulting in massive glioma datasets generated from different projects and laboratories throughout the world. Therefore, it has become crucially important on how to make full use of these valuable datasets for computational identification of glioma candidate biomarker genes in aid of precision tumor subtyping and accurate personalized treatment. In this study, we comprehensively integrated glioma datasets from all over the world and performed multi-omics molecular data mining. We revealed that PRKCG, a brain-specific gene abundantly expressed in cerebrospinal fluid, bears the great potential for glioma diagnosis, prognosis and treatment prediction, which has been consistently observed on multiple independent datasets. In the era of big data, our study highlights the significance of computational integrative data mining toward precision medicine in cancer research.


Viruses ◽  
2021 ◽  
Vol 13 (10) ◽  
pp. 2006
Author(s):  
Anna Y Budkina ◽  
Elena V Korneenko ◽  
Ivan A Kotov ◽  
Daniil A Kiselev ◽  
Ilya V Artyushin ◽  
...  

According to various estimates, only a small percentage of existing viruses have been discovered, naturally much less being represented in the genomic databases. High-throughput sequencing technologies develop rapidly, empowering large-scale screening of various biological samples for the presence of pathogen-associated nucleotide sequences, but many organisms are yet to be attributed specific loci for identification. This problem particularly impedes viral screening, due to vast heterogeneity in viral genomes. In this paper, we present a new bioinformatic pipeline, VirIdAl, for detecting and identifying viral pathogens in sequencing data. We also demonstrate the utility of the new software by applying it to viral screening of the feces of bats collected in the Moscow region, which revealed a significant variety of viruses associated with bats, insects, plants, and protozoa. The presence of alpha and beta coronavirus reads, including the MERS-like bat virus, deserves a special mention, as it once again indicates that bats are indeed reservoirs for many viral pathogens. In addition, it was shown that alignment-based methods were unable to identify the taxon for a large proportion of reads, and we additionally applied other approaches, showing that they can further reveal the presence of viral agents in sequencing data. However, the incompleteness of viral databases remains a significant problem in the studies of viral diversity, and therefore necessitates the use of combined approaches, including those based on machine learning methods.


2020 ◽  
Author(s):  
Yang Young Lu ◽  
Jiaxing Bai ◽  
Yiwen Wang ◽  
Ying Wang ◽  
Fengzhu Sun

AbstractMotivationRapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption.ResultsWe report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing (HTS) data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102 – 104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures.AvailabilityCRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/[email protected]; [email protected] informationSupplementary data are available at Bioinformatics online.


2020 ◽  
Vol 4 (Supplement_1) ◽  
Author(s):  
Thomas Kim

Abstract The hypothalamus is a central regulator of physiological homeostasis. During development, multiple transcription factors coordinate the patterning and specification of hypothalamic nuclei. However, the molecular mechanisms controlling hypothalamic patterning and cell fate specification are poorly understood. To identify genes that control these processes, we have used single-cell RNA sequencing (scRNA-Seq) to profile mouse hypothalamic gene expression across multiple developmental time points. We have further utilised scRNA-Seq to phenotype mutations in genes that play major roles in early hypothalamic patterning. To first understand hypothalamic development, hypothalami were collected at both embryonic (E10-E16, E18) and postnatal (PN4, PN8, PN14, PN45) time points. At early stages, when the bulk of hypothalamic patterning occurs (E11-E13), we observe a clear separation between mitotic progenitors and postmitotic neural precursor cells. We likewise observed clean segregation among cells expressing regional hypothalamic markers identified in previous large-scale analysis of hypothalamic development. This analysis reveals new region-specific markers and identifies candidate genes for selectively regulating patterning and cell fate specification in individual hypothalamic regions. With our rich dataset of developing mouse hypothalamus, we integrated our dataset with the Allen Brain Atlas in situ data, publicly available adult hypothalamic scRNA-Seq dataset to understand hierarchy of hypothalamic cell differentiation, as well as re-defining cell types of the hypothalamus. We next used scRNA-Seq to phenotype multiple mutant lines, including a line that has been extensively characterised as a proof of concept (Ctnnb1 overexpression), and lines that have not been characterised (Nkx2.1, Nkx2.2, Dlx1/2 deletion). We show that this approach can rapidly and comprehensively characterize mutants that have altered hypothalamic patterning, and in doing so, have identified multiple genes that simultaneously repress posterior hypothalamic identity while promoting prethalamic identity. This result supports a modified columnar model of organization for the diencephalon, where prethalamus and hypothalamus are situated in adjacent dorsal and ventral domains of the anterior diencephalon. These data serve as a resource for further studies of hypothalamic development and dysfunction, and able to delineate transcriptional regulatory networks of hypothalamic formation. Lastly, using our mouse hypothalamus as a guideline, we are comparing dataset of developing chicken, zebrafish and human hypothalamus, to identify evolutionarily conserved and divergent region-specific gene regulatory networks. We aim to use this knowledge and information of key molecular pathways of human hypothalamic development and produce human hypothalamus organoids.


Author(s):  
Elisa Pappalardo ◽  
Domenico Cantone

The successful sequencing of the genoma of various species leads to a great amount of data that need to be managed and analyzed. With the increasing popularity of high-throughput sequencing technologies, such data require the design of flexible scalable, efficient algorithms and enterprise data structures to be manipulated by both biologists and computational scientists; this emerging scenario requires flexible, scalable, efficient algorithms and enterprise data structures. This chapter focuses on the design of large scale database-driven applications for genomic and proteomic data; it is largely believed that biological databases are similar to any standard database-drive application; however, a number of different and increasingly complex challenges arises. In particular, while standard databases are used just to manage information, in biology, they represent a main source for further computational analysis, which frequently focuses on the identification of relations and properties of a network of entities. The analysis starts from the first text-based storage approach and ends with new insights on object relational mapping for biological data.


Blood ◽  
2011 ◽  
Vol 118 (21) ◽  
pp. 462-462
Author(s):  
Anna M Jankowska ◽  
Yun Huang ◽  
Myunggon Ko ◽  
Utz J Pape ◽  
Hideki Makishima ◽  
...  

Abstract Abstract 462 In myelodysplastic syndrome (MDS), mutations in genes affecting epigenetic regulation constitute a link between genomic and epigenetic instability. Previously, we and others described mutations in TET2, coding for a 2-oxyglutarate-dependent methylcytosine dioxygenase, which converts 5-methycytosine (5-mC) to 5-hydroxymethylcytosine (5-hmC). Subsequently, dysfunction of wild type TET2 was mechanistically linked to neomorphic IDH mutations which deplete 2-oxyglutarate and produce a competitive inhibitor, 2-hydroxyglutarate. Previously, we established analytic tools to indirectly quantify 5-hmC content in leukemic genomes: in patients with myeloid malignancies 5-hmC levels are decreased as compared to healthy controls (p=1.8e-09). A decrease in 5-hmC levels correlated with dysfunction of TET2 as a consequence of inactivating hypomorphic mutations. Nevertheless, while in a majority of patients with decreased 5hmC levels TET2 mutations can be found, in a substantial minority of cases no explanation for the 5hmC deficiency has been found; down-modulation of TET2 mRNA and protein expression was absent and mutations in TET1 and TET3 have not been identified. Thus, other currently unidentified proteins may be directly or indirectly (via regulation of TET activity) involved in the deregulation of 5hmC levels in TET2 and IDH1/2-mutation-negative cases with low 5-hmC. To further investigate this issue we first characterized on a molecular levels patients with low 5-hmC using various approaches. SNP-A karyotyping failed to identify recurrent chromosomal defects in these patients that could point towards defects in pathogenic genes involved in the regulation of 5-hmC levels. We also screened 107 MDS patients to correlate of genomic 5-hmC content and the presence of recurrent mutations including IDH1/2, DNMT3A, ASXL1 and RUNX1 genes (as well as TET2). Within these genes, except for an association with TET2 mutations, a positive correlation with low 5-hmC levels was found only for IDH1/2 mutant cases (p=.05, n=5), whereas no correlation has been established for DNMT3A (p=.119, n=12), ASXL1 (p=.434, n=21) and RUNX1 (p=.602, n=22) mutant cases. While TET2 and IDH mutations were rarely seen together (n=1), none of the other studied gene mutations were mutually exclusive with TET2, suggesting contributions of defects in novel yet not identified genes. Several other genes similar to TET or IDH proteins, or hypothetically linked to DNA demethylation pathways could, at least theoretically, affect 5-hmC content, including for instance D2HGDH and the ELP gene family. However, no mutations were identified in these patients, except for identification of yet unknown SNPs in D2HGDH and ELP4 in some patients with unexplained low 5-hmC levels. In addition to the targeted approach we have also applied next generation sequencing technologies and sequenced whole exomes of malignant and non-affected cells (paired-end (2×100) Illumina HiSeq 2000) to identify novel acquired determinants of 5-mC hydroxymethylation in two representative patients. By using a selective algorithm, 18 overlapping potential somatic alterations in these patients were found in genes which could functionally affect 5-hmC content. In addition, several other mutated genes have been identified in each patient; these are being further investigated in other patients with low 5-hmC levels. Sanger sequencing was applied to confirm the presence of previously detected mutations in NF1 and KRAS, as well as all novel mutations, for instance in BRCC3 and SF3B1, in these patients. In sum, our results provide novel insights into the molecular mechanisms underlying MDS pathophysiology and describe the possibility that the TET family enzymes can act together with other putative proteins linked to DNA demethylation pathways. The use of high throughput sequencing technologies increase the probability of identification of novel changes which can be linked to functional consequences in these patients, ultimately furthering the understanding its role in genomic stability in MDS. Disclosures: No relevant conflicts of interest to declare.


2019 ◽  
Vol 2 (1) ◽  
pp. 39-67
Author(s):  
Chao Deng ◽  
Timothy Daley ◽  
Guilherme De Sena Brandine ◽  
Andrew D. Smith

High-throughput sequencing technologies have evolved at a stellar pace for almost a decade and have greatly advanced our understanding of genome biology. In these sampling-based technologies, there is an important detail that is often overlooked in the analysis of the data and the design of the experiments, specifically that the sampled observations often do not give a representative picture of the underlying population. This has long been recognized as a problem in statistical ecology and in the broader statistics literature. In this review, we discuss the connections between these fields, methodological advances that parallel both the needs and opportunities of large-scale data analysis, and specific applications in modern biology. In the process we describe unique aspects of applying these approaches to sequencing technologies, including sequencing error, population and individual heterogeneity, and the design of experiments.


2015 ◽  
Author(s):  
Sophie Adjalley ◽  
Christophe Chabbert ◽  
Bernd Klaus ◽  
Vicent Pelechano ◽  
Lars Steinmetz

The lack of a comprehensive map of transcription start sites (TSS) across the highly AT-rich genome ofP. falciparumhas hindered progress towards deciphering the molecular mechanisms that underly the timely regulation of gene expression in this malaria parasite. Using high-throughput sequencing technologies, we generated a comprehensive atlas of transcription initiation events at single nucleotide-resolution during the parasite intra-erythrocytic developmental cycle. This detailed analysis of TSS usage enabled us to define architectural features of plasmodial promoters. We demonstrate that TSS selection and strength are constrained by local nucleotide composition. Furthermore, we provide evidence for coordinate and stage-specific TSS usage from distinct sites within the same transcriptional unit, thereby producing transcript isoforms, a subset of which are developmentally regulated. This work offers a framework for further investigations into the interactions between genomic sequences and regulatory factors governing the complex transcriptional program of this major human pathogen.


2017 ◽  
Author(s):  
Mark J.P. Chaisson ◽  
Ashley D. Sanders ◽  
Xuefang Zhao ◽  
Ankit Malhotra ◽  
David Porubsky ◽  
...  

ABSTRACTThe incomplete identification of structural variants (SVs) from whole-genome sequencing data limits studies of human genetic diversity and disease association. Here, we apply a suite of long-read, short-read, and strand-specific sequencing technologies, optical mapping, and variant discovery algorithms to comprehensively analyze three human parent–child trios to define the full spectrum of human genetic variation in a haplotype-resolved manner. We identify 818,054 indel variants (<50 bp) and 27,622 SVs (≥50 bp) per human genome. We also discover 156 inversions per genome—most of which previously escaped detection. Fifty-eight of the inversions we discovered intersect with the critical regions of recurrent microdeletion and microduplication syndromes. Taken together, our SV callsets represent a sevenfold increase in SV detection compared to most standard high-throughput sequencing studies, including those from the 1000 Genomes Project. The method and the dataset serve as a gold standard for the scientific community and we make specific recommendations for maximizing structural variation sensitivity for future large-scale genome sequencing studies.


2020 ◽  
Vol 48 (4) ◽  
pp. 1545-1556 ◽  
Author(s):  
Qianpeng Li ◽  
Zhao Li ◽  
Changrui Feng ◽  
Shuai Jiang ◽  
Zhang Zhang ◽  
...  

LncRNAs (long non-coding RNAs) are pervasively transcribed in the human genome and also extensively involved in a variety of essential biological processes and human diseases. The comprehensive annotation of human lncRNAs is of great significance in navigating the functional landscape of the human genome and deepening the understanding of the multi-featured RNA world. However, the unique characteristics of lncRNAs as well as their enormous quantity have complicated and challenged the annotation of lncRNAs. Advances in high-throughput sequencing technologies give rise to a large volume of omics data that are generated at an unprecedented rate and scale, providing possibilities in the identification, characterization and functional annotation of lncRNAs. Here, we review the recent important discoveries of human lncRNAs through analysis of various omics data and summarize specialized lncRNA database resources. Moreover, we highlight the multi-omics integrative analysis as a powerful strategy to efficiently discover and characterize the functional lncRNAs and elucidate their potential molecular mechanisms.


Sign in / Sign up

Export Citation Format

Share Document