scholarly journals The distribution pattern of genetic variation in the transcript isoforms of the alternatively spliced protein-coding genes in the human genome

2015 ◽  
Vol 11 (5) ◽  
pp. 1378-1388 ◽  
Author(s):  
Ting Liu ◽  
Kui Lin

The relationships among the types of transcripts, the classes of coding SNPs and the population frequencies in the human genome.

2019 ◽  
Vol 9 (1) ◽  
Author(s):  
Chao-Hsin Chen ◽  
Chao-Yu Pan ◽  
Wen-chang Lin

Abstract The completion of human genome sequences and the advancement of next-generation sequencing technologies have engendered a clear understanding of all human genes. Overlapping genes are usually observed in compact genomes, such as those of bacteria and viruses. Notably, overlapping protein-coding genes do exist in human genome sequences. Accordingly, we used the current Ensembl gene annotations to identify overlapping human protein-coding genes. We analysed 19,200 well-annotated protein-coding genes and determined that 4,951 protein-coding genes overlapped with their adjacent genes. Approximately a quarter of all human protein-coding genes were overlapping genes. We observed different clusters of overlapping protein-coding genes, ranging from two genes (paired overlapping genes) to 22 genes. We also divided the paired overlapping protein-coding gene groups into four subtypes. We found that the divergent overlapping gene subtype had a stronger expression association than did the subtypes of 5ʹ-tandem overlapping and 3ʹ-tandem overlapping genes. The majority of paired overlapping genes exhibited comparable coincidental tissue expression profiles; however, a few overlapping gene pairs displayed distinctive tissue expression association patterns. In summary, we have carefully examined the genomic features and distributions about human overlapping protein-coding genes and found coincidental expression in tissues for most overlapping protein-coding genes.


Author(s):  
Aysha Divan ◽  
Janice A. Royds

Biological functions require protein and the protein makeup of a cell determines its behaviour and identity. Proteins, therefore, are the most abundant molecules in the body except for water. The approximately 20,000 protein coding genes in the human genome can, by alternative splicing, multiple translation starts, and post-translational modifications, produce over 1,000,000 different proteins, collectively called ‘the proteome’. It is the size of the proteome and not the genome that defines the complexity of an organism. ‘Proteins’ describes the composition and structure of proteins and how they are studied. What information is required in order to understand how proteins work and what happens when this function is impaired in disease?


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Kuo-Feng Tung ◽  
Chao-Yu Pan ◽  
Chao-Hsin Chen ◽  
Wen-chang Lin

Abstract With considerable accumulation of RNA-Seq transcriptome data, we have extended our understanding about protein-coding gene transcript compositions. However, alternatively compounded patterns of human protein-coding gene transcripts would complicate gene expression data processing and interpretation. It is essential to exhaustively interrogate complex mRNA isoforms of protein-coding genes with an unified data resource. In order to investigate representative mRNA transcript isoforms to be utilized as transcriptome analysis references, we utilized GTEx data to establish a top-ranked transcript isoform expression data resource for human protein-coding genes. Distinctive tissue specific expression profiles and modulations could be observed for individual top-ranked transcripts of protein-coding genes. Protein-coding transcripts or genes do occupy much higher expression fraction in transcriptome data. In addition, top-ranked transcripts are the dominantly expressed ones in various normal tissues. Intriguingly, some of the top-ranked transcripts are noncoding splicing isoforms, which imply diverse gene regulation mechanisms. Comprehensive investigation on the tissue expression patterns of top-ranked transcript isoforms is crucial. Thus, we established a web tool to examine top-ranked transcript isoforms in various human normal tissue types, which provides concise transcript information and easy-to-use graphical user interfaces. Investigation of top-ranked transcript isoforms would contribute understanding on the functional significance of distinctive alternatively spliced transcript isoforms.


Blood ◽  
2009 ◽  
Vol 114 (22) ◽  
pp. 3260-3260
Author(s):  
Rosana A Silveira ◽  
Angela A Fachel ◽  
Yuri B Moreira ◽  
Marcia T Delamain ◽  
Carmino Antonio De Souza ◽  
...  

Abstract Abstract 3260 Poster Board III-1 Background: CML treatment with tyrosine kinase inhibitors induces high and durable rates of complete cytogenetic response. Despite treatment efficacy, a significant proportion of patients develop resistance to these drugs. We measured gene expression profiles in an attempt to identify gene pathways that may be associated with dasatinib resistance. Patients and Methods: Mononuclear cells were separated from peripheral blood samples from seven CML patients resistant to imatinib, collected prior and after dasatinib treatment. Three patients who achieved partial cytogenetic response (Ph-positive cells: 1% - 35%) within twelve months were considered responders (R), whereas four patients who failed to achieve PCyR within 12 months of treatment were classified as non-responders. RNA samples prepared from peripheral mononuclear cells were hybridized to Agilent Technologies 4×44K Whole Human Genome Microarrays (WHGM) and 4×44K intronic-exonic custom oligoarrays. The latter was developed by Verjovski-Almeida's group (Nakaya et al, Genome Biology 2007, 8:R43) and contains sense and antisense probes that map to intronic regions in the human genome representing totally (TIN) and partially (PIN) intronic non-coding RNAs (ncRNAs), in addition to probes for the corresponding protein-coding genes of the same loci. Raw microarray data were normalized by the Affy package in statistical R language implemented in the Bioconductor platform. Each sample was labeled in replicate with Cy3 or Cy5 and the two were considered technical replicates. Two independent statistical approaches SAM (Significance Analysis of Microarrays) and Golub's discrimination score (SNR, Signal to Noise Ratio, with permutations) were performed to identify differentially expressed transcripts between responder and non-responder patients. For the intronic-exonic platform, the analysis parameters were FDR 10%, SNR>1.5 and p<0.01, and for WHGM platform parameters were FDR 5%, SNR>1.5 and p<0.001. For this latter platform, we also performed a patient leave-one-out analysis. Functions of transcripts differentially expressed were annotated and compared using GO Biological Process categories (www.genetools.microarray.ntu.no/egon). Results: We identified 34 ncRNAs with altered expression (26 over and 8 underexpressed in responders) in pre-treatment samples and 33 ncRNAs (20 over and 13 underexpressed in responders) in post-treatment samples. Functions associated with protein-coding genes from the same genomic loci as those of the intronic differentially expressed ncRNAs were: regulation of transcription (PRMT5, SOD2, SSBP3, BCL7A, MLL), signal transduction (PRKCB1, RASGRP2, NF1, PXN) and apoptosis (BCL2, PCSK6, TNFAIP8, EIF4G2). WHGM platform data analysis showed 63 and 250 protein-coding genes differentially expressed in pre and post-treatment samples, respectively. We observed a higher number of protein-coding genes with altered expression after treatment in the following functions: cell communication, immune response and metabolic process (p<0.02). Conclusions: Overall, these findings indicate that protein-coding genes and intronic ncRNAs may be related to dasatinib resistance and response to treatment. In particular, altered expression of ncRNAs transcribed from the introns of ‘regulation of transcription' genes could be part of an important alternative mechanism of gene expression control during emergence of resistance.Support: FAPESP (2005/60266-8) Disclosures: No relevant conflicts of interest to declare.


2021 ◽  
Author(s):  
Noah Dukler ◽  
Mehreen R Mughal ◽  
Ritika Ramani ◽  
Yi-Fei Huang ◽  
Adam Siepel

Genome sequencing of tens of thousands of human individuals has recently enabled the measurement of large selective effects for mutations to protein-coding genes. Here we describe a new method, called ExtRaINSIGHT, for measuring similar selective effects at individual sites in noncoding as well as in coding regions of the human genome. ExtRaINSIGHT estimates the prevalance of strong purifying selection, or "ultraselection" (λs), as the fractional depletion of rare single-nucleotide variants (minor allele frequency <0.1%) in a target set of genomic sites relative to matched sites that are putatively neutrally evolving, in a manner that controls for local variation and neighbor-dependence in mutation rate. We show using simulations that, above an appropriate threshold, λs is closely related to the average site-specific selection coefficient against heterozygous point mutations, as predicted at mutation-selection balance. Applying ExtRaINSIGHT to 71,702 whole genome sequences from gnomAD v3, we find particularly strong evidence of ultraselection in evolutionarily ancient miRNAs and neuronal protein-coding genes, as well as at splice sites. Moreover, our estimated selection coefficient against heterozygous amino-acid replacements across the genome (at 1.4%) is substantially larger than previous estimates based on smaller sample sizes. By contrast, we find weak evidence of ultraselection in other noncoding RNAs and transcription factor binding sites, and only modest evidence in ultraconserved elements and human accelerated regions. We estimate that ~0.3-0.5% of the human genome is ultraselected, with one third to one half of ultraselected sites falling in coding regions. These estimates suggest ~0.3-0.4 lethal or nearly lethal de novo mutations per potential human zygote, together with ~2 de novo mutations that are more weakly deleterious. Overall, our study sheds new light on the genome-wide distribution of fitness effects for new point mutations by combining deep new sequencing data sets and classical theory from population genetics.


2013 ◽  
Vol 42 (5) ◽  
pp. 2820-2832 ◽  
Author(s):  
Nicolas Philippe ◽  
Elias Bou Samra ◽  
Anthony Boureux ◽  
Alban Mancheron ◽  
Florence Rufflé ◽  
...  

Abstract Recent sequencing technologies that allow massive parallel production of short reads are the method of choice for transcriptome analysis. Particularly, digital gene expression (DGE) technologies produce a large dynamic range of expression data by generating short tag signatures for each cell transcript. These tags can be mapped back to a reference genome to identify new transcribed regions that can be further covered by RNA-sequencing (RNA-Seq) reads. Here, we applied an integrated bioinformatics approach that combines DGE tags, RNA-Seq, tiling array expression data and species-comparison to explore new transcriptional regions and their specific biological features, particularly tissue expression or conservation. We analysed tags from a large DGE data set (designated as ‘TranscriRef’). We then annotated 750 000 tags that were uniquely mapped to the human genome according to Ensembl. We retained transcripts originating from both DNA strands and categorized tags corresponding to protein-coding genes, antisense, intronic- or intergenic-transcribed regions and computed their overlap with annotated non-coding transcripts. Using this bioinformatics approach, we identified ∼34 000 novel transcribed regions located outside the boundaries of known protein-coding genes. As demonstrated using sequencing data from human pluripotent stem cells for biological validation, the method could be easily applied for the selection of tissue-specific candidate transcripts. DigitagCT is available at http://cractools.gforge.inria.fr/softwares/digitagct.


Nature ◽  
2005 ◽  
Vol 437 (7062) ◽  
pp. 1153-1157 ◽  
Author(s):  
Carlos D. Bustamante ◽  
Adi Fledel-Alon ◽  
Scott Williamson ◽  
Rasmus Nielsen ◽  
Melissa Todd Hubisz ◽  
...  

2014 ◽  
Author(s):  
Iakes Ezkurdia ◽  
David Juan ◽  
Jose Manuel Rodriguez ◽  
Adam Frankish ◽  
Mark Deikhans ◽  
...  

Determining the full complement of protein-coding genes is a key goal of genome annotation. The most powerful approach for confirming protein coding potential is the detection of cellular protein expression through peptide mass spectrometry experiments. Here we map the peptides detected in 7 large-scale proteomics studies to almost 60% of the protein coding genes in the GENCODE annotation the human genome. We find that conservation across vertebrate species and the age of the gene family are key indicators of whether a peptide will be detected in proteomics experiments. We find peptides for most highly conserved genes and for practically all genes that evolved before bilateria. At the same time there is almost no evidence of protein expression for genes that have appeared since primates, or for genes that do not have any protein-like features or cross-species conservation. We identify 19 non-protein-like features such as weak conservation, no protein features or ambiguous annotations in major databases that are indicators of low peptide detection rates. We use these features to describe a set of 2,001 genes that are potentially non-coding, and show that many of these genes behave more like non-coding genes than protein-coding genes. We detect peptides for just 3% of these genes. We suggest that many of these 2,001 genes do not code for proteins under normal circumstances and that they should not be included in the human protein coding gene catalogue. These potential non-coding genes will be revised as part of the ongoing human genome annotation effort.


Sign in / Sign up

Export Citation Format

Share Document