AbstractVariants in the CDH23 gene are known to be responsible for both syndromic hearing loss (Usher syndrome type ID: USH1D) and non-syndromic hearing loss (DFNB12). Our series of studies demonstrated that CDH23 variants cause a broad range of phenotypes of non-syndromic hearing loss (DFNB12); from congenital profound hearing loss to late-onset high-frequency-involved progressive hearing loss. In this study, based on the genetic and clinical data from more than 10,000 patients, the mutational spectrum, clinical characteristics and genotype/phenotype correlations were evaluated. The present results reconfirmed that the variants in CDH23 are an important cause of non-syndromic sensorineural hearing loss. In addition, we showed that the mutational spectrum in the Japanese population, which is probably representative of the East Asian population in general, as well as frequent CDH23 variants that might be due to some founder effects. The present study demonstrated CDH23 variants cause a broad range of phenotypes, from non-syndromic to syndromic hearing loss as well as from congenital to age-related hearing loss. Genotype (variant combinations) and phenotype (association with retinal pigmentosa, onset age) are shown to be well correlated and are thought to be related to the residual function defined by the CDH23 variants.
AbstractDeep generative models can learn the underlying structure, such as pathways or gene programs, from omics data. We provide an introduction as well as an overview of such techniques, specifically illustrating their use with single-cell gene expression data. For example, the low dimensional latent representations offered by various approaches, such as variational auto-encoders, are useful to get a better understanding of the relations between observed gene expressions and experimental factors or phenotypes. Furthermore, by providing a generative model for the latent and observed variables, deep generative models can generate synthetic observations, which allow us to assess the uncertainty in the learned representations. While deep generative models are useful to learn the structure of high-dimensional omics data by efficiently capturing non-linear dependencies between genes, they are sometimes difficult to interpret due to their neural network building blocks. More precisely, to understand the relationship between learned latent variables and observed variables, e.g., gene transcript abundances and external phenotypes, is difficult. Therefore, we also illustrate current approaches that allow us to infer the relationship between learned latent variables and observed variables as well as external phenotypes. Thereby, we render deep learning approaches more interpretable. In an application with single-cell gene expression data, we demonstrate the utility of the discussed methods.
AbstractThe emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret the effect of single amino acid variants (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue to expand our understanding of the mutational landscape of single proteins, the results continue to challenge analyses. Protein Language Models (pLMs) use the latest deep learning (DL) algorithms to leverage growing databases of protein sequences. These methods learn to predict missing or masked amino acids from the context of entire sequence regions. Here, we used pLM representations (embeddings) to predict sequence conservation and SAV effects without multiple sequence alignments (MSAs). Embeddings alone predicted residue conservation almost as accurately from single sequences as ConSeq using MSAs (two-state Matthews Correlation Coefficient—MCC—for ProtT5 embeddings of 0.596 ± 0.006 vs. 0.608 ± 0.006 for ConSeq). Inputting the conservation prediction along with BLOSUM62 substitution scores and pLM mask reconstruction probabilities into a simplistic logistic regression (LR) ensemble for Variant Effect Score Prediction without Alignments (VESPA) predicted SAV effect magnitude without any optimization on DMS data. Comparing predictions for a standard set of 39 DMS experiments to other methods (incl. ESM-1v, DeepSequence, and GEMME) revealed our approach as competitive with the state-of-the-art (SOTA) methods using MSA input. No method outperformed all others, neither consistently nor statistically significantly, independently of the performance measure applied (Spearman and Pearson correlation). Finally, we investigated binary effect predictions on DMS experiments for four human proteins. Overall, embedding-based methods have become competitive with methods relying on MSAs for SAV effect prediction at a fraction of the costs in computing/energy. Our method predicted SAV effects for the entire human proteome (~ 20 k proteins) within 40 min on one Nvidia Quadro RTX 8000. All methods and data sets are freely available for local and online execution through bioembeddings.com, https://github.com/Rostlab/VESPA, and PredictProtein.
AbstractNeurofibromatosis type 1 (NF1) is the most frequent disorder associated with multiple café-au-lait macules (CALM) which may either be present at birth or appear during the first year of life. Other NF1-associated features such as skin-fold freckling and Lisch nodules occur later during childhood whereas dermal neurofibromas are rare in young children and usually only arise during early adulthood. The NIH clinical diagnostic criteria for NF1, established in 1988, include the most common NF1-associated features. Since many of these features are age-dependent, arriving at a definitive diagnosis of NF1 by employing these criteria may not be possible in infancy if CALM are the only clinical feature evident. Indeed, approximately 46% of patients who are diagnosed with NF1 later in life do not meet the NIH diagnostic criteria by the age of 1 year. Further, the 1988 diagnostic criteria for NF1 are not specific enough to distinguish NF1 from other related disorders such as Legius syndrome. In this review, we outline the challenges faced in diagnosing NF1 in young children, and evaluate the utility of the recently revised (2021) diagnostic criteria for NF1, which include the presence of pathogenic variants in the NF1 gene and choroidal anomalies, for achieving an early and accurate diagnosis.
AbstractThe discovery of introns over four decades ago revealed a new vision of genes and their interrupted arrangement. Throughout the years, it has appeared that introns play essential roles in the regulation of gene expression. Unique processing of excised introns through the formation of lariats suggests a widespread role for these molecules in the structure and function of cells. In addition to rapid destruction, these lariats may linger on in the nucleus or may even be exported to the cytoplasm, where they remain stable circular RNAs (circRNAs). Alternative splicing (AS) is a source of diversity in mature transcripts harboring retained introns (RI-mRNAs). Such RNAs may contain one or more entire retained intron(s) (RIs), but they may also have intron fragments resulting from sequential excision of smaller subfragments via recursive splicing (RS), which is characteristic of long introns. There are many potential fates of RI-mRNAs, including their downregulation via nuclear and cytoplasmic surveillance systems and the generation of new protein isoforms with potentially different functions. Various reports have linked the presence of such unprocessed transcripts in mammals to important roles in normal development and in disease-related conditions. In certain human neurological-neuromuscular disorders, including myotonic dystrophy type 2 (DM2), frontotemporal dementia/amyotrophic lateral sclerosis (FTD/ALS) and Duchenne muscular dystrophy (DMD), peculiar processing of long introns has been identified and is associated with their pathogenic effects. In this review, we discuss different mechanisms involved in the processing of introns during AS and the functions of these large sections of the genome in our biology.