scholarly journals Hierarchical Clustering of DNA k-mer Counts in RNAseq Fastq Files Identifies Sample Heterogeneities

2018 ◽  
Vol 19 (11) ◽  
pp. 3687
Author(s):  
Wolfgang Kaisers  ◽  
Holger Schwender ◽  
Heiner Schaal 

We apply hierarchical clustering (HC) of DNA k-mer counts on multiple Fastq files. The tree structures produced by HC may reflect experimental groups and thereby indicate experimental effects, but clustering of preparation groups indicates the presence of batch effects. Hence, HC of DNA k-mer counts may serve as a diagnostic device. In order to provide a simple applicable tool we implemented sequential analysis of Fastq reads with low memory usage in an R package (seqTools) available on Bioconductor. The approach is validated by analysis of Fastq file batches containing RNAseq data. Analysis of three Fastq batches downloaded from ArrayExpress indicated experimental effects. Analysis of RNAseq data from two cell types (dermal fibroblasts and Jurkat cells) sequenced in our facility indicate presence of batch effects. The observed batch effects were also present in reads mapped to the human genome and also in reads filtered for high quality (Phred > 30). We propose, that hierarchical clustering of DNA k-mer counts provides an unspecific diagnostic tool for RNAseq experiments. Further exploration is required once samples are identified as outliers in HC derived trees.

Author(s):  
Wolfgang Kaisers ◽  
Holger Schwender ◽  
Heiner Schaal

We apply hierarchical clustering (HC) of DNA k-mer counts on multiple Fastq files. The tree structures produced by HC may reflect experimental groups and thereby indicate experimental effects, but clustering of preparation groups indicates the presence of batch effects. Hence, HC of DNA k-mer counts may serve as an unspecific diagnostic device. In order to provide a simple applicable tool we implemented sequential analysis of Fastq reads with low memory usage in an R package (seqTools) available on Bioconductor. The approach is validated by analysis of Fastq file batches containing RNAseq data. Analysis of three Fastq batches downloaded from ArrayExpress indicated experimental effects. Analysis of RNAseq data from two cell types (dermal fibroblasts and Jurkat cells) sequenced in our facility indicate presence of batch effects. The observed batch effects were also present in reads mapped to the human genome and also in reads filtered for high quality (Phred > 30). We propose, that hierarchical clustering of DNA k-mer counts provides an unspecific diagnostic tool and a quality criterion and for RNAseq experiments.


Author(s):  
Massimo Andreatta ◽  
Santiago J Carmona

Abstract Summary STACAS is a computational method for the identification of integration anchors in the Seurat environment, optimized for the integration of single-cell (sc) RNA-seq datasets that share only a subset of cell types. We demonstrate that by (i) correcting batch effects while preserving relevant biological variability across datasets, (ii) filtering aberrant integration anchors with a quantitative distance measure and (iii) constructing optimal guide trees for integration, STACAS can accurately align scRNA-seq datasets composed of only partially overlapping cell populations. Availability and implementation Source code and R package available at https://github.com/carmonalab/STACAS; Docker image available at https://hub.docker.com/repository/docker/mandrea1/stacas_demo.


Author(s):  
Massimo Andreatta ◽  
Santiago J. Carmona

AbstractComputational tools for the integration of single-cell transcriptomics data are designed to correct batch effects between technical replicates or different technologies applied to the same population of cells. However, they have inherent limitations when applied to heterogeneous sets of data with moderate overlap in cell states or sub-types. STACAS is a package for the identification of integration anchors in the Seurat environment, optimized for the integration of datasets that share only a subset of cell types. We demonstrate that by i) correcting batch effects while preserving relevant biological variability across datasets, ii) filtering aberrant integration anchors with a quantitative distance measure, and iii) constructing optimal guide trees for integration, STACAS can accurately align scRNA-seq datasets composed of only partially overlapping cell populations. We anticipate that the algorithm will be a useful tool for the construction of comprehensive single-cell atlases by integration of the growing amount of single-cell data becoming available in public repositories.Code availabilityR package:https://github.com/carmonalab/STACASDocker image:https://hub.docker.com/repository/docker/mandrea1/stacas_demo


2020 ◽  
Vol 36 (11) ◽  
pp. 3522-3527 ◽  
Author(s):  
Emanuele Aliverti ◽  
Jeffrey L Tilson ◽  
Dayne L Filer ◽  
Benjamin Babcock ◽  
Alejandro Colaneri ◽  
...  

Abstract Motivation Low-dimensional representations of high-dimensional data are routinely employed in biomedical research to visualize, interpret and communicate results from different pipelines. In this article, we propose a novel procedure to directly estimate t-SNE embeddings that are not driven by batch effects. Without correction, interesting structure in the data can be obscured by batch effects. The proposed algorithm can therefore significantly aid visualization of high-dimensional data. Results The proposed methods are based on linear algebra and constrained optimization, leading to efficient algorithms and fast computation in many high-dimensional settings. Results on artificial single-cell transcription profiling data show that the proposed procedure successfully removes multiple batch effects from t-SNE embeddings, while retaining fundamental information on cell types. When applied to single-cell gene expression data to investigate mouse medulloblastoma, the proposed method successfully removes batches related with mice identifiers and the date of the experiment, while preserving clusters of oligodendrocytes, astrocytes, and endothelial cells and microglia, which are expected to lie in the stroma within or adjacent to the tumours. Availability and implementation Source code implementing the proposed approach is available as an R package at https://github.com/emanuelealiverti/BC_tSNE, including a tutorial to reproduce the simulation studies. Contact [email protected]


2019 ◽  
Vol 14 (2) ◽  
pp. 148-156
Author(s):  
Nighat Noureen ◽  
Sahar Fazal ◽  
Muhammad Abdul Qadir ◽  
Muhammad Tanvir Afzal

Background: Specific combinations of Histone Modifications (HMs) contributing towards histone code hypothesis lead to various biological functions. HMs combinations have been utilized by various studies to divide the genome into different regions. These study regions have been classified as chromatin states. Mostly Hidden Markov Model (HMM) based techniques have been utilized for this purpose. In case of chromatin studies, data from Next Generation Sequencing (NGS) platforms is being used. Chromatin states based on histone modification combinatorics are annotated by mapping them to functional regions of the genome. The number of states being predicted so far by the HMM tools have been justified biologically till now. Objective: The present study aimed at providing a computational scheme to identify the underlying hidden states in the data under consideration. </P><P> Methods: We proposed a computational scheme HCVS based on hierarchical clustering and visualization strategy in order to achieve the objective of study. Results: We tested our proposed scheme on a real data set of nine cell types comprising of nine chromatin marks. The approach successfully identified the state numbers for various possibilities. The results have been compared with one of the existing models as well which showed quite good correlation. Conclusion: The HCVS model not only helps in deciding the optimal state numbers for a particular data but it also justifies the results biologically thereby correlating the computational and biological aspects.


2020 ◽  
Vol 21 (8) ◽  
pp. 2748 ◽  
Author(s):  
Ruth Barral-Arca ◽  
Alberto Gómez-Carballa ◽  
Miriam Cebey-López ◽  
María José Currás-Tuala ◽  
Sara Pischedda ◽  
...  

There is a growing interest in unraveling gene expression mechanisms leading to viral host invasion and infection progression. Current findings reveal that long non-coding RNAs (lncRNAs) are implicated in the regulation of the immune system by influencing gene expression through a wide range of mechanisms. By mining whole-transcriptome shotgun sequencing (RNA-seq) data using machine learning approaches, we detected two lncRNAs (ENSG00000254680 and ENSG00000273149) that are downregulated in a wide range of viral infections and different cell types, including blood monocluclear cells, umbilical vein endothelial cells, and dermal fibroblasts. The efficiency of these two lncRNAs was positively validated in different viral phenotypic scenarios. These two lncRNAs showed a strong downregulation in virus-infected patients when compared to healthy control transcriptomes, indicating that these biomarkers are promising targets for infection diagnosis. To the best of our knowledge, this is the very first study using host lncRNAs biomarkers for the diagnosis of human viral infections.


2020 ◽  
Author(s):  
Mohit Goyal ◽  
Guillermo Serrano ◽  
Ilan Shomorony ◽  
Mikel Hernaez ◽  
Idoia Ochoa

AbstractSingle-cell RNA-seq is a powerful tool in the study of the cellular composition of different tissues and organisms. A key step in the analysis pipeline is the annotation of cell-types based on the expression of specific marker genes. Since manual annotation is labor-intensive and does not scale to large datasets, several methods for automated cell-type annotation have been proposed based on supervised learning. However, these methods generally require feature extraction and batch alignment prior to classification, and their performance may become unreliable in the presence of cell-types with very similar transcriptomic profiles, such as differentiating cells. We propose JIND, a framework for automated cell-type identification based on neural networks that directly learns a low-dimensional representation (latent code) in which cell-types can be reliably determined. To account for batch effects, JIND performs a novel asymmetric alignment in which the transcriptomic profile of unseen cells is mapped onto the previously learned latent space, hence avoiding the need of retraining the model whenever a new dataset becomes available. JIND also learns cell-type-specific confidence thresholds to identify and reject cells that cannot be reliably classified. We show on datasets with and without batch effects that JIND classifies cells more accurately than previously proposed methods while rejecting only a small proportion of cells. Moreover, JIND batch alignment is parallelizable, being more than five or six times faster than Seurat integration. Availability: https://github.com/mohit1997/JIND.


2020 ◽  
Author(s):  
qing hua ◽  
wenhao xu ◽  
xuefang shen ◽  
xi tian ◽  
Peng Wang ◽  
...  

Abstract Background: Surgery remains the most important treatment strategy for solid tumors, such as colorectal cancer (CRC); However, a number of studies have suggested that surgical stress contributes to tumor recurrence or distant metastases. Extracellular vesicles (EVs), which contain a rich variety of RNAs with specialized functions and clinical applications, have been shown to be an indicator for diagnosis and prognosis of cancers. The effect of surgical stress on the landscape and characteristics of EV long RNA (exLR) in human blood, however, remains largely unknown.Methods: We present an optimized strategy for exLR sequencing (exLR-seq) the plasma from three patients with CRC at 4 time points (before surgery [T0], after extubation [T1], 1 day after surgery [T2], and 3 days after surgery [T4]). The “Limma” R package was used to evaluate the dynamic changes of mRNAs and long non-coding (lnc)RNAs from EVs. We also constructed a protein–protein interaction (PPI) network of hub genes and predicted biological processes, cellular components, and molecular functions of gene ontology (GO) functional analysis and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway. Results: We observed a sufficient number of exLRs, including 12,924 mRNAs and 2196 lncRNAs. Both mRNAs and lncRNAs underwent dynamic changes during the peri-operative period. Compared with T0, there were 110 mRNAs differentially expressed after extubation, 60 differentially expressed genes(DEGs)1 day after surgery, and 50 DEGs 3 days after surgery. A total of 11 genes changed at all 3 time points and were related to regulation of the membrane potential, receptor complex, and passive transmembrane transporter activity. In addition, 22 lncRNAs were differentially expressed after extubation (T1). Nineteen lncRNAs were differentially expressed between T0 and T2, and 38 lncRNAs were differentially expressed between T0 and T3. In addition, we found that only 3 lncRNAs changed at 3 time points. Interestingly, blood exLRs reflected the tissue origins and relative fractions of different immune cell types. EVs from CD8+ T,CD4+ memory T, and NK cells decreased after surgery and the absolute quality of EVs from immune cells decreased as well. Conclusion: In summary, this study demonstrated abundant exLRs in human plasma and the dynamic changes of these exLRs and exLRs originating from CD8+ T and CD4+ memory T cells were reduced during the peri-operative period.


BioTechniques ◽  
2020 ◽  
Vol 69 (5) ◽  
pp. 347-355
Author(s):  
Snehal Kadam ◽  
Madhusoodhanan Vandana ◽  
Karishma S Kaushik

Direct contact-based coculture of human dermal fibroblasts and epidermal keratinocytes has been a long-standing and challenging issue owing to different serum and growth factor requirements of the two cell types. Existing protocols employ high serum concentrations (up to 10% fetal bovine serum), complex feeder systems and a range of supplemental factors. These approaches are technically demanding and labor intensive, and pose scientific and ethical limitations associated with the high concentrations of animal serum. On the other hand, serum-free conditions often fail to support the proliferation of one or both cell types when they are cultured together. We have developed two reduced serum approaches (1–2% serum) that support the contact-based coculture of human dermal fibroblasts and immortalized keratinocytes and enable the study of cell migration and wound closure.


Sign in / Sign up

Export Citation Format

Share Document