scholarly journals TraceQC: An R package for quality control of CRISPR lineage tracing data

2020 ◽  
Author(s):  
Jingyuan Hu ◽  
Rami Al-Ouran ◽  
Xiang Zhang ◽  
Zhandong Liu ◽  
Hyun-Hwan Jeong

AbstractMotivationThe CRISPR-based lineage tracing system is emerging as a powerful new sequencing tool to track cell lineages by marking cells with irreversible genetic mutations. Accurate reconstruction of cell lineages from CRISPR-based data is sensitive to noise. Quality control is critical for filtering out low-quality data points. Yet, existing quality control tools for RNA-seq and DNA-seq do not measure features specific for the CRISPR-linear tracing system.ResultsWe introduce TraceQC, a quality control package to overcome challenges with measuring the quality of CRISPR-based lineage tracing data and help in interpreting and constructing lineage trees.AvailabilityThe R package is available at https://github.com/LiuzLab/[email protected] or [email protected]

2020 ◽  
Author(s):  
Maxim Ivanov ◽  
Albin Sandelin ◽  
Sebastian Marquardt

Abstract Background: The quality of gene annotation determines the interpretation of results obtained in transcriptomic studies. The growing number of genome sequence information calls for experimental and computational pipelines for de novo transcriptome annotation. Ideally, gene and transcript models should be called from a limited set of key experimental data. Results: We developed TranscriptomeReconstructoR, an R package which implements a pipeline for automated transcriptome annotation. It relies on integrating features from independent and complementary datasets: i) full-length RNA-seq for detection of splicing patterns and ii) high-throughput 5' and 3' tag sequencing data for accurate definition of gene borders. The pipeline can also take a nascent RNA-seq dataset to supplement the called gene model with transient transcripts.We reconstructed de novo the transcriptional landscape of wild type Arabidopsis thaliana seedlings as a proof-of-principle. A comparison to the existing transcriptome annotations revealed that our gene model is more accurate and comprehensive than the two most commonly used community gene models, TAIR10 and Araport11. In particular, we identify thousands of transient transcripts missing from the existing annotations. Our new annotation promises to improve the quality of A.thaliana genome research.Conclusions: Our proof-of-concept data suggest a cost-efficient strategy for rapid and accurate annotation of complex eukaryotic transcriptomes. We combine the choice of library preparation methods and sequencing platforms with the dedicated computational pipeline implemented in the TranscriptomeReconstructoR package. The pipeline only requires prior knowledge on the reference genomic DNA sequence, but not the transcriptome. The package seamlessly integrates with Bioconductor packages for downstream analysis.


2020 ◽  
Author(s):  
Maxim Ivanov ◽  
Albin Sandelin ◽  
Sebastian Marquardt

AbstractBackgroundThe quality of gene annotation determines the interpretation of results obtained in transcriptomic studies. The growing number of genome sequence information calls for experimental and computational pipelines for de novo transcriptome annotation. Ideally, gene and transcript models should be called from a limited set of key experimental data.ResultsWe developed TranscriptomeReconstructoR, an R package which implements a pipeline for automated transcriptome annotation. It relies on integrating features from independent and complementary datasets: i) full-length RNA-seq for detection of splicing patterns and ii) high-throughput 5’ and 3’ tag sequencing data for accurate definition of gene borders. The pipeline can also take a nascent RNA-seq dataset to supplement the called gene model with transient transcripts.We reconstructed de novo the transcriptional landscape of wild type Arabidopsis thaliana seedlings as a proof-of-principle. A comparison to the existing transcriptome annotations revealed that our gene model is more accurate and comprehensive than the two most commonly used community gene models, TAIR10 and Araport11. In particular, we identify thousands of transient transcripts missing from the existing annotations. Our new annotation promises to improve the quality of A.thaliana genome research.ConclusionsOur proof-of-concept data suggest a cost-efficient strategy for rapid and accurate annotation of complex eukaryotic transcriptomes. We combine the choice of library preparation methods and sequencing platforms with the dedicated computational pipeline implemented in the TranscriptomeReconstructoR package. The pipeline only requires prior knowledge on the reference genomic DNA sequence, but not the transcriptome. The package seamlessly integrates with Bioconductor packages for downstream analysis.


2012 ◽  
Vol 2012 ◽  
pp. 1-8 ◽  
Author(s):  
Janet E. Squires ◽  
Alison M. Hutchinson ◽  
Anne-Marie Bostrom ◽  
Kelly Deis ◽  
Peter G. Norton ◽  
...  

Researchers strive to optimize data quality in order to ensure that study findings are valid and reliable. In this paper, we describe a data quality control program designed to maximize quality of survey data collected using computer-assisted personal interviews. The quality control program comprised three phases: (1) software development, (2) an interviewer quality control protocol, and (3) a data cleaning and processing protocol. To illustrate the value of the program, we assess its use in the Translating Research in Elder Care Study. We utilize data collected annually for two years from computer-assisted personal interviews with 3004 healthcare aides. Data quality was assessed using both survey and process data. Missing data and data errors were minimal. Mean and median values and standard deviations were within acceptable limits. Process data indicated that in only 3.4% and 4.0% of cases was the interviewer unable to conduct interviews in accordance with the details of the program. Interviewers’ perceptions of interview quality also significantly improved between Years 1 and 2. While this data quality control program was demanding in terms of time and resources, we found that the benefits clearly outweighed the effort required to achieve high-quality data.


2018 ◽  
Author(s):  
Nikolaos Papadopoulos ◽  
R. Gonzalo Parra ◽  
Johannes Söding

BackgroundSingle-cell RNA sequencing (scRNA-seq) is an enabling technology for the study of cellular differentiation and heterogeneity. From snapshots of the transcriptomic profiles of differentiating single cells, the cellular lineage tree that leads from a progenitor population to multiple types of differentiated cells can be derived. The underlying lineage trees of most published datasets are linear or have a single branchpoint, but many studies with more complex lineage trees will soon become available. To test and further develop tools for lineage tree reconstruction, we need test datasets with known trees.ResultsPROSSTT can simulate scRNA-seq datasets for differentiation processes with lineage trees of any desired complexity, noise level, noise model, and size. PROSSTT also provides scripts to quantify the quality of predicted lineage trees.Availabilityhttps://github.com/soedinglab/[email protected]


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Maxim Ivanov ◽  
Albin Sandelin ◽  
Sebastian Marquardt

Abstract Background The quality of gene annotation determines the interpretation of results obtained in transcriptomic studies. The growing number of genome sequence information calls for experimental and computational pipelines for de novo transcriptome annotation. Ideally, gene and transcript models should be called from a limited set of key experimental data. Results We developed TranscriptomeReconstructoR, an R package which implements a pipeline for automated transcriptome annotation. It relies on integrating features from independent and complementary datasets: (i) full-length RNA-seq for detection of splicing patterns and (ii) high-throughput 5′ and 3′ tag sequencing data for accurate definition of gene borders. The pipeline can also take a nascent RNA-seq dataset to supplement the called gene model with transient transcripts. We reconstructed de novo the transcriptional landscape of wild type Arabidopsis thaliana seedlings and Saccharomyces cerevisiae cells as a proof-of-principle. A comparison to the existing transcriptome annotations revealed that our gene model is more accurate and comprehensive than the most commonly used community gene models, TAIR10 and Araport11 for A.thaliana and SacCer3 for S.cerevisiae. In particular, we identify multiple transient transcripts missing from the existing annotations. Our new annotations promise to improve the quality of A.thaliana and S.cerevisiae genome research. Conclusions Our proof-of-concept data suggest a cost-efficient strategy for rapid and accurate annotation of complex eukaryotic transcriptomes. We combine the choice of library preparation methods and sequencing platforms with the dedicated computational pipeline implemented in the TranscriptomeReconstructoR package. The pipeline only requires prior knowledge on the reference genomic DNA sequence, but not the transcriptome. The package seamlessly integrates with Bioconductor packages for downstream analysis.


2018 ◽  
Author(s):  
Tito Candelli ◽  
Philip Lijnzaad ◽  
Mauro J Muraro ◽  
Hindrik Kerstens ◽  
Patrick Kemmeren ◽  
...  

AbstractDespite the meteoric rise of single cell RNA-seq, only a few preprocessing pipelines exist that are able to perform all steps from the original fastq files to a gene expression table ready for further analysis. Here we present Sharq, a versatile preprocessing pipeline designed to work with plate-based 3’-end protocols that include Unique Molecular Identifiers (UMIs). Sharq performs stringent step-wise trimming of reads, assigns them to features according to a flexible hierarchical model, and uses the barcode and UMI information to avoid amplification biases and produce gene expression tables. Additionally, Sharq provides an extensive plate diagnostics report for quality control and troubleshooting, including that of spatial artefacts. The diagnostics report includes measures of the quality of the individual plate wells as well as a robust assessment which of them contain material from live cells. Collectively, the innovative approaches presented here provide a valuable tool for processing and quality control of single cell RNA-seq data.


2018 ◽  
Author(s):  
Xiuwei Zhang ◽  
Chenling Xu ◽  
Nir Yosef

The abundance of new computational methods for processing and interpreting transcriptomes at a single cell level raises the need for in-silico platforms for evaluation and validation. Simulated datasets which resemble the properties of real datasets can aid in method development and prioritization as well as in questions in experimental design by providing an objective ground truth. Here, we present SymSim, a simulator software that explicitly models the processes that give rise to data observed in single cell RNA-Seq experiments. The components of the SymSim pipeline pertain to the three primary sources of variation in single cell RNA-Seq data: noise intrinsic to the process of transcription, extrinsic variation that is indicative of different cell states (both discrete and continuous), and technical variation due to low sensitivity and measurement noise and bias. Unlike other simulators, the parameters that govern the simulation process directly represent meaningful properties such as mRNA capture rate, the number of PCR cycles, sequencing depth, or the use of unique molecular identifiers. We demonstrate how SymSim can be used for benchmarking methods for clustering and differential expression and for examining the effects of various parameters on their performance. We also show how SymSim can be used to evaluate the number of cells required to detect a rare population and how this number deviates from the theoretical lower bound as the quality of the data decreases. SymSim is publicly available as an R package and allows users to simulate datasets with desired properties or matched with experimental data.


2018 ◽  
Vol 4 (Supplement 2) ◽  
pp. 156s-156s
Author(s):  
S. Rayne ◽  
J. Meyerowitz ◽  
G. Even-Tov ◽  
H. Rae ◽  
N. Tapela ◽  
...  

Background and context: Breast cancer is one of the most common cancers in most resource-constrained environments worldwide. Although breast awareness has improved, lack of understanding of the diagnosis and management can cause patient anxiety, noncompliance and ultimately may affect survival through compromised or delayed care. South African women attending government hospitals are diverse, with differing levels of income, education and support available. Often there is a lack of access for them to appropriate information for their cancer care. Aim: A novel bioinformatics data management system was conceived through an innovative close collaboration between Wits Biomedical Informatics and Translational Science (Wits-BITS) and academic breast cancer surgeons. The aim was to develop a platform to allow acquisition of epidemiologic data but synchronously convert this into a personalised cancer plan and “take-home” information sheet for the patient. Strategy/Tactics: The concept of a clinician “customer” was used, in which the “currency” in which they rewarded the database service was accurate data. For this payment they received the “product” of an immediate personalised information sheet for their patient. Program/Policy process: A custom software module was developed to generate individualized patient letters containing a mixture of template text and information from the patient's medical record. The letter is populated with the patient's name and where they were seen, and an personalised explanation of the patient's specific cancer stage according to the TNM system. Outcomes: Through a process of continuous use with patient and clinician feedback, the quality of data in the system was improved. Patients enjoyed the personalised information sheet, allowing patient and family to comprehend and be reassured by the management plan. Clinicians found that the quality of the information sheet was instant feedback as to the comprehensiveness of their data input, and thus assured compliance and quality of data points. What was learned: Using a consumer model, through a process of cross-discipline collaboration, where there is normally poor access to appropriate patient information and poor data entry by overburdened clinicians, a low-cost model of high-quality data collection was achieved, in real-time, by clinicians best qualified to input correct data points. Patients also benefitted from participation in a database immediately, through personalised information sheets improving their understanding of their cancer care.


2018 ◽  
Author(s):  
Jesse Kerkvliet ◽  
Arthur de Fouchier ◽  
Michiel van Wijk ◽  
Astrid T. Groot

AbstractTranscriptome quality control is an important step in RNA-seq experiments. However, the quality of de novo assembled transcriptomes is difficult to assess, due to the lack of reference genome to compare the assembly to. We developed a method to assess and improve the quality of de novo assembled transcriptomes by focusing on the removal of chimeric sequences. These chimeric sequences can be the result of faulty assembled contigs, merging two transcripts into one. The developed method is incorporated into a pipeline, that we named Bellerophon, which is broadly applicable and easy to use. Bellerophon first uses the quality-assessment tool TransRate to indicate the quality, after which it uses a Transcripts Per Million (TPM) filter to remove lowly expressed contigs and CD-HIT-EST to remove highly identical contigs. To validate the quality of this method, we performed three benchmark experiments: 1) a computational creation of chimeras, 2) identification of chimeric contigs in a transcriptome assembly, 3) a simulated RNAseq experiment using a known reference transcriptome. Overall, the Bellerophon pipeline was able to remove between 40 to 91.9% of the chimeras in transcriptome assemblies and removed more chimeric than non-chimeric contigs. Thus, the Bellerophon sequence of filtration steps is a broadly applicable solution to improve transcriptome assemblies.


Author(s):  
Jeffrey J. Quinn ◽  
Matthew G. Jones ◽  
Ross A. Okimoto ◽  
Shigeki Nanjo ◽  
Michelle M. Chan ◽  
...  

AbstractCancer progression is characterized by rare, transient events which are nonetheless highly consequential to disease etiology and mortality. Detailed cell phylogenies can recount the history and chronology of these critical events – including metastatic seeding. Here, we applied our Cas9-based lineage tracer to study the subclonal dynamics of metastasis in a lung cancer xenograft mouse model, revealing the underlying rates, routes, and drivers of metastasis. We report deeply resolved phylogenies for tens of thousands of metastatically disseminated cancer cells. We observe surprisingly diverse metastatic phenotypes, ranging from metastasis-incompetent to aggressive populations. These phenotypic distinctions result from pre-existing, heritable, and characteristic differences in gene expression, and we demonstrate that these differentially expressed genes can drive invasiveness. Furthermore, metastases transit via diverse, multidirectional tissue routes and seeding topologies. Our work demonstrates the power of tracing cancer progression at unprecedented resolution and scale.One Sentence SummarySingle-cell lineage tracing and RNA-seq capture diverse metastatic behaviors and drivers in lung cancer xenografts in mice.


Sign in / Sign up

Export Citation Format

Share Document