CLASS: Accurate and Efficient Splice Variant Annotation from RNA-seq Reads

Next generation sequencing of cellular RNA is making it possible to characterize genes and alternative splicing in unprecedented detail. However, designing bioinformatics tools to capture splicing variation accurately has proven difficult. Current programs find major isoforms of a gene but miss finer splicing variations, or are sensitive but highly imprecise. We present CLASS, a novel open source tool for accurate genome-guided transcriptome assembly from RNA-seq reads. CLASS employs a splice graph to represent a gene and its splice variants, combined with a linear program to determine an accurate set of exons and efficient splice graph-based transcript selection algorithms. When compared against reference programs, CLASS had the best overall accuracy and could detect up to twice as many splicing events with precision similar to the best reference program. Notably, it was the only tool that produced consistently reliable transcript models for a wide range of applications and sequencing strategies, including very large data sets and ribosomal RNA-depleted samples. Lightweight and multi-threaded, CLASS required <3GB RAM and less than one day to analyze a 350 million read set, and is an excellent choice for transcriptomics studies, from clinical RNA sequencing, to alternative splicing analyses, and to the annotation of new genomes.

Download Full-text

The art of using t-SNE for single-cell transcriptomics

10.1101/453449 ◽

2018 ◽

Cited By ~ 21

Author(s):

Dmitry Kobak ◽

Philipp Berens

Keyword(s):

Single Cell ◽

Large Data ◽

Two Dimensions ◽

Large Data Sets ◽

Data Sets ◽

Rna Seq ◽

Reduction Step ◽

High Learning ◽

Multi Scale ◽

Scale Similarity

AbstractSingle-cell transcriptomics yields ever growing data sets containing RNA expression levels for thousands of genes from up to millions of cells. Common data analysis pipelines include a dimensionality reduction step for visualising the data in two dimensions, most frequently performed using t-distributed stochastic neighbour embedding (t-SNE). It excels at revealing local structure in high-dimensional data, but naive applications often suffer from severe shortcomings, e.g. the global structure of the data is not represented accurately. Here we describe how to circumvent such pitfalls, and develop a protocol for creating more faithful t-SNE visualisations. It includes PCA initialisation, a high learning rate, and multi-scale similarity kernels; for very large data sets, we additionally use exaggeration and downsampling-based initialisation. We use published single-cell RNA-seq data sets to demonstrate that this protocol yields superior results compared to the naive application of t-SNE.

Download Full-text

An Atom-Probe Tomography Primer

MRS Bulletin ◽

10.1557/mrs2009.194 ◽

2009 ◽

Vol 34 (10) ◽

pp. 717-724 ◽

Cited By ~ 105

Author(s):

David N. Seidman ◽

Krystyna Stiller

Keyword(s):

Atom Probe Tomography ◽

Organic Materials ◽

Ion Beam ◽

Large Data ◽

Laser Pulses ◽

Atom Probe ◽

Femtosecond Laser Pulses ◽

Large Data Sets ◽

Data Sets ◽

Wide Range

AbstractAtom-probe tomography (APT) is in the midst of a dynamic renaissance as a result of the development of well-engineered commercial instruments that are both robust and ergonomic and capable of collecting large data sets, hundreds of millions of atoms, in short time periods compared to their predecessor instruments. An APT setup involves a field-ion microscope coupled directly to a special time-of-flight (TOF) mass spectrometer that permits one to determine the mass-to-charge states of individual field-evaporated ions plus theirx-,y-, andz-coordinates in a specimen in direct space with subnanoscale resolution. The three-dimensional (3D) data sets acquired are analyzed using increasingly sophisticated software programs that utilize high-end workstations, which permit one to handle continuously increasing large data sets. APT has the unique ability to dissect a lattice, with subnanometer-scale spatial resolution, using either voltage or laser pulses, on an atom-by-atom and atomic plane-by-plane basis and to reconstruct it in 3D with the chemical identity of each detected atom identified by TOF mass spectrometry. Employing pico- or femtosecond laser pulses using visible (green or blue light) to ultraviolet light makes the analysis of metallic, semiconducting, ceramic, and organic materials practical to different degrees of success. The utilization of dual-beam focused ion-beam microscopy for the preparation of microtip specimens from multilayer and surface films, semiconductor devices, and for producing site-specific specimens greatly extends the capabilities of APT to a wider range of scientific and engineering problems than could previously be studied for a wide range of materials: metals, semiconductors, ceramics, biominerals, and organic materials.

Download Full-text

Facilitated analysis of large data sets by interactive visualisation

10.1101/178616 ◽

2017 ◽

Cited By ~ 1

Author(s):

Zhigang Lu ◽

Yanyan Zhang

Keyword(s):

Large Data ◽

Large Data Sets ◽

Biological Research ◽

Data Sets ◽

Rna Seq ◽

Database Query ◽

Data Points ◽

Programming Skills ◽

Interactive Visualisation ◽

Research Analysis

AbstractIn biological research analysis of large data sets, such as RNA-seq gene expression, often involves visualisation of thousands of data points and associated database query. Static charts produced by traditional tools lack the ability to reveal underlying information, and separated database query is laborious and involves a lot of manual effort. Interactive charting is able to make the data transparent but the use of visualisation tools often requires certain programming skills, which hinders most academic users. We present here an open-source chart editor for interactive visualisation, which is designed for academic users with no programming experience. It can not only visualise the data in an interactive way, but also link the data points to external databases, by which the user can save a lot of manual effort. We believe that interactive visualisation using such tools will facilitate analysis of large data sets as well as presenting and interpreting the data.

Download Full-text

An Analysis of Data Processing for Big Data Analytics

Journal of Computing and Natural Science ◽

10.53759/181x/jcns202101019 ◽

2021 ◽

pp. 130-138

Author(s):

Steve Blair ◽

Jon Cotter

Keyword(s):

Big Data ◽

Deep Learning ◽

Data Analytics ◽

High Performance ◽

Big Data Analytics ◽

Large Data ◽

Data Availability ◽

Large Data Sets ◽

Data Sets ◽

Wide Range

The need for high-performance Data Mining (DM) algorithms is being driven by the exponentially increasing data availability such as images, audio and video from a variety of domains, including social networks and the Internet of Things (IoT). Deep learning is an emerging field of pattern recognition and Machine Learning (ML) study right now. It offers computer simulations of numerous nonlinear processing layers of neurons that may be used to learn and interpret data at higher degrees of abstractions. Deep learning models, which may be used in cloud technology and huge computational systems, can inherently capture complex structures of large data sets. Heterogeneousness is one of the most prominent characteristics of large data sets, and Heterogeneous Computing (HC) causes issues with system integration and Advanced Analytics. This article presents HC processing techniques, Big Data Analytics (BDA), large dataset instruments, and some classic ML and DM methodologies. The use of deep learning to Data Analytics is investigated. The benefits of integrating BDA, deep learning, HPC (High Performance Computing), and HC are highlighted. Data Analytics and coping with a wide range of data are discussed.

Download Full-text

Hardware Performance Evaluation of De novo Transcriptome Assembly Software in Amazon Elastic Compute Cloud

Current Bioinformatics ◽

10.2174/1574893615666191219095817 ◽

2020 ◽

Vol 15 (5) ◽

pp. 420-430 ◽

Cited By ~ 1

Author(s):

Fernando Mora-Márquez ◽

José Luis Vázquez-Poletti ◽

Víctor Chano ◽

Carmen Collada ◽

Álvaro Soto ◽

...

Keyword(s):

Virtual Machines ◽

De Novo ◽

Transcriptome Assembly ◽

Large Data ◽

Data Sets ◽

De Novo Transcriptome Assembly ◽

Rna Seq ◽

De Novo Transcriptome ◽

Time Duration ◽

Data Set

Background: Bioinformatics software for RNA-seq analysis has a high computational requirement in terms of the number of CPUs, RAM size, and processor characteristics. Specifically, de novo transcriptome assembly demands large computational infrastructure due to the massive data size, and complexity of the algorithms employed. Comparative studies on the quality of the transcriptome yielded by de novo assemblers have been previously published, lacking, however, a hardware efficiency-oriented approach to help select the assembly hardware platform in a cost-efficient way. Objective: We tested the performance of two popular de novo transcriptome assemblers, Trinity and SOAPdenovo-Trans (SDNT), in terms of cost-efficiency and quality to assess limitations, and provided troubleshooting and guidelines to run transcriptome assemblies efficiently. Methods: We built virtual machines with different hardware characteristics (CPU number, RAM size) in the Amazon Elastic Compute Cloud of the Amazon Web Services. Using simulated and real data sets, we measured the elapsed time, cost, CPU percentage and output size of small and large data set assemblies. Results: For small data sets, SDNT outperformed Trinity by an order the magnitude, significantly reducing the time duration and costs of the assembly. For large data sets, Trinity performed better than SDNT. Both the assemblers provide good quality transcriptomes. Conclusion: The selection of the optimal transcriptome assembler and provision of computational resources depend on the combined effect of size and complexity of RNA-seq experiments.

Download Full-text

The art of using t-SNE for single-cell transcriptomics

Nature Communications ◽

10.1038/s41467-019-13056-x ◽

2019 ◽

Vol 10 (1) ◽

Cited By ~ 65

Author(s):

Dmitry Kobak ◽

Philipp Berens

Keyword(s):

Single Cell ◽

Large Data ◽

Two Dimensions ◽

Large Data Sets ◽

Data Sets ◽

Rna Seq ◽

Reduction Step ◽

High Learning ◽

Multi Scale ◽

Scale Similarity

Download Full-text

An example of spectrum imaging used for comparison of EELS quantitative analysis techniques on Al-Li

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s042482010008794x ◽

1991 ◽

Vol 49 ◽

pp. 726-727

Author(s):

John A. Hunt

Keyword(s):

Quantitative Analysis ◽

Large Data ◽

Difference Spectrum ◽

Large Data Sets ◽

Foil Thickness ◽

Data Sets ◽

Analysis Techniques ◽

Spectrum Imaging ◽

Normal Spectrum ◽

Electron Energy Loss

Spectrum-imaging is a useful technique for comparing different processing methods on very large data sets which are identical for each method. This paper is concerned with comparing methods of electron energy-loss spectroscopy (EELS) quantitative analysis on the Al-Li system. The spectrum-image analyzed here was obtained from an Al-10at%Li foil aged to produce δ' precipitates that can span the foil thickness. Two 1024 channel EELS spectra offset in energy by 1 eV were recorded and stored at each pixel in the 80x80 spectrum-image (25 Mbytes). An energy range of 39-89eV (20 channels/eV) are represented. During processing the spectra are either subtracted to create an artifact corrected difference spectrum, or the energy offset is numerically removed and the spectra are added to create a normal spectrum. The spectrum-images are processed into 2D floating-point images using methods and software described in [1].

Download Full-text

Cluster analysis for large data sets: applications to individual aerosol particles from the mid-pacific

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s0424820100132078 ◽

1992 ◽

Vol 50 (2) ◽

pp. 1488-1489

Author(s):

Thomas W. Shattuck ◽

James R. Anderson ◽

Neil W. Tindale ◽

Peter R. Buseck

Keyword(s):

Cluster Analysis ◽

Chemical Reactivity ◽

Large Data ◽

Large Data Sets ◽

Particle Analysis ◽

Data Sets ◽

Halogen Chemistry ◽

Complete Study ◽

Components Analysis ◽

Automated Scanning

Individual particle analysis involves the study of tens of thousands of particles using automated scanning electron microscopy and elemental analysis by energy-dispersive, x-ray emission spectroscopy (EDS). EDS produces large data sets that must be analyzed using multi-variate statistical techniques. A complete study uses cluster analysis, discriminant analysis, and factor or principal components analysis (PCA). The three techniques are used in the study of particles sampled during the FeLine cruise to the mid-Pacific ocean in the summer of 1990. The mid-Pacific aerosol provides information on long range particle transport, iron deposition, sea salt ageing, and halogen chemistry.Aerosol particle data sets suffer from a number of difficulties for pattern recognition using cluster analysis. There is a great disparity in the number of observations per cluster and the range of the variables in each cluster. The variables are not normally distributed, they are subject to considerable experimental error, and many values are zero, because of finite detection limits. Many of the clusters show considerable overlap, because of natural variability, agglomeration, and chemical reactivity.

Download Full-text

Faculty Opinions recommendation of Detecting novel associations in large data sets.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.13805958.793484294 ◽

2014 ◽

Author(s):

Daniel Lee

Keyword(s):

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Novel Associations

Download Full-text

NVESTIGATION OF THE EFFICIENCY OF DISTRIBUTED INFORMATION SYSTEMS BASED ON THE PROCESSING OF LARGE AMOUNTS OF DATA

Visnyk Universytetu “Ukraina” ◽

10.36994/2707-4110-2019-2-23-03 ◽

2019 ◽

Author(s):

Mykhajlo Klymash ◽

Olena Hordiichuk — Bublivska ◽

Ihor Tchaikovskyi ◽

Oksana Urikova

Keyword(s):

Distributed Systems ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Data Decomposition ◽

Distributed Information ◽

Software Model ◽

Computing Performance ◽

Mapreduce Model ◽

Singular Data

In this article investigated the features of processing large arrays of information for distributed systems. A method of singular data decomposition is used to reduce the amount of data processed, eliminating redundancy. Dependencies of computational efficiency on distributed systems were obtained using the MPI messaging protocol and MapReduce node interaction software model. Were analyzed the efficiency of the application of each technology for the processing of different sizes of data: Non — distributed systems are inefficient for large volumes of information due to low computing performance. It is proposed to use distributed systems that use the method of singular data decomposition, which will reduce the amount of information processed. The study of systems using the MPI protocol and MapReduce model obtained the dependence of the duration calculations time on the number of processes, which testify to the expediency of using distributed computing when processing large data sets. It is also found that distributed systems using MapReduce model work much more efficiently than MPI, especially with large amounts of data. MPI makes it possible to perform calculations more efficiently for small amounts of information. When increased the data sets, advisable to use the Map Reduce model.

Download Full-text