FASTQuick: rapid and comprehensive quality assessment of raw sequence reads

Abstract Background Rapid and thorough quality assessment of sequenced genomes on an ultra-high-throughput scale is crucial for successful large-scale genomic studies. Comprehensive quality assessment typically requires full genome alignment, which costs a substantial amount of computational resources and turnaround time. Existing tools are either computationally expensive owing to full alignment or lacking essential quality metrics by skipping read alignment. Findings We developed a set of rapid and accurate methods to produce comprehensive quality metrics directly from a subset of raw sequence reads (from whole-genome or whole-exome sequencing) without full alignment. Our methods offer orders of magnitude faster turnaround time than existing full alignment–based methods while providing comprehensive and sophisticated quality metrics, including estimates of genetic ancestry and cross-sample contamination. Conclusions By rapidly and comprehensively performing the quality assessment, our tool will help investigators detect potential issues in ultra-high-throughput sequence reads in real time within a low computational cost at the early stages of the analyses, ensuring high-quality downstream results and preventing unexpected loss in time, money, and invaluable specimens.

Download Full-text

FASTQuick: Rapid and comprehensive quality assessment of raw sequence reads

10.1101/2020.06.10.143768 ◽

2020 ◽

Author(s):

Fan Zhang ◽

Hyun Min Kang

Keyword(s):

Quality Assessment ◽

High Throughput ◽

Large Scale ◽

Computational Cost ◽

Turnaround Time ◽

Quality Metrics ◽

Genome Alignment ◽

Full Genome ◽

Genomic Studies ◽

Downstream Analysis

AbstractBackgroundRapid and thorough quality assessment of sequenced genomes in an ultra-high-throughput scale is crucial for successful large-scale genomic studies. Comprehensive quality assessment typically requires full genome alignment, which costs a significant amount of computational resources and turnaround time. Existing tools are either computational expensive due to full alignment or lacking essential quality metrics by skipping read alignment.FindingsWe developed a set of rapid and accurate methods to produce comprehensive quality metrics directly from raw sequence reads without full genome alignment. Our methods offer orders of magnitude faster turnaround time than existing full alignment-based methods while providing comprehensive and sophisticated quality metrics, including estimates of genetic ancestry and contamination.ConclusionsBy rapidly and comprehensively performing the quality assessment, our tool will help investigators detect potential issues in ultra-high-throughput sequence reads in real-time within a low computational cost, ensuring high-quality downstream analysis and preventing unexpected loss in time, money, and invaluable specimens.

Download Full-text

A Replica Based Co-Scheduler (RBS) for Fault Tolerant Computational Grid

Cloud, Grid and High Performance Computing ◽

10.4018/978-1-60960-603-9.ch007 ◽

2011 ◽

pp. 101-116

Author(s):

Zahid Raza ◽

Deo P. Vidyarthi

Keyword(s):

Fault Tolerance ◽

Performance Optimization ◽

Large Scale ◽

Heterogeneous Computing ◽

Fault Tolerant ◽

Turnaround Time ◽

Computational Grid ◽

Successful Execution ◽

Computational Resources ◽

The Cost

Grid is a parallel and distributed computing network system comprising of heterogeneous computing resources spread over multiple administrative domains that offers high throughput computing. Since the Grid operates at a large scale, there is always a possibility of failure ranging from hardware to software. The penalty paid of these failures may be on a very large scale. System needs to be tolerant to various possible failures which, in spite of many precautions, are bound to happen. Replication is a strategy often used to introduce fault tolerance in the system to ensure successful execution of the job, even when some of the computational resources fail. Though replication incurs a heavy cost, a selective degree of replication can offer a good compromise between the performance and the cost. This chapter proposes a co-scheduler that can be integrated with main scheduler for the execution of the jobs submitted to computational Grid. The main scheduler may have any performance optimization criteria; the integration of co-scheduler will be an added advantage towards fault tolerance. The chapter evaluates the performance of the co-scheduler with the main scheduler designed to minimize the turnaround time of a modular job by introducing module replication to counter the effects of node failures in a Grid. Simulation study reveals that the model works well under various conditions resulting in a graceful degradation of the scheduler’s performance with improving the overall reliability offered to the job.

Download Full-text

Desktop Grids

Handbook of Research on Scalable Computing Technologies ◽

10.4018/978-1-60566-661-7.ch003 ◽

2010 ◽

pp. 31-61 ◽

Cited By ~ 2

Author(s):

Franck Cappello ◽

Gilles Fedak ◽

Derrick Kondo ◽

Paul Malecot ◽

Ala Rezmerita

Keyword(s):

High Throughput ◽

Large Scale ◽

Ecological Impact ◽

Desktop Grids ◽

Recent Argument ◽

High Throughput Computing ◽

Desktop Computers ◽

Computing Platforms ◽

Computational Resources

Desktop Grids, literally Grids made of Desktop Computers, are very popular in the context of “Volunteer Computing” for large scale “Distributed Computing” projects like SETI@home and Folding@home. They are very appealing, as “Internet Computing” platforms for scientific projects seeking a huge amount of computational resources for massive high throughput computing, like the EGEE project in Europe. Companies are also interested of using cheap computing solutions that does not add extra hardware and cost of ownership. A very recent argument for Desktop Grids is their ecological impact: by scavenging unused CPU cycles without increasing excessively the power consumption, they reduce the waste of electricity. This book chapter presents the background of Desktop Grid, their principles and essential mechanisms, the evolution of their architectures, their applications and the research tools associated with this technology.

Download Full-text

A Low Computational-Cost Electronic Payment Scheme for Mobile Commerce with Large-Scale Mobile Users

Wireless Personal Communications ◽

10.1007/s11277-010-0109-2 ◽

2010 ◽

Vol 63 (1) ◽

pp. 83-99 ◽

Cited By ~ 10

Author(s):

Jen-Ho Yang ◽

Chin-Chen Chang

Keyword(s):

Large Scale ◽

Computational Cost ◽

Mobile Commerce ◽

Mobile Users ◽

Electronic Payment ◽

Payment Scheme ◽

Low Computational Cost

Download Full-text

Design method of multiuser MIMO system using large scale transmit array with low computational cost based on subarray division

IEICE Communications Express ◽

10.1587/comex.2016xbl0130 ◽

2016 ◽

Vol 5 (11) ◽

pp. 407-412 ◽

Cited By ~ 1

Author(s):

Tetsuki Taniguchi ◽

Yoshio Karasawa

Keyword(s):

Large Scale ◽

Design Method ◽

Mimo System ◽

Computational Cost ◽

Multiuser Mimo ◽

Low Computational Cost

Download Full-text

High-performance computing service for bioinformatics and data science

Journal of the Medical Library Association JMLA ◽

10.5195/jmla.2018.512 ◽

2018 ◽

Vol 106 (4) ◽

Author(s):

Jean-Paul Courneya ◽

Alexa Mayo

Keyword(s):

Open Source ◽

High Throughput ◽

High Performance ◽

Large Scale ◽

Data Science ◽

Wet Work ◽

High Throughput Data ◽

Guided Learning ◽

Computational Resources ◽

Performance Computing

Despite having an ideal setup in their labs for wet work, researchers often lack the computational infrastructure to analyze the magnitude of data that result from “-omics” experiments. In this innovative project, the library supports analysis of high-throughput data from global molecular profiling experiments by offering a high-performance computer with open source software along with expert bioinformationist support. The audience for this new service is faculty, staff, and students for whom using the university’s large scale, CORE computational resources is not warranted because these resources exceed the needs of smaller projects. In the library’s approach, users are empowered to analyze high-throughput data that they otherwise would not be able to on their own computers. To develop the project, the library’s bioinformationist identified the ideal computing hardware and a group of open source bioinformatics software to provide analysis options for experimental data such as scientific images, sequence reads, and flow cytometry files. To close the loop between learning and practice, the bioinformationist developed self-guided learning materials and workshops or consultations on topics such as the National Center for Biotechnology Information’s BLAST, Bioinformatics on the Cloud, and ImageJ. Researchers apply the data analysis techniques that they learned in the classroom in an ideal computing environment.

Download Full-text

Designing production-friendly machine learning

Proceedings of the VLDB Endowment ◽

10.14778/3484224.3484241 ◽

2021 ◽

Vol 14 (13) ◽

pp. 3420-3420

Author(s):

Matei Zaharia

Keyword(s):

Machine Learning ◽

Open Source ◽

Large Scale ◽

Question Answering ◽

Failure Modes ◽

Computational Cost ◽

Language Models ◽

Software Systems ◽

Resource Cost ◽

Low Computational Cost

Building production ML applications is difficult because of their resource cost and complex failure modes. I will discuss these challenges from two perspectives: the Stanford DAWN Lab and experience with large-scale commercial ML users at Databricks. I will then present two emerging ideas to help address these challenges. The first is "ML platforms", an emerging class of software systems that standardize the interfaces used in ML applications to make them easier to build and maintain. I will give a few examples, including the open-source MLflow system from Databricks [3]. The second idea is models that are more "production-friendly" by design. As a concrete example, I will discuss retrieval-based NLP models such as Stanford's ColBERT [1, 2] that query documents from an updateable corpus to perform tasks such as question-answering, which gives multiple practical advantages, including low computational cost, high interpretability, and very fast updates to the model's "knowledge". These models are an exciting alternative to large language models such as GPT-3.

Download Full-text

Large-scale simulation of biomembranes: bringing realistic kinetics to coarse-grained models

10.1101/815571 ◽

2019 ◽

Author(s):

Mohsen Sadeghi ◽

Frank Noé

Keyword(s):

Large Scale ◽

Stochastic Dynamics ◽

Computational Cost ◽

Coarse Grained ◽

Fluctuation Spectrum ◽

Biologically Relevant ◽

Membrane Models ◽

Kinetic Effects ◽

The Cost ◽

Low Computational Cost

Biomembranes are two-dimensional assemblies of phospholipids that are only a few nanometres thick, but form micrometer-sized structures vital to cellular function. Explicit modelling of biologically relevant membrane systems is computationally expensive, especially when the large number of solvent particles and slow membrane kinetics are taken into account. While highly coarse-grained solvent-free models are available to study equilibrium behaviour of membranes, their efficiency comes at the cost of sacrificing realistic kinetics, and thereby the ability to predict pathways and mechanisms of membrane processes. Here, we present a framework for integrating coarse-grained membrane models with anisotropic stochastic dynamics and continuum-based hydrodynamics, allowing us to simulate large biomembrane systems with realistic kinetics at low computational cost. This paves the way for whole-cell simulations that still include nanometer/nanosecond spatiotemporal resolutions. As a demonstration, we obtain and verify fluctuation spectrum of a full-sized human red blood cell in a 150-milliseconds-long single trajectory. We show how the kinetic effects of different cytoplasmic viscosities can be studied with such a simulation, with predictions that agree with single-cell experimental observations.

Download Full-text

A nanoluciferase SARS-CoV-2 for rapid neutralization testing and screening of anti-infective drugs for COVID-19

10.1101/2020.06.22.165712 ◽

2020 ◽

Cited By ~ 7

Author(s):

Xuping Xie ◽

Antonio E. Muruato ◽

Xianwen Zhang ◽

Kumari G. Lokugamage ◽

Camila R. Fontes-Garfias ◽

...

Keyword(s):

High Throughput ◽

Large Scale ◽

A549 Cells ◽

Neutralizing Antibody ◽

Turnaround Time ◽

Wild Type Virus ◽

Antibody Activity ◽

Reporter Virus ◽

Antibody Testing ◽

Tenofovir Alafenamide

AbstractA high-throughput platform would greatly facilitate COVID-19 serological testing and antiviral screening. Here we report a nanoluciferase SARS-CoV-2 (SARS-CoV-2-Nluc) that is genetically stable and replicates similarly to the wild-type virus in cell culture. We demonstrate that the optimized reporter virus assay in Vero E6 cells can be used to measure neutralizing antibody activity in patient sera and produces results in concordance with a plaque reduction neutralization test (PRNT). Compared with the low-throughput PRNT (3 days), the SARS-CoV-2-Nluc assay has substantially shorter turnaround time (5 hours) with a high-throughput testing capacity. Thus, the assay can be readily deployed for large-scale vaccine evaluation and neutralizing antibody testing in humans. Additionally, we developed a high-throughput antiviral assay using SARS-CoV-2-Nluc infection of A549 cells expressing human ACE2 receptor (A549-hACE2). When tested against this reporter virus, remdesivir exhibited substantially more potent activity in A549-hACE2 cells compared to Vero E6 cells (EC50 0.115 vs 1.28 μM), while this difference was not observed for chloroquine (EC50 1.32 vs 3.52 μM), underscoring the importance of selecting appropriate cells for antiviral testing. Using the optimized SARS-CoV-2-Nluc assay, we evaluated a collection of approved and investigational antivirals and other anti-infective drugs. Nelfinavir, rupintrivir, and cobicistat were identified as the most selective inhibitors of SARS-CoV-2-Nluc (EC50 0.77 to 2.74 μM). In contrast, most of the clinically approved antivirals, including tenofovir alafenamide, emtricitabine, sofosbuvir, ledipasvir, and velpatasvir were inactive at concentrations up to 10 μM. Collectively, this high-throughput platform represents a reliable tool for rapid neutralization testing and antiviral screening for SARS-CoV-2.

Download Full-text

An Algorithm to Estimate the Two-Way Fixed Effects Model

Journal of Econometric Methods ◽

10.1515/jem-2014-0008 ◽

2016 ◽

Vol 5 (1) ◽

Cited By ~ 8

Author(s):

Paulo Somaini ◽

Frank A. Wolak

Keyword(s):

Least Squares ◽

Method Of Moments ◽

Fixed Effects ◽

Asymptotic Variance ◽

Generalized Method Of Moments ◽

Computational Cost ◽

Ordinary Least Squares ◽

Fixed Effects Model ◽

Computational Resources ◽

Low Computational Cost

AbstractWe present an algorithm to estimate the two-way fixed effect linear model. The algorithm relies on the Frisch-Waugh-Lovell theorem and applies to ordinary least squares (OLS), two-stage least squares (TSLS) and generalized method of moments (GMM) estimators. The coefficients of interest are computed using the residuals from the projection of all variables on the two sets of fixed effects. Our algorithm has three desirable features. First, it manages memory and computational resources efficiently which speeds up the computation of the estimates. Second, it allows the researcher to estimate multiple specifications using the same set of fixed effects at a very low computational cost. Third, the asymptotic variance of the parameters of interest can be consistently estimated using standard routines on the residualized data.

Download Full-text