Optimization of data-intensive next generation sequencing in high performance computing

Author(s):  
Nagarajan Kathiresan ◽  
Rashid Al-Ali ◽  
Puthen V. Jithesh ◽  
Tariq AbuZaid ◽  
Ramzi Temanni ◽  
...  
2015 ◽  
Author(s):  
Pierre Carrier ◽  
Bill Long ◽  
Richard Walsh ◽  
Jef Dawson ◽  
Carlos P. Sosa ◽  
...  

High Performance Computing (HPC) Best Practice offers opportunities to implement lessons learned in areas such as computational chemistry and physics in genomics workflows, specifically Next-Generation Sequencing (NGS) workflows. In this study we will briefly describe how distributed-memory parallelism can be an important enhancement to the performance and resource utilization of NGS workflows. We will illustrate this point by showing results on the parallelization of the Inchworm module of the Trinity RNA-Seq pipeline for de novo transcriptome assembly. We show that these types of applications can scale to thousands of cores. Time scaling as well as memory scaling will be discussed at length using two RNA-Seq datasets, targeting the Mus musculus (mouse) and the Axolotl (Mexican salamander). Details about the efficient MPI communication and the impact on performance will also be shown. We hope to demonstrate that this type of parallelization approach can be extended to most types of bioinformatics workflows, with substantial benefits. The efficient, distributed-memory parallel implementation eliminates memory bottlenecks and dramatically accelerates NGS analysis. We further include a summary of programming paradigms available to the bioinformatics community, such as C++/MPI.


2020 ◽  
Vol 73 (9) ◽  
pp. 602-604
Author(s):  
Silvia Bessi ◽  
Francesco Pepe ◽  
Marco Ottaviantonio ◽  
Pasquale Pisapia ◽  
Umberto Malapelle ◽  
...  

In the present study, we analysed 44 formalin fixed paraffin embedded (FFPE) from different solid tumours by adopting two different next generation sequencing platforms: GeneReader (QIAGEN, Hilden, Germany) and Ion Torrent (Thermo Fisher Scientific, Waltham, Massachusetts, USA). We highlighted a 100% concordance between the platforms. In addition, focusing on variant detection, we evaluated a very good agreement between the two tests (Cohen’s kappa=0.84) and, when taking into account variant allele fraction value for each variant, a very high concordance was obtained (Pearson’s r=0.94). Our results underlined the high performance rate of GeneReader on FFPE samples and its suitability in routine molecular predictive practice.


2021 ◽  
Vol 13 (21) ◽  
pp. 11782
Author(s):  
Taha Al-Jody ◽  
Hamza Aagela ◽  
Violeta Holmes

There is a tradition at our university for teaching and research in High Performance Computing (HPC) systems engineering. With exascale computing on the horizon and a shortage of HPC talent, there is a need for new specialists to secure the future of research computing. Whilst many institutions provide research computing training for users within their particular domain, few offer HPC engineering and infrastructure-related courses, making it difficult for students to acquire these skills. This paper outlines how and why we are training students in HPC systems engineering, including the technologies used in delivering this goal. We demonstrate the potential for a multi-tenant HPC system for education and research, using novel container and cloud-based architecture. This work is supported by our previously published work that uses the latest open-source technologies to create sustainable, fast and flexible turn-key HPC environments with secure access via an HPC portal. The proposed multi-tenant HPC resources can be deployed on a “bare metal” infrastructure or in the cloud. An evaluation of our activities over the last five years is given in terms of recruitment metrics, skills audit feedback from students, and research outputs enabled by the multi-tenant usage of the resource.


2012 ◽  
pp. 841-861
Author(s):  
Chao-Tung Yang ◽  
Wen-Chung Shih

Biology databases are diverse and massive. As a result, researchers must compare each sequence with vast numbers of other sequences. Comparison, whether of structural features or protein sequences, is vital in bioinformatics. These activities require high-speed, high-performance computing power to search through and analyze large amounts of data and industrial-strength databases to perform a range of data-intensive computing functions. Grid computing and Cluster computing meet these requirements. Biological data exist in various web services that help biologists search for and extract useful information. The data formats produced are heterogeneous and powerful tools are needed to handle the complex and difficult task of integrating the data. This paper presents a review of the technologies and an approach to solve this problem using cluster and grid computing technologies. The authors implement an experimental distributed computing application for bioinformatics, consisting of basic high-performance computing environments (Grid and PC Cluster systems), multiple interfaces at user portals that provide useful graphical interfaces to enable biologists to benefit directly from the use of high-performance technology, and a translation tool for converting biology data into XML format.


Sign in / Sign up

Export Citation Format

Share Document