High Performance Computational Analysis of Large-scale Proteome Data Sets to Assess Incremental Contribution to Coverage of the Human Genome

ABSTRACT Time series data of celestial objects are commonly used to study valuable and unexpected objects such as extrasolar planets and supernova in time domain astronomy. Due to the rapid growth of data volume, traditional manual methods are becoming extremely hard and infeasible for continuously analysing accumulated observation data. To meet such demands, we designed and implemented a special tool named AstroCatR that can efficiently and flexibly reconstruct time series data from large-scale astronomical catalogues. AstroCatR can load original catalogue data from Flexible Image Transport System (FITS) files or data bases, match each item to determine which object it belongs to, and finally produce time series data sets. To support the high-performance parallel processing of large-scale data sets, AstroCatR uses the extract-transform-load (ETL) pre-processing module to create sky zone files and balance the workload. The matching module uses the overlapped indexing method and an in-memory reference table to improve accuracy and performance. The output of AstroCatR can be stored in CSV files or be transformed other into formats as needed. Simultaneously, the module-based software architecture ensures the flexibility and scalability of AstroCatR. We evaluated AstroCatR with actual observation data from The three Antarctic Survey Telescopes (AST3). The experiments demonstrate that AstroCatR can efficiently and flexibly reconstruct all time series data by setting relevant parameters and configuration files. Furthermore, the tool is approximately 3× faster than methods using relational data base management systems at matching massive catalogues.

Download Full-text

MS-PyCloud: An open-source, cloud computing-based pipeline for LC-MS/MS data analysis

10.1101/320887 ◽

2018 ◽

Cited By ~ 2

Author(s):

Li Chen ◽

Bai Zhang ◽

Michael Schnaubelt ◽

Punit Shah ◽

Paul Aiyetan ◽

...

Keyword(s):

Cloud Computing ◽

Data Analysis ◽

Open Source ◽

High Performance ◽

Large Scale ◽

Rapid Development ◽

Data File ◽

Data Sets ◽

Proteomics Data ◽

Amazon Web Services

ABSTRACTRapid development and wide adoption of mass spectrometry-based proteomics technologies have empowered scientists to study proteins and their modifications in complex samples on a large scale. This progress has also created unprecedented challenges for individual labs to store, manage and analyze proteomics data, both in the cost for proprietary software and high-performance computing, and the long processing time that discourages on-the-fly changes of data processing settings required in explorative and discovery analysis. We developed an open-source, cloud computing-based pipeline, MS-PyCloud, with graphical user interface (GUI) support, for LC-MS/MS data analysis. The major components of this pipeline include data file integrity validation, MS/MS database search for spectral assignment, false discovery rate estimation, protein inference, determination of protein post-translation modifications, and quantitation of specific (modified) peptides and proteins. To ensure the transparency and reproducibility of data analysis, MS-PyCloud includes open source software tools with comprehensive testing and versioning for spectrum assignments. Leveraging public cloud computing infrastructure via Amazon Web Services (AWS), MS-PyCloud scales seamlessly based on analysis demand to achieve fast and efficient performance. Application of the pipeline to the analysis of large-scale iTRAQ/TMT LC-MS/MS data sets demonstrated the effectiveness and high performance of MS-PyCloud. The software can be downloaded at: https://bitbucket.org/mschnau/ms-pycloud/downloads/

Download Full-text

Integrated Cloud Computing Environment for Upstream Geoscience Workflows

10.2118/204848-ms ◽

2021 ◽

Author(s):

Murtadha Al-Habib ◽

Yasser Al-Ghamdi

Keyword(s):

High Performance ◽

Large Scale ◽

End Users ◽

Data Sets ◽

Production Environment ◽

Remote Visualization ◽

Test Environment ◽

Petroleum Resources ◽

Customized Production ◽

And Performance

Abstract Extensive computing resources are required to leverage todays advanced geoscience workflows that are used to explore and characterize giant petroleum resources. In these cases, high-performance workstations are often unable to adequately handle the scale of computing required. The workflows typically utilize complex and massive data sets, which require advanced computing resources to store, process, manage, and visualize various forms of the data throughout the various lifecycles. This work describes a large-scale geoscience end-to-end interpretation platform customized to run on a cluster-based remote visualization environment. A team of computing infrastructure and geoscience workflow experts was established to collaborate on the deployment, which was broken down into separate phases. Initially, an evaluation and analysis phase was conducted to analyze computing requirements and assess potential solutions. A testing environment was then designed, implemented and benchmarked. The third phase used the test environment to determine the scale of infrastructure required for the production environment. Finally, the full-scale customized production environment was deployed for end users. During testing phase, aspects such as connectivity, stability, interactivity, functionality, and performance were investigated using the largest available geoscience datasets. Multiple computing configurations were benchmarked until optimal performance was achieved, under applicable corporate information security guidelines. It was observed that the customized production environment was able to execute workflows that were unable to run on local user workstations. For example, while conducting connectivity, stability and interactivity benchmarking, the test environment was operated for extended periods to ensure stability for workflows that require multiple days to run. To estimate the scale of the required production environment, varying categories of users’ portfolio were determined based on data type, scale and workflow. Continuous monitoring of system resources and utilization enabled continuous improvements to the final solution. The utilization of a fit-for-purpose, customized remote visualization solution may reduce or ultimately eliminate the need to deploy high-end workstations to all end users. Rather, a shared, scalable and reliable cluster-based solution can serve a much larger user community in a highly performant manner.

Download Full-text

bíogo: a simple high-performance bioinformatics toolkit for the Go language

10.1101/005033 ◽

2014 ◽

Cited By ~ 6

Author(s):

R Daniel Kortschak ◽

David L Adelson

Keyword(s):

High Performance ◽

Large Scale ◽

Large Data ◽

Biological Data ◽

Data Sets ◽

Barriers To Entry ◽

Data Types ◽

Concurrent Processing ◽

Computationally Intensive ◽

And Performance

bíogo is a framework designed to ease development and maintenance of computationally intensive bioinformatics applications. The library is written in the Go programming language, a garbage-collected, strictly typed compiled language with built in support for concurrent processing, and performance comparable to C and Java. It provides a variety of data types and utility functions to facilitate manipulation and analysis of large scale genomic and other biological data. bíogo uses a concise and expressive syntax, lowering the barriers to entry for researchers needing to process large data sets with custom analyses while retaining computational safety and ease of code review. We believe bíogo provides an excellent environment for training and research in computational biology because of its combination of strict typing, simple and expressive syntax, and high performance.

Download Full-text

Data Intensive Computing for Bioinformatics

Bioinformatics ◽

10.4018/978-1-4666-3604-0.ch016 ◽

2013 ◽

pp. 287-321

Author(s):

Judy Qiu ◽

Jaliya Ekanayake ◽

Thilina Gunarathne ◽

Jong Youl Choi ◽

Seung-Hee Bae ◽

...

Keyword(s):

High Performance ◽

Large Scale ◽

Programming Model ◽

Life Sciences ◽

Programming Models ◽

Data Sets ◽

Data Intensive Computing ◽

Data Intensive ◽

Data Mining Algorithms ◽

Model Combining

Data intensive computing, cloud computing, and multicore computing are converging as frontiers to address massive data problems with hybrid programming models and/or runtimes including MapReduce, MPI, and parallel threading on multicore platforms. A major challenge is to utilize these technologies and large-scale computing resources effectively to advance fundamental science discoveries such as those in Life Sciences. The recently developed next-generation sequencers have enabled large-scale genome sequencing in areas such as environmental sample sequencing leading to metagenomic studies of collections of genes. Metagenomic research is just one of the areas that present a significant computational challenge because of the amount and complexity of data to be processed. This chapter discusses the use of innovative data-mining algorithms and new programming models for several Life Sciences applications. The authors particularly focus on methods that are applicable to large data sets coming from high throughput devices of steadily increasing power. They show results for both clustering and dimension reduction algorithms, and the use of MapReduce on modest size problems. They identify two key areas where further research is essential, and propose to develop new O(NlogN) complexity algorithms suitable for the analysis of millions of sequences. They suggest Iterative MapReduce as a promising programming model combining the best features of MapReduce with those of high performance environments such as MPI.

Download Full-text

In situ visualization of large-scale turbulence simulations in Nek5000 with ParaView Catalyst

The Journal of Supercomputing ◽

10.1007/s11227-021-03990-3 ◽

2021 ◽

Author(s):

Marco Atzori ◽

Wiebke Köpp ◽

Steven W. D. Chien ◽

Daniele Massaro ◽

Fermín Mallor ◽

...

Keyword(s):

High Performance ◽

Large Scale ◽

Data Sets ◽

High Fidelity Simulation ◽

Computing Systems ◽

Performance Simulation ◽

Image Composition ◽

In Situ Visualization ◽

Scale Turbulence

AbstractIn situ visualization on high-performance computing systems allows us to analyze simulation results that would otherwise be impossible, given the size of the simulation data sets and offline post-processing execution time. We develop an in situ adaptor for Paraview Catalyst and Nek5000, a massively parallel Fortran and C code for computational fluid dynamics. We perform a strong scalability test up to 2048 cores on KTH’s Beskow Cray XC40 supercomputer and assess in situ visualization’s impact on the Nek5000 performance. In our study case, a high-fidelity simulation of turbulent flow, we observe that in situ operations significantly limit the strong scalability of the code, reducing the relative parallel efficiency to only $$\approx 21\%$$ ≈ 21 % on 2048 cores (the relative efficiency of Nek5000 without in situ operations is $$\approx 99\%$$ ≈ 99 % ). Through profiling with Arm MAP, we identified a bottleneck in the image composition step (that uses the Radix-kr algorithm) where a majority of the time is spent on MPI communication. We also identified an imbalance of in situ processing time between rank 0 and all other ranks. In our case, better scaling and load-balancing in the parallel image composition would considerably improve the performance of Nek5000 with in situ capabilities. In general, the result of this study highlights the technical challenges posed by the integration of high-performance simulation codes and data-analysis libraries and their practical use in complex cases, even when efficient algorithms already exist for a certain application scenario.

Download Full-text

Large-scale correlations between eight key double strand break related data sets over the whole human genome

10.1101/581173 ◽

2019 ◽

Author(s):

Anders Brahme ◽

Maj Hultén ◽

Carin Bengtsson ◽

Andreas Hultgren ◽

Anders Zetterberg

Keyword(s):

Human Genome ◽

Copy Number ◽

High Probability ◽

Large Scale ◽

Fragile Sites ◽

Dna Lesions ◽

Data Sets ◽

Breast Cancers ◽

Cancer Induction ◽

Copy Number Changes

AbstractEight different data sets, covering the whole human genome are compared with regard to their genomic distribution. A close correlation between cytological detected chiasma and MLH1 immunofluorescence sites with the recombination density distribution from the HapMap project was found. Sites with a high probability of chromatid breakage after exposure to low and high ionization density radiations are often located inside common and rare Fragile Sites (FSs) indicating that the common Radiation-Induced Breakpoint sites (RIBs) may be a new kind of more local fragility. Furthermore, Oncogenes and other cancer-related genes are commonly located in regions with an increased probability of rearrangements during genomic recombination, or in regions with high probability of copy number changes, possibly since these processes may be involved in oncogene activation and cancer induction. An increased CpG density is linked to regions of high gene density to secure high fidelity reproduction and survival. To minimize cancer induction these genes are often located in regions of decreased recombination density and/or higher than average CpG density. Interestingly, copy number changes occur predominantly at common RIBs and/or FSs at least for breast cancers with poor prognosis and they decrease weakly but significantly in regions with increasing recombination density and CpG density. It is compelling that all these datasets are influenced by the cells handling of double strand breaks and more generally DNA damage on its genome. In fact, the DNA repair genes are systematically avoiding regions with a high recombination density. This may be a consequence of natural selection, as they need to be intact to accurately handle repairable DNA lesions.

Download Full-text

Data Intensive Computing for Bioinformatics

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Data Intensive Distributed Computing ◽

10.4018/978-1-61520-971-2.ch010 ◽

2012 ◽

pp. 207-241 ◽

Cited By ~ 1

Author(s):

Judy Qiu ◽

Jaliya Ekanayake ◽

Thilina Gunarathne ◽

Jong Youl Choi ◽

Seung-Hee Bae ◽

...

Keyword(s):

High Performance ◽

Large Scale ◽

Programming Model ◽

Life Sciences ◽

Programming Models ◽

Data Sets ◽

Data Intensive Computing ◽

Data Intensive ◽

Data Mining Algorithms ◽

Model Combining

Data intensive computing, cloud computing, and multicore computing are converging as frontiers to address massive data problems with hybrid programming models and/or runtimes including MapReduce, MPI, and parallel threading on multicore platforms. A major challenge is to utilize these technologies and large-scale computing resources effectively to advance fundamental science discoveries such as those in Life Sciences. The recently developed next-generation sequencers have enabled large-scale genome sequencing in areas such as environmental sample sequencing leading to metagenomic studies of collections of genes. Metagenomic research is just one of the areas that present a significant computational challenge because of the amount and complexity of data to be processed. This chapter discusses the use of innovative data-mining algorithms and new programming models for several Life Sciences applications. The authors particularly focus on methods that are applicable to large data sets coming from high throughput devices of steadily increasing power. They show results for both clustering and dimension reduction algorithms, and the use of MapReduce on modest size problems. They identify two key areas where further research is essential, and propose to develop new O(NlogN) complexity algorithms suitable for the analysis of millions of sequences. They suggest Iterative MapReduce as a promising programming model combining the best features of MapReduce with those of high performance environments such as MPI.

Download Full-text

The Structure and Properties of MoSi2 Thin Film in Mos Process

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s1431927600001379 ◽

1980 ◽

Vol 38 ◽

pp. 326-327

Author(s):

C.K. Wu ◽

P. Chang ◽

N. Godinho

Keyword(s):

Thin Film ◽

Integrated Circuits ◽

High Performance ◽

Large Scale ◽

Process Development ◽

Structure And Properties ◽

Metal Silicides ◽

High Oxidation ◽

Important Approach ◽

High Oxidation Resistance

Recently, the use of refractory metal silicides as low resistivity, high temperature and high oxidation resistance gate materials in large scale integrated circuits (LSI) has become an important approach in advanced MOS process development (1). This research is a systematic study on the structure and properties of molybdenum silicide thin film and its applicability to high performance LSI fabrication.

Download Full-text

RECOMMENDATIONS FOR THE CHOICE OF MILKING INSTALLATIONS IN LOOSE HOUSING SYSTEMS OF COWS

Molochnoe i miasnoe skotovodstvo ◽

10.33943/mms.2020.12.24.001 ◽

2020 ◽

Author(s):

В.В. ГОРДЕЕВ ◽

В.Е. ХАЗАНОВ

Keyword(s):

Dairy Cows ◽

High Performance ◽

Large Scale ◽

Dairy Farms ◽

Economic Indicators ◽

Technical Level ◽

Housing Systems ◽

Working Shift ◽

Technical And Economic Indicators

При выборе типа доильной установки и ее размера необходимо учитывать максимальное планируемое поголовье дойных коров и размер технологической группы, кратность и время одного доения, продолжительность рабочей смены дояров. Анализ технико-экономических показателей наиболее распространенных на сегодняшний день типов доильных установок одинакового технического уровня свидетельствует, что наилучшие удельные показатели имеет установка типа «Карусель» (1), а установка типа «Елочка» (2) требует более высоких затрат труда и средств. Установка «Параллель» (3) занимает промежуточное положение. Из анализа пропускной способности и количества необходимых операторов: установка 2 рекомендована для ферм с поголовьем дойного стада до 600 голов, 3 — не более 1200 дойных коров, 1 — более 1200 дойных коров. «Карусель» — наиболее рациональный, высокопроизводительный, легко автоматизируемый и, следовательно, перспективный способ доения в залах, особенно для крупных молочных ферм. The choice of the proper type and size of milking installations needs to take into account the maximum planned number of dairy cows, the size of a technological group, the number of milkings per day, and the duration of one milking and the operator's working shift. The analysis of technical and economic indicators of currently most common types of milking machines of the same technical level revealed that the Carousel installation had the best specific indicators while the Herringbone installation featured higher labour inputs and cash costs. The Parallel installation was found somewhere in between. In terms of the throughput and the required number of operators Herringbone is recommended for farms with up to 600 dairy cows, Parallel — below 1200 dairy cows, Carousel — above 1200 dairy cows. Carousel was found the most practical, high-performance, easily automated and, therefore, promising milking system for milking parlours, especially on the large-scale dairy farms.

Download Full-text