shinyheatmap: ultra fast low memory heatmap web interface for big data genomics

AbstractBackgroundTranscriptomics, metabolomics, metagenomics, and other various next-generation sequencing (-omics) fields are known for their production of large datasets, especially across single-cell sequencing studies. Visualizing such big data has posed technical challenges in biology, both in terms of available computational resources as well as programming acumen. Since heatmaps are used to depict high-dimensional numerical data as a colored grid of cells, efficiency and speed have often proven to be critical considerations in the process of successfully converting data into graphics. For example, rendering interactive heatmaps from large input datasets (e.g., 100k+ rows) has been computationally infeasible on both desktop computers and web browsers. In addition to memory requirements, programming skills and knowledge have frequently been barriers-to-entry for creating highly customizable heatmaps.ResultsWe propose shinyheatmap: an advanced user-friendly heatmap software suite capable of efficiently creating highly customizable static and interactive biological heatmaps in a web browser. shinyheatmap is a low memory footprint program, making it particularly well-suited for the interactive visualization of extremely large datasets that cannot typically be computed in-memory due to size restrictions. Also, shinyheatmap features a built-in high performance web plug-in, fastheatmap, for rapidly plotting interactive heatmaps of datasets as large as 105 − 107 rows within seconds, effectively shattering previous performance benchmarks of heatmap rendering speed.Conclusionsshinyheatmap is hosted online as a freely available web server with an intuitive graphical user interface: http://shinyheatmap.com. The methods are implemented in R, and are available as part of the shinyheatmap project at: https://github.com/Bohdan-Khomtchouk/shinyheatmap. Users can access fastheatmap directly from within the shinyheatmap web interface, and all source code has been made publicly available on Github: https://github.com/Bohdan-Khomtchouk/fastheatmap.

Download Full-text

SOAPMetaS: profiling large metagenome datasets efficiently on distributed clusters

Bioinformatics ◽

10.1093/bioinformatics/btaa697 ◽

2020 ◽

Author(s):

Shixu He ◽

Zhibo Huang ◽

Xiaohan Wang ◽

Lin Fang ◽

Shengkang Li ◽

...

Keyword(s):

Big Data ◽

Large Volume ◽

Machine Tools ◽

High Performance ◽

Marker Gene ◽

Source Code ◽

Large Datasets ◽

Supplementary Information ◽

Supplementary Data ◽

Multiple Sample

Abstract Summary Rapid increase of the data size in metagenome researches has raised the demand for new tools to process large datasets efficiently. To accelerate the metagenome profiling process in the scenario of big data, we developed SOAPMetaS, a marker gene-based multiple-sample metagenome profiling tool built on Apache Spark. SOAPMetaS demonstrates high performance and scalability to process large datasets. It can process 80 samples of FASTQ data, summing up to 416 GiB, in around half an hour; and the accuracy of species profiling results of SOAPMetaS is similar to that of MetaPhlAn2. SOAPMetaS can deal with a large volume of metagenome data more efficiently than common-used single-machine tools. Availability and implementation Source code is implemented in Java and freely available at https://github.com/BGI-flexlab/SOAPMetaS. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Who’s who? Detecting and resolving sample anomalies in human DNA sequencing studies with peddy

10.1101/074385 ◽

2016 ◽

Author(s):

Brent S. Pedersen ◽

Aaron R. Quinlan

Keyword(s):

Dna Sequencing ◽

Web Interface ◽

1000 Genomes ◽

Link Type ◽

Human Dna ◽

Machine Learning Model ◽

Sequencing Studies ◽

Sequencing Quality ◽

Interactive Visualizations ◽

Genetic Discovery

AbstractThe potential for genetic discovery in human DNA sequencing studies is greatly diminished if DNA samples from the cohort are mislabelled, swapped, contaminated, or include unintended individuals. Unfortunately, the potential for such errors is significant since DNA samples are often manipulated by several protocols, labs or scientists in the process of sequencing. We have developed peddy to identify and facilitate the remediation of such errors via interactive visualizations and reports comparing the stated sex, relatedness, and ancestry to what is inferred from each individual’s genotypes. Peddy predicts a sample’s ancestry using a machine learning model trained on individuals of diverse ancestries from the 1000 Genomes Project reference panel. Peddy’s speed, text reports and web interface facilitate both automated and visual detection of sample swaps, poor sequencing quality and other indicators of sample problems that, were they left undetected, would inhibit discovery.Software Availabilityhttps://github.com/brentp/peddyDemonstration (Chrome suggested)http://home.chpc.utah.edu/∼u6000771//plots/ceph1463.html

Download Full-text

Artificial Intelligence-Assisted Colonoscopy for Detection of Colon Polyps: a Prospective, Randomized Cohort Study

Journal of Gastrointestinal Surgery ◽

10.1007/s11605-020-04802-4 ◽

2020 ◽

Author(s):

Yuchen Luo ◽

Yi Zhang ◽

Ming Liu ◽

Yihong Lai ◽

Panpan Liu ◽

...

Keyword(s):

Artificial Intelligence ◽

Real Time ◽

High Performance ◽

Detection System ◽

Random Order ◽

Colon Polyps ◽

Clinical Environment ◽

Polyp Detection ◽

Link Type ◽

Polyp Detection Rate

Abstract Background and aims Improving the rate of polyp detection is an important measure to prevent colorectal cancer (CRC). Real-time automatic polyp detection systems, through deep learning methods, can learn and perform specific endoscopic tasks previously performed by endoscopists. The purpose of this study was to explore whether a high-performance, real-time automatic polyp detection system could improve the polyp detection rate (PDR) in the actual clinical environment. Methods The selected patients underwent same-day, back-to-back colonoscopies in a random order, with either traditional colonoscopy or artificial intelligence (AI)-assisted colonoscopy performed first by different experienced endoscopists (> 3000 colonoscopies). The primary outcome was the PDR. It was registered with clinicaltrials.gov. (NCT047126265). Results In this study, we randomized 150 patients. The AI system significantly increased the PDR (34.0% vs 38.7%, p < 0.001). In addition, AI-assisted colonoscopy increased the detection of polyps smaller than 6 mm (69 vs 91, p < 0.001), but no difference was found with regard to larger lesions. Conclusions A real-time automatic polyp detection system can increase the PDR, primarily for diminutive polyps. However, a larger sample size is still needed in the follow-up study to further verify this conclusion. Trial Registration clinicaltrials.gov Identifier: NCT047126265

Download Full-text

Perspectives on High-Performance Computing in a Big Data World

Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '19 ◽

10.1145/3307681.3325410 ◽

2019 ◽

Author(s):

Geoffrey C. Fox

Keyword(s):

Big Data ◽

High Performance Computing ◽

High Performance ◽

Performance Computing

Download Full-text

Computational storage: an efficient and scalable platform for big data and HPC applications

Journal Of Big Data ◽

10.1186/s40537-019-0265-5 ◽

2019 ◽

Vol 6 (1) ◽

Cited By ~ 2

Author(s):

Mahdi Torabzadehkashi ◽

Siavash Rezaei ◽

Ali HeydariGorji ◽

Hosein Bobarshad ◽

Vladimir Alves ◽

...

Keyword(s):

Big Data ◽

High Performance ◽

Distributed Processing ◽

Data Access ◽

Distributed Applications ◽

Process Data ◽

Storage Devices ◽

Hadoop Mapreduce ◽

Big Data Applications ◽

Application Processor

AbstractIn the era of big data applications, the demand for more sophisticated data centers and high-performance data processing mechanisms is increasing drastically. Data are originally stored in storage systems. To process data, application servers need to fetch them from storage devices, which imposes the cost of moving data to the system. This cost has a direct relation with the distance of processing engines from the data. This is the key motivation for the emergence of distributed processing platforms such as Hadoop, which move process closer to data. Computational storage devices (CSDs) push the “move process to data” paradigm to its ultimate boundaries by deploying embedded processing engines inside storage devices to process data. In this paper, we introduce Catalina, an efficient and flexible computational storage platform, that provides a seamless environment to process data in-place. Catalina is the first CSD equipped with a dedicated application processor running a full-fledged operating system that provides filesystem-level data access for the applications. Thus, a vast spectrum of applications can be ported for running on Catalina CSDs. Due to these unique features, to the best of our knowledge, Catalina CSD is the only in-storage processing platform that can be seamlessly deployed in clusters to run distributed applications such as Hadoop MapReduce and HPC applications in-place without any modifications on the underlying distributed processing framework. For the proof of concept, we build a fully functional Catalina prototype and a CSD-equipped platform using 16 Catalina CSDs to run Intel HiBench Hadoop and HPC benchmarks to investigate the benefits of deploying Catalina CSDs in the distributed processing environments. The experimental results show up to 2.2× improvement in performance and 4.3× reduction in energy consumption, respectively, for running Hadoop MapReduce benchmarks. Additionally, thanks to the Neon SIMD engines, the performance and energy efficiency of DFT algorithms are improved up to 5.4× and 8.9×, respectively.

Download Full-text

Bandwidth compression of multiple numerical data streams for high performance custom computing

2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors ◽

10.1109/asap.2014.6868660 ◽

2014 ◽

Cited By ~ 4

Author(s):

Tomohiro Ueno ◽

Ryo Ito ◽

Kentaro Sano ◽

Satoru Yamamoto

Keyword(s):

Data Streams ◽

High Performance ◽

Numerical Data ◽

Bandwidth Compression

Download Full-text

AusGeochem and Big Data Analytics in Low-Temperature Thermochronology

10.5194/egusphere-egu21-16550 ◽

2021 ◽

Author(s):

Samuel Boone ◽

Fabian Kohlmann ◽

Moritz Theile ◽

Wayne Noble ◽

Barry Kohn ◽

...

Keyword(s):

Big Data ◽

Low Temperature ◽

Data Analytics ◽

Big Data Analytics ◽

Application Programming Interface ◽

Large Datasets ◽

Geochemical Data ◽

Advisory Group ◽

New Era ◽

Analytical Instruments

The AuScope Geochemistry Network (AGN) and partners Lithodat Pty Ltd are developing AusGeochem, a novel cloud-based platform for Australian-produced geochemistry data from around the globe. The open platform will allow laboratories to upload, archive, disseminate and publish their datasets, as well as perform statistical analyses and data synthesis within the context of large volumes of publicly funded geochemical data. As part of this endeavour, representatives from four Australian low-temperature thermochronology laboratories (University of Melbourne, University of Adelaide, Curtin University and University of Queensland) are advising the AGN and Lithodat on the development of low-temperature thermochronology (LTT)-specific data models for the relational AusGeochem database and its international counterpart, LithoSurfer. These schemas will facilitate the structured archiving of a wide variety of thermochronology data, enabling geoscientists to readily perform LTT Big Data analytics and gain new insights into the thermo-tectonic evolution of Earth&#8217;s crust.Adopting established international data reporting best practices, the LTT expert advisory group has designed database schemas for the fission track and (U-Th-Sm)/He methods, as well as for thermal history modelling results and metadata. In addition to recording the parameters required for LTT analyses, the schemas include fields for reference material results and error reporting, allowing AusGeochem users to independently perform QA/QC on data archived in the database. Development of scripts for the automated upload of data directly from analytical instruments into AusGeochem using its open-source Application Programming Interface are currently under way.The advent of a LTT relational database heralds the beginning of a new era of Big Data analytics in the field of low-temperature thermochronology. By methodically archiving detailed LTT (meta-)data in structured schemas, intractably large datasets comprising 1000s of analyses produced by numerous laboratories can be readily interrogated in new and powerful ways. These include rapid derivation of inter-data relationships, facilitating on-the-fly age computation, statistical analysis and data visualisation. With the detailed LTT data stored in relational schemas, measurements can then be re-calculated and re-modelled using user-defined constants and kinetic algorithms. This enables analyses determined using different parameters to be equated and compared across regional- to global scales.The development of this novel tool heralds the beginning of a new era of structured Big Data in the field of low-temperature thermochronology, improving laboratories&#8217; ability to manage and share their data in alignment with FAIR data principles while enabling analysts to readily interrogate intractably large datasets in new and powerful ways.

Download Full-text

The potential of entry barriers in the context of digitalization

Moscow University Economics Bulletin ◽

10.38050/01300105202116 ◽

2021 ◽

Vol 21 (1) ◽

pp. 128-147

Author(s):

Aleksey Zazdravnykh

Keyword(s):

Big Data ◽

Market Power ◽

Network Effects ◽

Economies Of Scale ◽

Digital Transformation ◽

Entry Barriers ◽

Barriers To Entry ◽

Significant Growth ◽

Digital Infrastructure ◽

Market Operation

The article analyzes the practical aspects of the functioning of some barriers to entry in the era of digital transformation of industry markets. It is noted that under the influence of digitalization processes, both positive changes in the mechanism of market operation are recorded, as well as a number of negative circumstances that have become a serious challenge for antitrust agencies. Control of big data, initial investment in digital infrastructure, and broad technological capabilities of digital blocking of users, against the background of powerful network effects and pronounced economies of scale, carry the potential for significant growth in the market power of individual firms. The article substantiates that such trends theoretically pose a significant threat to competition, and can form new types of entry barriers. At the same time, practical arguments are presented that indicate the ambiguity of this position.

Download Full-text