A guide and best practices for R/Bioconductor tool integration in Galaxy

Galaxy provides a web-based platform for interactive, large-scale data analyses, which integrates bioinformatics tools written in a variety of languages. A substantial number of these tools are written in the R programming language, which enables powerful analysis and visualization of complex data. The Bioconductor Project provides access to these open source R tools and currently contains over 1200 R packages. While some R/Bioconductor tools are currently available in Galaxy, scientific research communities would benefit greatly if they were integrated on a larger scale. Tool development in Galaxy is an early entry point for Galaxy developers, biologists, and bioinformaticians, who want to make their work more accessible to a larger community of scientists. Here, we present a guide and best practices for R/Bioconductor tool integration into Galaxy. In addition, we introduce new functionalities to existing software that resolve dependency issues and semi-automate generation of tool integration components. With these improvements, novice and experienced developers can easily integrate R/Bioconductor tools into Galaxy to make their work more accessible to the scientific community.

Download Full-text

Introducing R as a smart version of calculators enables beginners to explore it on their own

F1000Research ◽

10.12688/f1000research.54685.1 ◽

2021 ◽

Vol 10 ◽

pp. 859

Author(s):

Krishna Choudhary ◽

Alexander R. Pico

Keyword(s):

Data Analysis ◽

Programming Languages ◽

Large Scale ◽

Complex Data ◽

Large Scale Data ◽

Skill Gaps ◽

Technological Advances ◽

Communication Barrier ◽

R Programming ◽

Public Repositories

Rapid technological advances in the past decades have enabled molecular biologists to generate large-scale and complex data with affordable resource investments, or obtain such data from public repositories. Yet, many graduate students, postdoctoral scholars, and senior researchers in the biosciences find themselves ill-equipped to analyze large-scale data. Global surveys have revealed that active researchers prefer short training workshops to fill their skill gaps. In this article, we focus on the challenge of delivering a short data analysis workshop to absolute beginners in computer programming. We propose that introducing R or other programming languages for data analysis as smart versions of calculators can help lower the communication barrier with absolute beginners. We describe this comparison with a few analogies and hope that other instructors will find them useful. We utilized these in our four-hour long training workshops involving participatory live coding, which we delivered in person and via videoconferencing. Anecdotal evidence suggests that our exposition made R programming seem easy and enabled beginners to explore it on their own.

Download Full-text

Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data

10.1101/117812 ◽

2017 ◽

Cited By ~ 1

Author(s):

Julie A McMurry ◽

Nick Juty ◽

Niklas Blomberg ◽

Tony Burdett ◽

Tom Conlin ◽

...

Keyword(s):

Best Practices ◽

Large Scale ◽

Data Science ◽

Life Science ◽

Core Component ◽

Science Data ◽

Web Based ◽

Large Scale Data ◽

Online Databases ◽

Scale Data

AbstractIn many disciplines, data is highly decentralized across thousands of online databases (repositories, registries, and knowledgebases). Wringing value from such databases depends on the discipline of data science and on the humble bricks and mortar that make integration possible; identifiers are a core component of this integration infrastructure. Drawing on our experience and on work by other groups, we outline ten lessons we have learned about the identifier qualities and best practices that facilitate large-scale data integration. Specifically, we propose actions that identifier practitioners (database providers) should take in the design, provision and reuse of identifiers; we also outline important considerations for those referencing identifiers in various circumstances, including by authors and data generators. While the importance and relevance of each lesson will vary by context, there is a need for increased awareness about how to avoid and manage common identifier problems, especially those related to persistence and web-accessibility/resolvability. We focus strongly on web-based identifiers in the life sciences; however, the principles are broadly relevant to other disciplines.

Download Full-text

WebViz: A Web-Based Collaborative Interactive Visualization System for Large-Scale Data Sets

Lecture Notes in Earth System Sciences - GPU Solutions to Multi-scale Problems in Science and Engineering ◽

10.1007/978-3-642-16405-7_37 ◽

2013 ◽

pp. 587-606 ◽

Cited By ~ 2

Author(s):

Yichen Zhou ◽

Robin M. Weiss ◽

Elizabeth McArthur ◽

David Sanchez ◽

Xiang Yao ◽

...

Keyword(s):

Large Scale ◽

Interactive Visualization ◽

Data Sets ◽

Web Based ◽

Visualization System ◽

Large Scale Data ◽

Scale Data ◽

Large Scale Data Sets

Download Full-text

PsicquicGraph, a BioJS component to visualize molecular interactions from PSICQUIC servers

F1000Research ◽

10.12688/f1000research.3-44.v1 ◽

2014 ◽

Vol 3 ◽

pp. 44 ◽

Cited By ~ 4

Author(s):

Jose M. Villaveces ◽

Rafael C. Jimenez ◽

Bianca H. Habermann

Keyword(s):

Small Molecules ◽

Molecular Interactions ◽

Large Scale ◽

Protein Interaction Networks ◽

Interaction Networks ◽

Web Based ◽

High Throughput Data ◽

Link Type ◽

Large Scale Data ◽

Scale Data

Summary: Protein interaction networks have become an essential tool in large-scale data analysis, integration, and the visualization of high-throughput data in the context of complex cellular networks. Many individual databases are available that provide information on binary interactions of proteins and small molecules. Community efforts such as PSICQUIC aim to unify and standardize information emanating from these public databases. Here we introduce PsicquicGraph, an open-source, web-based visualization component for molecular interactions from PSIQUIC services. Availability: PsicquicGraph is freely available at the BioJS Registry for download and enhancement. Instructions on how to use the tool are available here http://goo.gl/kDaIgZ and the source code can be found at http://github.com/biojs/biojs and DOI:10.5281/zenodo.7709.

Download Full-text

A Hierarchical 3D Data Rendering System Synchronized with HTML

International Journal of Virtual Reality ◽

10.20870/ijvr.2006.5.2.2691 ◽

2006 ◽

Vol 5 (2) ◽

pp. 67-72

Author(s):

Yousuke Kimura ◽

Tomohiro Mashita ◽

Atsushi Nakazawa ◽

Takashi Machida ◽

Kiyoshi Kiyokawa ◽

...

Keyword(s):

Large Scale ◽

Frame Rate ◽

Web Based ◽

Geometric Data ◽

Large Scale Data ◽

3D Data ◽

Interactive Frame ◽

Encoding Method ◽

Hierarchical Encoding ◽

Scale Data

We propose a new rendering system for large-scale, 3D geometic data that can be used with web-based content man-agement systems (CMS). To achieve this, we employed a geometry hierarchical encoding method "QSplat" and implemented this in a Java and JOGL (Java bindings of OpenGL) environment. Users can view large-scale geometric data using conventional HTML browsers with a non-powerful CPU and low-speed networks. Further, this system is independent of the platforms. We add new functionalities so that users can easily understand the geometric data: Annotations and HTML Synchronization. Users can see the geometric data with the associated annotations that describe the names or the detailed explanations of the particular portions. The HTML Synchronization enables users to smoothly and interactively switch our rendering system and HTML contents. The experimental results show that our system performs an interactive frame rate even for a large-scale data whereas other systems cannot render them

Download Full-text

Packaging data analytical work reproducibly using R (and friends)

10.7287/peerj.preprints.3192 ◽

2018 ◽

Author(s):

Ben Marwick ◽

Carl Boettiger ◽

Lincoln Mullen

Keyword(s):

Large Scale ◽

Research Process ◽

R Programming Language ◽

Large Scale Data ◽

Software Packages ◽

Analytical Work ◽

Computer Based ◽

R Packages ◽

R Programming ◽

Central Tool

Computers are a central tool in the research process, enabling complex and large scale data analysis. As computer-based research has increased in complexity, so have the challenges of ensuring that this research is reproducible. To address this challenge, we review the concept of the research compendium as a solution for providing a standard and easily recognisable way for organising the digital materials of a research project to enable other researchers to inspect, reproduce, and extend the research. We investigate how the structure and tooling of software packages of the R programming language are being used to produce research compendia in a variety of disciplines. We also describe how software engineering tools and services are being used by researchers to streamline working with research compendia. Using real-world examples, we show how researchers can improve the reproducibility of their work using research compendia based on R packages and related tools.

Download Full-text

Packaging data analytical work reproducibly using R (and friends)

10.7287/peerj.preprints.3192v1 ◽

2017 ◽

Cited By ~ 3

Author(s):

Ben Marwick ◽

Carl Boettiger ◽

Lincoln Mullen

Keyword(s):

Large Scale ◽

Research Process ◽

R Programming Language ◽

Large Scale Data ◽

Software Packages ◽

Analytical Work ◽

Computer Based ◽

R Packages ◽

R Programming ◽

Central Tool

Download Full-text

ChR: Dynamic Functional Constraints Checking in R

Journal of Applied Computer Science Methods ◽

10.1515/jacsm-2017-0004 ◽

2017 ◽

Vol 9 (1) ◽

pp. 65-78

Author(s):

Konrad Grzanek

Keyword(s):

Large Scale ◽

Data Science ◽

R Programming Language ◽

Large Scale Data ◽

Dynamic Type ◽

Gradual Typing ◽

R Programming ◽

Learning Projects ◽

Time Type ◽

Scale Data

Abstract Dynamic typing of R programming language may issue some quality problems in large scale data-science and machine-learning projects for which the language is used. Following our efforts on providing gradual typing library for Clojure we come with a package chR - a library that offers functionality of run-time type-related checks in R. The solution is not only a dynamic type checker, it also helps to systematize thinking about types in the language, at the same time offering high expressivenes and full adherence to functional programming style.

Download Full-text

Visualizing the Impact of Cyberattacks on Web-Based Transactions on Large-Scale Data and Knowledge-Based Systems

Engineering International ◽

10.18034/ei.v7i2.578 ◽

2019 ◽

Vol 7 (2) ◽

pp. 95-104

Author(s):

Mani Manavalan ◽

Nur Mohammad Ali Chisty

Keyword(s):

Large Scale ◽

Infiltration Process ◽

Web Based ◽

Cause And Effect ◽

Detection Systems ◽

Knowledge Based ◽

Large Scale Data ◽

Set Up ◽

Definition Of ◽

The Impact

Manual approaches rely on the abilities and knowledge of individual human administrators to detect, analyze, and interpret attacks. Intrusion Detection Systems (IDS) are systems that can automatically detect and warn the appropriate persons when an attack occurs. Despite the fact that individual attacks can be useful, they are frequently insufficient for understanding the entire attacking process, as well as the attackers' talents and objectives. The attacking stage is usually merely a component of a larger infiltration process, during which attackers gather information and set up the proper conditions before launching an attack, after which they clear log records in order to conceal their footprints and disappear. In today's assault scenarios, the pre-definition of cause-and-effect links between events is required, which is a tough and time-consuming task that takes considerable effort. Our technique for creating attack scenarios is based on the linking nature of web pages, and it does not require the pre-definition of cause and effect links, as demonstrated in previous work. Constructed situations are displayed in spatial and temporal coordinate systems to make viewing and analyzing them more convenient. In addition, we develop a prototype implementation of the concept, which we utilize to test a number of assault scenario scenarios.

Download Full-text

The Challenges of Large-Scale, Web-based Language Datasets: Word Length and Predictability Revisited

10.31234/osf.io/6832r ◽

2021 ◽

Author(s):

Stephan Meylan ◽

Tom Griffiths

Keyword(s):

Best Practices ◽

Word Frequency ◽

Word Length ◽

Large Scale ◽

Strongly Correlated ◽

Web Based ◽

Language Research ◽

Efficient Communication ◽

Average Information Content ◽

Average Information

Language research has come to rely heavily on large-scale, web-based datasets. These datasets can present significant methodological challenges, requiring researchers to make a number of decisions about how they are collected, represented, and analyzed. These decisions often concern long-standing challenges in corpus-based language research, including determining what counts as a word, deciding which words should be analyzed, and matching sets of words across languages. We illustrate these challenges by revisiting "Word lengths are optimized for efficient communication" (Piantadosi, Tily, & Gibson,2011), which found that word lengths in 11 languages are more strongly correlated with their average predictability (or average information content) than their frequency. Using what we argue to be best practices for large-scale corpus analyses, we find significantly attenuated support for this result, and demonstrate that a stronger relationship obtains between word frequency and length for a majority of the languages in the sample. We consider the implications of the results for language research more broadly and provide several recommendations to researchers regarding best practices.

Download Full-text