scholarly journals A guide and best practices for R/Bioconductor tool integration in Galaxy

F1000Research ◽  
2016 ◽  
Vol 5 ◽  
pp. 2757 ◽  
Author(s):  
Nitesh Turaga ◽  
Mallory A. Freeberg ◽  
Dannon Baker ◽  
John Chilton ◽  
Anton Nekrutenko ◽  
...  

Galaxy provides a web-based platform for interactive, large-scale data analyses, which integrates bioinformatics tools written in a variety of languages. A substantial number of these tools are written in the R programming language, which enables powerful analysis and visualization of complex data. The Bioconductor Project provides access to these open source R tools and currently contains over 1200 R packages. While some R/Bioconductor tools are currently available in Galaxy, scientific research communities would benefit greatly if they were integrated on a larger scale. Tool development in Galaxy is an early entry point for Galaxy developers, biologists, and bioinformaticians, who want to make their work more accessible to a larger community of scientists. Here, we present a guide and best practices for R/Bioconductor tool integration into Galaxy. In addition, we introduce new functionalities to existing software that resolve dependency issues and semi-automate generation of tool integration components. With these improvements, novice and experienced developers can easily integrate R/Bioconductor tools into Galaxy to make their work more accessible to the scientific community.

F1000Research ◽  
2021 ◽  
Vol 10 ◽  
pp. 859
Author(s):  
Krishna Choudhary ◽  
Alexander R. Pico

Rapid technological advances in the past decades have enabled molecular biologists to generate large-scale and complex data with affordable resource investments, or obtain such data from public repositories. Yet, many graduate students, postdoctoral scholars, and senior researchers in the biosciences find themselves ill-equipped to analyze large-scale data. Global surveys have revealed that active researchers prefer short training workshops to fill their skill gaps. In this article, we focus on the challenge of delivering a short data analysis workshop to absolute beginners in computer programming. We propose that introducing R or other programming languages for data analysis as smart versions of calculators can help lower the communication barrier with absolute beginners. We describe this comparison with a few analogies and hope that other instructors will find them useful. We utilized these in our four-hour long training workshops involving participatory live coding, which we delivered in person and via videoconferencing. Anecdotal evidence suggests that our exposition made R programming seem easy and enabled beginners to explore it on their own.


2017 ◽  
Author(s):  
Julie A McMurry ◽  
Nick Juty ◽  
Niklas Blomberg ◽  
Tony Burdett ◽  
Tom Conlin ◽  
...  

AbstractIn many disciplines, data is highly decentralized across thousands of online databases (repositories, registries, and knowledgebases). Wringing value from such databases depends on the discipline of data science and on the humble bricks and mortar that make integration possible; identifiers are a core component of this integration infrastructure. Drawing on our experience and on work by other groups, we outline ten lessons we have learned about the identifier qualities and best practices that facilitate large-scale data integration. Specifically, we propose actions that identifier practitioners (database providers) should take in the design, provision and reuse of identifiers; we also outline important considerations for those referencing identifiers in various circumstances, including by authors and data generators. While the importance and relevance of each lesson will vary by context, there is a need for increased awareness about how to avoid and manage common identifier problems, especially those related to persistence and web-accessibility/resolvability. We focus strongly on web-based identifiers in the life sciences; however, the principles are broadly relevant to other disciplines.


F1000Research ◽  
2014 ◽  
Vol 3 ◽  
pp. 44 ◽  
Author(s):  
Jose M. Villaveces ◽  
Rafael C. Jimenez ◽  
Bianca H. Habermann

Summary: Protein interaction networks have become an essential tool in large-scale data analysis, integration, and the visualization of high-throughput data in the context of complex cellular networks. Many individual databases are available that provide information on binary interactions of proteins and small molecules. Community efforts such as PSICQUIC aim to unify and standardize information emanating from these public databases. Here we introduce PsicquicGraph, an open-source, web-based visualization component for molecular interactions from PSIQUIC services. Availability: PsicquicGraph is freely available at the BioJS Registry for download and enhancement. Instructions on how to use the tool are available here http://goo.gl/kDaIgZ and the source code can be found at http://github.com/biojs/biojs and DOI:10.5281/zenodo.7709.


2006 ◽  
Vol 5 (2) ◽  
pp. 67-72
Author(s):  
Yousuke Kimura ◽  
Tomohiro Mashita ◽  
Atsushi Nakazawa ◽  
Takashi Machida ◽  
Kiyoshi Kiyokawa ◽  
...  

We propose a new rendering system for large-scale, 3D geometic data that can be used with web-based content man-agement systems (CMS). To achieve this, we employed a geometry hierarchical encoding method "QSplat" and implemented this in a Java and JOGL (Java bindings of OpenGL) environment. Users can view large-scale geometric data using conventional HTML browsers with a non-powerful CPU and low-speed networks. Further, this system is independent of the platforms. We add new functionalities so that users can easily understand the geometric data: Annotations and HTML Synchronization. Users can see the geometric data with the associated annotations that describe the names or the detailed explanations of the particular portions. The HTML Synchronization enables users to smoothly and interactively switch our rendering system and HTML contents. The experimental results show that our system performs an interactive frame rate even for a large-scale data whereas other systems cannot render them


2018 ◽  
Author(s):  
Ben Marwick ◽  
Carl Boettiger ◽  
Lincoln Mullen

Computers are a central tool in the research process, enabling complex and large scale data analysis. As computer-based research has increased in complexity, so have the challenges of ensuring that this research is reproducible. To address this challenge, we review the concept of the research compendium as a solution for providing a standard and easily recognisable way for organising the digital materials of a research project to enable other researchers to inspect, reproduce, and extend the research. We investigate how the structure and tooling of software packages of the R programming language are being used to produce research compendia in a variety of disciplines. We also describe how software engineering tools and services are being used by researchers to streamline working with research compendia. Using real-world examples, we show how researchers can improve the reproducibility of their work using research compendia based on R packages and related tools.


Author(s):  
Ben Marwick ◽  
Carl Boettiger ◽  
Lincoln Mullen

Computers are a central tool in the research process, enabling complex and large scale data analysis. As computer-based research has increased in complexity, so have the challenges of ensuring that this research is reproducible. To address this challenge, we review the concept of the research compendium as a solution for providing a standard and easily recognisable way for organising the digital materials of a research project to enable other researchers to inspect, reproduce, and extend the research. We investigate how the structure and tooling of software packages of the R programming language are being used to produce research compendia in a variety of disciplines. We also describe how software engineering tools and services are being used by researchers to streamline working with research compendia. Using real-world examples, we show how researchers can improve the reproducibility of their work using research compendia based on R packages and related tools.


2017 ◽  
Vol 9 (1) ◽  
pp. 65-78
Author(s):  
Konrad Grzanek

Abstract Dynamic typing of R programming language may issue some quality problems in large scale data-science and machine-learning projects for which the language is used. Following our efforts on providing gradual typing library for Clojure we come with a package chR - a library that offers functionality of run-time type-related checks in R. The solution is not only a dynamic type checker, it also helps to systematize thinking about types in the language, at the same time offering high expressivenes and full adherence to functional programming style.


2019 ◽  
Vol 7 (2) ◽  
pp. 95-104
Author(s):  
Mani Manavalan ◽  
Nur Mohammad Ali Chisty

Manual approaches rely on the abilities and knowledge of individual human administrators to detect, analyze, and interpret attacks. Intrusion Detection Systems (IDS) are systems that can automatically detect and warn the appropriate persons when an attack occurs. Despite the fact that individual attacks can be useful, they are frequently insufficient for understanding the entire attacking process, as well as the attackers' talents and objectives. The attacking stage is usually merely a component of a larger infiltration process, during which attackers gather information and set up the proper conditions before launching an attack, after which they clear log records in order to conceal their footprints and disappear. In today's assault scenarios, the pre-definition of cause-and-effect links between events is required, which is a tough and time-consuming task that takes considerable effort. Our technique for creating attack scenarios is based on the linking nature of web pages, and it does not require the pre-definition of cause and effect links, as demonstrated in previous work. Constructed situations are displayed in spatial and temporal coordinate systems to make viewing and analyzing them more convenient. In addition, we develop a prototype implementation of the concept, which we utilize to test a number of assault scenario scenarios.


2021 ◽  
Author(s):  
Stephan Meylan ◽  
Tom Griffiths

Language research has come to rely heavily on large-scale, web-based datasets. These datasets can present significant methodological challenges, requiring researchers to make a number of decisions about how they are collected, represented, and analyzed. These decisions often concern long-standing challenges in corpus-based language research, including determining what counts as a word, deciding which words should be analyzed, and matching sets of words across languages. We illustrate these challenges by revisiting "Word lengths are optimized for efficient communication" (Piantadosi, Tily, & Gibson,2011), which found that word lengths in 11 languages are more strongly correlated with their average predictability (or average information content) than their frequency. Using what we argue to be best practices for large-scale corpus analyses, we find significantly attenuated support for this result, and demonstrate that a stronger relationship obtains between word frequency and length for a majority of the languages in the sample. We consider the implications of the results for language research more broadly and provide several recommendations to researchers regarding best practices.


Sign in / Sign up

Export Citation Format

Share Document