scholarly journals Keemei: cloud-based validation of tabular bioinformatics file formats in Google Sheets

Author(s):  
Jai Ram Rideout ◽  
John H Chase ◽  
Evan Bolyen ◽  
Gail Ackermann ◽  
Antonio Gonzalez ◽  
...  

Bioinformatics software often requires human-generated tabular text files as input and have specific requirements for how those data are formatted. Users frequently manage these data in spreadsheet programs, which is convenient for researchers who are compiling the requisite information because the spreadsheet programs can easily be used on different platforms including laptops and tablets, and because they provide a familiar interface. It is increasingly common for many different researchers to be involved in compiling these data, including study coordinators, clinicians, lab technicians, and bioinformaticians. As a result, many research groups are shifting toward using cloud-based spreadsheet programs, such as Google Sheets, which support concurrent editing of a single spreadsheet by different users working on different platforms. Often most of the researchers who are entering data will not be familiar with the formatting requirements of the bioinformatics programs that will be used, so validating and correcting file formats is often a bottleneck prior to beginning bioinformatics analysis. We present Keemei, a Google Sheets Add-on for validating tabular files used in bioinformatics analyses. Keemei is available free of charge from Google’s Chrome Web Store. Keemei can be installed and run on any web browser supported by Google Sheets. Keemei currently supports validation of two widely used tabular bioinformatics formats, the QIIME sample metadata mapping file format, and the Spatially Referenced Genetic Data (SRGD) format, but is designed to easily support the addition of others. Keemei will save researchers time and frustration by providing a convenient interface for tabular bioinformatics file format validation. By allowing everyone involved with data entry for a project to easily validate their data, it will reduce the validation and formatting bottlenecks that are commonly encountered when human-generated data files are first used with a bioinformatics system. Simplifying the validation of essential tabular data files, such as sample metadata, will reduce common errors and thereby improve the quality and reliability of research outcomes.

2016 ◽  
Author(s):  
Jai Ram Rideout ◽  
John H Chase ◽  
Evan Bolyen ◽  
Gail Ackermann ◽  
Antonio Gonzalez ◽  
...  

Bioinformatics software often requires human-generated tabular text files as input and have specific requirements for how those data are formatted. Users frequently manage these data in spreadsheet programs, which is convenient for researchers who are compiling the requisite information because the spreadsheet programs can easily be used on different platforms including laptops and tablets, and because they provide a familiar interface. It is increasingly common for many different researchers to be involved in compiling these data, including study coordinators, clinicians, lab technicians, and bioinformaticians. As a result, many research groups are shifting toward using cloud-based spreadsheet programs, such as Google Sheets, which support concurrent editing of a single spreadsheet by different users working on different platforms. Often most of the researchers who are entering data will not be familiar with the formatting requirements of the bioinformatics programs that will be used, so validating and correcting file formats is often a bottleneck prior to beginning bioinformatics analysis. We present Keemei, a Google Sheets Add-on for validating tabular files used in bioinformatics analyses. Keemei is available free of charge from Google’s Chrome Web Store. Keemei can be installed and run on any web browser supported by Google Sheets. Keemei currently supports validation of two widely used tabular bioinformatics formats, the QIIME sample metadata mapping file format, and the Spatially Referenced Genetic Data (SRGD) format, but is designed to easily support the addition of others. Keemei will save researchers time and frustration by providing a convenient interface for tabular bioinformatics file format validation. By allowing everyone involved with data entry for a project to easily validate their data, it will reduce the validation and formatting bottlenecks that are commonly encountered when human-generated data files are first used with a bioinformatics system. Simplifying the validation of essential tabular data files, such as sample metadata, will reduce common errors and thereby improve the quality and reliability of research outcomes.


2022 ◽  
Author(s):  
Yi Nian Niu ◽  
Eric G. Roberts ◽  
Danielle Denisko ◽  
Michael M. Hoffman

Background: Bioinformatics software tools operate largely through the use of specialized genomics file formats. Often these formats lack formal specification, and only rarely do the creators of these tools robustly test them for correct handling of input and output. This causes problems in interoperability between different tools that, at best, wastes time and frustrates users. At worst, interoperability issues could lead to undetected errors in scientific results. Methods: We sought (1) to assess the interoperability of a wide range of bioinformatics software using a shared genomics file format and (2) to provide a simple, reproducible method for enhancing interoperability. As a focus, we selected the popular BED file format for genomic interval data. Based on the file format's original documentation, we created a formal specification. We developed a new verification system, Acidbio (https://github.com/hoffmangroup/acidbio), which tests for correct behavior in bioinformatics software packages. We crafted tests to unify correct behavior when tools encounter various edge cases—potentially unexpected inputs that exemplify the limits of the format. To analyze the performance of existing software, we tested the input validation of 80 Bioconda packages that parsed the BED format. We also used a fuzzing approach to automatically perform additional testing. Results: Of 80 software packages examined, 75 achieved less than 70% correctness on our test suite. We categorized multiple root causes for the poor performance of different types of software. Fuzzing detected other errors that the manually designed test suite could not. We also created a badge system that developers can use to indicate more precisely which BED variants their software accepts and to advertise the software's performance on the test suite. Discussion: Acidbio makes it easy to assess interoperability of software using the BED format, and therefore to identify areas for improvement in individual software packages. Applying our approach to other file formats would increase the reliability of bioinformatics software and data.


2014 ◽  
Vol 940 ◽  
pp. 433-436 ◽  
Author(s):  
Ying Zhang ◽  
Xin Shi

Based on the detailed analysis of the STL file format, VC++ 6.0 programming language was used to extract the STL ASCII and binary file information, at the same time, using the OpenGL triangle drawing technology for graphical representation of the STL file, with rendering functions such as material, coordinate transformation, lighting, et al, finally realizing the loading and three-dimensional display of STL ASCII and binary file formats.


2020 ◽  
Author(s):  
A. E. Sullivan ◽  
S. J. Tappan ◽  
P. J. Angstman ◽  
A. Rodriguez ◽  
G. C. Thomas ◽  
...  

AbstractWith advances in microscopy and computer science, the technique of digitally reconstructing, modeling, and quantifying microscopic anatomies has become central to many fields of biological research. MBF Bioscience has chosen to openly document their digital reconstruction file format, Neuromorphological File Specification (4.0), available at www.mbfbioscience.com/filespecification (Angstman et al. 2020). One of such technologies, the format created and maintained by MBF Bioscience is broadly utilized by the neuroscience community. The data format’s structure and capabilities have evolved since its inception, with modifications made to keep pace with advancements in microscopy and the scientific questions raised by worldwide experts in the field. More recent modifications to the neuromorphological data format ensure it abides by the Findable, Accessible, Interoperable, and Reusable (FAIR) data standards promoted by the International Neuroinformatics Coordinating Facility (INCF; Wilkinson et al. 2016). The incorporated metadata make it easy to identify and repurpose these data types for downstream application and investigation. This publication describes key elements of the file format and details their relevant structural advantages in an effort to encourage the reuse of these rich data files for alternative analysis or reproduction of derived conclusions.


2018 ◽  
pp. 218-233
Author(s):  
Mayank Yuvaraj

During the course of planning an institutional repository, digital library collections or digital preservation service it is inevitable to draft file format policies in order to ensure long term digital preservation, its accessibility and compatibility. Sincere efforts have been made to encourage the adoption of standard formats yet the digital preservation policies vary from library to library. The present paper is based against this background to present the digital preservation community with a common understanding of the common file formats used in the digital libraries or institutional repositories. The paper discusses both open and proprietary file formats for several media.


2002 ◽  
Vol 1804 (1) ◽  
pp. 144-150
Author(s):  
Kenneth G. Courage ◽  
Scott S. Washburn ◽  
Jin-Tae Kim

The proliferation of traffic software programs on the market has resulted in many very specialized programs, intended to analyze one or two specific items within a transportation network. Consequently, traffic engineers use multiple programs on a single project, which ironically has resulted in new inefficiency for the traffic engineer. Most of these programs deal with the same core set of data, for example, physical roadway characteristics, traffic demand levels, and traffic control variables. However, most of these programs have their own formats for saving data files. Therefore, these programs cannot share information directly or communicate with each other because of incompatible data formats. Thus, the traffic engineer is faced with manually reentering common data from one program into another. In addition to inefficiency, this also creates additional opportunities for data entry errors. XML is catching on rapidly as a means for exchanging data between two systems or users who deal with the same data but in different formats. Specific vocabularies have been developed for statistics, mathematics, chemistry, and many other disciplines. The traffic model markup language (TMML) is introduced as a resource for traffic model data representation, storage, rendering, and exchange. TMML structure and vocabulary are described, and examples of their use are presented.


Blood ◽  
2011 ◽  
Vol 118 (21) ◽  
pp. 4763-4763
Author(s):  
William T. Tse ◽  
Kevin K. Duh ◽  
Morris Kletzel

Abstract Abstract 4763 Data collection and analysis in clinical studies in hematology often require the use of specialized databases, which demand extensive information technology (IT) support and are expensive to maintain. With the goal of reducing the cost of clinical trials and promoting outcomes research, we have devised a new informatics framework that is low-cost, low-maintenance, and adaptable to both small- and large-scale clinical studies. This framework is based on the idea that most clinical data are hierarchical in nature: a clinical protocol typically entails the creation of sequential patient files, each of which documents multiple encounters, during which clinical events and data are captured and tagged for later retrieval and analysis. These hierarchical trees of clinical data can be easily stored in a hypertext mark-up language (HTML) document format, which is designed to represent similar hierarchical data on web pages. In this framework, the stored clinical data will be structured according to a web standard called Document Object Model (DOM), for which powerful informatics techniques have been developed to allow efficient retrieval and collation of data from the HTML documents. The proposed framework has many potential advantages. The data will be stored in plain text files in the HTML format, which is both human and machine readable, hence facilitating data exchange between collaborative groups. The framework requires only a regular web browser to function, thereby easing its adoption in multiple institutions. There will be no need to set up or maintain a relational database for data storage, thus minimizing data fragmentation and reducing the demand for IT support. Data entry and analysis will be performed mostly on the client computer, requiring the use of a backend server only for central data storage. Utility programs for data management and manipulation will be written in Javascript and JQuery, computer languages that are free, open-source and easy to maintain. Data can be captured, retrieved, and analyzed on different devices, including desktop computers, tablets or smart phones. Encryption and password protection can be applied in document storage and data transmission to ensure data security and HIPPA compliance. In a pilot project to implement and test this informatics framework, we designed prototype programming modules to perform individual tasks commonly encountered in clinical data management. The functionalities of these modules included user-interface creation, patient data entry and retrieval, visualization and analysis of aggregate results, and exporting and reporting of extracted data. These modules were used to access simulated clinical data stored in a remote server, employing standard web browsers available on all desktop computers and mobile devices. To test the capability of these modules, benchmark tests were performed. Simulated datasets of complete patient records, each with 1000 data items, were created and stored in the remote server. Data were retrieved via the web using a gzip compressed format. Retrieval of 100, 300, 1000 such records took only 1.01, 2.45, and 6.67 seconds using a desktop computer via a broadband connection, or 3.67, 11.39, and 30.23 seconds using a tablet computer via a 3G connection. Filtering of specific data from the retrieved records was equally speedy. Automated extraction of relevant data from 300 complete records for a two-sample t-test analysis took 1.97 seconds. A similar extraction of data for a Kaplan-Meier survival analysis took 4.19 seconds. The program allowed the data to be presented separately for individual patients or in aggregation for different clinical subgroups. A user-friendly interface enabled viewing of the data in either tabular or graphical forms. Incorporation of a new web browser technique permitted caching of the entire dataset locally for off-line access and analysis. Adaptable programming allowed efficient export of data in different formats for regulatory reporting purposes. Once the system was set up, no further intervention from IT department was necessary. In summary, we have designed and implemented a prototype of a new informatics framework for clinical data management, which should be low-cost and highly adaptable to various types of clinical studies. Field-testing of this framework in real-life clinical studies will be the next step to demonstrate its effectiveness and potential benefits. Disclosures: No relevant conflicts of interest to declare.


2019 ◽  
Author(s):  
Zou Yutong ◽  
Bui Thuy Tien ◽  
Kumar Selvarajoo

AbstractHere we report a bio-statistical/informatics tool, ABioTrans, developed in R for gene expression analysis. The tool allows the user to directly read RNA-Seq data files deposited in the Gene Expression Omnibus or GEO database. Operated using any web browser application, ABioTrans provides easy options for multiple statistical distribution fitting, Pearson and Spearman rank correlations, PCA, k-means and hierarchical clustering, differential expression analysis, Shannon entropy and noise (square of coefficient of variation) analyses, as well as Gene ontology classifications.Availability and implementationABioTrans is available at https://github.com/buithuytien/ABioTransOperating system(s): Platform independent (web browser)Programming language: R (R studio)Other requirements: Bioconductor genome wide annotation databases, R-packages (shiny, LSD, fitdistrplus, actuar, entropy, moments, RUVSeq, edgeR, DESeq2, NOISeq, AnnotationDbi, ComplexHeatmap, circlize, clusterProfiler, reshape2, DT, plotly, shinycssloaders, dplyr, ggplot2). These packages will automatically be installed when the ABioTrans.R is executed in R studio.No restriction of usage for non-academic.


Author(s):  
Karl W Broman ◽  
Kara H. Woo

Spreadsheets are widely used software tools for data entry, storage, analysis, and visualization. Focusing on the data entry and storage aspects, this paper offers practical recommendations for organizing spreadsheet data to reduce errors and ease later analyses. The basic principles are: be consistent, write dates like YYYY-MM-DD, don't leave any cells empty, put just one thing in a cell, organize the data as a single rectangle (with subjects as rows and variables as columns, and with a single header row), create a data dictionary, don't include calculations in the raw data files, don't use font color or highlighting as data, choose good names for things, make backups, use data validation to avoid data entry errors, and save the data in plain text files.


Big data is one of the most influential technologies of the modern era. However, in order to support maturity of big data systems, development and sustenance of heterogeneous environments is requires. This, in turn, requires integration of technologies as well as concepts. Computing and storage are the two core components of any big data system. With that said, big data storage needs to communicate with the execution engine and other processing and visualization technologies to create a comprehensive solution. This brings the facet of big data file formats into picture. This paper classifies available big data file formats into five categories namely text-based, row-based, column-based, in-memory and data storage services. It also compares the advantages, shortcomings and possible use cases of available big data file formats for Hadoop, which is the foundation for most big data computing technologies. Lastly, it provides a discussion on tradeoffs that must be considered while choosing a file format for a big data system, providing a framework for creation for file format selection criteria.


Sign in / Sign up

Export Citation Format

Share Document