scholarly journals Assessing Migration Risk for Scientific Data Formats

2012 ◽  
Vol 7 (1) ◽  
pp. 27-38 ◽  
Author(s):  
Chris Frisz ◽  
Geoffrey Brown ◽  
Samuel Waggoner

The majority of information about science, culture, society, economy and the environment is born digital, yet the underlying technology is subject to rapid obsolescence. One solution to this obsolescence, format migration, is widely practiced and supported by many software packages, yet migration has well known risks. For example, newer formats – even where similar in function – do not generally support all of the features of their predecessors, and, where similar features exist, there may be significant differences of interpretation.There appears to be a conflict between the wide use of migration and its known risks. In this paper we explore a simple hypothesis – that, where migration paths exist, the majority of data files can be safely migrated leaving only a few that must be handled more carefully – in the context of several scientific data formats that are or were widely used. Our approach is to gather information about potential migration mismatches and, using custom tools, evaluate a large collection of data files for the incidence of these risks. Our results support our initial hypothesis, though with some caveats. Further, we found that writing a tool to identify “risky” format features is considerably easier than writing a migration tool.

2019 ◽  
Vol 2 (1) ◽  
pp. 8 ◽  
Author(s):  
Jesse Meyer

The identification of nearly all proteins in a biological system using data-dependent acquisition (DDA) tandem mass spectrometry has become routine for organisms with relatively small genomes such as bacteria and yeast. Still, the quantification of the identified proteins may be a complex process and often requires multiple different software packages. In this protocol, I describe a flexible strategy for the identification and label-free quantification of proteins from bottom-up proteomics experiments. This method can be used to quantify all the detectable proteins in any DDA dataset collected with high-resolution precursor scans and may be used to quantify proteome remodeling in response to drug treatment or a gene knockout. Notably, the method is statistically rigorous, uses the latest and fastest freely-available software, and the entire protocol can be completed in a few hours with a small number of data files from the analysis of yeast.


2002 ◽  
Vol 1804 (1) ◽  
pp. 144-150
Author(s):  
Kenneth G. Courage ◽  
Scott S. Washburn ◽  
Jin-Tae Kim

The proliferation of traffic software programs on the market has resulted in many very specialized programs, intended to analyze one or two specific items within a transportation network. Consequently, traffic engineers use multiple programs on a single project, which ironically has resulted in new inefficiency for the traffic engineer. Most of these programs deal with the same core set of data, for example, physical roadway characteristics, traffic demand levels, and traffic control variables. However, most of these programs have their own formats for saving data files. Therefore, these programs cannot share information directly or communicate with each other because of incompatible data formats. Thus, the traffic engineer is faced with manually reentering common data from one program into another. In addition to inefficiency, this also creates additional opportunities for data entry errors. XML is catching on rapidly as a means for exchanging data between two systems or users who deal with the same data but in different formats. Specific vocabularies have been developed for statistics, mathematics, chemistry, and many other disciplines. The traffic model markup language (TMML) is introduced as a resource for traffic model data representation, storage, rendering, and exchange. TMML structure and vocabulary are described, and examples of their use are presented.


2013 ◽  
Vol 5 (2) ◽  
pp. 365-373 ◽  
Author(s):  
H. Keller-Rudek ◽  
G. K. Moortgat ◽  
R. Sander ◽  
R. Sörensen

Abstract. We present the MPI-Mainz UV/VIS Spectral Atlas of Gaseous Molecules, which is a large collection of absorption cross sections and quantum yields in the ultraviolet and visible (UV/VIS) wavelength region for gaseous molecules and radicals primarily of atmospheric interest. The data files contain results of individual measurements, covering research of almost a whole century. To compare and visualize the data sets, multicoloured graphical representations have been created. The MPI-Mainz UV/VIS Spectral Atlas is available on the Internet at http://www.uv-vis-spectral-atlas-mainz.org. It now appears with improved browse and search options, based on new database software. In addition to the Web pages, which are continuously updated, a frozen version of the data is available under the doi:10.5281/zenodo.6951.


2021 ◽  
Author(s):  
Josh Moore ◽  
Chris Allan ◽  
Sebastien Besson ◽  
Jean-marie Burel ◽  
Erin Diel ◽  
...  

Biological imaging is one of the most innovative fields in the modern biological sciences. New imaging modalities, probes, and analysis tools appear every few months and often prove decisive for enabling new directions in scientific discovery. One feature of this dynamic field is the need to capture new types of data and data structures. While there is a strong drive to make scientific data Findable, Accessible, Interoperable and Reproducible (FAIR, 1), the rapid rate of innovation in imaging impedes the unification and adoption of standardized data formats. Despite this, the opportunities for sharing and integrating bioimaging data and, in particular, linking these data to other "omics" datasets have never been greater; therefore, to every extent possible, increasing "FAIRness" of bioimaging data is critical for maximizing scientific value, as well as for promoting openness and integrity. In the absence of a common, FAIR format, two approaches have emerged to provide access to bioimaging data: translation and conversion. On-the-fly translation produces a transient representation of bioimage metadata and binary data but must be repeated on each use. In contrast, conversion produces a permanent copy of the data, ideally in an open format that makes the data more accessible and improves performance and parallelization in reads and writes. Both approaches have been implemented successfully in the bioimaging community but both have limitations. At cloud-scale, those shortcomings limit scientific analysis and the sharing of results. We introduce here next-generation file formats (NGFF) as a solution to these challenges.


2016 ◽  
Author(s):  
Edmund Hart ◽  
Pauline Barmby ◽  
David LeBauer ◽  
François Michonneau ◽  
Sarah Mount ◽  
...  

Data is the central currency of science, but the nature of scientific data has changed dramatically with the rapid pace of technology. This change has led to the development of a wide variety of data formats, dataset sizes, data complexity, data use cases, and data sharing practices. Improvements in high throughput DNA sequencing, sustained institutional support for large sensor networks, and sky surveys with large-format digital cameras have created massive quantities of data. At the same time, the combination of increasingly diverse research teams and data aggregation in portals (e.g. for biodiversity data, GBIF or iDigBio) necessitates increased coordination among data collectors and institutions. As a consequence, “data” can now mean anything from petabytes of information stored in professionally-maintained databases, through spreadsheets on a single computer, to hand-written tables in lab notebooks on shelves. All remain important, but data curation practices must continue to keep pace with the changes brought about by new forms and practices of data collection and storage.


2012 ◽  
Vol 06 (04) ◽  
pp. 447-489
Author(s):  
HAJO RIJGERSBERG ◽  
JAN TOP ◽  
BOB WIELINGA

Computers are central in processing scientific data. This data is typically expressed as numbers and strings. Appropriate annotation of "bare" data is required to allow people or machines to interpret it and to relate the data to real-world phenomena. In scientific practice however, annotations are often incomplete and ambiguous — let alone machine interpretable. This holds for reports and papers, but also for spreadsheets and databases. Moreover, in practice it is often unclear how the data has been created. This hampers interpretation, reproduction and reuse of results and thus leads to suboptimal science. In this paper we focus on annotation of scientific computations. For this purpose we propose the ontology OQR (Ontology of Quantitative Research). It includes a way to represent generic scientific methods and their implementation in software packages, invocation of these methods and handling of tabular datasets. This ontology promotes annotation by humans, but also allows automatic, semantic processing of numerical data. It allows scientists to understand the selected settings of computational methods and to automatically reproduce data generated by others. A prototype application demonstrates this can be done, illustrated by a case in food research. We evaluate this case with a number of researchers in the considered domain.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Arsenij Ustjanzew ◽  
Alexander Desuki ◽  
Christoph Ritzel ◽  
Alina Corinna Dolezilek ◽  
Daniel-Christoph Wagner ◽  
...  

Abstract Background Extensive sequencing of tumor tissues has greatly improved our understanding of cancer biology over the past years. The integration of genomic and clinical data is increasingly used to select personalized therapies in dedicated tumor boards (Molecular Tumor Boards) or to identify patients for basket studies. Genomic alterations and clinical information can be stored, integrated and visualized in the open-access resource cBioPortal for Cancer Genomics. cBioPortal can be run as a local instance enabling storage and analysis of patient data in single institutions, in the respect of data privacy. However, uploading clinical input data and genetic aberrations requires the elaboration of multiple data files and specific data formats, which makes it difficult to integrate this system into clinical practice. To solve this problem, we developed cbpManager. Results cbpManager is an R package providing a web-based interactive graphical user interface intended to facilitate the maintenance of mutations data and clinical data, including patient and sample information, as well as timeline data. cbpManager enables a large spectrum of researchers and physicians, regardless of their informatics skills to intuitively create data files ready for upload in cBioPortal for Cancer Genomics on a daily basis or in batch. Due to its modular structure based on R Shiny, further data formats such as copy number and fusion data can be covered in future versions. Further, we provide cbpManager as a containerized solution, enabling a straightforward large-scale deployment in clinical systems and secure access in combination with ShinyProxy. cbpManager is freely available via the Bioconductor project at https://bioconductor.org/packages/cbpManager/ under the AGPL-3 license. It is already used at six University Hospitals in Germany (Mainz, Gießen, Lübeck, Halle, Freiburg, and Marburg). Conclusion In summary, our package cbpManager is currently a unique software solution in the workflow with cBioPortal for Cancer Genomics, to assist the user in the interactive generation and management of study files suited for the later upload in cBioPortal.


INvoke ◽  
2020 ◽  
Vol 5 ◽  
Author(s):  
Sarah Marlow

This paper critically responds to Stacy Alaimo’s “Eluding Capture: The Science, Culture and Pleasure of Queer Animals” (2010), from Queer Ecologies by Bruce Erickson and Catrina Mortimer-Sandilands. Here, I focus on how the author addresses the relationship between social sciences and natural sciences, how social structures impact the ways in which we understand and interpret scientific data, and how she suggests we embrace the concept of “Naturecultures” in order to move forward in recognizing that heteronormative accounts of life, while dominant, are not the only possible lenses through which nature and sex can/should be seen. I explore Alaimo’s arguments against various different accounts of “same-sex” sexual activity in nature, whilst also reiterating that she does not wish to use animal sex as a form of validation for the LGBTQ+ community, reducing its mere existance to that of biological essentialism and erasing any possible discussions of gender/sexual fluidity by doing so. Instead, she cleverly uses rhetoric regarding animal sex and their perceived sexuality to expose the intrinsic heteronormativity that permeates even the supposedly “empirical” biological sciences, whilst bringing forward what I perceive as a very valuable discussion regarding how social life influences biological life, as opposed to the other way around.  Keywords: naturecultures, biopolitics, sexuality, queer


2019 ◽  
Author(s):  
Dobromir Rahnev ◽  
Kobe Desender ◽  
Alan L. F. Lee ◽  
William T. Adler ◽  
David Aguilar-Lleyda ◽  
...  

Understanding how people rate their confidence is critical for characterizing a wide range of perceptual, memory, motor, and cognitive processes. However, as in many other fields, progress has been slowed by the difficulty of collecting new data and the unavailability of existing data. To address this issue, we created a large database of confidence studies spanning a broad set of paradigms, participant populations, and fields of study. The data from each study are structured in a common, easy-to-use format that can be easily imported and analyzed in multiple software packages. Each dataset is further accompanied by an explanation regarding the nature of the collected data. At the time of publication, the Confidence Database (available at osf.io/s46pr) contained 145 datasets with data from over 8,700 participants and almost 4 million trials. The database will remain open for new submissions indefinitely and is expected to continue to grow. We show the usefulness of this large collection of datasets in four different analyses that provide precise estimation for several foundational confidence-related effects and lead to new findings that depend on the availability of large quantity of data. This Confidence Database will continue to enable new discoveries and can serve as a blueprint for similar databases in related fields.


Sign in / Sign up

Export Citation Format

Share Document