scholarly journals Development of data representation standards by the human proteome organization proteomics standards initiative

2015 ◽  
Vol 22 (3) ◽  
pp. 495-506 ◽  
Author(s):  
Eric W Deutsch ◽  
Juan Pablo Albar ◽  
Pierre-Alain Binz ◽  
Martin Eisenacher ◽  
Andrew R Jones ◽  
...  

Abstract Objective To describe the goals of the Proteomics Standards Initiative (PSI) of the Human Proteome Organization, the methods that the PSI has employed to create data standards, the resulting output of the PSI, lessons learned from the PSI’s evolution, and future directions and synergies for the group. Materials and Methods The PSI has 5 categories of deliverables that have guided the group. These are minimum information guidelines, data formats, controlled vocabularies, resources and software tools, and dissemination activities. These deliverables are produced via the leadership and working group organization of the initiative, driven by frequent workshops and ongoing communication within the working groups. Official standards are subjected to a rigorous document process that includes several levels of peer review prior to release. Results We have produced and published minimum information guidelines describing what information should be provided when making data public, either via public repositories or other means. The PSI has produced a series of standard formats covering mass spectrometer input, mass spectrometer output, results of informatics analysis (both qualitative and quantitative analyses), reports of molecular interaction data, and gel electrophoresis analyses. We have produced controlled vocabularies that ensure that concepts are uniformly annotated in the formats and engaged in extensive software development and dissemination efforts so that the standards can efficiently be used by the community. Conclusion In its first dozen years of operation, the PSI has produced many standards that have accelerated the field of proteomics by facilitating data exchange and deposition to data repositories. We look to the future to continue developing standards for new proteomics technologies and workflows and mechanisms for integration with other omics data types. Our products facilitate the translation of genomics and proteomics findings to clinical and biological phenotypes. The PSI website can be accessed at http://www.psidev.info.

2018 ◽  
Vol 12 (2) ◽  
pp. 76-85
Author(s):  
Dharma Akmon ◽  
Margaret Hedstrom ◽  
James D. Myers ◽  
Anna Ovchinnikova ◽  
Inna Kouper

SEAD – a project funded by the US National Science Foundation’s DataNet program – has spent the last five years designing, building, and deploying an integrated set of services to better connect scientists’ research workflows to data publication and preservation activities. Throughout the project, SEAD has promoted the concept and practice of “active curation,” which consists of capturing data and metadata early and refining it throughout the data life cycle. In promoting active curation, our team saw an opportunity to develop tools that would help scientists better manage data for their own use, improve team coordination around data, implement practices that would serve the data better over time, and seamlessly connect with data repositories to ease the burden of sharing and publishing. SEAD has worked with 30 projects, dozens of researchers, and hundreds of thousands of files, providing us with ample opportunities to learn about data and metadata, integrating with researchers’ workflows, and building tools and services for data. In this paper, we discuss the lessons we have learned and suggest how this might guide future data infrastructure development efforts.


2020 ◽  
pp. 958-971
Author(s):  
Marcel Ramos ◽  
Ludwig Geistlinger ◽  
Sehyun Oh ◽  
Lucas Schiffer ◽  
Rimsha Azhar ◽  
...  

PURPOSE Investigations of the molecular basis for the development, progression, and treatment of cancer increasingly use complementary genomic assays to gather multiomic data, but management and analysis of such data remain complex. The cBioPortal for cancer genomics currently provides multiomic data from > 260 public studies, including The Cancer Genome Atlas (TCGA) data sets, but integration of different data types remains challenging and error prone for computational methods and tools using these resources. Recent advances in data infrastructure within the Bioconductor project enable a novel and powerful approach to creating fully integrated representations of these multiomic, pan-cancer databases. METHODS We provide a set of R/Bioconductor packages for working with TCGA legacy data and cBioPortal data, with special considerations for loading time; efficient representations in and out of memory; analysis platform; and an integrative framework, such as MultiAssayExperiment. Large methylation data sets are provided through out-of-memory data representation to provide responsive loading times and analysis capabilities on machines with limited memory. RESULTS We developed the curatedTCGAData and cBioPortalData R/Bioconductor packages to provide integrated multiomic data sets from the TCGA legacy database and the cBioPortal web application programming interface using the MultiAssayExperiment data structure. This suite of tools provides coordination of diverse experimental assays with clinicopathological data with minimal data management burden, as demonstrated through several greatly simplified multiomic and pan-cancer analyses. CONCLUSION These integrated representations enable analysts and tool developers to apply general statistical and plotting methods to extensive multiomic data through user-friendly commands and documented examples.


2002 ◽  
Vol 1804 (1) ◽  
pp. 144-150
Author(s):  
Kenneth G. Courage ◽  
Scott S. Washburn ◽  
Jin-Tae Kim

The proliferation of traffic software programs on the market has resulted in many very specialized programs, intended to analyze one or two specific items within a transportation network. Consequently, traffic engineers use multiple programs on a single project, which ironically has resulted in new inefficiency for the traffic engineer. Most of these programs deal with the same core set of data, for example, physical roadway characteristics, traffic demand levels, and traffic control variables. However, most of these programs have their own formats for saving data files. Therefore, these programs cannot share information directly or communicate with each other because of incompatible data formats. Thus, the traffic engineer is faced with manually reentering common data from one program into another. In addition to inefficiency, this also creates additional opportunities for data entry errors. XML is catching on rapidly as a means for exchanging data between two systems or users who deal with the same data but in different formats. Specific vocabularies have been developed for statistics, mathematics, chemistry, and many other disciplines. The traffic model markup language (TMML) is introduced as a resource for traffic model data representation, storage, rendering, and exchange. TMML structure and vocabulary are described, and examples of their use are presented.


Author(s):  
Roy Gelbard ◽  
Avichai Meged

Representing and consequently processing fuzzy data in standard and binary databases is problematic. The problem is further amplified in binary databases where continuous data is represented by means of discrete ‘1’ and ‘0’ bits. As regards classification, the problem becomes even more acute. In these cases, we may want to group objects based on some fuzzy attributes, but unfortunately, an appropriate fuzzy similarity measure is not always easy to find. The current paper proposes a novel model and measure for representing fuzzy data, which lends itself to both classification and data mining. Classification algorithms and data mining attempt to set up hypotheses regarding the assigning of different objects to groups and classes on the basis of the similarity/distance between them (Estivill-Castro & Yang, 2004) (Lim, Loh & Shih, 2000) (Zhang & Srihari, 2004). Classification algorithms and data mining are widely used in numerous fields including: social sciences, where observations and questionnaires are used in learning mechanisms of social behavior; marketing, for segmentation and customer profiling; finance, for fraud detection; computer science, for image processing and expert systems applications; medicine, for diagnostics; and many other fields. Classification algorithms and data mining methodologies are based on a procedure that calculates a similarity matrix based on similarity index between objects and on a grouping technique. Researches proved that a similarity measure based upon binary data representation yields better results than regular similarity indexes (Erlich, Gelbard & Spiegler, 2002) (Gelbard, Goldman & Spiegler, 2007). However, binary representation is currently limited to nominal discrete attributes suitable for attributes such as: gender, marital status, etc., (Zhang & Srihari, 2003). This makes the binary approach for data representation unattractive for widespread data types. The current research describes a novel approach to binary representation, referred to as Fuzzy Binary Representation. This new approach is suitable for all data types - nominal, ordinal and as continuous. We propose that there is meaning not only to the actual explicit attribute value, but also to its implicit similarity to other possible attribute values. These similarities can either be determined by a problem domain expert or automatically by analyzing fuzzy functions that represent the problem domain. The added new fuzzy similarity yields improved classification and data mining results. More generally, Fuzzy Binary Representation and related similarity measures exemplify that a refined and carefully designed handling of data, including eliciting of domain expertise regarding similarity, may add both value and knowledge to existing databases.


Data ◽  
2019 ◽  
Vol 4 (1) ◽  
pp. 16 ◽  
Author(s):  
Silke Cuno ◽  
Lina Bruns ◽  
Nikolay Tcholtchev ◽  
Philipp Lämmel ◽  
Ina Schieferdecker

European cities and communities (and beyond) require a structured overview and a set of tools as to achieve a sustainable transformation towards smarter cities/municipalities, thereby leveraging on the enormous potential of the emerging data driven economy. This paper presents the results of a recent study that was conducted with a number of German municipalities/cities. Based on the obtained and briefly presented recommendations emerging from the study, the authors propose the concept of an Urban Data Space (UDS), which facilitates an eco-system for data exchange and added value creation thereby utilizing the various types of data within a smart city/municipality. Looking at an Urban Data Space from within a German context and considering the current situation and developments in German municipalities, this paper proposes a reasonable classification of urban data that allows the relation of various data types to legal aspects, and to conduct solid considerations regarding technical implementation designs and decisions. Furthermore, the Urban Data Space is described/analyzed in detail, and relevant stakeholders are identified, as well as corresponding technical artifacts are introduced. The authors propose to setup Urban Data Spaces based on emerging standards from the area of ICT reference architectures for Smart Cities, such as DIN SPEC 91357 “Open Urban Platform” and EIP SCC. In the course of this, the paper walks the reader through the construction of a UDS based on the above-mentioned architectures and outlines all the goals, recommendations and potentials, which an Urban Data Space can reveal to a municipality/city. Finally, we aim at deriving the proposed concepts in a way that they have the potential to be part of the required set of tools towards the sustainable transformation of German and European cities in the direction of smarter urban environments, based on utilizing the hidden potential of digitalization and efficient interoperable data exchange.


2019 ◽  
Vol 27 (5) ◽  
pp. 687-710
Author(s):  
Oleksii Osliak ◽  
Andrea Saracino ◽  
Fabio Martinelli

Purpose This paper aims to propose a structured threat information expression (STIX)-based data representation for privacy-preserving data analysis to report format and semantics of specific data types and to represent sticky policies in the format of embedded human-readable data sharing agreements (DSAs). More specifically, the authors exploit and extend the STIX standard to represent in a structured way analysis-ready pieces of data and the attached privacy policies. Design/methodology/approach The whole scheme is designed to be completely compatible with the STIX 2.0 standard for cyber-threat intelligence (CTI) representation. The proposed scheme will be implemented in this work by defining the complete scheme for representing an email, which is more expressive than the standard one defined for STIX, designed specifically for spam email analysis. Findings Moreover, the paper provides a new scheme for general DSA representation that has been practically applied for the process of encoding specific attributes in different CTI reports. Research limitations/implications Because of the chosen approach, the research results may have limitations. Specifically, current practice for entity recognition has the limitation that was discovered during the research. However, its effect on process time was minimized and the way for improvement was proposed. Originality/value This paper has covered the existing gap including the lack of generality in DSA representation for privacy-preserving analysis of structured CTI. Therefore, the new model for DSA representation was introduced, as well as its practical implementation.


2020 ◽  
Author(s):  
John S. Hughes ◽  
Daniel J. Crichton

<p>The PDS4 Information Model (IM) Version 1.13.0.0 was released for use in December 2019. The ontology-based IM remains true to its foundational principles found in the Open Archive Information System (OAIS) Reference Model (ISO 14721) and the Metadata Registry (MDR) standard (ISO/IEC 11179). The standards generated from the IM have become the de-facto data archiving standards for the international planetary science community and have successfully scaled to meet the requirements of the diverse and evolving planetary science disciplines.</p><p>A key foundational principle is the use of a multi-level governance scheme that partitions the IM into semi-independent dictionaries. The governance scheme first partitions the IM vertically into three levels, the common, discipline, and project/mission levels. The IM is then partitioned horizontally across both discipline and project/mission levels into individual Local Data Dictionaries (LDDs).</p><p>The Common dictionary defines the classes used across the science disciplines such as product, collection, bundle, data formats, data types, and units of measurement. The dictionary resulted from a large collaborative effort involving domain experts across the community. An ontology modeling tool was used to enforce a modeling discipline, for configuration management, to ensure consistency and extensibility, and to enable interoperability. The Common dictionary encompasses the information categories defined in the OAIS RM, specifically data representation, provenance, fixity, identification, reference, and context. Over the last few years, the Common dictionary has remained relatively stable in spite of requirements levied by new missions, instruments, and more complex data types.</p><p>Since the release of the Common dictionary, the creation of a significant number of LDDs has proved the effectiveness of multi-level, steward-based governance. This scheme is allowing the IM to scale to meet the archival and interoperability demands of the evolving disciplines. In fact, an LDD development “cottage industry” has emerged that required improvements to the development processes and configuration management.  An LDD development tool now allows dictionary stewards to quickly produce specialized LDDs that are consistent with the Common dictionary.</p><p>The PDS4 Information Model is a world-class knowledge-base that governs the Planetary Science community's trusted digital repositories. This presentation will provide an overview of the model and additional information about its multi-level governance scheme including the topics of stewardship, configuration management, processes, and oversight.</p>


2003 ◽  
Vol 4 (1) ◽  
pp. 16-19 ◽  
Author(s):  
Sandra Orchard ◽  
Paul Kersey ◽  
Henning Hermjakob ◽  
Rolf Apweiler

The Proteomics Standards Initiative (PSI) aims to define community standards for data representation in proteomics and to facilitate data comparison, exchange and verification. Initially the fields of protein–protein interactions (PPI) and mass spectroscopy have been targeted and the inaugural meeting of the PSI addressed the questions of data storage and exchange in both of these areas. The PPI group rapidly reached consensus as to the minimum requirements for a data exchange model; an XML draft is now being produced. The mass spectroscopy group have achieved major advances in the definition of a required data model and working groups are currently taking these discussions further. A further meeting is planned in January 2003 to advance both these projects.


2004 ◽  
Vol 5 (2) ◽  
pp. 184-189 ◽  
Author(s):  
H. Schoof ◽  
R. Ernst ◽  
K. F. X. Mayer

The completion of theArabidopsisgenome and the large collections of other plant sequences generated in recent years have sparked extensive functional genomics efforts. However, the utilization of this data is inefficient, as data sources are distributed and heterogeneous and efforts at data integration are lagging behind. PlaNet aims to overcome the limitations of individual efforts as well as the limitations of heterogeneous, independent data collections. PlaNet is a distributed effort among European bioinformatics groups and plant molecular biologists to establish a comprehensive integrated database in a collaborative network. Objectives are the implementation of infrastructure and data sources to capture plant genomic information into a comprehensive, integrated platform. This will facilitate the systematic exploration ofArabidopsisand other plants. New methods for data exchange, database integration and access are being developed to create a highly integrated, federated data resource for research. The connection between the individual resources is realized with BioMOBY. BioMOBY provides an architecture for the discovery and distribution of biological data through web services. While knowledge is centralized, data is maintained at its primary source without a need for warehousing. To standardize nomenclature and data representation, ontologies and generic data models are defined in interaction with the relevant communities.Minimal data models should make it simple to allow broad integration, while inheritance allows detail and depth to be added to more complex data objects without losing integration. To allow expert annotation and keep databases curated, local and remote annotation interfaces are provided. Easy and direct access to all data is key to the project.


Sign in / Sign up

Export Citation Format

Share Document