scholarly journals BioPaxCOMP: an efficient system for integrating, compressing, and querying BioPAX

Author(s):  
Giuseppe Agapito ◽  
Andrea Greco ◽  
Mario Cannataro

Abstract truncated at 3,000 characters - the full version is available in the pdf file. Biological networks and, in particular, biological pathways are composed of thousands of nodes and edges, posing several challenge regarding analysis and storage. The primary format used to represent pathways data is BioPAX (http://biopax.org.) BioPAX is a standard language that aims to enable integration, exchange, visualization and analysis of biological pathway data. BioPAX is an open and collaborative effort made by the community of researchers, software developers, and institutions and it specifically supports data exchange between pathway data groups. BioPAX is defined in OWL and is represented in the RDF/XML format. OWL (Web Ontology Language) is a W3C standard and is designed for use by applications that need to process the content of information instead of just presenting information to humans. RDF is a standard model for data interchange on the Web. Although OWL allows a standard representation of pathways, since it is based on XML, it is a verbose and redundant language, so the storage of pathways may be very huge, preventing an efficient transmission and sharing of this data. The typical size of a pathway is related to the organism, for example, the size of Homo Sapiens pathways (from Reactome database) is near to 200 MB on disk. Moreover, integrating pathways data coming from different data sources may require GBytes of space. A second problem with pathways is related to the possibility to integrate information coming from different data sources to have updated information in a centralized way. There exist several different databases for pathways data that emphasizes different aspect of the same pathway, thus, it could be useful to integrate and annotate together pathways coming from different databases to obtain a centralized and more informative pathway data. The principal obstacle for integrating, storing and exchanging such data is the extreme size growth when several pathways data are merged together, posing several challenges from the computational and archiving point of view. Pathways data can be easily classified as big data, because they meet all the 5V (Volume, Velocity, Variety, Veracity, Value) characteristics typical of Big Data, thus, the necessity to efficiently integrate and compress pathways data arises. The methodology for pathways data integration is based on the following steps: i) aggregation and validation locally of data coming from several pathway databases, ii) identification and normalization of compounds and reactions identifier and iii) integration. Integration occurs at the level of physical entities, such as proteins and small molecules. This is accomplished by linking interaction and pathway records together if they use the same physical entities (such as from UniProt for proteins) and by adding annotation data from UniProt or GeneOntology.

2016 ◽  
Author(s):  
Giuseppe Agapito ◽  
Andrea Greco ◽  
Mario Cannataro

Abstract truncated at 3,000 characters - the full version is available in the pdf file. Biological networks and, in particular, biological pathways are composed of thousands of nodes and edges, posing several challenge regarding analysis and storage. The primary format used to represent pathways data is BioPAX (http://biopax.org.) BioPAX is a standard language that aims to enable integration, exchange, visualization and analysis of biological pathway data. BioPAX is an open and collaborative effort made by the community of researchers, software developers, and institutions and it specifically supports data exchange between pathway data groups. BioPAX is defined in OWL and is represented in the RDF/XML format. OWL (Web Ontology Language) is a W3C standard and is designed for use by applications that need to process the content of information instead of just presenting information to humans. RDF is a standard model for data interchange on the Web. Although OWL allows a standard representation of pathways, since it is based on XML, it is a verbose and redundant language, so the storage of pathways may be very huge, preventing an efficient transmission and sharing of this data. The typical size of a pathway is related to the organism, for example, the size of Homo Sapiens pathways (from Reactome database) is near to 200 MB on disk. Moreover, integrating pathways data coming from different data sources may require GBytes of space. A second problem with pathways is related to the possibility to integrate information coming from different data sources to have updated information in a centralized way. There exist several different databases for pathways data that emphasizes different aspect of the same pathway, thus, it could be useful to integrate and annotate together pathways coming from different databases to obtain a centralized and more informative pathway data. The principal obstacle for integrating, storing and exchanging such data is the extreme size growth when several pathways data are merged together, posing several challenges from the computational and archiving point of view. Pathways data can be easily classified as big data, because they meet all the 5V (Volume, Velocity, Variety, Veracity, Value) characteristics typical of Big Data, thus, the necessity to efficiently integrate and compress pathways data arises. The methodology for pathways data integration is based on the following steps: i) aggregation and validation locally of data coming from several pathway databases, ii) identification and normalization of compounds and reactions identifier and iii) integration. Integration occurs at the level of physical entities, such as proteins and small molecules. This is accomplished by linking interaction and pathway records together if they use the same physical entities (such as from UniProt for proteins) and by adding annotation data from UniProt or GeneOntology.


2015 ◽  
Vol 31 (2) ◽  
pp. 249-262 ◽  
Author(s):  
Piet J.H. Daas ◽  
Marco J. Puts ◽  
Bart Buelens ◽  
Paul A.M. van den Hurk

Abstract More and more data are being produced by an increasing number of electronic devices physically surrounding us and on the internet. The large amount of data and the high frequency at which they are produced have resulted in the introduction of the term ‘Big Data’. Because these data reflect many different aspects of our daily lives and because of their abundance and availability, Big Data sources are very interesting from an official statistics point of view. This article discusses the exploration of both opportunities and challenges for official statistics associated with the application of Big Data. Experiences gained with analyses of large amounts of Dutch traffic loop detection records and Dutch social media messages are described to illustrate the topics characteristic of the statistical analysis and use of Big Data.


Author(s):  
Alberto Traverso ◽  
Frank J. W. M. Dankers ◽  
Leonard Wee ◽  
Sander M. J. van Kuijk

AbstractPre-requisites to better understand the chapter: basic knowledge of major sources of clinical data.Logical position of the chapter with respect to the previous chapter: in the previous chapter, you have learned what the major sources of clinical data are. In this chapter, we will dive into the main characteristics of presented data sources. In particular, we will learn how to distinguish and classify data according to its scale.Learning objectives: you will learn the major differences between data sources presented in previous chapters; how clinical data can be classified according to its scale. You will get familiar with the concept of ‘big’ clinical data; you will learn which are the major concerns limiting ‘big’ data exchange.


Metahumaniora ◽  
2018 ◽  
Vol 8 (3) ◽  
pp. 300
Author(s):  
Tania Intan ◽  
Trisna Gumilar

AbstrakPenelitian ini bertujuan untuk (1) mendekripsikan tanggapan pembaca terhadap novel Le Petit Prince (2) mendeskripsikan horizon harapan pembaca terhadap novel Le Petit Prince, dan (3) mendeskripsikan faktor-faktor penyebab perbedaan tanggapan dan horizon harapan pembaca. Penelitian ini termasuk jenis penelitian deskriptif kualitatif. Data penelitian berupa teks yang memuat tanggapan pembaca novel Le Petit Princeyang terdiri dari 20 orang, sedangkan sumber datanya berupa artikel dan makalah yang dimuat di media massa cetak dan elektronik termasuk internet. Instrumen penelitian berupa seperangkat konsep tentang pembaca, tanggapan pembaca, dan horizon harapan. Teknik pengumpulan data dengan cara observasi dan data dianalisis dengan menggunakan teknik deskriptif kualitatif. Hasil penelitian yang didapat sebagai berikut. (1) Seluruh pembaca menanggapi atau menilai positif unsur tema, alur, tokoh, latar, sudut pandang, gaya bahasa, teknik penceritaan, bahasa, dan isi novel Le Petit Prince. (2) Harapan sebagian besar pembaca sebelum membaca novel Le Petit Prince sesuai dengan kenyataan ke sembilan unsur di dalam novel Le Petit Prince, sehingga pembaca dapat dengan mudah menerima dan memberikan pujian pada novel Le Petit Prince. (3) Faktor penyebab perbedaan tanggapan dan horizon harapan pembaca selain perbedaan stressing unsur yang ditanggapi juga karena perbedaan pengetahuan tentang sastra, pengetahuan tentang kehidupan, dan pengalaman membaca karya sastra.Kata kunci: tanggapan pembaca, horizon harapan, Le Petit PrinceAbstractThis study aims to (1) describe reader’s responses to the novel Le Petit Prince (2) to describe the reader's expectations horizon of Le Petit Prince's novel, and (3) to describe the factors causing differences in responses and the horizon of readers' expectations. This research is a descriptive qualitative research type. The research data consist of a set of paragraphs that contains readers' responses to Le Petit Prince's novel, while the data sources are articles and papers published in print and electronic mass media including the internet. The research instruments are a set of reader concepts, reader responses, and expectations horizon. The technique of collecting data is observation and data are analyzed by using qualitative descriptive technique. The results obtained are as follow: (1) All readers respond and valuethe theme elements,plots, characters, background, point of view, language, titles, storytelling techniques, language, and extrinsic novel Le Petit Prince positively. (2) The expectations of most readers before reading Le Petit Prince's novels are in accordance with the nine facts in Le Petit Prince's novel, so readers can easily accept and give prise to Le Petit Prince's novel. (3) Factors causing differences in responses and horizon of readers' expectations other than the stressing differences of the elements being addressed also due to the differences in knowledge of literature, knowledge of life and literary reading experience. Keywords: readers responses, expectations horizon, Le Petit Prince


2019 ◽  
Author(s):  
Michiru Makuuchi

Symbolic behaviours such as language, music, drawing, dance, etc. are unique to humans and are found universally in every culture on earth1. These behaviours operate in different cognitive domains, but they are commonly characterised as linear sequences of symbols2,3. One of the most prominent features of language is hierarchical structure4, which is also found in music5,6 and mathematics7. Current research attempts to address whether hierarchical structure exists in drawing. When we draw complex objects, such as a face, we draw part by part in a hierarchical manner guided by visual semantic knowledge8. More specifically, we predicted how hierarchical structure emerges in drawing as follows. Although the drawing order of the constituent parts composing the target object is different amongst individuals, some parts will be drawn in succession consistently, thereby forming chunks. These chunks of parts would then be further integrated with other chunks into superordinate chunks, while showing differential affinity amongst chunks. The integration of chunks to an even higher chunk level repeats until finally reaching the full object. We analysed the order of drawing strokes of twenty-two complex objects by twenty-five young healthy adult participants with a cluster analysis9 and demonstrated reasonable hierarchical structures. The results suggest that drawing involves a linear production of symbols with a hierarchical structure. From an evolutionary point of view, we argue that ancient engravings and paintings manifest Homo sapiens’ capability for hierarchical symbolic cognition.


2020 ◽  
Author(s):  
Bankole Olatosi ◽  
Jiajia Zhang ◽  
Sharon Weissman ◽  
Zhenlong Li ◽  
Jianjun Hu ◽  
...  

BACKGROUND The Coronavirus Disease 2019 (COVID-19) caused by the severe acute respiratory syndrome coronavirus (SARS-CoV-2) remains a serious global pandemic. Currently, all age groups are at risk for infection but the elderly and persons with underlying health conditions are at higher risk of severe complications. In the United States (US), the pandemic curve is rapidly changing with over 6,786,352 cases and 199,024 deaths reported. South Carolina (SC) as of 9/21/2020 reported 138,624 cases and 3,212 deaths across the state. OBJECTIVE The growing availability of COVID-19 data provides a basis for deploying Big Data science to leverage multitudinal and multimodal data sources for incremental learning. Doing this requires the acquisition and collation of multiple data sources at the individual and county level. METHODS The population for the comprehensive database comes from statewide COVID-19 testing surveillance data (March 2020- till present) for all SC COVID-19 patients (N≈140,000). This project will 1) connect multiple partner data sources for prediction and intelligence gathering, 2) build a REDCap database that links de-identified multitudinal and multimodal data sources useful for machine learning and deep learning algorithms to enable further studies. Additional data will include hospital based COVID-19 patient registries, Health Sciences South Carolina (HSSC) data, data from the office of Revenue and Fiscal Affairs (RFA), and Area Health Resource Files (AHRF). RESULTS The project was funded as of June 2020 by the National Institutes for Health. CONCLUSIONS The development of such a linked and integrated database will allow for the identification of important predictors of short- and long-term clinical outcomes for SC COVID-19 patients using data science.


Author(s):  
Marco Angrisani ◽  
Anya Samek ◽  
Arie Kapteyn

The number of data sources available for academic research on retirement economics and policy has increased rapidly in the past two decades. Data quality and comparability across studies have also improved considerably, with survey questionnaires progressively converging towards common ways of eliciting the same measurable concepts. Probability-based Internet panels have become a more accepted and recognized tool to obtain research data, allowing for fast, flexible, and cost-effective data collection compared to more traditional modes such as in-person and phone interviews. In an era of big data, academic research has also increasingly been able to access administrative records (e.g., Kostøl and Mogstad, 2014; Cesarini et al., 2016), private-sector financial records (e.g., Gelman et al., 2014), and administrative data married with surveys (Ameriks et al., 2020), to answer questions that could not be successfully tackled otherwise.


2021 ◽  
Vol 37 (1) ◽  
pp. 161-169
Author(s):  
Dominik Rozkrut ◽  
Olga Świerkot-Strużewska ◽  
Gemma Van Halderen

Never has there been a more exciting time to be an official statistician. The data revolution is responding to the demands of the CoVID-19 pandemic and a complex sustainable development agenda to improve how data is produced and used, to close data gaps to prevent discrimination, to build capacity and data literacy, to modernize data collection systems and to liberate data to promote transparency and accountability. But can all data be liberated in the production and communication of official statistics? This paper explores the UN Fundamental Principles of Official Statistics in the context of eight new and big data sources. The paper concludes each data source can be used for the production of official statistics in adherence with the Fundamental Principles and argues these data sources should be used if National Statistical Systems are to adhere to the first Fundamental Principle of compiling and making available official statistics that honor citizen’s entitlement to public information.


Omega ◽  
2021 ◽  
pp. 102479
Author(s):  
Zhongbao Zhou ◽  
Meng Gao ◽  
Helu Xiao ◽  
Rui Wang ◽  
Wenbin Liu

Sign in / Sign up

Export Citation Format

Share Document