Relational Databases: A Transparent Framework for Encouraging Biology Students To Think Informatically

We discuss how relational databases constitute an ideal framework for representing and analyzing large-scale genomic data sets in biology. As a case study, we describe a Drosophila splice-site database that we recently developed at Wesleyan University for use in research and teaching. The database stores data about splice sites computed by a custom algorithm using Drosophila cDNA transcripts and genomic DNA and supports a set of procedures for analyzing splice-site sequence space. A generic Web interface permits the execution of the procedures with a variety of parameter settings and also supports custom structured query language queries. Moreover, new analytical procedures can be added by updating special metatables in the database without altering the Web interface. The database provides a powerful setting for students to develop informatic thinking skills.

Download Full-text

Bio-Node – Bioinformatics in the Cloud

10.1101/2020.04.15.043596 ◽

2020 ◽

Author(s):

Yannick Spreen ◽

Maximilian Miller

Keyword(s):

Large Scale ◽

Research Area ◽

Data Sets ◽

Web Interface ◽

Data Types ◽

Function Annotation ◽

Specific Data ◽

And Performance ◽

Faster Development ◽

Optimization And Performance

Motivation: The applicability and reproducibility of bioinformatics methods and results often depend on the structure and software architecture of their development. Exponentially growing data sets require ever more optimization and performance with conventional computing capacities lacking this process. This creates a large overhead for software development in a research area which is primarily interested in solving complex biological problems rather than developing new, performant software solutions. In pure computer science, new structures in the field of web development have produced more efficient processes for container-based software solutions. The advantages of these structures have rarely been explored in a broader scientific scale. This is also the case with the trend of migrating computations from on premise resources to the cloud. Results: We created Bio-Node, a new platform for large scale bio data analysis utilizing cloud compute resources (publicly available at https://bio-node.de). Bio-Node enables building complex workflows using a sophisticated web interface. We applied Bio-Node to implement bioinformatic workflows for rapid metagenome function annotation. We further developed "Auto-Clustering", a workflow that automatically extracts the most suited clustering parameters for specific data types and subsequently enables to optimally segregate unknown samples of the same type. Compared to existing methods and approaches Bio-Node improves performance and costs of bioinformatics data analyses while providing an easier and faster development process with focus on reproducibility and reusability.

Download Full-text

A VERSION MANAGEMENT FRAMEWORK FOR RDF TRIPLE STORES

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194012500040 ◽

2012 ◽

Vol 22 (01) ◽

pp. 85-106 ◽

Cited By ~ 21

Author(s):

DONG-HYUK IM ◽

SANG-WON LEE ◽

HYOUNG-JOO KIM

Keyword(s):

Relational Databases ◽

Large Scale ◽

Real Life ◽

Original Version ◽

Data Sets ◽

Compression Technique ◽

Version Management ◽

Large Scale Data ◽

Ontology Language ◽

Rdf Data

RDF is widely used as an ontology language for representing the metadata in Semantic Web, knowledge management system and E-commerce. Since ontologies model the knowledge in a particular domain, they may change over time. Furthermore, ontologies are usually developed and controlled in a distributed and collaborative way. Thus, it is very important to be able to manage multiple versions for RDF data. Earlier studies on RDF versions have focused on providing the accesses to different versions (i.e. snapshots) and computing the differences between those two versions. However, the existing approaches suffer from the space overhead for large scale data, since all snapshots should be redundantly kept in a repository. Moreover, it is very time consuming to compute the delta between two specific versions, which is very common in RDF applications. In this paper, we propose a framework for RDF version management in relational databases. It stores the original version and the deltas between two consecutive versions, thereby reducing the space requirement considerably. The other benefit of our approach is appropriate for change queries. On the flip side, in order to answer a query on a specific logical version, version should be constructed on the fly by applying the deltas between the original version and the logical version. This can slow down query performance. In order to overcome this, we propose a compression technique for deltas, called Aggregated Delta, to create a logical version directly rather than executing the sequence of deltas. An experimental study with real life RDF data sets shows our framework maintains multiple versions efficiently.

Download Full-text

The Concept of Data Mining and Knowledge Extraction Techniques

Qubahan Academic Journal ◽

10.48161/qaj.v1n2a43 ◽

2021 ◽

Vol 1 (2) ◽

pp. 17-20

Author(s):

Renas Rajab Asaad ◽

Revink Masoud Abdulhakim

Keyword(s):

Data Mining ◽

Relational Databases ◽

Query Language ◽

Knowledge Extraction ◽

Data Sets ◽

Extraction Techniques ◽

Good Decision

Recent days, the concept of data mining and the need for it, its objectives and its uses in various fields, explain its procedures and tools, the type of data that is mined, and the structural structure of that data while simplifying the concept of databases, relational databases and the query language. Explain the benefits and uses of mining or mining data stored in specialized databases in various vital areas of society. Also, it is the process of analyzing data from different perspectives and discovering imbalances, patterns and correlations in data sets that are insightful and useful for predicting results that help you make a good decision. Let's bring back our mining example, when you plan to prospect for gold or any valuable minerals you first have to determine where you think the gold is to start digging. In the process of data mining we have the same concept. To mine data, you must first collect data from various sources, prepare it, and store it in one place, as nothing from data mining is related to the process of searching for the data itself. Currently, the company is storing data in what is called a Datawarehouse which we will talk about in a later stage in detail.

Download Full-text

Improving the performance of big data databases

Kurdistan Journal of Applied Research ◽

10.24017/science.2019.2.20 ◽

2019 ◽

Vol 4 (2) ◽

pp. 206-220

Author(s):

Dashne Raouf Arif ◽

Nzar Abdulqadir Ali

Keyword(s):

Relational Database ◽

Relational Databases ◽

Query Language ◽

Database Management System ◽

Data Sets ◽

Relational Database Management System ◽

Nosql Database ◽

Sql Database ◽

Performance Results ◽

Relational Database Management

Real-time monitoring systems utilize two types of database, they are relational databases such as MySQL and non-relational databases such as MongoDB. A relational database management system (RDBMS) stores data in a structured format using rows and columns. It is relational because the values of the tables are connected. A non-relational database is a database that does not adopt the relational structure given by traditional. In recent years, this class of databases has also been referred to as Not only SQL (NoSQL). This paper discusses many comparisons that have been conducted on the execution time performance of types of databases (SQL and NoSQL). In SQL (Structured Query Language) databases different algorithms are used for inserting and updating data, such as indexing, bulk insert and multiple updating. However, in NoSQL different algorithms are used for inserting and updating operations such as default-indexing, batch insert, multiple updating and pipeline aggregation. As a result, firstly compared with related papers, this paper shows that the performance of both SQL and NoSQL can be improved. Secondly, performance can be dramatically improved for inserting and updating operations in the NoSQL database compared to the SQL database. To demonstrate the performance of the different algorithms for entering and updating data in SQL and NoSQL, this paper focuses on a different number of data sets and different performance results. The SQL part of the paper is conducted on 50,000 records to 3,000,000 records, while the NoSQL part of the paper is conducted on 50,000 to 16,000,000 documents (2GB) for NoSQL. In SQL, three million records are inserted within 606.53 seconds, while in NoSQL this number of documents is inserted within 67.87 seconds. For updating data, in SQL 300,000 records are updated within 271.17 seconds, while for NoSQL this number of documents is updated within just 46.02 seconds.

Download Full-text

Multi-Dimensional Event Data in Graph Databases

Journal on Data Semantics ◽

10.1007/s13740-021-00122-1 ◽

2021 ◽

Author(s):

Stefan Esser ◽

Dirk Fahland

Keyword(s):

Data Model ◽

Relational Databases ◽

Query Language ◽

Process Mining ◽

Real Life ◽

Query Languages ◽

Data Sets ◽

Temporal Relations ◽

Event Data ◽

Process Event

AbstractProcess event data is usually stored either in a sequential process event log or in a relational database. While the sequential, single-dimensional nature of event logs aids querying for (sub)sequences of events based on temporal relations such as “directly/eventually-follows,” it does not support querying multi-dimensional event data of multiple related entities. Relational databases allow storing multi-dimensional event data, but existing query languages do not support querying for sequences or paths of events in terms of temporal relations. In this paper, we propose a general data model for multi-dimensional event data based on labeled property graphs that allows storing structural and temporal relations in a single, integrated graph-based data structure in a systematic way. We provide semantics for all concepts of our data model, and generic queries for modeling event data over multiple entities that interact synchronously and asynchronously. The queries allow for efficiently converting large real-life event data sets into our data model, and we provide 5 converted data sets for further research. We show that typical and advanced queries for retrieving and aggregating such multi-dimensional event data can be formulated and executed efficiently in the existing query language Cypher, giving rise to several new research questions. Specifically, aggregation queries on our data model enable process mining over multiple inter-related entities using off-the-shelf technology.

Download Full-text

Faculty Opinions recommendation of Comparative assessment of large-scale data sets of protein-protein interactions.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1006598.82257 ◽

2002 ◽

Author(s):

Rob Russell

Keyword(s):

Protein Interactions ◽

Large Scale ◽

Comparative Assessment ◽

Data Sets ◽

Protein Protein Interactions ◽

Large Scale Data ◽

Scale Data ◽

Large Scale Data Sets

Download Full-text

Galaxy spin direction distribution in HST and SDSS show similar large-scale asymmetry

Publications of the Astronomical Society of Australia ◽

10.1017/pasa.2020.46 ◽

2020 ◽

Vol 37 ◽

Author(s):

Lior Shamir

Keyword(s):

Large Scale ◽

Spiral Galaxies ◽

Hubble Space Telescope ◽

Gravitational Interaction ◽

Large Data ◽

Sloan Digital Sky Survey ◽

Data Sets ◽

Dipole Axis ◽

Data Set ◽

The Asymmetry

Abstract Several recent observations using large data sets of galaxies showed non-random distribution of the spin directions of spiral galaxies, even when the galaxies are too far from each other to have gravitational interaction. Here, a data set of $\sim8.7\cdot10^3$ spiral galaxies imaged by Hubble Space Telescope (HST) is used to test and profile a possible asymmetry between galaxy spin directions. The asymmetry between galaxies with opposite spin directions is compared to the asymmetry of galaxies from the Sloan Digital Sky Survey. The two data sets contain different galaxies at different redshift ranges, and each data set was annotated using a different annotation method. The results show that both data sets show a similar asymmetry in the COSMOS field, which is covered by both telescopes. Fitting the asymmetry of the galaxies to cosine dependence shows a dipole axis with probabilities of $\sim2.8\sigma$ and $\sim7.38\sigma$ in HST and SDSS, respectively. The most likely dipole axis identified in the HST galaxies is at $(\alpha=78^{\rm o},\delta=47^{\rm o})$ and is well within the $1\sigma$ error range compared to the location of the most likely dipole axis in the SDSS galaxies with $z>0.15$ , identified at $(\alpha=71^{\rm o},\delta=61^{\rm o})$ .

Download Full-text

Accelerating In-Transit Co-Processing for Scientific Simulations Using Region-Based Data-Driven Analysis

Algorithms ◽

10.3390/a14050154 ◽

2021 ◽

Vol 14 (5) ◽

pp. 154

Author(s):

Marcus Walldén ◽

Masao Okita ◽

Fumihiko Ino ◽

Dimitris Drikakis ◽

Ioannis Kokkinakis

Keyword(s):

Large Scale ◽

Data Driven ◽

Data Sets ◽

Output Constraints ◽

Data Driven Approach ◽

Scientific Simulations ◽

Multiple Metrics ◽

In Transit ◽

Multiple Compression ◽

Large Scale Simulations

Increasing processing capabilities and input/output constraints of supercomputers have increased the use of co-processing approaches, i.e., visualizing and analyzing data sets of simulations on the fly. We present a method that evaluates the importance of different regions of simulation data and a data-driven approach that uses the proposed method to accelerate in-transit co-processing of large-scale simulations. We use the importance metrics to simultaneously employ multiple compression methods on different data regions to accelerate the in-transit co-processing. Our approach strives to adaptively compress data on the fly and uses load balancing to counteract memory imbalances. We demonstrate the method’s efficiency through a fluid mechanics application, a Richtmyer–Meshkov instability simulation, showing how to accelerate the in-transit co-processing of simulations. The results show that the proposed method expeditiously can identify regions of interest, even when using multiple metrics. Our approach achieved a speedup of 1.29× in a lossless scenario. The data decompression time was sped up by 2× compared to using a single compression method uniformly.

Download Full-text

J-CO: A Platform-Independent Framework for Managing Geo-Referenced JSON Data Sets

Electronics ◽

10.3390/electronics10050621 ◽

2021 ◽

Vol 10 (5) ◽

pp. 621

Author(s):

Giuseppe Psaila ◽

Paolo Fosci

Keyword(s):

Query Language ◽

Open Data ◽

Internet Technology ◽

Data Sets ◽

Specific Storage ◽

Current State ◽

Execution Engine ◽

Share Data ◽

Cloud Servers ◽

Computational Resources

Internet technology and mobile technology have enabled producing and diffusing massive data sets concerning almost every aspect of day-by-day life. Remarkable examples are social media and apps for volunteered information production, as well as Open Data portals on which public administrations publish authoritative and (often) geo-referenced data sets. In this context, JSON has become the most popular standard for representing and exchanging possibly geo-referenced data sets over the Internet.Analysts, wishing to manage, integrate and cross-analyze such data sets, need a framework that allows them to access possibly remote storage systems for JSON data sets, to retrieve and query data sets by means of a unique query language (independent of the specific storage technology), by exploiting possibly-remote computational resources (such as cloud servers), comfortably working on their PC in their office, more or less unaware of real location of resources. In this paper, we present the current state of the J-CO Framework, a platform-independent and analyst-oriented software framework to manipulate and cross-analyze possibly geo-tagged JSON data sets. The paper presents the general approach behind the J-CO Framework, by illustrating the query language by means of a simple, yet non-trivial, example of geographical cross-analysis. The paper also presents the novel features introduced by the re-engineered version of the execution engine and the most recent components, i.e., the storage service for large single JSON documents and the user interface that allows analysts to comfortably share data sets and computational resources with other analysts possibly working in different places of the Earth globe. Finally, the paper reports the results of an experimental campaign, which show that the execution engine actually performs in a more than satisfactory way, proving that our framework can be actually used by analysts to process JSON data sets.

Download Full-text

Compartment and hub definitions tune metabolic networks for metabolomic interpretations

GigaScience ◽

10.1093/gigascience/giz137 ◽

2020 ◽

Vol 9 (1) ◽

Cited By ~ 3

Author(s):

T Cameron Waller ◽

Jordan A Berg ◽

Alexander Lex ◽

Brian E Chapman ◽

Jared Rutter

Keyword(s):

Large Scale ◽

Metabolic Networks ◽

Shortest Paths ◽

Large Data ◽

Differential Regulation ◽

Large Data Sets ◽

Data Sets ◽

Human Metabolism ◽

Experimental Conditions ◽

Systemic Model

Abstract Background Metabolic networks represent all chemical reactions that occur between molecular metabolites in an organism’s cells. They offer biological context in which to integrate, analyze, and interpret omic measurements, but their large scale and extensive connectivity present unique challenges. While it is practical to simplify these networks by placing constraints on compartments and hubs, it is unclear how these simplifications alter the structure of metabolic networks and the interpretation of metabolomic experiments. Results We curated and adapted the latest systemic model of human metabolism and developed customizable tools to define metabolic networks with and without compartmentalization in subcellular organelles and with or without inclusion of prolific metabolite hubs. Compartmentalization made networks larger, less dense, and more modular, whereas hubs made networks larger, more dense, and less modular. When present, these hubs also dominated shortest paths in the network, yet their exclusion exposed the subtler prominence of other metabolites that are typically more relevant to metabolomic experiments. We applied the non-compartmental network without metabolite hubs in a retrospective, exploratory analysis of metabolomic measurements from 5 studies on human tissues. Network clusters identified individual reactions that might experience differential regulation between experimental conditions, several of which were not apparent in the original publications. Conclusions Exclusion of specific metabolite hubs exposes modularity in both compartmental and non-compartmental metabolic networks, improving detection of relevant clusters in omic measurements. Better computational detection of metabolic network clusters in large data sets has potential to identify differential regulation of individual genes, transcripts, and proteins.

Download Full-text