scholarly journals Helping Novices Avoid the Hazards of Data: Leveraging Ontologies to Improve Model Generalization Automatically with Online Data Sources

AI Magazine ◽  
2016 ◽  
Vol 37 (2) ◽  
pp. 19-32 ◽  
Author(s):  
Sasin Janpuangtong ◽  
Dylan A. Shell

The infrastructure and tools necessary for large-scale data analytics, formerly the exclusive purview of experts, are increasingly available. Whereas a knowledgeable data-miner or domain expert can rightly be expected to exercise caution when required (for example, around fallacious conclusions supposedly supported by the data), the nonexpert may benefit from some judicious assistance. This article describes an end-to-end learning framework that allows a novice to create models from data easily by helping structure the model building process and capturing extended aspects of domain knowledge. By treating the whole modeling process interactively and exploiting high-level knowledge in the form of an ontology, the framework is able to aid the user in a number of ways, including in helping to avoid pitfalls such as data dredging. Prudence must be exercised to avoid these hazards as certain conclusions may only be supported if, for example, there is extra knowledge which gives reason to trust a narrower set of hypotheses. This article adopts the solution of using higher-level knowledge to allow this sort of domain knowledge to be used automatically, selecting relevant input attributes, and thence constraining the hypothesis space. We describe how the framework automatically exploits structured knowledge in an ontology to identify relevant concepts, and how a data extraction component can make use of online data sources to find measurements of those concepts so that their relevance can be evaluated. To validate our approach, models of four different problem domains were built using our implementation of the framework. Prediction error on unseen examples of these models show that our framework, making use of the ontology, helps to improve model generalization.

2015 ◽  
Vol 31 (3) ◽  
pp. 475-487 ◽  
Author(s):  
John R. Bryant ◽  
Patrick Graham

Abstract The article describes a Bayesian approach to deriving population estimates from multiple administrative data sources. Coverage rates play an important role in the approach: identifying anomalies in coverage rates is a key step in the model-building process, and data sources receive more weight within the model if their coverage rates are more consistent. Random variation in population processes and measurement processes is dealt with naturally within the model, and all outputs come with measures of uncertainty. The model is applied to the problem of estimating regional populations in New Zealand. The New Zealand example illustrates the continuing importance of coverage surveys.


2021 ◽  
Vol 118 (20) ◽  
pp. e2024287118
Author(s):  
J. Masison ◽  
J. Beezley ◽  
Y. Mei ◽  
HAL Ribeiro ◽  
A. C. Knapp ◽  
...  

This paper presents a modular software design for the construction of computational modeling technology that will help implement precision medicine. In analogy to a common industrial strategy used for preventive maintenance of engineered products, medical digital twins are computational models of disease processes calibrated to individual patients using multiple heterogeneous data streams. They have the potential to help improve diagnosis, prognosis, and personalized treatment for a wide range of medical conditions. Their large-scale development relies on both mechanistic and data-driven techniques and requires the integration and ongoing update of multiple component models developed across many different laboratories. Distributed model building and integration requires an open-source modular software platform for the integration and simulation of models that is scalable and supports a decentralized, community-based model building process. This paper presents such a platform, including a case study in an animal model of a respiratory fungal infection.


2009 ◽  
pp. 596-614 ◽  
Author(s):  
I. Koffina ◽  
G. Serfiotis ◽  
V. Christophides ◽  
V. Tannen

Semantic Web (SW) technology aims to facilitate the integration of legacy data sources spread worldwide. Despite the plethora of SW languages (e.g., RDF/S, OWL) recently proposed for supporting large-scale information interoperation, the vast majority of legacy sources still rely on relational databases (RDB) published on the Web or corporate intranets as virtual XML. In this article, we advocate a first-order logic framework for mediating high-level queries to relational and/or XML sources using community ontologies expressed in a SW language such as RDF/S. We describe the architecture and reasoning services of our SW integration middleware, termed SWIM, and we present the main design choices and techniques for supporting powerful mappings between different data models, as well as reformulation and optimization of queries expressed against mediator ontologies and views.


Author(s):  
Tomasz Gubala ◽  
Marian Bubak ◽  
Peter Sloot

Research environments for modern, cross-disciplinary scientific endeavors have to unite multiple users, with varying levels of expertise and roles, along with multitudes of data sources and processing units. The high level of required integration contrasts with the loosely-coupled nature of environments which are appropriate for research. The problem is to support integration of dynamic service-based infrastructures with data sources, tools and users in a way that conserves ubiquity, extensibility and usability. This chapter presents a close examination of related achievements in the field and the description of proposed approach. It shows that integration of loosely-coupled system components with formallydefined vocabularies may fulfill the listed requirements. The authors demonstrate that combining formal representations of domain knowledge with techniques like data integration, semantic annotations and shared vocabularies, enables the development of systems for modern e-Science. For demonstration they present how several semantically-augmented experiments are modeled in the ViroLab virtual laboratory for virology.


2010 ◽  
Vol 38 (5) ◽  
pp. 1197-1201 ◽  
Author(s):  
David A. Fell ◽  
Mark G. Poolman ◽  
Albert Gevorgyan

Reconstructing a model of the metabolic network of an organism from its annotated genome sequence would seem, at first sight, to be one of the most straightforward tasks in functional genomics, even if the various data sources required were never designed with this application in mind. The number of genome-scale metabolic models is, however, lagging far behind the number of sequenced genomes and is likely to continue to do so unless the model-building process can be accelerated. Two aspects that could usefully be improved are the ability to find the sources of error in a nascent model rapidly, and the generation of tenable hypotheses concerning solutions that would improve a model. We will illustrate these issues with approaches we have developed in the course of building metabolic models of Streptococcus agalactiae and Arabidopsis thaliana.


Author(s):  
Georgi Derluguian

The author develops ideas about the origin of social inequality during the evolution of human societies and reflects on the possibilities of its overcoming. What makes human beings different from other primates is a high level of egalitarianism and altruism, which contributed to more successful adaptability of human collectives at early stages of the development of society. The transition to agriculture, coupled with substantially increasing population density, was marked by the emergence and institutionalisation of social inequality based on the inequality of tangible assets and symbolic wealth. Then, new institutions of warfare came into existence, and they were aimed at conquering and enslaving the neighbours engaged in productive labour. While exercising control over nature, people also established and strengthened their power over other people. Chiefdom as a new type of polity came into being. Elementary forms of power (political, economic and ideological) served as a basis for the formation of early states. The societies in those states were characterised by social inequality and cruelties, including slavery, mass violence and numerous victims. Nowadays, the old elementary forms of power that are inherent in personalistic chiefdom are still functioning along with modern institutions of public and private bureaucracy. This constitutes the key contradiction of our time, which is the juxtaposition of individual despotic power and public infrastructural one. However, society is evolving towards an ever more efficient combination of social initiatives with the sustainability and viability of large-scale organisations.


Genetics ◽  
2001 ◽  
Vol 159 (4) ◽  
pp. 1765-1778
Author(s):  
Gregory J Budziszewski ◽  
Sharon Potter Lewis ◽  
Lyn Wegrich Glover ◽  
Jennifer Reineke ◽  
Gary Jones ◽  
...  

Abstract We have undertaken a large-scale genetic screen to identify genes with a seedling-lethal mutant phenotype. From screening ~38,000 insertional mutant lines, we identified >500 seedling-lethal mutants, completed cosegregation analysis of the insertion and the lethal phenotype for >200 mutants, molecularly characterized 54 mutants, and provided a detailed description for 22 of them. Most of the seedling-lethal mutants seem to affect chloroplast function because they display altered pigmentation and affect genes encoding proteins predicted to have chloroplast localization. Although a high level of functional redundancy in Arabidopsis might be expected because 65% of genes are members of gene families, we found that 41% of the essential genes found in this study are members of Arabidopsis gene families. In addition, we isolated several interesting classes of mutants and genes. We found three mutants in the recently discovered nonmevalonate isoprenoid biosynthetic pathway and mutants disrupting genes similar to Tic40 and tatC, which are likely to be involved in chloroplast protein translocation. Finally, we directly compared T-DNA and Ac/Ds transposon mutagenesis methods in Arabidopsis on a genome scale. In each population, we found only about one-third of the insertion mutations cosegregated with a mutant phenotype.


Epidemiologia ◽  
2021 ◽  
Vol 2 (3) ◽  
pp. 315-324
Author(s):  
Juan M. Banda ◽  
Ramya Tekumalla ◽  
Guanyu Wang ◽  
Jingyuan Yu ◽  
Tuo Liu ◽  
...  

As the COVID-19 pandemic continues to spread worldwide, an unprecedented amount of open data is being generated for medical, genetics, and epidemiological research. The unparalleled rate at which many research groups around the world are releasing data and publications on the ongoing pandemic is allowing other scientists to learn from local experiences and data generated on the front lines of the COVID-19 pandemic. However, there is a need to integrate additional data sources that map and measure the role of social dynamics of such a unique worldwide event in biomedical, biological, and epidemiological analyses. For this purpose, we present a large-scale curated dataset of over 1.12 billion tweets, growing daily, related to COVID-19 chatter generated from 1 January 2020 to 27 June 2021 at the time of writing. This data source provides a freely available additional data source for researchers worldwide to conduct a wide and diverse number of research projects, such as epidemiological analyses, emotional and mental responses to social distancing measures, the identification of sources of misinformation, stratified measurement of sentiment towards the pandemic in near real time, among many others.


1979 ◽  
Vol 6 (2) ◽  
pp. 70-72
Author(s):  
T. A. Coffelt ◽  
F. S. Wright ◽  
J. L. Steele

Abstract A new method of harvesting and curing breeder's seed peanuts in Virginia was initiated that would 1) reduce the labor requirements, 2) maintain a high level of germination, 3) maintain varietal purity at 100%, and 4) reduce the risk of frost damage. Three possible harvesting and curing methods were studied. The traditional stack-pole method satisfied the latter 3 objectives, but not the first. The windrow-combine method satisfied the first 2 objectives, but not the last 2. The direct harvesting method satisfied all four objectives. The experimental equipment and curing procedures for direct harvesting had been developed but not tested on a large scale for seed harvesting. This method has been used in Virginia to produce breeder's seed of 3 peanut varieties (Florigiant, VA 72R and VA 61R) during five years. Compared to the stackpole method, labor requirements have been reduced, satisfactory levels of germination and varietal purity have been obtained, and the risk of frost damage has been minimized.


Sign in / Sign up

Export Citation Format

Share Document