Efficient management of transitive relationships in large data and knowledge bases

Abstract Motivation: The increasing diversity of data available to the biomedical scientist holds promise for better understanding of diseases and discovery of new treatments for patients. In order to provide a complete picture of a biomedical question, data from many different origins needs to be combined into a unified representation. During this data integration process, inevitable errors and ambiguities present in the initial sources compromise the quality of the resulting data warehouse, and greatly diminish the scientific value of the content. Expensive and time-consuming manual curation is then required to improve the quality of the information. However, it becomes increasingly difficult to dedicate and optimize the resources for data integration projects as available repositories are growing both in size and in number everyday. Results: We present a new generic methodology to identify problematic records, causing what we describe as ‘data hairball’ structures. The approach is graph-based and relies on two metrics traditionally used in social sciences: the graph density and the betweenness centrality. We evaluate and discuss these measures and show their relevance for flexible, optimized and automated data curation and linkage. The methodology focuses on information coherence and correctness to improve the scientific meaningfulness of data integration endeavors, such as knowledge bases and large data warehouses. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

Download Full-text

An example of spectrum imaging used for comparison of EELS quantitative analysis techniques on Al-Li

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s042482010008794x ◽

1991 ◽

Vol 49 ◽

pp. 726-727

Author(s):

John A. Hunt

Keyword(s):

Quantitative Analysis ◽

Large Data ◽

Difference Spectrum ◽

Large Data Sets ◽

Foil Thickness ◽

Data Sets ◽

Analysis Techniques ◽

Spectrum Imaging ◽

Normal Spectrum ◽

Electron Energy Loss

Spectrum-imaging is a useful technique for comparing different processing methods on very large data sets which are identical for each method. This paper is concerned with comparing methods of electron energy-loss spectroscopy (EELS) quantitative analysis on the Al-Li system. The spectrum-image analyzed here was obtained from an Al-10at%Li foil aged to produce δ' precipitates that can span the foil thickness. Two 1024 channel EELS spectra offset in energy by 1 eV were recorded and stored at each pixel in the 80x80 spectrum-image (25 Mbytes). An energy range of 39-89eV (20 channels/eV) are represented. During processing the spectra are either subtracted to create an artifact corrected difference spectrum, or the energy offset is numerically removed and the spectra are added to create a normal spectrum. The spectrum-images are processed into 2D floating-point images using methods and software described in [1].

Download Full-text

Cluster analysis for large data sets: applications to individual aerosol particles from the mid-pacific

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s0424820100132078 ◽

1992 ◽

Vol 50 (2) ◽

pp. 1488-1489

Author(s):

Thomas W. Shattuck ◽

James R. Anderson ◽

Neil W. Tindale ◽

Peter R. Buseck

Keyword(s):

Cluster Analysis ◽

Chemical Reactivity ◽

Large Data ◽

Large Data Sets ◽

Particle Analysis ◽

Data Sets ◽

Halogen Chemistry ◽

Complete Study ◽

Components Analysis ◽

Automated Scanning

Individual particle analysis involves the study of tens of thousands of particles using automated scanning electron microscopy and elemental analysis by energy-dispersive, x-ray emission spectroscopy (EDS). EDS produces large data sets that must be analyzed using multi-variate statistical techniques. A complete study uses cluster analysis, discriminant analysis, and factor or principal components analysis (PCA). The three techniques are used in the study of particles sampled during the FeLine cruise to the mid-Pacific ocean in the summer of 1990. The mid-Pacific aerosol provides information on long range particle transport, iron deposition, sea salt ageing, and halogen chemistry.Aerosol particle data sets suffer from a number of difficulties for pattern recognition using cluster analysis. There is a great disparity in the number of observations per cluster and the range of the variables in each cluster. The variables are not normally distributed, they are subject to considerable experimental error, and many values are zero, because of finite detection limits. Many of the clusters show considerable overlap, because of natural variability, agglomeration, and chemical reactivity.

Download Full-text

Automated 3-D cell population analysis in thick tissue sections using laser-scanning confocal-microscopy data

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s0424820100148642 ◽

1993 ◽

Vol 51 ◽

pp. 562-563

Author(s):

Hakan Ancin

Keyword(s):

Laser Scanning ◽

Population Analysis ◽

Three Dimensional ◽

Large Data ◽

Laser Scanning Confocal Microscopy ◽

Tissue Slices ◽

Scanning Confocal Microscopy ◽

Tissue Sections ◽

Spatially Adaptive ◽

Microscopy Data

This paper presents methods for performing detailed quantitative automated three dimensional (3-D) analysis of cell populations in thick tissue sections while preserving the relative 3-D locations of cells. Specifically, the method disambiguates overlapping clusters of cells, and accurately measures the volume, 3-D location, and shape parameters for each cell. Finally, the entire population of cells is analyzed to detect patterns and groupings with respect to various combinations of cell properties. All of the above is accomplished with zero subjective bias.In this method, a laser-scanning confocal light microscope (LSCM) is used to collect optical sections through the entire thickness (100 - 500μm) of fluorescently-labelled tissue slices. The acquired stack of optical slices is first subjected to axial deblurring using the expectation maximization (EM) algorithm. The resulting isotropic 3-D image is segmented using a spatially-adaptive Poisson based image segmentation algorithm with region-dependent smoothing parameters. Extracting the voxels that were labelled as "foreground" into an active voxel data structure results in a large data reduction.

Download Full-text

Building and using large common sense knowledge bases

PsycEXTRA Dataset ◽

10.1037/e446342006-001 ◽

2013 ◽

Author(s):

Kenneth D. Forbus

Keyword(s):

Common Sense ◽

Knowledge Bases ◽

Common Sense Knowledge

Download Full-text

A Model for Structured Data Entry Based on Explicit Descriptional Knowledge

Methods of Information in Medicine ◽

10.1055/s-0038-1635050 ◽

1994 ◽

Vol 33 (05) ◽

pp. 454-463 ◽

Cited By ~ 22

Author(s):

A. M. van Ginneken ◽

J. van der Lei ◽

J. H. van Bemmel ◽

P. W. Moorman

Keyword(s):

Domain Knowledge ◽

Data Entry ◽

Knowledge Bases ◽

Structured Data ◽

Research Quality ◽

Free Text ◽

Specific Knowledge ◽

User Input ◽

The One ◽

Research Quality Assessment

Abstract:Clinical narratives in patient records are usually recorded in free text, limiting the use of this information for research, quality assessment, and decision support. This study focuses on the capture of clinical narratives in a structured format by supporting physicians with structured data entry (SDE). We analyzed and made explicit which requirements SDE should meet to be acceptable for the physician on the one hand, and generate unambiguous patient data on the other. Starting from these requirements, we found that in order to support SDE, the knowledge on which it is based needs to be made explicit: we refer to this knowledge as descriptional knowledge. We articulate the nature of this knowledge, and propose a model in which it can be formally represented. The model allows the construction of specific knowledge bases, each representing the knowledge needed to support SDE within a circumscribed domain. Data entry is made possible through a general entry program, of which the behavior is determined by a combination of user input and the content of the applicable domain knowledge base. We clarify how descriptional knowledge is represented, modeled, and used for data entry to achieve SDE, which meets the proposed requirements.

Download Full-text

The Distinction between Linguistic and Conceptual Semantics in Medical Terminology and its Implication for NLP-Based Knowledge Acquisition

Methods of Information in Medicine ◽

10.1055/s-0038-1634568 ◽

1998 ◽

Vol 37 (04/05) ◽

pp. 327-333 ◽

Cited By ~ 3

Author(s):

F. Buekens ◽

G. De Moor ◽

A. Waagmeester ◽

W. Ceusters

Keyword(s):

Natural Language ◽

Knowledge Acquisition ◽

Natural Language Understanding ◽

Knowledge Bases ◽

Linguistic Knowledge ◽

Medical Terminology ◽

Language Understanding ◽

Conceptual Semantics

AbstractNatural language understanding systems have to exploit various kinds of knowledge in order to represent the meaning behind texts. Getting this knowledge in place is often such a huge enterprise that it is tempting to look for systems that can discover such knowledge automatically. We describe how the distinction between conceptual and linguistic semantics may assist in reaching this objective, provided that distinguishing between them is not done too rigorously. We present several examples to support this view and argue that in a multilingual environment, linguistic ontologies should be designed as interfaces between domain conceptualizations and linguistic knowledge bases.

Download Full-text