scholarly journals On the Concepts of Identity and Similarity in the Context of Biomedical Record Linkage

Author(s):  
Murat Sariyar ◽  
Jürgen Holm

Record linkage refers to a range of methods for merging and consolidating data in a manner such that duplicates are detected and false links are avoided. It is crucial for such a task to discern between similarity and identity of entities. This paper explores the implications of the ontological concepts of identity for record linkage (RL) on biomedical data sets. In order to draw substantial conclusions, we use the differentiation between numerical identity, qualitative identity and relational identity. We will discuss the problems of using similarity measures for record pairs and quality identity for ascertaining the real status of these pairs. We conclude that relational identity should be operationalized for RL.

2014 ◽  
Vol 2014 ◽  
pp. 1-7 ◽  
Author(s):  
Itziar Irigoien ◽  
Basilio Sierra ◽  
Concepción Arenas

In the problem of one-class classification (OCC) one of the classes, the target class, has to be distinguished from all other possible objects, considered as nontargets. In many biomedical problems this situation arises, for example, in diagnosis, image based tumor recognition or analysis of electrocardiogram data. In this paper an approach to OCC based on a typicality test is experimentally compared with reference state-of-the-art OCC techniques—Gaussian, mixture of Gaussians, naive Parzen, Parzen, and support vector data description—using biomedical data sets. We evaluate the ability of the procedures using twelve experimental data sets with not necessarily continuous data. As there are few benchmark data sets for one-class classification, all data sets considered in the evaluation have multiple classes. Each class in turn is considered as the target class and the units in the other classes are considered as new units to be classified. The results of the comparison show the good performance of the typicality approach, which is available for high dimensional data; it is worth mentioning that it can be used for any kind of data (continuous, discrete, or nominal), whereas state-of-the-art approaches application is not straightforward when nominal variables are present.


Genes ◽  
2020 ◽  
Vol 11 (7) ◽  
pp. 717
Author(s):  
Garba Abdulrauf Sharifai ◽  
Zurinahni Zainol

The training machine learning algorithm from an imbalanced data set is an inherently challenging task. It becomes more demanding with limited samples but with a massive number of features (high dimensionality). The high dimensional and imbalanced data set has posed severe challenges in many real-world applications, such as biomedical data sets. Numerous researchers investigated either imbalanced class or high dimensional data sets and came up with various methods. Nonetheless, few approaches reported in the literature have addressed the intersection of the high dimensional and imbalanced class problem due to their complicated interactions. Lately, feature selection has become a well-known technique that has been used to overcome this problem by selecting discriminative features that represent minority and majority class. This paper proposes a new method called Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm (rCBR-BGOA); rCBR-BGOA has employed an ensemble of multi-filters coupled with the Correlation-Based Redundancy method to select optimal feature subsets. A binary Grasshopper optimisation algorithm (BGOA) is used to construct the feature selection process as an optimisation problem to select the best (near-optimal) combination of features from the majority and minority class. The obtained results, supported by the proper statistical analysis, indicate that rCBR-BGOA can improve the classification performance for high dimensional and imbalanced datasets in terms of G-mean and the Area Under the Curve (AUC) performance metrics.


Author(s):  
Fang Chu ◽  
Lipo Wang

Accurate diagnosis of cancers is of great importance for doctors to choose a proper treatment. Furthermore, it also plays a key role in the searching for the pathology of cancers and drug discovery. Recently, this problem attracts great attention in the context of microarray technology. Here, we apply radial basis function (RBF) neural networks to this pattern recognition problem. Our experimental results in some well-known microarray data sets indicate that our method can obtain very high accuracy with a small number of genes.


2019 ◽  
Vol 21 (1) ◽  
pp. 79 ◽  
Author(s):  
Jörn Lötsch ◽  
Alfred Ultsch

Advances in flow cytometry enable the acquisition of large and high-dimensional data sets per patient. Novel computational techniques allow the visualization of structures in these data and, finally, the identification of relevant subgroups. Correct data visualizations and projections from the high-dimensional space to the visualization plane require the correct representation of the structures in the data. This work shows that frequently used techniques are unreliable in this respect. One of the most important methods for data projection in this area is the t-distributed stochastic neighbor embedding (t-SNE). We analyzed its performance on artificial and real biomedical data sets. t-SNE introduced a cluster structure for homogeneously distributed data that did not contain any subgroup structure. In other data sets, t-SNE occasionally suggested the wrong number of subgroups or projected data points belonging to different subgroups, as if belonging to the same subgroup. As an alternative approach, emergent self-organizing maps (ESOM) were used in combination with U-matrix methods. This approach allowed the correct identification of homogeneous data while in sets containing distance or density-based subgroups structures; the number of subgroups and data point assignments were correctly displayed. The results highlight possible pitfalls in the use of a currently widely applied algorithmic technique for the detection of subgroups in high dimensional cytometric data and suggest a robust alternative.


2019 ◽  
Vol 11 (16) ◽  
pp. 1886 ◽  
Author(s):  
Xinghui Zhao ◽  
Na Chen ◽  
Weifu Li ◽  
Shen ◽  
Peng

Known as input in the Numerical Weather Prediction (NWP) models, Microwave Radiation Imager (MWRI) data have been widely distributed to the user community. With the development of remote sensing technology, improving the geolocation accuracy of MWRI data are required and the first step is to estimate the geolocation error accurately. However, the traditional method, such as the coastline inflection method (CIM), usually has the disadvantages of low accuracy and poor anti-noise ability. To overcome these limitations, this paper proposes a novel ℓ p iterative closest point coastline inflection method ( ℓ p -ICP CIM). It assumes that the field of views (FOVs) across the coastline can degenerate into a step function and employs an ℓ p ( 0 ≤ p < 1 ) sparse regularization optimization model to solve the coastline point. After estimating the coastline points, the ICP algorithm is employed to estimate the corresponding relationship between the estimated coastline points and the real coastline. Finally, the geolocation error can be defined as the distance between the estimated coastline point and the corresponding point on the true coastline. Experimental results on simulated and real data sets show the effectiveness of our method over CIM. The accuracy of the geolocation error estimated by ℓ p -ICP CIM is up to 0 . 1 pixel, in more than 90 % of cases. We also show that the distribution of brightness temperature near the coastline is more consistent with the real coastline and the average geolocation error is reduced by 63 % after geolocation error correction.


2006 ◽  
Vol 45 (02) ◽  
pp. 200-203 ◽  
Author(s):  
L. Bobrowski

Summary Objectives: To improve the medical diagnosis support rules based on comparisons of diagnosed patients with similar cases (precedents) archived in a clinical database. The case-based reasoning (CBR) or the nearest neighbors (K-NN) classifications, which operate on referencing (learning) data sets, belong to this scheme. Methods: Inducing similarity measure through special linear transformations of the referencing sets aimed at the best separation of these sets. Designing separable transformations can be based on dipolar models and minimization of the convex and piecewise linear (CPL) criterion functions in accordance with the basis exchange algorithm. Results: Separable linear transformations allow for some data sets to decrease the error rate of the K-NNclassification rule based on the Euclidean distance. Such results can be seen on the example of data sets taken from the Heparsystem of diagnosis support. Conclusions: Medical diagnosis support based on the CBRor the K-NNrules can be improved through separable transformations of the referencing sets.


2019 ◽  
Vol 31 (04) ◽  
pp. 1950030
Author(s):  
Ayesha Sohail

Due to the advancement in data collection and maintenance strategies, the current clinical databases around the globe are rich in a sense that these contain detailed information not only about the individual’s medical conditions, but also about the environmental features, associated with the individual. Classification within this data could provide new medical insights. Data mining technology has become an attraction for researchers due to its affectivity and efficacy in the field of biomedicine research. Due to the diverse structure of such data sets, only few successful techniques and easy to use softwares, are available in literature. A Bayesian analysis provides a more intuitive statement of probability that hypothesis is true. Bayesian approach uses all available information and can give answers to complex questions more accurately. This means that Bayesian methods include prior information. In Bayesian analysis, no relevant information is excluded as prior represents all the available information apart from data itself. Bayesian techniques are specifically used for decision making. Uncertainty is the main hurdle in making decisions. Due to lack of information about relevant parameters, there is uncertainty about given decision. Bayesian methods measure these uncertainties by using probability. In this study, selected techniques of biostatistical Bayesian inference (the probability based inferencing approach, to identify uncertainty in databases) are discussed. To show the efficiency of a Hybrid technique, its application on two distinct data sets is presented in a novel way.


Sign in / Sign up

Export Citation Format

Share Document