scholarly journals Similarity to a single set

Author(s):  
Lee Naish

Identifying patterns and associations in data is fundamental to discovery in science. This work investigates a very simple instance of the problem, where each data point consists of a vector of binary attributes, and attributes are treated equally. For example, each data point may correspond to a person and the attributes may be their sex, whether they smoke cigarettes, whether they have been diagnosed with lung cancer, etc. Measuring similarity of attributes in the data is equivalent to measuring similarity of sets - an attribute can be mapped to the set of data points which have the attribute. Furthermore, there is one identified base set (or attribute) and only similarity to that set is considered - the other sets are just ranked according to how similar they are to the base set. For example, if the base set is lung cancer sufferers, the set of smokers may well be high in the ranking. Identifying set similarity or correlation has many uses and is often the first step in determining causality. Set similarity is also the basis for comparing binary classifiers such as diagnostic tests for any data set. More than a hundred set similarity measures have been proposed in the literature is but there is very little understanding of how best to choose a similarity measure for a given domain. This work discusses numerous properties that similarity measures can have, weakening some previously proposed definitions so they are no longer incompatible, and identifying important forms of symmetry which have not previously been considered. It defines ordering relations over similarity measures and shows how some properties of a domain can be used to help choose a similarity measure which will perform well for that domain.

2016 ◽  
Author(s):  
Lee Naish

Identifying patterns and associations in data is fundamental to discovery in science. This work investigates a very simple instance of the problem, where each data point consists of a vector of binary attributes, and attributes are treated equally. For example, each data point may correspond to a person and the attributes may be their sex, whether they smoke cigarettes, whether they have been diagnosed with lung cancer, etc. Measuring similarity of attributes in the data is equivalent to measuring similarity of sets - an attribute can be mapped to the set of data points which have the attribute. Furthermore, there is one identified base set (or attribute) and only similarity to that set is considered - the other sets are just ranked according to how similar they are to the base set. For example, if the base set is lung cancer sufferers, the set of smokers may well be high in the ranking. Identifying set similarity or correlation has many uses and is often the first step in determining causality. Set similarity is also the basis for comparing binary classifiers such as diagnostic tests for any data set. More than a hundred set similarity measures have been proposed in the literature is but there is very little understanding of how best to choose a similarity measure for a given domain. This work discusses numerous properties that similarity measures can have, weakening some previously proposed definitions so they are no longer incompatible, and identifying important forms of symmetry which have not previously been considered. It defines ordering relations over similarity measures and shows how some properties of a domain can be used to help choose a similarity measure which will perform well for that domain.


1998 ◽  
Vol 52 (4) ◽  
pp. 621-625 ◽  
Author(s):  
H. G. Schulze ◽  
L. S. Greek ◽  
C. J. Barbosa ◽  
M. W. Blades ◽  
R. F. B. Turner

We report on a method to reduce background noise and amplify signals in data sets with low signal-to-noise ratios (SNRs). This method consists of taking a data set with mean 0 and normalized with respect to absolute value, adding 1 to all values to adjust the mean to 1, and then applying a moving product (MP) to the transformed data set (similar to the application of a moving average or 0-order Savitzky–Golay filtering). A data point in the presence of a signal raises the probability of that data point having a value >1, while the absence of a signal increases the probability of that data point having a value < 1. If the autocorrelation lag of the signal is larger than the autocorrelation lag of the associated noise, the use of an MP with window comparable to that of the signal width (i.e., 2–3 times the signal standard deviation) will tend to reduce the values of data points where no signal is present and similarly amplify data points where signal is present. Signal amplification, often to a considerable degree, is gained at the cost of signal distortion. We have used this method on simulated data sets with SNRs of 1, 0.5, and 0.33, and obtained signal-to-background noise ratio (SBNR) enhancements in excess of 100 times. We have also applied this procedure to low SNR measured Raman spectra, and we discuss our findings and their implications. This method is expected to be useful in the detection of weak signals buried in strong background noise.


2022 ◽  
pp. 1-23
Author(s):  
Zhenghang Cui ◽  
Issei Sato

Abstract Noisy pairwise comparison feedback has been incorporated to improve the overall query complexity of interactively learning binary classifiers. The positivity comparison oracle is extensively used to provide feedback on which is more likely to be positive in a pair of data points. Because it is impossible to determine accurate labels using this oracle alone without knowing the classification threshold, existing methods still rely on the traditional explicit labeling oracle, which explicitly answers the label given a data point. The current method conducts sorting on all data points and uses explicit labeling oracle to find the classification threshold. However, it has two drawbacks: (1) it needs unnecessary sorting for label inference and (2) it naively adapts quick sort to noisy feedback. In order to avoid these inefficiencies and acquire information of the classification threshold at the same time, we propose a new pairwise comparison oracle concerning uncertainties. This oracle answers which one has higher uncertainty given a pair of data points. We then propose an efficient adaptive labeling algorithm to take advantage of the proposed oracle. In addition, we address the situation where the labeling budget is insufficient compared to the data set size. Furthermore, we confirm the feasibility of the proposed oracle and the performance of the proposed algorithm theoretically and empirically.


2003 ◽  
Vol 99 (6) ◽  
pp. 1255-1262 ◽  
Author(s):  
Wei Lu ◽  
James G. Ramsay ◽  
James M. Bailey

Background Many pharmacologic studies record data as binary, yes-or-no, variables with analysis using logistic regression. In a previous study, it was shown that estimates of C50, the drug concentration associated with a 50% probability of drug effect, were unbiased, whereas estimates of gamma, the term describing the steepness of the concentration-effect relationship, were biased when sparse data were naively pooled for analysis. In this study, it was determined whether mixed-effects analysis improved the accuracy of parameter estimation. Methods Pharmacodynamic studies with binary, yes-or-no, responses were simulated and analyzed with NONMEM. The bias and coefficient of variation of C50 and gamma estimates were determined as a function of numbers of patients in the simulated study, the number of simulated data points per patient, and the "true" value of gamma. In addition, 100 sparse binary human data sets were generated from an evaluation of midazolam for postoperative sedation of adult patients undergoing cardiac surgery by random selection of a single data point (sedation score vs. midazolam plasma concentration) from each of the 30 patients in the study. C50 and gamma were estimated for each of these data sets by using NONMEM and were compared with the estimates from the complete data set of 656 observations. Results Estimates of C50 were unbiased, even for sparse data (one data point per patient) with coefficients of variation of 30-50%. Estimates of gamma were highly biased for sparse data for all values of gamma greater than 1, and the value of gamma was overestimated. Unbiased estimation of gamma required 10 data points per patient. The coefficient of variation of gamma estimates was greater than that of the C50 estimates. Clinical data for sedation with midazolam confirmed the simulation results, showing an overestimate of gamma with sparse data. Conclusion Although accurate estimations of C50 from sparse binary data are possible, estimates of gamma are biased. Data with 10 or more observations per patient is necessary for accurate estimations of gamma.


Author(s):  
Bojan Furlan ◽  
Vladimir Sivački ◽  
Davor Jovanović ◽  
Boško Nikolić

This paper presents methods for measuring the semantic similarity of texts, where we evaluated different approaches based on existing similarity measures. On one side word similarity was calculated by processing large text corpuses and on the other, commonsense knowledgebase was used. Given that a large fraction of the information available today, on the Web and elsewhere, consists of short text snippets (e.g. abstracts of scientific documents, image captions or product descriptions), where commonsense knowledge has an important role, in this paper we focus on computing the similarity between two sentences or two short paragraphs by extending existing measures with information from the ConceptNet knowledgebase. On the other hand, an extensive research has been done in the field of corpus-based semantic similarity, so we also evaluated existing solutions by imposing some modifications. Through experiments performed on a paraphrase data set, we demonstrate that some of proposed approaches can improve the semantic similarity measurement of short text.


2004 ◽  
Vol 127 (2) ◽  
pp. 311-317 ◽  
Author(s):  
James Coburn ◽  
Joseph J. Crisco

Kinematic interpolation is an important tool in biomechanics. The purpose of this work is to describe a method for interpolating three-dimensional kinematic data, minimizing error while maintaining ease of calculation. This method uses cubic quaternion and hermite interpolation to fill gaps between kinematic data points. Data sets with a small number of samples were extracted from a larger data set and used to validate the technique. Two additional types of interpolation were applied and then compared to the cubic quaternion interpolation. Displacement errors below 2% using the cubic quaternion method were achieved using 4% of the total samples, representing a decrease in error over the other algorithms.


2014 ◽  
Vol 2014 ◽  
pp. 1-10 ◽  
Author(s):  
Zhicai Liu ◽  
Keyun Qin ◽  
Zheng Pei

Soft set theory, proposed by Molodtsov, has been regarded as an effective mathematical tool to deal with uncertainties. Recently, uncertainty measures of soft sets and fuzzy soft sets have gained attentions from researchers. This paper is devoted to the study of uncertainty measures of fuzzy soft sets. The axioms for similarity measure and entropy are proposed. A new category of similarity measures and entropies is presented based on fuzzy equivalence. Our approach is general in the sense that by using different fuzzy equivalences one gets different similarity measures and entropies. The relationships among these measures and the other proposals in the literatures are analyzed.


2021 ◽  
Author(s):  
Valerie Cross ◽  
Michael Zmuda

Current machine learning research is addressing the problem that occurs when the data set includes numerous features but the number of training data is small. Microarray data, for example, typically has a very large number of features, the genes, as compared to the number of training data examples, the patients. An important research problem is to develop techniques to effectively reduce the number of features by selecting the best set of features for use in a machine learning process, referred to as the feature selection problem. Another means of addressing high dimensional data is the use of an ensemble of base classifiers. Ensembles have been shown to improve the predictive performance of a single model by training multiple models and combining their predictions. This paper examines combining an enhancement of the random subspace model of feature selection using fuzzy set similarity measures with different measures of evaluating feature subsets in the construction of an ensemble classifier. Experimental results show that in most cases a fuzzy set similarity measure paired with a feature subset evaluator outperforms the corresponding fuzzy similarity measure by itself and the learning process only needs to occur on typically about half the number of base classifiers since the features subset evaluator eliminates those feature subsets of low quality from use in the ensemble. In general, the fuzzy consistency index is the better performing feature subset evaluator, and inclusion maximum is the better performing fuzzy similarity measure.


Author(s):  
Ngoc Anh Nguyen

The analysis of a data set of observation for Vietnamese banks in period from 2011 - 2015 shows how Capital Adequacy Ratio (CAR) is influenced by selected factors: asset of the bank SIZE, loans in total asset LOA, leverage LEV, net interest margin NIM, loans lost reserve LLR, Cash and Precious Metals in total asset LIQ. Results indicate based on data that NIM, LIQ have significant effect on CAR. On the other hand, SIZE and LEV do not appear to have significant effect on CAR. Variables NIM, LIQ have positive effect on CAR, while variables LLR and LOA are negatively related with CAR.


2020 ◽  
Vol 11 (SPL3) ◽  
pp. 1861-1868
Author(s):  
Bianca Princeton ◽  
Abilasha R ◽  
Preetha S

Oral hygiene is defined as the practice of keeping the mouth clean and healthy, by brushing and flossing to prevent the occurrence of any gum diseases like periodontitis or gingivitis. The main aim of oral health hygiene is to prevent the buildup of plaque, which is defined as a sticky film of bacteria and food formed on the teeth. The coastal guard is an official who is employed to watch the sea near a coast for ships that are in danger or involved with illegal activities. Coastal guards have high possibilities of being affected by mesothelioma or lung cancer due to asbestos exposure. So, a questionnaire consisting of 20 questions was created and circulated among a hundred participants who were coastal guards, through Google forms. The responses were recorded and tabulated in the form of bar graphs. Out of a hundred participants, 52.4% were not aware of the fact that coastal guards have high chances of developing lung cancer and Mesothelioma. 53.7% were aware of the other oral manifestations of lung cancer other than bleeding gums. Majority of the coastal guards feel that they are given enough information about dental hygiene protocols. Hence, to conclude, oral hygiene habits have to be elaborated using various tools in the right manner to ensure better health of teeth and gums.


Sign in / Sign up

Export Citation Format

Share Document