Clustering Based on Conditional Distributions in an Auxiliary Space

2002 ◽  
Vol 14 (1) ◽  
pp. 217-239 ◽  
Author(s):  
Janne Sinkkonen ◽  
Samuel Kaski

We study the problem of learning groups or categories that are local in the continuous primary space but homogeneous by the distributions of an associated auxiliary random variable over a discrete auxiliary space. Assuming that variation in the auxiliary space is meaningful, categories will emphasize similarly meaningful aspects of the primary space. From a data set consisting of pairs of primary and auxiliary items, the categories are learned by minimizing a Kullback-Leibler divergence-based distortion between (implicitly estimated) distributions of the auxiliary data, conditioned on the primary data. Still, the categories are defined in terms of the primary space. An online algorithm resembling the traditional Hebb-type competitive learning is introduced for learning the categories. Minimizing the distortion criterion turns out to be equivalent to maximizing the mutual information between the categories and the auxiliary data. In addition, connections to density estimation and to the distributional clustering paradigm are outlined. The method is demonstrated by clustering yeast gene expression data from DNA chips, with biological knowledge about the functional classes of the genes as the auxiliary data.

Author(s):  
Oren Fivel ◽  
Moshe Klein ◽  
Oded Maimon

In this paper we develop the foundation of a new theory for decision trees based on new modeling of phenomena with soft numbers. Soft numbers represent the theory of soft logic that addresses the need to combine real processes and cognitive ones in the same framework. At the same time soft logic develops a new concept of modeling and dealing with uncertainty: the uncertainty of time and space. It is a language that can talk in two reference frames, and also suggest a way to combine them. In the classical probability, in continuous random variables there is no distinguishing between the probability involving strict inequality and non-strict inequality. Moreover, a probability involves equality collapse to zero, without distinguishing among the values that we would like that the random variable will have for comparison. This work presents Soft Probability, by incorporating of Soft Numbers into probability theory. Soft Numbers are set of new numbers that are linear combinations of multiples of ”ones” and multiples of ”zeros”. In this work, we develop a probability involving equality as a ”soft zero” multiple of a probability density function (PDF). We also extend this notion of soft probabilities to the classical definitions of Complements, Unions, Intersections and Conditional probabilities, and also to the expectation, variance and entropy of a continuous random variable, condition being in a union of disjoint intervals and a discrete set of numbers. This extension provides information regarding to a continuous random variable being within discrete set of numbers, such that its probability does not collapse completely to zero. When we developed the notion of soft entropy, we found potentially another soft axis, multiples of 0log(0), that motivates to explore the properties of those new numbers and applications. We extend the notion of soft entropy into the definition of Cross Entropy and Kullback–Leibler-Divergence (KLD), and we found that a soft KLD is a soft number, that does not have a multiple of 0log(0). Based on a soft KLD, we defined a soft mutual information, that can be used as a splitting criteria in decision trees with data set of continuous random variables, consist of single samples and intervals.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Jagadish Sankaran ◽  
Harikrushnan Balasubramanian ◽  
Wai Hoh Tang ◽  
Xue Wen Ng ◽  
Adrian Röllin ◽  
...  

AbstractSuper-resolution microscopy and single molecule fluorescence spectroscopy require mutually exclusive experimental strategies optimizing either temporal or spatial resolution. To achieve both, we implement a GPU-supported, camera-based measurement strategy that highly resolves spatial structures (~100 nm), temporal dynamics (~2 ms), and molecular brightness from the exact same data set. Simultaneous super-resolution of spatial and temporal details leads to an improved precision in estimating the diffusion coefficient of the actin binding polypeptide Lifeact and corrects structural artefacts. Multi-parametric analysis of epidermal growth factor receptor (EGFR) and Lifeact suggests that the domain partitioning of EGFR is primarily determined by EGFR-membrane interactions, possibly sub-resolution clustering and inter-EGFR interactions but is largely independent of EGFR-actin interactions. These results demonstrate that pixel-wise cross-correlation of parameters obtained from different techniques on the same data set enables robust physicochemical parameter estimation and provides biological knowledge that cannot be obtained from sequential measurements.


2018 ◽  
pp. 130-155
Author(s):  
Fozia Munir ◽  
Mirajul Haq ◽  
Syed Nisar Hussain Hamadani

Maximization of wellbeing is the exceedingly targeted objective that conventional economics going forward. Keeping in view its central place, economists developed well-structured models and tools in order to measure and investigate wellbeing. In received literature, on the subject, various factors have been investigated that affecting wellbeing. However, wellbeing which is viewed from different approaches and is of a different form is not shaping equally with different types of factors. In this context, this study is an attempt to investigate how subjective wellbeing is affecting by social capital. The basic hypothesis is that “individual wellbeing moves parallel with its social capital”. The hypothesis is empirically tested using primary data set of 848 individuals collecting form Azad Jammu and Kashmir (Pakistan). The empirical estimates indicate that keeping other factors constant, an individual that embodied more social capital enjoy more wellbeing in their life. JEL Classification: B24, I30, C43


Entropy ◽  
2018 ◽  
Vol 20 (8) ◽  
pp. 601 ◽  
Author(s):  
Paul Darscheid ◽  
Anneli Guthke ◽  
Uwe Ehret

When constructing discrete (binned) distributions from samples of a data set, applications exist where it is desirable to assure that all bins of the sample distribution have nonzero probability. For example, if the sample distribution is part of a predictive model for which we require returning a response for the entire codomain, or if we use Kullback–Leibler divergence to measure the (dis-)agreement of the sample distribution and the original distribution of the variable, which, in the described case, is inconveniently infinite. Several sample-based distribution estimators exist which assure nonzero bin probability, such as adding one counter to each zero-probability bin of the sample histogram, adding a small probability to the sample pdf, smoothing methods such as Kernel-density smoothing, or Bayesian approaches based on the Dirichlet and Multinomial distribution. Here, we suggest and test an approach based on the Clopper–Pearson method, which makes use of the binominal distribution. Based on the sample distribution, confidence intervals for bin-occupation probability are calculated. The mean of each confidence interval is a strictly positive estimator of the true bin-occupation probability and is convergent with increasing sample size. For small samples, it converges towards a uniform distribution, i.e., the method effectively applies a maximum entropy approach. We apply this nonzero method and four alternative sample-based distribution estimators to a range of typical distributions (uniform, Dirac, normal, multimodal, and irregular) and measure the effect with Kullback–Leibler divergence. While the performance of each method strongly depends on the distribution type it is applied to, on average, and especially for small sample sizes, the nonzero, the simple “add one counter”, and the Bayesian Dirichlet-multinomial model show very similar behavior and perform best. We conclude that, when estimating distributions without an a priori idea of their shape, applying one of these methods is favorable.


2020 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Le Quoc Hoi ◽  
Hương Lan Trần

PurposeThis paper aims to examine the credit composition and income inequality reduction in Vietnam. In particular, the authors focus on the distinction between policy and commercial credits and investigate whether these two types of credit had adverse effects on income inequality. The authors also examine whether the impact of policy credit on income inequality is conditioned by the educational level and institutional quality.Design/methodology/approachThe authors use the primary data set, which contains a panel of 60 provinces collected from the General Statistics Office of Vietnam from 2002 to 2016. The authors employ the generalized method of moments to solve the endogenous problem.FindingsThe authors show that while commercial credit increases income inequality, policy credit contributes to reducing income inequality in Vietnam. In addition, we provide evidence that the institutional quality and educational level condition the impact of policy credit on income inequality. Based on the findings, the paper implies that it was not the size of the private credit but its composition that mattered in reducing income inequality, due to the asymmetric effects of different types of credit.Originality/valueThis is the first study that examines the links between the two components of credit and income inequality as well as constraints of the links. The authors argue that analyzing the separate effects of commercial and policy credits is more important for explaining the role of credit in income inequality than the size of total credit.


2020 ◽  
Author(s):  
Alexander E. Zarebski ◽  
Louis du Plessis ◽  
Kris V. Parag ◽  
Oliver G. Pybus

Inferring the dynamics of pathogen transmission during an outbreak is an important problem in both infectious disease epidemiology and phylodynamics. In mathematical epidemiology, estimates are often informed by time-series of infected cases while in phylodynamics genetic sequences sampled through time are the primary data source. Each data type provides different, and potentially complementary, insights into transmission. However inference methods are typically highly specialised and field-specific. Recent studies have recognised the benefits of combining data sources, which include improved estimates of the transmission rate and number of infected individuals. However, the methods they employ are either computationally prohibitive or require intensive simulation, limiting their real-time utility. We present a novel birth-death phylogenetic model, called TimTam which can be informed by both phylogenetic and epidemiological data. Moreover, we derive a tractable analytic approximation of the TimTam likelihood, the computational complexity of which is linear in the size of the data set. Using the TimTam we show how key parameters of transmission dynamics and the number of unreported infections can be estimated accurately using these heterogeneous data sources. The approximate likelihood facilitates inference on large data sets, an important consideration as such data become increasingly common due to improving sequencing capability.


Author(s):  
Ladislav Stejskal ◽  
Jana Pustinová ◽  
Jana Stávková

Article is devoted to evaluation of the Czech population’s income situation according to the inquiry realized within the frame of the Statistics on Income and Living Conditions (SILC) project. This was carried out by the Czech Statistical Office in the year 2005. Selected introductive analyses are presented with the view of pointing at the primary data usage possibilities. Main aim of the paper is to explicate basic quantitative indicators of Czech households’ income situation in general, then in division according to social groups and regional belonging. Consequent aim encompasses the identification and analysis of the income unevenness measure by the help of alternative methodological approach. The essential findings and income characteristics are introduced, including recomputation to the physical and so-called standardized member. In compliance with the predefined threshold the households endangered with the insufficient income level are identified. Insufficient income level means that household earnings cannot cover standard living costs. This part is followed by the brief statistical analysis of the data set of this group of households and the reference to other studies which are currently being pursued. Conclusion comprehends the spectrum of processes and analyses that could follow, or are already worked out, in concurrence with the existing findings. First of these, for example, is the income situation evaluation of seniors involved in the enquiry. Reason is that this segment is traditionally perceived as economically weak and more or less dependent on the social system settings.


Filomat ◽  
2018 ◽  
Vol 32 (17) ◽  
pp. 5931-5947
Author(s):  
Hatami Mojtaba ◽  
Alamatsaz Hossein

In this paper, we propose a new transformation of circular random variables based on circular distribution functions, which we shall call inverse distribution function (id f ) transformation. We show that M?bius transformation is a special case of our id f transformation. Very general results are provided for the properties of the proposed family of id f transformations, including their trigonometric moments, maximum entropy, random variate generation, finite mixture and modality properties. In particular, we shall focus our attention on a subfamily of the general family when id f transformation is based on the cardioid circular distribution function. Modality and shape properties are investigated for this subfamily. In addition, we obtain further statistical properties for the resulting distribution by applying the id f transformation to a random variable following a von Mises distribution. In fact, we shall introduce the Cardioid-von Mises (CvM) distribution and estimate its parameters by the maximum likelihood method. Finally, an application of CvM family and its inferential methods are illustrated using a real data set containing times of gun crimes in Pittsburgh, Pennsylvania.


2018 ◽  
Author(s):  
Ruth Stoney ◽  
Jean-Mark Schwartz ◽  
David L Robertson ◽  
Goran Nenadic

1.Abstract1.01BackgroundThe consolidation of pathway databases, such as KEGG[1], Reactome[2]and ConsensusPathDB[3], has generated widespread biological interest, however the issue of pathway redundancy impedes the use of these consolidated datasets. Attempts to reduce this redundancy have focused on visualizing pathway overlap or merging pathways, but the resulting pathways may be of heterogeneous sizes and cover multiple biological functions. Efforts have also been made to deal with redundancy in pathway data by consolidating enriched pathways into a number of clusters or concepts. We present an alternative approach, which generates pathway subsets capable of covering all of genes presented within either pathway databases or enrichment results, generating substantial reductions in redundancy.1.02ResultsWe propose a method that uses set cover to reduce pathway redundancy, without merging pathways. The proposed approach considers three objectives: removal of pathway redundancy, controlling pathway size and coverage of the gene set. By applying set cover to the ConsensusPathDB dataset we were able to produce a reduced set of pathways, representing 100% of the genes in the original data set with 74% less redundancy, or 95% of the genes with 88% less redundancy. We also developed an algorithm to simplify enrichment data and applied it to a set of enriched osteoarthritis pathways, revealing that within the top ten pathways, five were redundant subsets of more enriched pathways. Applying set cover to the enrichment results removed these redundant pathways allowing more informative pathways to take their place.1.03ConclusionOur method provides an alternative approach for handling pathway redundancy, while ensuring that the pathways are of homogeneous size and gene coverage is maximised. Pathways are not altered from their original form, allowing biological knowledge regarding the data set to be directly applicable. We demonstrate the ability of the algorithms to prioritise redundancy reduction, pathway size control or gene set coverage. The application of set cover to pathway enrichment results produces an optimised summary of the pathways that best represent the differentially regulated gene set.


Sign in / Sign up

Export Citation Format

Share Document