scholarly journals Systematic exploration of unsupervised methods for mapping behavior

2016 ◽  
Author(s):  
Jeremy G. Todd ◽  
Jamey S. Kain ◽  
Benjamin L. de Bivort

AbstractTo fully understand the mechanisms giving rise to behavior, we need to be able to precisely measure it. When coupled with large behavioral data sets, unsupervised clustering methods offer the potential of unbiased mapping of behavioral spaces. However, unsupervised techniques to map behavioral spaces are in their infancy, and there have been few systematic considerations of all the methodological options. We compared the performance of seven distinct mapping methods in clustering a data set consisting of the x-and y-positions of the six legs of individual flies. Legs were automatically tracked by small pieces of fluorescent dye, while the fly was tethered and walking on an air-suspended ball. We find that there is considerable variation in the performance of these mapping methods, and that better performance is attained when clustering is done in higher dimensional spaces (which are otherwise less preferable because they are hard to visualize). High dimensionality means that some algorithms, including the non-parametric watershed cluster assignment algorithm, cannot be used. We developed an alternative watershed algorithm which can be used in high-dimensional spaces when the probability density estimate can be computed directly. With these tools in hand, we examined the behavioral space of fly leg postural dynamics and locomotion. We find a striking division of behavior into modes involving the fore legs and modes involving the hind legs, with few direct transitions between them. By computing behavioral clusters using the data from all flies simultaneously, we show that this division appears to be common to all flies. We also identify individual-to-individual differences in behavior and behavioral transitions. Lastly, we suggest a computational pipeline that can achieve satisfactory levels of performance without the taxing computational demands of a systematic combinatorial approach.AbbreviationsGMM: Gaussian mixture model; PCA: principal components analysis; SW: sparse watershed; t-SNE: t-distributed stochastic neighbor embedding

2015 ◽  
Vol 17 (5) ◽  
pp. 719-732
Author(s):  
Dulakshi Santhusitha Kumari Karunasingha ◽  
Shie-Yui Liong

A simple clustering method is proposed for extracting representative subsets from lengthy data sets. The main purpose of the extracted subset of data is to use it to build prediction models (of the form of approximating functional relationships) instead of using the entire large data set. Such smaller subsets of data are often required in exploratory analysis stages of studies that involve resource consuming investigations. A few recent studies have used a subtractive clustering method (SCM) for such data extraction, in the absence of clustering methods for function approximation. SCM, however, requires several parameters to be specified. This study proposes a clustering method, which requires only a single parameter to be specified, yet it is shown to be as effective as the SCM. A method to find suitable values for the parameter is also proposed. Due to having only a single parameter, using the proposed clustering method is shown to be orders of magnitudes more efficient than using SCM. The effectiveness of the proposed method is demonstrated on phase space prediction of three univariate time series and prediction of two multivariate data sets. Some drawbacks of SCM when applied for data extraction are identified, and the proposed method is shown to be a solution for them.


2021 ◽  
Vol 40 (1) ◽  
pp. 477-490
Author(s):  
Yanping Xu ◽  
Tingcong Ye ◽  
Xin Wang ◽  
Yuping Lai ◽  
Jian Qiu ◽  
...  

In the field of security, the data labels are unknown or the labels are too expensive to label, so that clustering methods are used to detect the threat behavior contained in the big data. The most widely used probabilistic clustering model is Gaussian Mixture Models(GMM), which is flexible and powerful to apply prior knowledge for modelling the uncertainty of the data. Therefore, in this paper, we use GMM to build the threat behavior detection model. Commonly, Expectation Maximization (EM) and Variational Inference (VI) are used to estimate the optimal parameters of GMM. However, both EM and VI are quite sensitive to the initial values of the parameters. Therefore, we propose to use Singular Value Decomposition (SVD) to initialize the parameters. Firstly, SVD is used to factorize the data set matrix to get the singular value matrix and singular matrices. Then we calculate the number of the components of GMM by the first two singular values in the singular value matrix and the dimension of the data. Next, other parameters of GMM, such as the mixing coefficients, the mean and the covariance, are calculated based on the number of the components. After that, the initialization values of the parameters are input into EM and VI to estimate the optimal parameters of GMM. The experiment results indicate that our proposed method performs well on the parameters initialization of GMM clustering using EM and VI for estimating parameters.


2019 ◽  
Author(s):  
Srishti Mishra ◽  
Zohair Shafi ◽  
Santanu Pathak

Data driven decision making is becoming increasingly an important aspect for successful business execution. More and more organizations are moving towards taking informed decisions based on the data that they are generating. Most of this data are in temporal format - time series data. Effective analysis across time series data sets, in an efficient and quick manner is a challenge. The most interesting and valuable part of such analysis is to generate insights on correlation and causation across multiple time series data sets. This paper looks at methods that can be used to analyze such data sets and gain useful insights from it, primarily in the form of correlation and causation analysis. This paper focuses on two methods for doing so, Two Sample Test with Dynamic Time Warping and Hierarchical Clustering and looks at how the results returned from both can be used to gain a better understanding of the data. Moreover, the methods used are meant to work with any data set, regardless of the subject domain and idiosyncrasies of the data set, primarily, a data agnostic approach.


Author(s):  
Muhammad Shoaib ◽  
Saif Ur Rehman ◽  
Imran Siddiqui ◽  
Shafiqur Rehman ◽  
Shamim Khan ◽  
...  

In order to have a reliable estimate of wind energy potential of a site, high frequency wind speed and direction data recorded for an extended period of time is required. Weibull distribution function is commonly used to approximate the recorded data distribution for estimation of wind energy. In the present study a comparison of Weibull function and Gaussian mixture model (GMM) as theoretical functions are used. The data set used for the study consists of hourly wind speeds and wind directions of 54 years duration recorded at Ijmuiden wind site located in north of Holland. The entire hourly data set of 54 years is reduced to 12 sets of hourly averaged data corresponding to 12 months. Authenticity of data is assessed by computing descriptive statistics on the entire data set without average and on monthly 12 data sets. Additionally, descriptive statistics show that wind speeds are positively skewed and most of the wind data points are observed to be blowing in south-west direction. Cumulative distribution and probability density function for all data sets are determined for both Weibull function and GMM. Wind power densities on monthly as well as for the entire set are determined from both models using probability density functions of Weibull function and GMM. In order to assess the goodness-of-fit of the fitted Weibull function and GMM, coefficient of determination (R2) and Kolmogorov-Smirnov (K-S) tests are also determined. Although R2 test values for Weibull function are much closer to ‘1’ compared to its values for GMM. Nevertheless, overall performance of GMM is superior to Weibull function in terms of estimated wind power densities using GMM which are in good agreement with the power densities estimated using wind data for the same duration. It is reported that wind power densities for the entire wind data set are 307 W/m2 and 403.96 W/m2 estimated using GMM and Weibull function, respectively.


2020 ◽  
Vol 224 (1) ◽  
pp. 40-68 ◽  
Author(s):  
Thibaut Astic ◽  
Lindsey J Heagy ◽  
Douglas W Oldenburg

SUMMARY In a previous paper, we introduced a framework for carrying out petrophysically and geologically guided geophysical inversions. In that framework, petrophysical and geological information is modelled with a Gaussian mixture model (GMM). In the inversion, the GMM serves as a prior for the geophysical model. The formulation and applications were confined to problems in which a single physical property model was sought, and a single geophysical data set was available. In this paper, we extend that framework to jointly invert multiple geophysical data sets that depend on multiple physical properties. The petrophysical and geological information is used to couple geophysical surveys that, otherwise, rely on independent physics. This requires advancements in two areas. First, an extension from a univariate to a multivariate analysis of the petrophysical data, and their inclusion within the inverse problem, is necessary. Secondly, we address the practical issues of simultaneously inverting data from multiple surveys and finding a solution that acceptably reproduces each one, along with the petrophysical and geological information. To illustrate the efficacy of our approach and the advantages of carrying out multi-physics inversions coupled with petrophysical and geological information, we invert synthetic gravity and magnetic data associated with a kimberlite deposit. The kimberlite pipe contains two distinct facies embedded in a host rock. Inverting the data sets individually, even with petrophysical information, leads to a binary geological model: background or undetermined kimberlite. A multi-physics inversion, with petrophysical information, differentiates between the two main kimberlite facies of the pipe. Through this example, we also highlight the capabilities of our framework to work with interpretive geological assumptions when minimal quantitative information is available. In those cases, the dynamic updates of the GMM allow us to perform multi-physics inversions by learning a petrophysical model.


Author(s):  
Hong Lu ◽  
Xiangyang Xue

With the amount of video data increasing rapidly, automatic methods are needed to deal with large-scale video data sets in various applications. In content-based video analysis, a common and fundamental preprocess for these applications is video segmentation. Based on the segmentation results, video has a hierarchical representation structure of frames, shots, and scenes from the low level to high level. Due to the huge amount of video frames, it is not appropriate to represent video contents using frames. In the levels of video structure, shot is defined as an unbroken sequence of frames from one camera; however, the contents in shots are trivial and can hardly convey valuable semantic information. On the other hand, scene is a group of consecutive shots that focuses on an object or objects of interest. And a scene can represent a semantic unit for further processing such as story extraction, video summarization, etc. In this chapter, we will survey the methods on video scene segmentation. Specifically, there are two kinds of scenes. One kind of scene is to just consider the visual similarity of video shots and clustering methods are used for scene clustering. Another kind of scene is to consider both the visual similarity and temporal constraints of video shots, i.e., shots with similar contents and not lying too far in temporal order. Also, we will present our proposed methods on scene clustering and scene segmentation by using Gaussian mixture model, graph theory, sequential change detection, and spectral methods.


2013 ◽  
Vol 19 (5) ◽  
pp. 1281-1289 ◽  
Author(s):  
Jesse Ward ◽  
Rebecca Marvin ◽  
Thomas O'Halloran ◽  
Chris Jacobsen ◽  
Stefan Vogt

AbstractX-ray fluorescence (XRF) microscopy is an important tool for studying trace metals in biology, enabling simultaneous detection of multiple elements of interest and allowing quantification of metals in organelles without the need for subcellular fractionation. Currently, analysis of XRF images is often done using manually defined regions of interest (ROIs). However, since advances in synchrotron instrumentation have enabled the collection of very large data sets encompassing hundreds of cells, manual approaches are becoming increasingly impractical. We describe here the use of soft clustering to identify cell ROIs based on elemental contents, using data collected over a sample of the malaria parasite Plasmodium falciparum as a test case. Soft clustering was able to successfully classify regions in infected erythrocytes as “parasite,” “food vacuole,” “host,” or “background.” In contrast, hard clustering using the k-means algorithm was found to have difficulty in distinguishing cells from background. While initial tests showed convergence on two or three distinct solutions in 60% of the cells studied, subsequent modifications to the clustering routine improved results to yield 100% consistency in image segmentation. Data extracted using soft cluster ROIs were found to be as accurate as data extracted using manually defined ROIs, and analysis time was considerably improved.


2013 ◽  
Vol 748 ◽  
pp. 590-594
Author(s):  
Li Liao ◽  
Yong Gang Lu ◽  
Xu Rong Chen

We propose a novel density estimation method using both the k-nearest neighbor (KNN) graph and the potential field of the data points to capture the local and global data distribution information respectively. The clustering is performed based on the computed density values. A forest of trees is built using each data point as the tree node. And the clusters are formed according to the trees in the forest. The new clustering method is evaluated by comparing with three popular clustering methods, K-means++, Mean Shift and DBSCAN. Experiments on two synthetic data sets and one real data set show that our approach can effectively improve the clustering results.


2013 ◽  
Vol 7 (1) ◽  
pp. 19-24
Author(s):  
Kevin Blighe

Elaborate downstream methods are required to analyze large microarray data-sets. At times, where the end goal is to look for relationships between (or patterns within) different subgroups or even just individual samples, large data-sets must first be filtered using statistical thresholds in order to reduce their overall volume. As an example, in anthropological microarray studies, such ‘dimension reduction’ techniques are essential to elucidate any links between polymorphisms and phenotypes for given populations. In such large data-sets, a subset can first be taken to represent the larger data-set. For example, polling results taken during elections are used to infer the opinions of the population at large. However, what is the best and easiest method of capturing a sub-set of variation in a data-set that can represent the overall portrait of variation? In this article, principal components analysis (PCA) is discussed in detail, including its history, the mathematics behind the process, and in which ways it can be applied to modern large-scale biological datasets. New methods of analysis using PCA are also suggested, with tentative results outlined.


2019 ◽  
Vol 28 (06) ◽  
pp. 1960001
Author(s):  
Erdem Beğenilmiş ◽  
Susan Uskudarli

The successful use of social media to manipulate public opinion via bots and hired individuals to spread (mis)information to unsuspecting users reached alarming levels due to the manipulations during the 2016 US elections and the Brexit deliberations in the UK. Fake interaction such as “liking” and “retweeting” are staged to foster trust in the posts of bots and individuals, which makes it difficult for individuals to detect the posts that are part of greater schemes. We propose an approach based on supervised learning to classify collections of tweets as “organized” when they inhabit premeditated intent and as “organic” otherwise. Features related to users and posting behavior are used to train the classifiers using 851 data sets totaling above 270 million tweets. Further classifiers are trained to assess the effectiveness of the selected features. The random forest algorithm persistently yielded the best results with scores greater than 95% for both accuracy and f-measure. For comparison purposes, unsupervised learning methods were used to cluster the same data sets. The Gaussian Mixture Model clustered [organized vs organic] data set with 99% agreement with the labels. The success of using only behavioral features to detect organized behavior is encouraging.


Sign in / Sign up

Export Citation Format

Share Document