Quality metrics for diversified similarity searching: What they stand for?

Diversity-oriented searches retrieve objects not only similar to a reference element but also related to the different types of collections within the queried dataset. While such characterization is flexible enough to include methods originally from information retrieval, data clustering, and similarity searching under the same umbrella, diversity metrics are expected to be much less paradigm-biased in order to discriminate which approaches are more suitable and when they should be applied. Accordingly, we extend and implement a broad set of quality metrics from those distinct realms and experimentally discuss their trends and limitations. In particular, we evaluate the suitability of data clustering indexes, and similarity-driven measures regarding their adherence to diversified similarity searching. Experiments in real-world datasets indicate such measures are capable of distinguishing diversity methods from different paradigms, but they heavily favor the approaches of the same group – especially cluster indexes. As an alternative, we argue diversity is better addressed by a set of measures rather than a single quality value. Therefore, we propose the Diversity Features Model (DFM) that combines the perspectives of the competing approaches into a multidimensional point whose features are calculated based on the distance distribution within both retrieved and queried datasets. Empirical evaluations showed DFM compares different diversity searching approaches by considering multiple criteria, whereas overall winners can be found by ranking aggregation or visualized through parallel coordinates maps.

Download Full-text

An empirical assessment of quality metrics for diversified similarity searching

Journal of Information and Data Management ◽

10.5753/jidm.2021.1917 ◽

2021 ◽

Vol 12 (3) ◽

Author(s):

Camila R. Lopes ◽

Lúcio F. D. Santos ◽

Daniel L. Jasbick ◽

Daniel De Oliveira ◽

Marcos Bedo

Keyword(s):

Data Clustering ◽

Similarity Search ◽

Quality Metrics ◽

Experimental Comparison ◽

Similarity Searching ◽

Parallel Coordinates ◽

Research Areas ◽

Open Issue ◽

Ranking Aggregation ◽

Multidimensional Representation

A diversified similarity search retrieves elements that are simultaneously similar to a query object and akin to the different collections within the explored data. While several methods in information retrieval, data clustering, and similarity searching have tackled the problem of adding diversity into result sets, the experimental comparison of their performances is still an open issue mainly because the quality metrics are “borrowed” from those different research areas, bringing their biases alongside. In this manuscript, we investigate a series of such metrics and experimentally discuss their trends and limitations. We conclude diversity is better addressed by a set of measures rather than a single quality index and introduce the concept of Diversity Features Model (DFM), which combines the viewpoints of biased metrics into a multidimensional representation. Experimental evaluations indicate (i) DFM enables comparing different result diversification algorithms by considering multiple criteria, and (ii) the most suitable searching methods for a particular dataset are spotted by combining DFM with ranking aggregation and parallel coordinates maps.

Download Full-text

An improved ACS algorithm for data clustering

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v17.i3.pp1506-1515 ◽

2020 ◽

Vol 17 (3) ◽

pp. 1506

Author(s):

Ayad Mohammed Jabbar ◽

Ku Ruhana Ku-Mahamud ◽

Rafid Sagban

Keyword(s):

Data Clustering ◽

Foraging Behaviour ◽

Clustering Algorithms ◽

Data Mining Technique ◽

Mining Technique ◽

Algorithm Comparison ◽

Hidden Patterns ◽

Real World Datasets ◽

Acs Algorithm ◽

Modification Rate

Data clustering is a data mining technique that discovers hidden patterns by creating groups (clusters) of objects. Each object in every cluster exhibits sufficient similarity to its neighbourhood, whereas objects with insufficient similarity are found in other clusters. Data clustering techniques minimise intra-cluster similarity in each cluster and maximise inter-cluster dissimilarity amongst different clusters. Ant colony optimisation for clustering (ACOC) is a swarm algorithm inspired by the foraging behaviour of ants. This algorithm minimises deterministic imperfections in which clustering is considered an optimisation problem. However, ACOC suffers from high diversification in which the algorithm cannot search for best solutions in the local neighbourhood. To improve the ACOC, this study proposes a modified ACOC, called M-ACOC, which has a modification rate parameter that controls the convergence of the algorithm. Comparison of the performance of several common clustering algorithms using real-world datasets shows that the accuracy results of the proposed algorithm surpasses other algorithms.

Download Full-text

A Trust-based Mixture of Gaussian Processes Model for Reliable Regression in Participatory Sensing

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/540 ◽

2017 ◽

Cited By ~ 2

Author(s):

Qikun Xiang ◽

Jie Zhang ◽

Ido Nevat ◽

Pengfei Zhang

Keyword(s):

Real World ◽

Gaussian Processes ◽

Spatial Regression ◽

Participatory Sensing ◽

Sensing Applications ◽

Different Types ◽

Inaccurate Estimation ◽

Real World Datasets ◽

Gp Model ◽

Mixture Of Gaussian

Data trustworthiness is a crucial issue in real-world participatory sensing applications. Without considering this issue, different types of worker misbehavior, especially the challenging collusion attacks, can result in biased and inaccurate estimation and decision making. We propose a novel trust-based mixture of Gaussian processes (GP) model for spatial regression to jointly detect such misbehavior and accurately estimate the spatial field. We develop a Markov chain Monte Carlo (MCMC)-based algorithm to efficiently perform Bayesian inference of the model. Experiments using two real-world datasets show the superior robustness of our model compared with existing approaches.

Download Full-text

Microbiological description of self-overgrowing spoil heaps and sand quarries in Nordwest Russia

10.5194/egusphere-egu21-2086 ◽

2021 ◽

Author(s):

Aleksei Zverev ◽

Anastasiia Kimeklis ◽

Grigory Gladkov ◽

Arina Kichko ◽

Evgeny Andronov ◽

...

Keyword(s):

Beta Diversity ◽

Alpha Diversity ◽

Illumina Miseq ◽

Differential Abundance ◽

Alpha And Beta Diversity ◽

Technogenic Landscapes ◽

Different Types ◽

Undisturbed Soils ◽

Disturbed Soils ◽

Diversity Metrics

Self-overgrowing recovery of disturbed soils is one of important processes in reclamation of disturbed soils. Different types of anthropogenic disturbances followed by variety of soil types and their genesis leads to different bacterial communities, envolved in reclamation processes. Here we describe regional self-overgrowing soils in two location (Novgorod region, Northwest Russia). We analyse top level of industrial disturbed soils after coil mining (spoil tips with extremely low pH, and overburden soil) and sand quarry dumps followed by local undisturbed soils.We perform 16s amplicone sequencind (v4-region) by Illumina MiSEQ and chemical routine analysis (pH, C, N and other). We provide alpha- and beta-diversity analysis, followed by CCA and analysis of differential abundance of taxa.Sand quarry dumps and regional soils looks common on phyla level, and represent common soil phyla like Proteobacteria, Actinobacteria and Verrucomicrobia. Alpha-diversity metrics aslo are similar, despite difference in beta-diversity. Overburden soil and soil from spot tips, by contrast, is very different even in phylum level. Main intermediants here are Actinobacteria, Chloroflexi &#1080; Nitrospirae. Also they show extremely low alpha-diversity metrics.This work was supported by RSF 17-16-01030, &#171;Dynamics of soil biota in chronoseries of post-technogenic landscapes: analysis of soil-ecological efficiency of ecosystem restoration processes&#187;

Download Full-text

A Noval Weighted Meta Graph Method for Classification in Heterogeneous Information Networks

Applied Sciences ◽

10.3390/app10051603 ◽

2020 ◽

Vol 10 (5) ◽

pp. 1603

Author(s):

Jinli Zhang ◽

Tong Li ◽

Zongli Jiang ◽

Xiaohua Hu ◽

Ali Jazayeri

Keyword(s):

Real World ◽

Structural Features ◽

Information Networks ◽

Heterogeneous Information ◽

Heterogeneous Information Networks ◽

Real World Applications ◽

Different Types ◽

Multiple Challenges ◽

Real World Datasets

There has been increasing interest in the analysis and mining of Heterogeneous Information Networks (HINs) and the classification of their components in recent years. However, there are multiple challenges associated with distinguishing different types of objects in HINs in real-world applications. In this paper, a novel framework is proposed for the weighted Meta graph-based Classification of Heterogeneous Information Networks (MCHIN) to address these challenges. The proposed framework has several appealing properties. In contrast to other proposed approaches, MCHIN can fully compute the weights of different meta graphs and mine the latent structural features of different nodes by using these weighted meta graphs. Moreover, MCHIN significantly enlarges the training sets by introducing the concept of Extension Meta Graphs in HINs. The extension meta graphs are used to augment the semantic relationship among the source objects. Finally, based on the ranking distribution of objects, MCHIN groups the objects into pre-specified classes. We verify the performance of MCHIN on three real-world datasets. As is shown and discussed in the results section, the proposed framework can effectively outperform the baselines algorithms.

Download Full-text

Covid-19 News Clustering using MCMC-Based Learing of finite EMSD Mixture Models

The International FLAIRS Conference Proceedings ◽

10.32473/flairs.v34i1.128506 ◽

2021 ◽

Vol 34 (1) ◽

Author(s):

Xuanbo Su ◽

Nizar Bouguila ◽

Nuha Zamzami

Keyword(s):

Mixture Models ◽

Data Clustering ◽

Bayesian Learning ◽

Finite Mixture Models ◽

State Of The Art ◽

Generative Models ◽

Finite Mixture ◽

Model Parameters ◽

Different Types ◽

Statistical Approaches

With the growth of social media information on the Web, performing clustering on different types of data is a challenging task.Statistical approaches are widely used to tackle this task. Among the successful statistical approaches, finite mixture models have received a lot attention thanks to their flexibility. There are already many finite mixture models to cope with this task, but the Exponential Multinomial Scaled Dirichlet Distributions (EMSD) has recently shown to attain higher accuracy compared to other state-of-the-art generative models for count data clustering. Thus, in this paper, we present a Bayesian learning method based on Markov Chain Monte Carlo and Metropolis-Hastings algorithm for learning this model parameters. This proposed method is validated via extensive simulations and comparison with multinomial based mixture models.

Download Full-text

Benchmarking Computational Integration Methods for Spatial Transcriptomics Data

10.1101/2021.08.27.457741 ◽

2021 ◽

Author(s):

Yijun Li ◽

Stefan Stanojevic ◽

Bing He ◽

Zheng Jing ◽

Qianhui Huang ◽

...

Keyword(s):

Clustering Analysis ◽

Data Clustering ◽

Expression Patterns ◽

Cell Types ◽

Marker Genes ◽

Sequencing Data ◽

Integration Methods ◽

Spatial Expression ◽

Different Types ◽

Transcriptomics Data

AbstractThe increasing popularity of spatial transcriptomics has allowed researchers to analyze transcriptome data in its tissue sample’s spatial context. Various methods have been developed for detecting SV (spatially variable) genes, with distinct spatial expression patterns. However, the accuracy of using such SV genes in clustering cell types has not been thoroughly studied. On the other hand, in single cell resolution sequencing data, clustering analysis is usually done on highly variable (HV) genes. Here we investigate if integrating SV genes and HV genes from spatial transcriptomics data can improve clustering performance beyond using SV genes alone. We evaluated six methods that integrate different features measured from the same samples including MOFA+, scVI, Seurat v4, CIMLR, SNF, and the straightforward concatenation approach. We applied these methods on 19 real datasets from three different spatial transcriptomics technologies (merFISH, SeqFISH+, and Visium) as well as 20 simulated datasets of varying spatial expression conditions. Our evaluations show that the performances of these integration methods are largely dependent on spatial transcriptomics platforms. Despite the variations among the results, in general MOFA+ and simple concatenation have good performances across different types of spatial transcriptomics platforms. This work shows that integrating quantitative and spatial marker genes in the spatial transcriptomics data can improve clustering. It also provides practical guides on the choices of computational methods to accomplish this goal.

Download Full-text

Side Information Fusion for Recommender Systems over Heterogeneous Information Network

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3441446 ◽

2021 ◽

Vol 15 (4) ◽

pp. 1-32

Author(s):

Huan Zhao ◽

Quanming Yao ◽

Yangqiu Song ◽

James T. Kwok ◽

Dik Lun Lee

Keyword(s):

Information Fusion ◽

Side Information ◽

Low Rank ◽

Information Network ◽

Heterogeneous Information Network ◽

Heterogeneous Information ◽

Different Types ◽

Latent Features ◽

Real World Datasets ◽

Types Of Information

Collaborative filtering (CF) has been one of the most important and popular recommendation methods, which aims at predicting users’ preferences (ratings) based on their past behaviors. Recently, various types of side information beyond the explicit ratings users give to items, such as social connections among users and metadata of items, have been introduced into CF and shown to be useful for improving recommendation performance. However, previous works process different types of information separately, thus failing to capture the correlations that might exist across them. To address this problem, in this work, we study the application of heterogeneous information network (HIN), which offers a unifying and flexible representation of different types of side information, to enhance CF-based recommendation methods. However, we face challenging issues in HIN-based recommendation, i.e., how to capture similarities of complex semantics between users and items in a HIN, and how to effectively fuse these similarities to improve final recommendation performance. To address these issues, we apply metagraph to similarity computation and solve the information fusion problem with a “matrix factorization (MF) + factorization machine (FM)” framework. For the MF part, we obtain the user-item similarity matrix from each metagraph and then apply low-rank matrix approximation to obtain latent features for both users and items. For the FM part, we apply FM with Group lasso (FMG) on the features obtained from the MF part to train the recommending model and, at the same time, identify the useful metagraphs. Besides FMG, a two-stage method, we further propose an end-to-end method, hierarchical attention fusing, to fuse metagraph-based similarities for the final recommendation. Experimental results on four large real-world datasets show that the two proposed frameworks significantly outperform existing state-of-the-art methods in terms of recommendation performance.

Download Full-text

A study on quality metrics vs. human perception: Can visual measures help us to filter visualizations of interest?

it - Information Technology ◽

10.1515/itit-2014-1070 ◽

2015 ◽

Vol 57 (1) ◽

Cited By ~ 2

Author(s):

Dirk J. Lehmann ◽

Sebastian Hundt ◽

Holger Theisel

Keyword(s):

Visual Search ◽

Real Number ◽

Visual Analysis ◽

Human Perception ◽

Quality Metrics ◽

Visual Pattern ◽

High Dimensional ◽

Parallel Coordinates ◽

Filter Approach ◽

High Dimensional Datasets

AbstractThe number of visualizations being required for a complete view on data non-linearly grows with the number of data dimensions. Thus, relevant visualizations need to be filtered to guide the user during the visual search. A popular filter approach is the usage of quality metrics, which map a visual pattern to a real number. This way, visualizations that contain interesting patterns are automatically detected. Quality metrics are a useful tool in visual analysis, if they resemble the human perception. In this work we present a broad study to examine the relation between filtering relevant visualizations based on human perception versus quality metrics. For this, seven widely-used quality metrics were tested on five high-dimensional datasets, covering scatterplots, parallel coordinates, and radial visualizations. In total, 102 participants were available. The results of our studies show that quality metrics often work similar to the human perception. Interestingly, a subset of so-called Scagnostic measures does the best job.

Download Full-text

An Adaptive Heterogeneous Online Learning Ensemble Classifier for Nonstationary Environments

Computational Intelligence and Neuroscience ◽

10.1155/2021/6669706 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Tinofirei Museba ◽

Fulufhelo Nelwamondo ◽

Khmaies Ouahada

Keyword(s):

Concept Drift ◽

Predictive Performance ◽

Dynamic Environments ◽

Ensemble Classifiers ◽

Data Generation ◽

Ensemble Selection ◽

Different Types ◽

Real World Datasets ◽

The Impact ◽

Dynamic Ensemble Selection

In recent years, the prevalence of technological advances has led to an enormous and ever-increasing amount of data that are now commonly available in a streaming fashion. In such nonstationary environments, the underlying process generating the data stream is characterized by an intrinsic nonstationary or evolving or drifting phenomenon known as concept drift. Given the increasingly common applications whose data generation mechanisms are susceptible to change, the need for effective and efficient algorithms for learning from and adapting to evolving or drifting environments can hardly be overstated. In dynamic environments associated with concept drift, learning models are frequently updated to adapt to changes in the underlying probability distribution of the data. A lot of work in the area of learning in nonstationary environments focuses on updating the learning predictive model to optimize recovery from concept drift and convergence to new concepts by adjusting parameters and discarding poorly performing models while little effort has been dedicated to investigate what type of learning model is suitable at any given time for different types of concept drift. In this paper, we investigate the impact of heterogeneous online ensemble learning based on online model selection for predictive modeling in dynamic environments. We propose a novel heterogeneous ensemble approach based on online dynamic ensemble selection that accurately interchanges between different types of base models in an ensemble to enhance its predictive performance in nonstationary environments. The approach is known as Heterogeneous Dynamic Ensemble Selection based on Accuracy and Diversity (HDES-AD) and makes use of models generated by different base learners to increase diversity to circumvent problems associated with existing dynamic ensemble classifiers that may experience loss of diversity due to the exclusion of base learners generated by different base algorithms. The algorithm is evaluated on artificial and real-world datasets with well-known online homogeneous online ensemble approaches such as DDD, AFWE, and OAUE. The results show that HDES-AD performed significantly better than the other three homogeneous online ensemble approaches in nonstationary environments.

Download Full-text