MI3: Machine-initiated Intelligent Interaction for Interactive Classification and Data Reconstruction

In many applications, while machine learning (ML) can be used to derive algorithmic models to aid decision processes, it is often difficult to learn a precise model when the number of similar data points is limited. One example of such applications is data reconstruction from historical visualizations, many of which encode precious data, but their numerical records are lost. On the one hand, there is not enough similar data for training an ML model. On the other hand, manual reconstruction of the data is both tedious and arduous. Hence, a desirable approach is to train an ML model dynamically using interactive classification, and hopefully, after some training, the model can complete the data reconstruction tasks with less human interference. For this approach to be effective, the number of annotated data objects used for training the ML model should be as small as possible, while the number of data objects to be reconstructed automatically should be as large as possible. In this article, we present a novel technique for the machine to initiate intelligent interactions to reduce the user’s interaction cost in interactive classification tasks. The technique of machine-initiated intelligent interaction (MI3) builds on a generic framework featuring active sampling and default labeling. To demonstrate the MI3 approach, we use the well-known cholera map visualization by John Snow as an example, as it features three instances of MI3 pipelines. The experiment has confirmed the merits of the MI3 approach.

Download Full-text

BIAS-VARIANCE CONTROL VIA HARD POINTS SHAVING

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001404003460 ◽

2004 ◽

Vol 18 (05) ◽

pp. 891-903 ◽

Cited By ~ 12

Author(s):

STEFANO MERLER ◽

BRUNO CAPRILE ◽

CESARE FURLANELLO

Keyword(s):

Control Strategy ◽

Noisy Data ◽

Real Data ◽

Training Data ◽

Regularization Technique ◽

Data Points ◽

Classification Tasks ◽

Bias Variance

In this paper, we propose a regularization technique for AdaBoost. The method implements a bias-variance control strategy in order to avoid overfitting in classification tasks on noisy data. The method is based on a notion of easy and hard training patterns as emerging from analysis of the dynamical evolutions of AdaBoost weights. The procedure consists in sorting the training data points by a hardness measure, and in progressively eliminating the hardest, stopping at an automatically selected threshold. Effectiveness of the method is tested and discussed on synthetic as well as real data.

Download Full-text

Outlier Detection

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch227 ◽

2011 ◽

pp. 1476-1482

Author(s):

Sharanjit Kaur

Keyword(s):

Outlier Detection ◽

Credit Card ◽

Weather Prediction ◽

Pharmaceutical Research ◽

Real Life ◽

Knowledge Discovery In Databases ◽

Data Generator ◽

Data Points ◽

Small Set ◽

Data Objects

Knowledge discovery in databases (KDD) is a nontrivial process of detecting valid, novel, potentially useful and ultimately understandable patterns in data (Fayyad, Piatetsky-Shapiro, Smyth & Uthurusamy, 1996). In general KDD tasks can be classified into four categories i) Dependency detection, ii) Class identification, iii) Class description and iv) Outlier detection. The first three categories of tasks correspond to patterns that apply to many objects while the task (iv) focuses on a small fraction of data objects often called outliers (Han & Kamber, 2006). Typically, outliers are data points which deviate more than user expectation from the majority of points in a dataset. There are two types of outliers: i) data points/objects with abnormally large errors and ii) data points/objects with normal errors but at far distance from its neighboring points (Maimon & Rokach, 2005). The former type may be the outcome of malfunctioning of data generator or due to errors while recording data, whereas latter is due to genuine data variation reflecting an unexpected trend in data. Outliers may be present in real life datasets because of several reasons including errors in capturing, storage and communication of data. Since outliers often interfere and obstruct the data mining process, they are considered to be nuisance. In several commercial and scientific applications, a small set of objects representing some rare or unexpected events is often more interesting than the larger ones. Example applications in commercial domain include credit-card fraud detection, criminal activities in e-commerce, pharmaceutical research etc.. In scientific domain, unknown astronomical objects, unexpected values of vital parameters in patient analysis etc. manifest as exceptions in observed data. Outliers are required to be reported immediately to take appropriate action in applications like network intrusion, weather prediction etc., whereas in other applications like astronomy, further investigation of outliers may lead to discovery of new celestial objects. Thus exception/ outlier handling is an important task in KDD and often leads to a more meaningful discovery (Breunig, Kriegel, Raymond & Sander, 2000). In this article different approaches for outlier detection in static datasets are presented.

Download Full-text

From Data Analytics to Data Hermeneutics. Online Political Discussions, Digital Methods and the Continuing Relevance of Interpretive Approaches

Digital Culture & Society ◽

10.14361/dcs-2016-0207 ◽

2016 ◽

Vol 2 (2) ◽

pp. 95-112 ◽

Cited By ~ 6

Author(s):

Paolo Gerbaudo

Keyword(s):

Social Media ◽

Data Analytics ◽

Methodological Approach ◽

Social Contexts ◽

Protest Movements ◽

Qualitative Approaches ◽

Holistic Understanding ◽

Data Points ◽

Digital Politics ◽

The One

Abstract To advance the study of digital politics it is urgent to complement data analytics with data hermeneutics to be understood as a methodological approach that focuses on the interpretation of the deep structures of meaning in social media conversations as they develop around various political phenomena, from digital protest movements to online election campaigns. The diffusion of Big Data techniques in recent scholarship on political behavior has led to a quantitative bias in the understanding of online political phenomena and a disregard for issues of content and meaning. To solve this problem it is necessary to adapt the hermeneutic approach to the conditions of social media communication, and shift its object of analysis from texts to datasets. On the one hand, this involves identifying procedures to select samples of social media posts out of datasets, so that they can be analysed in more depth. I describe three sampling strategies - top sampling, random sampling and zoom-in sampling - to attain this goal. On the other hand, “close reading” procedures used in hermeneutic analysis need to be adapted to the different quality of digital objects vis-à-vis traditional texts. This can be achieved by analysing posts not only as data-points in a dataset, but also as interventions in a collective conversation, and as utterances of broader “discourses”. The task of interpretation of social media data also requires an understanding of the political and social contexts in which digital political phenomena unfold, as well as taking into account the subjective viewpoints and motivations of those involved, which can be gained through in-depth interviews, and other qualitative social science methods. Data hermeneutics thus holds promise for a closing of the gap between quantitative and qualitative approaches in the study of digital politics, allowing for a deeper and more holistic understanding of online political phenomena.

Download Full-text

A Generic Framework for Accountable Optimistic Fair Exchange Protocol

Symmetry ◽

10.3390/sym11020285 ◽

2019 ◽

Vol 11 (2) ◽

pp. 285

Author(s):

Jia-Ch’ng Loh ◽

Swee-Huay Heng ◽

Syh-Yuan Tan

Keyword(s):

Standard Model ◽

Random Oracle Model ◽

Building Blocks ◽

Random Oracle ◽

Fair Exchange ◽

Generic Framework ◽

The Standard Model ◽

Security Properties ◽

Optimistic Fair Exchange ◽

The One

Optimistic Fair Exchange protocol was designed for two parties to exchange in a fair way where an arbitrator always remains offline and will be referred only if any dispute happens. There are various optimistic fair exchange protocols with different security properties in the literature. Most of the optimistic fair exchange protocols satisfy resolution ambiguity where a signature signed by the signer is computational indistinguishable from the one resolved by the arbitrator. Huang et al. proposed the first generic framework for accountable optimistic fair exchange protocol in the random oracle model where it possesses resolution ambiguity and is able to reveal the actual signer when needed. Ganjavi et al. later proposed the first generic framework in the standard model. In this paper, we propose a new generic framework for accountable optimistic fair exchange protocol in the standard model using ordinary signature, convertible undeniable signature, and ring signature scheme as the underlying building blocks. We also provide an instantiation using our proposed generic framework to obtain an efficient pairing-based accountable optimistic fair exchange protocol with short signature.

Download Full-text

Three-Dimensional ResNeXt Network Using Feature Fusion and Label Smoothing for Hyperspectral Image Classification

Sensors ◽

10.3390/s20061652 ◽

2020 ◽

Vol 20 (6) ◽

pp. 1652 ◽

Cited By ~ 4

Author(s):

Peida Wu ◽

Ziguan Cui ◽

Zongliang Gan ◽

Feng Liu

Keyword(s):

Hyperspectral Image ◽

Feature Fusion ◽

Three Dimensional ◽

Spectral Feature ◽

Feature Learning ◽

Hyperspectral Image Classification ◽

Training Samples ◽

Dramatic Rise ◽

Classification Tasks ◽

The One

In recent years, deep learning methods have been widely used in the hyperspectral image (HSI) classification tasks. Among them, spectral-spatial combined methods based on the three-dimensional (3-D) convolution have shown good performance. However, because of the three-dimensional convolution, increasing network depth will result in a dramatic rise in the number of parameters. In addition, the previous methods do not make full use of spectral information. They mostly use the data after dimensionality reduction directly as the input of networks, which result in poor classification ability in some categories with small numbers of samples. To address the above two issues, in this paper, we designed an end-to-end 3D-ResNeXt network which adopts feature fusion and label smoothing strategy further. On the one hand, the residual connections and split-transform-merge strategy can alleviate the declining-accuracy phenomenon and decrease the number of parameters. We can adjust the hyperparameter cardinality instead of the network depth to extract more discriminative features of HSIs and improve the classification accuracy. On the other hand, in order to improve the classification accuracies of classes with small numbers of samples, we enrich the input of the 3D-ResNeXt spectral-spatial feature learning network by additional spectral feature learning, and finally use a loss function modified by label smoothing strategy to solve the imbalance of classes. The experimental results on three popular HSI datasets demonstrate the superiority of our proposed network and an effective improvement in the accuracies especially for the classes with small numbers of training samples.

Download Full-text

An Initial Point Selection Algorithm for K-Means Clustering

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.791-793.1289 ◽

2013 ◽

Vol 791-793 ◽

pp. 1289-1292

Author(s):

Le Qiang Bai ◽

Yan Yao Zhou ◽

Shi Hong Zhang

Keyword(s):

Initial Point ◽

Data Sets ◽

Similar Data ◽

Selection Algorithm ◽

Point Selection ◽

Random Data ◽

Data Object ◽

Clustering Center ◽

Data Objects ◽

Standard Sets

Aiming at the problem of K-Means algorithm which is sensitive to select initial clustering center, this paper proposes a kind of initial point of K-Means algorithm. The algorithm processes the properties of the data objects, which determines the density of data object by counting the number of similar data objects and selects the center of categories according to the density of data object. The cluster numbers given and the UCI standard sets of data and the random data sets used, the clustering results demonstrate that the proposed algorithm has good stability, accuracy.

Download Full-text

Data mining techniques

Acta Numerica ◽

10.1017/s0962492901000058 ◽

2001 ◽

Vol 10 ◽

pp. 313-355 ◽

Cited By ~ 16

Author(s):

Markus Hegland

Keyword(s):

Data Mining ◽

Efficient Algorithms ◽

Data Sets ◽

Mathematical Framework ◽

Similar Data ◽

Functional Relationships ◽

New Methods ◽

The Common ◽

Data Objects ◽

Data Collections

Methods for knowledge discovery in data bases (KDD) have been studied for more than a decade. New methods are required owing to the size and complexity of data collections in administration, business and science. They include procedures for data query and extraction, for data cleaning, data analysis, and methods of knowledge representation. The part of KDD dealing with the analysis of the data has been termed data mining. Common data mining tasks include the induction of association rules, the discovery of functional relationships (classification and regression) and the exploration of groups of similar data objects in clustering. This review provides a discussion of and pointers to efficient algorithms for the common data mining tasks in a mathematical framework. Because of the size and complexity of the data sets, efficient algorithms and often crude approximations play an important role.

Download Full-text

A state-of-the-art survey on semantic similarity for document clustering using GloVe and density-based algorithms

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v22.i1.pp552-562 ◽

2021 ◽

Vol 22 (1) ◽

pp. 552

Author(s):

Shapol M. Mohammed ◽

Karwan Jacksi ◽

Subhi R. M. Zeebaree

Keyword(s):

Semantic Similarity ◽

State Of The Art ◽

Clustering Algorithms ◽

Document Clustering ◽

Accuracy Evaluation ◽

Similar Data ◽

Document Similarity ◽

Density Based Clustering ◽

Data Points ◽

The Common

<p><span>Semantic similarity is the process of identifying relevant data semantically. The traditional way of identifying document similarity is by using synonymous keywords and syntactician. In comparison, semantic similarity is to find similar data using meaning of words and semantics. Clustering is a concept of grouping objects that have the same features and properties as a cluster and separate from those objects that have different features and properties. In semantic document clustering, documents are clustered using semantic similarity techniques with similarity measurements. One of the common techniques to cluster documents is the density-based clustering algorithms using the density of data points as a main strategic to measure the similarity between them. In this paper, a state-of-the-art survey is presented to analyze the density-based algorithms for clustering documents. Furthermore, the similarity and evaluation measures are investigated with the selected algorithms to grasp the common ones. The delivered review revealed that the most used density-based algorithms in document clustering are DBSCAN and DPC. The most effective similarity measurement has been used with density-based algorithms, specifically DBSCAN and DPC, is Cosine similarity with F-measure for performance and accuracy evaluation.</span></p>

Download Full-text

Implementation of Clustering Algorithms for Real Time Large Datasets

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.c2570.0981119 ◽

2019 ◽

Vol 8 (11) ◽

pp. 2303-2304

Keyword(s):

Big Data ◽

Clustering Algorithms ◽

Vital Role ◽

Large Datasets ◽

Similar Data ◽

Data Set ◽

Survey Paper ◽

Density Based Clustering ◽

Geographical Maps ◽

Data Objects

Now a day’s clustering plays vital role in big data. It is very difficult to analyze and cluster large volume of data. Clustering is a procedure for grouping similar data objects of a data set. We make sure that inside the cluster high intra cluster similarity and outside the cluster high inter similarity. Clustering used in statistical analysis, geographical maps, biology cell analysis and in google maps. The various approaches for clustering grid clustering, density based clustering, hierarchical methods, partitioning approaches. In this survey paper we focused on all these algorithms for large datasets like big data and make a report on comparison among them. The main metric is time complexity to differentiate all algorithms.

Download Full-text

A Parallel Architecture for the Partitioning Around Medoids (PAM) Algorithm for Scalable Multi-Core Processor Implementation with Applications in Healthcare

Sensors ◽

10.3390/s18124129 ◽

2018 ◽

Vol 18 (12) ◽

pp. 4129 ◽

Cited By ~ 1

Author(s):

Hassan Mushtaq ◽

Sajid Gul Khawaja ◽

Muhammad Usman Akram ◽

Amanullah Yasin ◽

Muhammad Muzammal ◽

...

Keyword(s):

Computational Complexity ◽

Real Time ◽

Parallel Architecture ◽

Partitioning Around Medoids ◽

Proposed Model ◽

Data Points ◽

Data Objects ◽

Natural Groups ◽

Multiple Processing ◽

Multi Core Processor

Clustering is the most common method for organizing unlabeled data into its natural groups (called clusters), based on similarity (in some sense or another) among data objects. The Partitioning Around Medoids (PAM) algorithm belongs to the partitioning-based methods of clustering widely used for objects categorization, image analysis, bioinformatics and data compression, but due to its high time complexity, the PAM algorithm cannot be used with large datasets or in any embedded or real-time application. In this work, we propose a simple and scalable parallel architecture for the PAM algorithm to reduce its running time. This architecture can easily be implemented either on a multi-core processor system to deal with big data or on a reconfigurable hardware platform, such as FPGA and MPSoCs, which makes it suitable for real-time clustering applications. Our proposed model partitions data equally among multiple processing cores. Each core executes the same sequence of tasks simultaneously on its respective data subset and shares intermediate results with other cores to produce results. Experiments show that the computational complexity of the PAM algorithm is reduced exponentially as we increase the number of cores working in parallel. It is also observed that the speedup graph of our proposed model becomes more linear with the increase in number of data points and as the clusters become more uniform. The results also demonstrate that the proposed architecture produces the same results as the actual PAM algorithm, but with reduced computational complexity.

Download Full-text