Hydra: a method for strain-minimizing hyperbolic embedding of network- and distance-based data

Abstract We introduce hydra (hyperbolic distance recovery and approximation), a new method for embedding network- or distance-based data into hyperbolic space. We show mathematically that hydra satisfies a certain optimality guarantee: it minimizes the ‘hyperbolic strain’ between original and embedded data points. Moreover, it is able to recover points exactly, when they are contained in a low-dimensional hyperbolic subspace of the feature space. Testing on real network data we show that the embedding quality of hydra is competitive with existing hyperbolic embedding methods, but achieved at substantially shorter computation time. An extended method, termed hydra+, typically outperforms existing methods in both computation time and embedding quality.

Download Full-text

A Novel Density-based Technique for Outlier Detection of High Dimensional Data Utilizing Full Feature Space

Information Technology And Control ◽

10.5755/j01.itc.50.1.25588 ◽

2021 ◽

Vol 50 (1) ◽

pp. 138-152

Author(s):

Mujeeb Ur Rehman ◽

Dost Muhammad Khan

Keyword(s):

Data Mining ◽

Outlier Detection ◽

High Dimensional Data ◽

Research Work ◽

Feature Space ◽

High Dimensional ◽

Data Set ◽

Data Points ◽

Low Dimensional ◽

Intrinsic Feature

Recently, anomaly detection has acquired a realistic response from data mining scientists as a graph of its reputation has increased smoothly in various practical domains like product marketing, fraud detection, medical diagnosis, fault detection and so many other fields. High dimensional data subjected to outlier detection poses exceptional challenges for data mining experts and it is because of natural problems of the curse of dimensionality and resemblance of distant and adjoining points. Traditional algorithms and techniques were experimented on full feature space regarding outlier detection. Customary methodologies concentrate largely on low dimensional data and hence show ineffectiveness while discovering anomalies in a data set comprised of a high number of dimensions. It becomes a very difficult and tiresome job to dig out anomalies present in high dimensional data set when all subsets of projections need to be explored. All data points in high dimensional data behave like similar observations because of its intrinsic feature i.e., the distance between observations approaches to zero as the number of dimensions extends towards infinity. This research work proposes a novel technique that explores deviation among all data points and embeds its findings inside well established density-based techniques. This is a state of art technique as it gives a new breadth of research towards resolving inherent problems of high dimensional data where outliers reside within clusters having different densities. A high dimensional dataset from UCI Machine Learning Repository is chosen to test the proposed technique and then its results are compared with that of density-based techniques to evaluate its efficiency.

Download Full-text

Parametric Embedding for Class Visualization

Neural Computation ◽

10.1162/neco.2007.19.9.2536 ◽

2007 ◽

Vol 19 (9) ◽

pp. 2536-2556 ◽

Cited By ~ 28

Author(s):

Tomoharu Iwata ◽

Kazumi Saito ◽

Naonori Ueda ◽

Sean Stromsten ◽

Thomas L. Griffiths ◽

...

Keyword(s):

Latent Dirichlet Allocation ◽

Gaussian Mixture ◽

Web Pages ◽

Data Points ◽

Latent Topics ◽

Low Dimensional ◽

Number Of Classes ◽

Parametric Embedding ◽

Embedding Methods ◽

Insight Into

We propose a new method, parametric embedding (PE), that embeds objects with the class structure into a low-dimensional visualization space. PE takes as input a set of class conditional probabilities for given data points and tries to preserve the structure in an embedding space by minimizing a sum of Kullback-Leibler divergences, under the assumption that samples are generated by a gaussian mixture with equal covariances in the embedding space. PE has many potential uses depending on the source of the input data, providing insight into the classifier's behavior in supervised, semisupervised, and unsupervised settings. The PE algorithm has a computational advantage over conventional embedding methods based on pairwise object relations since its complexity scales with the product of the number of objects and the number of classes. We demonstrate PE by visualizing supervised categorization of Web pages, semisupervised categorization of digits, and the relations of words and latent topics found by an unsupervised algorithm, latent Dirichlet allocation.

Download Full-text

Hierarchical Clustering Approach for Selecting Representative Skylines

Information ◽

10.3390/info10030096 ◽

2019 ◽

Vol 10 (3) ◽

pp. 96

Author(s):

Lkhagvadorj Battulga ◽

Aziz Nasridinov

Keyword(s):

Data Distribution ◽

Computation Time ◽

Agglomerative Clustering ◽

Skyline Query ◽

Big Data Applications ◽

Wide Range ◽

Hierarchical Agglomerative Clustering ◽

Data Points ◽

Low Dimensional ◽

Representative Skyline

Recently, the skyline query has attracted interest in a wide range of applications from recommendation systems to computer networks. The skyline query is useful to obtain the dominant data points from the given dataset. In the low-dimensional dataset, the skyline query may return a small number of skyline points. However, as the dimensionality of the dataset increases, the number of skyline points also increases. In other words, depending on the data distribution and dimensionality, most of the data points may become skyline points. With the emergence of big data applications, where the data distribution and dimensionality are a significant problem, obtaining representative skyline points among resulting skyline points is necessary. There have been several methods that focused on extracting representative skyline points with various success. However, existing methods have a problem of re-computation when the global threshold changes. Moreover, in certain cases, the resulting representative skyline points may not satisfy a user with multiple preferences. Thus, in this paper, we propose a new representative skyline query processing method, called representative skyline cluster (RSC), which solves the problems of the existing methods. Our method utilizes the hierarchical agglomerative clustering method to find the exact representative skyline points, which enable us to reduce the re-computation time significantly. We show the superiority of our proposed method over the existing state-of-the-art methods with various types of experiments.

Download Full-text

A Survey on Internet of Things : Applications and Layered Wise Security Issues

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit195624 ◽

2019 ◽

pp. 171-180

Author(s):

Uppuluri Sirisha ◽

G. Lakshme Eswari

Keyword(s):

Quality Of Life ◽

Internet Of Things ◽

Security And Privacy ◽

Global Network ◽

Business Growth ◽

Paper Briefly ◽

Smart Meters ◽

Security Issues ◽

Data Points

This paper briefly introduces Internet of Things(IOT) as a intellectual connectivity among the physical objects or devices which are gaining massive increase in the fields like efficiency, quality of life and business growth. IOT is a global network which is interconnecting around 46 million smart meters in U.S. alone with 1.1 billion data points per day[1]. The total installation base of IOT connecting devices would increase to 75.44 billion globally by 2025 with a increase in growth in business, productivity, government efficiency, lifestyle, etc., This paper familiarizes the serious concern such as effective security and privacy to ensure exact and accurate confidentiality, integrity, authentication access control among the devices.

Download Full-text

Analyzing Intra-Speaker and Inter-Speaker Vocal Tract Impedance Characteristics in a Low-Dimensional Feature Space Using t-SNE

10.21437/interspeech.2019-1492 ◽

2019 ◽

Author(s):

Balamurali B.T. ◽

Jer-Ming Chen

Keyword(s):

Vocal Tract ◽

Feature Space ◽

Impedance Characteristics ◽

Low Dimensional

Download Full-text

Ensemble-Based Out-of-Distribution Detection

Electronics ◽

10.3390/electronics10050567 ◽

2021 ◽

Vol 10 (5) ◽

pp. 567

Author(s):

Donghun Yang ◽

Kien Mai Mai Ngoc ◽

Iksoo Shin ◽

Kyong-Ha Lee ◽

Myunggwon Hwang

Keyword(s):

Detection Method ◽

State Of The Art ◽

Metric Learning ◽

Feature Space ◽

Confidence Score ◽

Distance Metric Learning ◽

Current State ◽

Overall Performance ◽

Deep Learning Model

To design an efficient deep learning model that can be used in the real-world, it is important to detect out-of-distribution (OOD) data well. Various studies have been conducted to solve the OOD problem. The current state-of-the-art approach uses a confidence score based on the Mahalanobis distance in a feature space. Although it outperformed the previous approaches, the results were sensitive to the quality of the trained model and the dataset complexity. Herein, we propose a novel OOD detection method that can train more efficient feature space for OOD detection. The proposed method uses an ensemble of the features trained using the softmax-based classifier and the network based on distance metric learning (DML). Through the complementary interaction of these two networks, the trained feature space has a more clumped distribution and can fit well on the Gaussian distribution by class. Therefore, OOD data can be efficiently detected by setting a threshold in the trained feature space. To evaluate the proposed method, we applied our method to various combinations of image datasets. The results show that the overall performance of the proposed approach is superior to those of other methods, including the state-of-the-art approach, on any combination of datasets.

Download Full-text

Developing a LTE Localization Framework using Real Network Data towards RAN Optimization through Context Knowledge

2020 23rd International Symposium on Wireless Personal Multimedia Communications (WPMC) ◽

10.1109/wpmc50192.2020.9309483 ◽

2020 ◽

Author(s):

R. Borralho ◽

D. Duarte ◽

A. Quddus ◽

P. Vieira

Keyword(s):

Network Data ◽

Real Network ◽

Context Knowledge ◽

Real Network Data

Download Full-text

A Heuristics-Based Parthenogenetic Algorithm for the VRP with Potential Demands and Time Windows

Scientific Programming ◽

10.1155/2016/8461857 ◽

2016 ◽

Vol 2016 ◽

pp. 1-12

Author(s):

Chenghua Shi ◽

Tonglei Li ◽

Yu Bai ◽

Fei Zhao

Keyword(s):

Genetic Algorithm ◽

Time Windows ◽

Computation Time ◽

Routing Problem ◽

Split Delivery ◽

Soft Time Windows ◽

Comparison Results ◽

Potential Demand ◽

The Cost

We present the vehicle routing problem with potential demands and time windows (VRP-PDTW), which is a variation of the classical VRP. A homogenous fleet of vehicles originated in a central depot serves customers with soft time windows and deliveries from/to their locations, and split delivery is considered. Also, besides the initial demand in the order contract, the potential demand caused by conformity consuming behavior is also integrated and modeled in our problem. The objective of minimizing the cost traveled by the vehicles and penalized cost due to violating time windows is then constructed. We propose a heuristics-based parthenogenetic algorithm (HPGA) for successfully solving optimal solutions to the problem, in which heuristics is introduced to generate the initial solution. Computational experiments are reported for instances and the proposed algorithm is compared with genetic algorithm (GA) and heuristics-based genetic algorithm (HGA) from the literature. The comparison results show that our algorithm is quite competitive by considering the quality of solutions and computation time.

Download Full-text

Unsupervised labelling of remote sensing images based on force field clustering

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-210802 ◽

2021 ◽

pp. 1-14

Author(s):

Zhenggang Wang ◽

Jin Jin

Keyword(s):

Remote Sensing ◽

Force Field ◽

Image Data ◽

Model Parameters ◽

Remote Sensing Images ◽

Initial Cluster ◽

Data Points ◽

Global Optimal ◽

Density Force

Remote sensing image segmentation provides technical support for decision making in many areas of environmental resource management. But, the quality of the remote sensing images obtained from different channels can vary considerably, and manually labeling a mass amount of image data is too expensive and Inefficiently. In this paper, we propose a point density force field clustering (PDFC) process. According to the spectral information from different ground objects, remote sensing superpixel points are divided into core and edge data points. The differences in the densities of core data points are used to form the local peak. The center of the initial cluster can be determined by the weighted density and position of the local peak. An iterative nebular clustering process is used to obtain the result, and a proposed new objective function is used to optimize the model parameters automatically to obtain the global optimal clustering solution. The proposed algorithm can cluster the area of different ground objects in remote sensing images automatically, and these categories are then labeled by humans simply.

Download Full-text

Great differences in performance and outcome of high-throughput sequencing data analysis platforms for fungal metabarcoding

MycoKeys ◽

10.3897/mycokeys.39.28109 ◽

2018 ◽

Vol 39 ◽

pp. 29-40 ◽

Cited By ~ 21

Author(s):

Sten Anslan ◽

R. Henrik Nilsson ◽

Christian Wurzbacher ◽

Petr Baldrian ◽

Leho Tedersoo ◽

...

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Computation Time ◽

Potential Effect ◽

Data Sets ◽

Sequencing Data ◽

Operational Taxonomic Units ◽

High Throughput Sequencing Data ◽

Recent Developments

Along with recent developments in high-throughput sequencing (HTS) technologies and thus fast accumulation of HTS data, there has been a growing need and interest for developing tools for HTS data processing and communication. In particular, a number of bioinformatics tools have been designed for analysing metabarcoding data, each with specific features, assumptions and outputs. To evaluate the potential effect of the application of different bioinformatics workflow on the results, we compared the performance of different analysis platforms on two contrasting high-throughput sequencing data sets. Our analysis revealed that the computation time, quality of error filtering and hence output of specific bioinformatics process largely depends on the platform used. Our results show that none of the bioinformatics workflows appears to perfectly filter out the accumulated errors and generate Operational Taxonomic Units, although PipeCraft, LotuS and PIPITS perform better than QIIME2 and Galaxy for the tested fungal amplicon dataset. We conclude that the output of each platform requires manual validation of the OTUs by examining the taxonomy assignment values.

Download Full-text