scholarly journals Hydra: a method for strain-minimizing hyperbolic embedding of network- and distance-based data

2020 ◽  
Vol 8 (1) ◽  
Author(s):  
Martin Keller-Ressel ◽  
Stephanie Nargang

Abstract We introduce hydra (hyperbolic distance recovery and approximation), a new method for embedding network- or distance-based data into hyperbolic space. We show mathematically that hydra satisfies a certain optimality guarantee: it minimizes the ‘hyperbolic strain’ between original and embedded data points. Moreover, it is able to recover points exactly, when they are contained in a low-dimensional hyperbolic subspace of the feature space. Testing on real network data we show that the embedding quality of hydra is competitive with existing hyperbolic embedding methods, but achieved at substantially shorter computation time. An extended method, termed hydra+, typically outperforms existing methods in both computation time and embedding quality.

2021 ◽  
Vol 50 (1) ◽  
pp. 138-152
Author(s):  
Mujeeb Ur Rehman ◽  
Dost Muhammad Khan

Recently, anomaly detection has acquired a realistic response from data mining scientists as a graph of its reputation has increased smoothly in various practical domains like product marketing, fraud detection, medical diagnosis, fault detection and so many other fields. High dimensional data subjected to outlier detection poses exceptional challenges for data mining experts and it is because of natural problems of the curse of dimensionality and resemblance of distant and adjoining points. Traditional algorithms and techniques were experimented on full feature space regarding outlier detection. Customary methodologies concentrate largely on low dimensional data and hence show ineffectiveness while discovering anomalies in a data set comprised of a high number of dimensions. It becomes a very difficult and tiresome job to dig out anomalies present in high dimensional data set when all subsets of projections need to be explored. All data points in high dimensional data behave like similar observations because of its intrinsic feature i.e., the distance between observations approaches to zero as the number of dimensions extends towards infinity. This research work proposes a novel technique that explores deviation among all data points and embeds its findings inside well established density-based techniques. This is a state of art technique as it gives a new breadth of research towards resolving inherent problems of high dimensional data where outliers reside within clusters having different densities. A high dimensional dataset from UCI Machine Learning Repository is chosen to test the proposed technique and then its results are compared with that of density-based techniques to evaluate its efficiency.


2007 ◽  
Vol 19 (9) ◽  
pp. 2536-2556 ◽  
Author(s):  
Tomoharu Iwata ◽  
Kazumi Saito ◽  
Naonori Ueda ◽  
Sean Stromsten ◽  
Thomas L. Griffiths ◽  
...  

We propose a new method, parametric embedding (PE), that embeds objects with the class structure into a low-dimensional visualization space. PE takes as input a set of class conditional probabilities for given data points and tries to preserve the structure in an embedding space by minimizing a sum of Kullback-Leibler divergences, under the assumption that samples are generated by a gaussian mixture with equal covariances in the embedding space. PE has many potential uses depending on the source of the input data, providing insight into the classifier's behavior in supervised, semisupervised, and unsupervised settings. The PE algorithm has a computational advantage over conventional embedding methods based on pairwise object relations since its complexity scales with the product of the number of objects and the number of classes. We demonstrate PE by visualizing supervised categorization of Web pages, semisupervised categorization of digits, and the relations of words and latent topics found by an unsupervised algorithm, latent Dirichlet allocation.


Information ◽  
2019 ◽  
Vol 10 (3) ◽  
pp. 96
Author(s):  
Lkhagvadorj Battulga ◽  
Aziz Nasridinov

Recently, the skyline query has attracted interest in a wide range of applications from recommendation systems to computer networks. The skyline query is useful to obtain the dominant data points from the given dataset. In the low-dimensional dataset, the skyline query may return a small number of skyline points. However, as the dimensionality of the dataset increases, the number of skyline points also increases. In other words, depending on the data distribution and dimensionality, most of the data points may become skyline points. With the emergence of big data applications, where the data distribution and dimensionality are a significant problem, obtaining representative skyline points among resulting skyline points is necessary. There have been several methods that focused on extracting representative skyline points with various success. However, existing methods have a problem of re-computation when the global threshold changes. Moreover, in certain cases, the resulting representative skyline points may not satisfy a user with multiple preferences. Thus, in this paper, we propose a new representative skyline query processing method, called representative skyline cluster (RSC), which solves the problems of the existing methods. Our method utilizes the hierarchical agglomerative clustering method to find the exact representative skyline points, which enable us to reduce the re-computation time significantly. We show the superiority of our proposed method over the existing state-of-the-art methods with various types of experiments.


Author(s):  
Uppuluri Sirisha ◽  
G. Lakshme Eswari

This paper briefly introduces Internet of Things(IOT) as a intellectual connectivity among the physical objects or devices which are gaining massive increase in the fields like efficiency, quality of life and business growth. IOT is a global network which is interconnecting around 46 million smart meters in U.S. alone with 1.1 billion data points per day[1]. The total installation base of IOT connecting devices would increase to 75.44 billion globally by 2025 with a increase in growth in business, productivity, government efficiency, lifestyle, etc., This paper familiarizes the serious concern such as effective security and privacy to ensure exact and accurate confidentiality, integrity, authentication access control among the devices.


Electronics ◽  
2021 ◽  
Vol 10 (5) ◽  
pp. 567
Author(s):  
Donghun Yang ◽  
Kien Mai Mai Ngoc ◽  
Iksoo Shin ◽  
Kyong-Ha Lee ◽  
Myunggwon Hwang

To design an efficient deep learning model that can be used in the real-world, it is important to detect out-of-distribution (OOD) data well. Various studies have been conducted to solve the OOD problem. The current state-of-the-art approach uses a confidence score based on the Mahalanobis distance in a feature space. Although it outperformed the previous approaches, the results were sensitive to the quality of the trained model and the dataset complexity. Herein, we propose a novel OOD detection method that can train more efficient feature space for OOD detection. The proposed method uses an ensemble of the features trained using the softmax-based classifier and the network based on distance metric learning (DML). Through the complementary interaction of these two networks, the trained feature space has a more clumped distribution and can fit well on the Gaussian distribution by class. Therefore, OOD data can be efficiently detected by setting a threshold in the trained feature space. To evaluate the proposed method, we applied our method to various combinations of image datasets. The results show that the overall performance of the proposed approach is superior to those of other methods, including the state-of-the-art approach, on any combination of datasets.


2016 ◽  
Vol 2016 ◽  
pp. 1-12
Author(s):  
Chenghua Shi ◽  
Tonglei Li ◽  
Yu Bai ◽  
Fei Zhao

We present the vehicle routing problem with potential demands and time windows (VRP-PDTW), which is a variation of the classical VRP. A homogenous fleet of vehicles originated in a central depot serves customers with soft time windows and deliveries from/to their locations, and split delivery is considered. Also, besides the initial demand in the order contract, the potential demand caused by conformity consuming behavior is also integrated and modeled in our problem. The objective of minimizing the cost traveled by the vehicles and penalized cost due to violating time windows is then constructed. We propose a heuristics-based parthenogenetic algorithm (HPGA) for successfully solving optimal solutions to the problem, in which heuristics is introduced to generate the initial solution. Computational experiments are reported for instances and the proposed algorithm is compared with genetic algorithm (GA) and heuristics-based genetic algorithm (HGA) from the literature. The comparison results show that our algorithm is quite competitive by considering the quality of solutions and computation time.


2021 ◽  
pp. 1-14
Author(s):  
Zhenggang Wang ◽  
Jin Jin

Remote sensing image segmentation provides technical support for decision making in many areas of environmental resource management. But, the quality of the remote sensing images obtained from different channels can vary considerably, and manually labeling a mass amount of image data is too expensive and Inefficiently. In this paper, we propose a point density force field clustering (PDFC) process. According to the spectral information from different ground objects, remote sensing superpixel points are divided into core and edge data points. The differences in the densities of core data points are used to form the local peak. The center of the initial cluster can be determined by the weighted density and position of the local peak. An iterative nebular clustering process is used to obtain the result, and a proposed new objective function is used to optimize the model parameters automatically to obtain the global optimal clustering solution. The proposed algorithm can cluster the area of different ground objects in remote sensing images automatically, and these categories are then labeled by humans simply.


MycoKeys ◽  
2018 ◽  
Vol 39 ◽  
pp. 29-40 ◽  
Author(s):  
Sten Anslan ◽  
R. Henrik Nilsson ◽  
Christian Wurzbacher ◽  
Petr Baldrian ◽  
Leho Tedersoo ◽  
...  

Along with recent developments in high-throughput sequencing (HTS) technologies and thus fast accumulation of HTS data, there has been a growing need and interest for developing tools for HTS data processing and communication. In particular, a number of bioinformatics tools have been designed for analysing metabarcoding data, each with specific features, assumptions and outputs. To evaluate the potential effect of the application of different bioinformatics workflow on the results, we compared the performance of different analysis platforms on two contrasting high-throughput sequencing data sets. Our analysis revealed that the computation time, quality of error filtering and hence output of specific bioinformatics process largely depends on the platform used. Our results show that none of the bioinformatics workflows appears to perfectly filter out the accumulated errors and generate Operational Taxonomic Units, although PipeCraft, LotuS and PIPITS perform better than QIIME2 and Galaxy for the tested fungal amplicon dataset. We conclude that the output of each platform requires manual validation of the OTUs by examining the taxonomy assignment values.


Sign in / Sign up

Export Citation Format

Share Document