CDEC: a constrained deep embedded clustering

2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Elham Amirizadeh ◽  
Reza Boostani

PurposeThe aim of this study is to propose a deep neural network (DNN) method that uses side information to improve clustering results for big datasets; also, the authors show that applying this information improves the performance of clustering and also increase the speed of the network training convergence.Design/methodology/approachIn data mining, semisupervised learning is an interesting approach because good performance can be achieved with a small subset of labeled data; one reason is that the data labeling is expensive, and semisupervised learning does not need all labels. One type of semisupervised learning is constrained clustering; this type of learning does not use class labels for clustering. Instead, it uses information of some pairs of instances (side information), and these instances maybe are in the same cluster (must-link [ML]) or in different clusters (cannot-link [CL]). Constrained clustering was studied extensively; however, little works have focused on constrained clustering for big datasets. In this paper, the authors have presented a constrained clustering for big datasets, and the method uses a DNN. The authors inject the constraints (ML and CL) to this DNN to promote the clustering performance and call it constrained deep embedded clustering (CDEC). In this manner, an autoencoder was implemented to elicit informative low dimensional features in the latent space and then retrain the encoder network using a proposed Kullback–Leibler divergence objective function, which captures the constraints in order to cluster the projected samples. The proposed CDEC has been compared with the adversarial autoencoder, constrained 1-spectral clustering and autoencoder + k-means was applied to the known MNIST, Reuters-10k and USPS datasets, and their performance were assessed in terms of clustering accuracy. Empirical results confirmed the statistical superiority of CDEC in terms of clustering accuracy to the counterparts.FindingsFirst of all, this is the first DNN-constrained clustering that uses side information to improve the performance of clustering without using labels in big datasets with high dimension. Second, the author defined a formula to inject side information to the DNN. Third, the proposed method improves clustering performance and network convergence speed.Originality/valueLittle works have focused on constrained clustering for big datasets; also, the studies in DNNs for clustering, with specific loss function that simultaneously extract features and clustering the data, are rare. The method improves the performance of big data clustering without using labels, and it is important because the data labeling is expensive and time-consuming, especially for big datasets.

2021 ◽  
Vol 34 (2) ◽  
Author(s):  
Caitlyn L. Holmes ◽  
Mark T. Anderson ◽  
Harry L. T. Mobley ◽  
Michael A. Bachman

SUMMARY Gram-negative bacteremia is a devastating public health threat, with high mortality in vulnerable populations and significant costs to the global economy. Concerningly, rates of both Gram-negative bacteremia and antimicrobial resistance in the causative species are increasing. Gram-negative bacteremia develops in three phases. First, bacteria invade or colonize initial sites of infection. Second, bacteria overcome host barriers, such as immune responses, and disseminate from initial body sites to the bloodstream. Third, bacteria adapt to survive in the blood and blood-filtering organs. To develop new therapies, it is critical to define species-specific and multispecies fitness factors required for bacteremia in model systems that are relevant to human infection. A small subset of species is responsible for the majority of Gram-negative bacteremia cases, including Escherichia coli, Klebsiella pneumoniae, Pseudomonas aeruginosa, and Acinetobacter baumannii. The few bacteremia fitness factors identified in these prominent Gram-negative species demonstrate shared and unique pathogenic mechanisms at each phase of bacteremia progression. Capsule production, adhesins, and metabolic flexibility are common mediators, whereas only some species utilize toxins. This review provides an overview of Gram-negative bacteremia, compares animal models for bacteremia, and discusses prevalent Gram-negative bacteremia species.


2021 ◽  
pp. 1-27
Author(s):  
Tim Sainburg ◽  
Leland McInnes ◽  
Timothy Q. Gentner

Abstract UMAP is a nonparametric graph-based dimensionality reduction algorithm using applied Riemannian geometry and algebraic topology to find low-dimensional embeddings of structured data. The UMAP algorithm consists of two steps: (1) computing a graphical representation of a data set (fuzzy simplicial complex) and (2) through stochastic gradient descent, optimizing a low-dimensional embedding of the graph. Here, we extend the second step of UMAP to a parametric optimization over neural network weights, learning a parametric relationship between data and embedding. We first demonstrate that parametric UMAP performs comparably to its nonparametric counterpart while conferring the benefit of a learned parametric mapping (e.g., fast online embeddings for new data). We then explore UMAP as a regularization, constraining the latent distribution of autoencoders, parametrically varying global structure preservation, and improving classifier accuracy for semisupervised learning by capturing structure in unlabeled data.


Symmetry ◽  
2020 ◽  
Vol 12 (3) ◽  
pp. 434 ◽  
Author(s):  
Huilin Ge ◽  
Zhiyu Zhu ◽  
Kang Lou ◽  
Wei Wei ◽  
Runbang Liu ◽  
...  

Infrared image recognition technology can work day and night and has a long detection distance. However, the infrared objects have less prior information and external factors in the real-world environment easily interfere with them. Therefore, infrared object classification is a very challenging research area. Manifold learning can be used to improve the classification accuracy of infrared images in the manifold space. In this article, we propose a novel manifold learning algorithm for infrared object detection and classification. First, a manifold space is constructed with each pixel of the infrared object image as a dimension. Infrared images are represented as data points in this constructed manifold space. Next, we simulate the probability distribution information of infrared data points with the Gaussian distribution in the manifold space. Then, based on the Gaussian distribution information in the manifold space, the distribution characteristics of the data points of the infrared image in the low-dimensional space are derived. The proposed algorithm uses the Kullback-Leibler (KL) divergence to minimize the loss function between two symmetrical distributions, and finally completes the classification in the low-dimensional manifold space. The efficiency of the algorithm is validated on two public infrared image data sets. The experiments show that the proposed method has a 97.46% classification accuracy and competitive speed in regards to the analyzed data sets.


2013 ◽  
Vol 12 (9) ◽  
pp. 1293-1304 ◽  
Author(s):  
Anda Zhang ◽  
Zhongle Liu ◽  
Lawrence C. Myers

ABSTRACT The multisubunit eukaryotic Mediator complex integrates diverse positive and negative gene regulatory signals and transmits them to the core transcription machinery. Mutations in individual subunits within the complex can lead to decreased or increased transcription of certain subsets of genes, which are highly specific to the mutated subunit. Recent studies suggest a role for Mediator in epigenetic silencing. Using white-opaque morphological switching in Candida albicans as a model, we have shown that Mediator is required for the stability of both the epigenetic silenced (white) and active (opaque) states of the bistable transcription circuit driven by the master regulator Wor1. Individual deletions of eight C. albicans Mediator subunits have shown that different Mediator subunits have dramatically diverse effects on the directionality, frequency, and environmental induction of epigenetic switching. Among the Mediator deletion mutants analyzed, only Med12 has a steady-state transcriptional effect on the components of the Wor1 circuit that clearly corresponds to its effect on switching. The MED16 and MED9 genes have been found to be among a small subset of genes that are required for the stability of both the white and opaque states. Deletion of the Med3 subunit completely destabilizes the opaque state, even though the Wor1 transcription circuit is intact and can be driven by ectopic expression of Wor1. The highly impaired ability of the med3 deletion mutant to mate, even when Wor1 expression is ectopically induced, reveals that the activation of the Wor1 circuit can be decoupled from the opaque state and one of its primary biological consequences.


2016 ◽  
Vol 116 (4) ◽  
pp. 667-689 ◽  
Author(s):  
Chao-Lung Yang ◽  
Thi Phuong Quyen Nguyen

Purpose – Class-based storage has been studied extensively and proved to be an efficient storage policy. However, few literature addressed how to cluster stuck items for class-based storage. The purpose of this paper is to develop a constrained clustering method integrated with principal component analysis (PCA) to meet the need of clustering stored items with the consideration of practical storage constraints. Design/methodology/approach – In order to consider item characteristic and the associated storage restrictions, the must-link and cannot-link constraints were constructed to meet the storage requirement. The cube-per-order index (COI) which has been used for location assignment in class-based warehouse was analyzed by PCA. The proposed constrained clustering method utilizes the principal component loadings as item sub-group features to identify COI distribution of item sub-groups. The clustering results are then used for allocating storage by using the heuristic assignment model based on COI. Findings – The clustering result showed that the proposed method was able to provide better compactness among item clusters. The simulated result also shows the new location assignment by the proposed method was able to improve the retrieval efficiency by 33 percent. Practical implications – While number of items in warehouse is tremendously large, the human intervention on revealing storage constraints is going to be impossible. The developed method can be easily fit in to solve the problem no matter what the size of the data is. Originality/value – The case study demonstrated an example of practical location assignment problem with constraints. This paper also sheds a light on developing a data clustering method which can be directly applied on solving the practical data analysis issues.


2016 ◽  
Vol 29 (3) ◽  
pp. 274-291
Author(s):  
Alexey Feigin ◽  
Andrew Ferguson ◽  
Matthew Grosse ◽  
Tom Scott

Purpose The purpose of this study is to consider why firms use different disclosure outlets. The authors argue that the firm's choice of disclosure outlet can be explained by voluntary disclosure theories and investigate whether the market response around different disclosure outlets varies. Design/methodology/approach The authors investigate differences in the characteristics of firms purchasing analyst research, holding investor presentations or Open Briefings and compare market reactions around each disclosure event. Findings The authors find that firm incentives to reduce information acquisition costs or mitigate disclosure risk affect firm disclosure outlet choice, and mixed evidence in support of talent signalling motivations. There is a lower absolute abnormal return around Open Briefings and a higher signed abnormal return around purchased analyst research. Research limitations/implications The research is exploratory in nature and only considers a small subset of disclosure outlets. There may be differences in information content across disclosure outlets. Originality/value They show disclosure outlets are not homogenous and provide empirical evidence voluntary disclosure theories help explain differences between firms’ use of disclosure outlets. Considering the growing number of disclosure outlets available, disclosure outlet choice is likely to be an increasingly important topic in accounting research.


2016 ◽  
Vol 88 (6) ◽  
pp. 729-739
Author(s):  
Mario Perhinschi ◽  
Dia Al Azzawi ◽  
Hever Moncayo ◽  
Andres Perez ◽  
Adil Togayev

Purpose This paper aims to present the development of prediction models for aircraft actuator failure impact on flight envelope within the artificial immune system (AIS) paradigm. Design/methodology/approach Simplified algorithms are developed for estimating ranges of flight envelope-relevant variables using an AIS in conjunction with the hierarchical multi-self strategy. The AIS is a new computational paradigm mimicking mechanisms of its biological counterpart for health management of complex systems. The hierarchical multi-self strategy consists of building the AIS as a collection of low-dimensional projections replacing the hyperspace of the self to avoid numerical and conceptual issues related to the high dimensionality of the problem. Findings The proposed methodology demonstrates the capability of the AIS to not only detect and identify abnormal conditions (ACs) of the aircraft subsystem but also evaluate their impact and consequences. Research limitations/implications The prediction of altered ranges of relevant variables at post-failure conditions requires failure-specific algorithms to correlate with the characteristics and dimensionality of self-projections. Future investigations are expected to expand the types of subsystems that are affected and the nature of the ACs targeted. Practical implications It is expected that the proposed methodology will facilitate the design of on-board augmentation systems to increase aircraft survivability and improve operation safety. Originality/value The AIS paradigm is extended to AC evaluation as part of an integrated and comprehensive health management process system, also including AC detection, identification and accommodation.


2019 ◽  
Vol 15 (3) ◽  
pp. 346-358
Author(s):  
Luciano Barbosa

Purpose Matching instances of the same entity, a task known as entity resolution, is a key step in the process of data integration. This paper aims to propose a deep learning network that learns different representations of Web entities for entity resolution. Design/methodology/approach To match Web entities, the proposed network learns the following representations of entities: embeddings, which are vector representations of the words in the entities in a low-dimensional space; convolutional vectors from a convolutional layer, which capture short-distance patterns in word sequences in the entities; and bag-of-word vectors, created by a bow layer that learns weights for words in the vocabulary based on the task at hand. Given a pair of entities, the similarity between their learned representations is used as a feature to a binary classifier that identifies a possible match. In addition to those features, the classifier also uses a modification of inverse document frequency for pairs, which identifies discriminative words in pairs of entities. Findings The proposed approach was evaluated in two commercial and two academic entity resolution benchmarking data sets. The results have shown that the proposed strategy outperforms previous approaches in the commercial data sets, which are more challenging, and have similar results to its competitors in the academic data sets. Originality/value No previous work has used a single deep learning framework to learn different representations of Web entities for entity resolution.


2020 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Hui Liu ◽  
Tinglong Tang ◽  
Jake Luo ◽  
Meng Zhao ◽  
Baole Zheng ◽  
...  

Purpose This study aims to address the challenge of training a detection model for the robot to detect the abnormal samples in the industrial environment, while abnormal patterns are very rare under this condition. Design/methodology/approach The authors propose a new model with double encoder–decoder (DED) generative adversarial networks to detect anomalies when the model is trained without any abnormal patterns. The DED approach is used to map high-dimensional input images to a low-dimensional space, through which the latent variables are obtained. Minimizing the change in the latent variables during the training process helps the model learn the data distribution. Anomaly detection is achieved by calculating the distance between two low-dimensional vectors obtained from two encoders. Findings The proposed method has better accuracy and F1 score when compared with traditional anomaly detection models. Originality/value A new architecture with a DED pipeline is designed to capture the distribution of images in the training process so that anomalous samples are accurately identified. A new weight function is introduced to control the proportion of losses in the encoding reconstruction and adversarial phases to achieve better results. An anomaly detection model is proposed to achieve superior performance against prior state-of-the-art approaches.


Sign in / Sign up

Export Citation Format

Share Document