Learning representations of Web entities for entity resolution

2019 ◽  
Vol 15 (3) ◽  
pp. 346-358
Author(s):  
Luciano Barbosa

Purpose Matching instances of the same entity, a task known as entity resolution, is a key step in the process of data integration. This paper aims to propose a deep learning network that learns different representations of Web entities for entity resolution. Design/methodology/approach To match Web entities, the proposed network learns the following representations of entities: embeddings, which are vector representations of the words in the entities in a low-dimensional space; convolutional vectors from a convolutional layer, which capture short-distance patterns in word sequences in the entities; and bag-of-word vectors, created by a bow layer that learns weights for words in the vocabulary based on the task at hand. Given a pair of entities, the similarity between their learned representations is used as a feature to a binary classifier that identifies a possible match. In addition to those features, the classifier also uses a modification of inverse document frequency for pairs, which identifies discriminative words in pairs of entities. Findings The proposed approach was evaluated in two commercial and two academic entity resolution benchmarking data sets. The results have shown that the proposed strategy outperforms previous approaches in the commercial data sets, which are more challenging, and have similar results to its competitors in the academic data sets. Originality/value No previous work has used a single deep learning framework to learn different representations of Web entities for entity resolution.

2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Samar Ali Shilbayeh ◽  
Sunil Vadera

Purpose This paper aims to describe the use of a meta-learning framework for recommending cost-sensitive classification methods with the aim of answering an important question that arises in machine learning, namely, “Among all the available classification algorithms, and in considering a specific type of data and cost, which is the best algorithm for my problem?” Design/methodology/approach This paper describes the use of a meta-learning framework for recommending cost-sensitive classification methods for the aim of answering an important question that arises in machine learning, namely, “Among all the available classification algorithms, and in considering a specific type of data and cost, which is the best algorithm for my problem?” The framework is based on the idea of applying machine learning techniques to discover knowledge about the performance of different machine learning algorithms. It includes components that repeatedly apply different classification methods on data sets and measures their performance. The characteristics of the data sets, combined with the algorithms and the performance provide the training examples. A decision tree algorithm is applied to the training examples to induce the knowledge, which can then be used to recommend algorithms for new data sets. The paper makes a contribution to both meta-learning and cost-sensitive machine learning approaches. Those both fields are not new, however, building a recommender that recommends the optimal case-sensitive approach for a given data problem is the contribution. The proposed solution is implemented in WEKA and evaluated by applying it on different data sets and comparing the results with existing studies available in the literature. The results show that a developed meta-learning solution produces better results than METAL, a well-known meta-learning system. The developed solution takes the misclassification cost into consideration during the learning process, which is not available in the compared project. Findings The proposed solution is implemented in WEKA and evaluated by applying it to different data sets and comparing the results with existing studies available in the literature. The results show that a developed meta-learning solution produces better results than METAL, a well-known meta-learning system. Originality/value The paper presents a major piece of new information in writing for the first time. Meta-learning work has been done before but this paper presents a new meta-learning framework that is costs sensitive.


2020 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Hui Liu ◽  
Tinglong Tang ◽  
Jake Luo ◽  
Meng Zhao ◽  
Baole Zheng ◽  
...  

Purpose This study aims to address the challenge of training a detection model for the robot to detect the abnormal samples in the industrial environment, while abnormal patterns are very rare under this condition. Design/methodology/approach The authors propose a new model with double encoder–decoder (DED) generative adversarial networks to detect anomalies when the model is trained without any abnormal patterns. The DED approach is used to map high-dimensional input images to a low-dimensional space, through which the latent variables are obtained. Minimizing the change in the latent variables during the training process helps the model learn the data distribution. Anomaly detection is achieved by calculating the distance between two low-dimensional vectors obtained from two encoders. Findings The proposed method has better accuracy and F1 score when compared with traditional anomaly detection models. Originality/value A new architecture with a DED pipeline is designed to capture the distribution of images in the training process so that anomalous samples are accurately identified. A new weight function is introduced to control the proportion of losses in the encoding reconstruction and adversarial phases to achieve better results. An anomaly detection model is proposed to achieve superior performance against prior state-of-the-art approaches.


2021 ◽  
Author(s):  
Nanditha Mallesh ◽  
Max Zhao ◽  
Lisa Meintker ◽  
Alexander Höllein ◽  
Franz Elsner ◽  
...  

AbstractMulti-parameter flow cytometry (MFC) is a cornerstone in clinical decision making for hematological disorders such as leukemia or lymphoma. MFC data analysis requires trained experts to manually gate cell populations of interest, which is time-consuming and subjective. Manual gating is often limited to a two-dimensional space. In recent years, deep learning models have been developed to analyze the data in high-dimensional space and are highly accurate. Such models have been used successfully in histology, cytopathology, image flow cytometry, and conventional MFC analysis. However, current AI models used for subtype classification based on MFC data are limited to the antibody (flow cytometry) panel they were trained on. Thus, a key challenge in deploying AI models into routine diagnostics is the robustness and adaptability of such models. In this study, we present a workflow to extend our previous model to four additional MFC panels. We employ knowledge transfer to adapt the model to smaller data sets. We trained models for each of the data sets by transferring the features learned from our base model. With our workflow, we could increase the model’s overall performance and more prominently, increase the learning rate for very small training sizes.


Author(s):  
Bolin Chen ◽  
Yourui Han ◽  
Xuequn Shang ◽  
Shenggui Zhang

The identification of disease related genes plays essential roles in bioinformatics. To achieve this, many powerful machine learning methods have been proposed from various computational aspects, such as biological network analysis, classification, regression, deep learning, etc. Among them, deep learning based methods have gained big success in identifying disease related genes in terms of higher accuracy and efficiency. However, these methods rarely handle the following two issues very well, which are (1) the multifunctions of many genes; and (2) the scale-free property of biological networks. To overcome these, we propose a novel network representation method to transfer individual vertices together with their surrounding topological structures into image-like datasets. It takes each node-induced sub-network as a represented candidate, and adds its environmental characteristics to generate a low-dimensional space as its representation. This image-like datasets can be applied directly in a Convolutional Neural Network-based method for identifying cancer-related genes. The numerical experiments show that the proposed method can achieve the AUC value at 0.9256 in a single network and at 0.9452 in multiple networks, which outperforms many existing methods.


2018 ◽  
Vol 4 ◽  
pp. e154 ◽  
Author(s):  
Kelwin Fernandes ◽  
Davide Chicco ◽  
Jaime S. Cardoso ◽  
Jessica Fernandes

Cervical cancer remains a significant cause of mortality all around the world, even if it can be prevented and cured by removing affected tissues in early stages. Providing universal and efficient access to cervical screening programs is a challenge that requires identifying vulnerable individuals in the population, among other steps. In this work, we present a computationally automated strategy for predicting the outcome of the patient biopsy, given risk patterns from individual medical records. We propose a machine learning technique that allows a joint and fully supervised optimization of dimensionality reduction and classification models. We also build a model able to highlight relevant properties in the low dimensional space, to ease the classification of patients. We instantiated the proposed approach with deep learning architectures, and achieved accurate prediction results (top area under the curve AUC = 0.6875) which outperform previously developed methods, such as denoising autoencoders. Additionally, we explored some clinical findings from the embedding spaces, and we validated them through the medical literature, making them reliable for physicians and biomedical researchers.


2022 ◽  
Vol 40 (3) ◽  
pp. 1-28
Author(s):  
Surong Yan ◽  
Kwei-Jay Lin ◽  
Xiaolin Zheng ◽  
Haosen Wang

Explicit and implicit knowledge about users and items have been used to describe complex and heterogeneous side information for recommender systems (RSs). Many existing methods use knowledge graph embedding (KGE) to learn the representation of a user-item knowledge graph (KG) in low-dimensional space. In this article, we propose a lightweight end-to-end joint learning framework for fusing the tasks of KGE and RSs at the model level. Our method proposes a lightweight KG embedding method by using bidirectional bijection relation-type modeling to enable scalability for large graphs while using self-adaptive negative sampling to optimize negative sample generating. Our method further generates the integrated views for users and items based on relation-types to explicitly model users’ preferences and items’ features, respectively. Finally, we add virtual “recommendation” relations between the integrated views of users and items to model the preferences of users on items, seamlessly integrating RS with user-item KG over a unified graph. Experimental results on multiple datasets and benchmarks show that our method can achieve a better accuracy of recommendation compared with existing state-of-the-art methods. Complexity and runtime analysis suggests that our method can gain a lower time and space complexity than most of existing methods and improve scalability.


Author(s):  
Jianhua Su ◽  
Rui Li ◽  
Hong Qiao ◽  
Jing Xu ◽  
Qinglin Ai ◽  
...  

Purpose The purpose of this paper is to develop a dual peg-in-hole insertion strategy. Dual peg-in-hole insertion is the most common task in manufacturing. Most of the previous work develop the insertion strategy in a two- or three-dimensional space, in which they suppose the initial yaw angle is zero and only concern the roll and pitch angles. However, in some case, the yaw angle could not be ignored due to the pose uncertainty of the peg on the gripper. Therefore, there is a need to design the insertion strategy in a higher-dimensional configuration space. Design/methodology/approach In this paper, the authors handle the insertion problem by converting it into several sub-problems based on the attractive region formed by the constraints. The existence of the attractive region in the high-dimensional configuration space is first discussed. Then, the construction of the high-dimensional attractive region with its sub-attractive region in the low-dimensional space is proposed. Therefore, the robotic insertion strategy can be designed in the subspace to eliminate some uncertainties between the dual pegs and dual holes. Findings Dual peg-in-hole insertion is realized without using of force sensors. The proposed strategy is also used to demonstrate the precision dual peg-in-hole insertion, where the clearance between the dual-peg and dual-hole is about 0.02 mm. Practical implications The sensor-less insertion strategy will not increase the cost of the assembly system and also can be used in the dual peg-in-hole insertion. Originality/value The theoretical and experimental analyses for dual peg-in-hole insertion are proposed without using of force sensor.


2018 ◽  
Vol 11 (3) ◽  
pp. 386-403 ◽  
Author(s):  
M. Arif Wani ◽  
Saduf Afzal

Purpose Many strategies have been put forward for training deep network models, however, stacking of several layers of non-linearities typically results in poor propagation of gradients and activations. The purpose of this paper is to explore the use of two steps strategy where initial deep learning model is obtained first by unsupervised learning and then optimizing the initial deep learning model by fine tuning. A number of fine tuning algorithms are explored in this work for optimizing deep learning models. This includes proposing a new algorithm where Backpropagation with adaptive gain algorithm is integrated with Dropout technique and the authors evaluate its performance in the fine tuning of the pretrained deep network. Design/methodology/approach The parameters of deep neural networks are first learnt using greedy layer-wise unsupervised pretraining. The proposed technique is then used to perform supervised fine tuning of the deep neural network model. Extensive experimental study is performed to evaluate the performance of the proposed fine tuning technique on three benchmark data sets: USPS, Gisette and MNIST. The authors have tested the approach on varying size data sets which include randomly chosen training samples of size 20, 50, 70 and 100 percent from the original data set. Findings Through extensive experimental study, it is concluded that the two steps strategy and the proposed fine tuning technique significantly yield promising results in optimization of deep network models. Originality/value This paper proposes employing several algorithms for fine tuning of deep network model. A new approach that integrates adaptive gain Backpropagation (BP) algorithm with Dropout technique is proposed for fine tuning of deep networks. Evaluation and comparison of various algorithms proposed for fine tuning on three benchmark data sets is presented in the paper.


2017 ◽  
Vol 29 (4) ◽  
pp. 1053-1102 ◽  
Author(s):  
Hossein Soleimani ◽  
David J. Miller

Many classification tasks require both labeling objects and determining label associations for parts of each object. Example applications include labeling segments of images or determining relevant parts of a text document when the training labels are available only at the image or document level. This task is usually referred to as multi-instance (MI) learning, where the learner typically receives a collection of labeled (or sometimes unlabeled) bags, each containing several segments (instances). We propose a semisupervised MI learning method for multilabel classification. Most MI learning methods treat instances in each bag as independent and identically distributed samples. However, in many practical applications, instances are related to each other and should not be considered independent. Our model discovers a latent low-dimensional space that captures structure within each bag. Further, unlike many other MI learning methods, which are primarily developed for binary classification, we model multiple classes jointly, thus also capturing possible dependencies between different classes. We develop our model within a semisupervised framework, which leverages both labeled and, typically, a larger set of unlabeled bags for training. We develop several efficient inference methods for our model. We first introduce a Markov chain Monte Carlo method for inference, which can handle arbitrary relations between bag labels and instance labels, including the standard hard-max MI assumption. We also develop an extension of our model that uses stochastic variational Bayes methods for inference, and thus scales better to massive data sets. Experiments show that our approach outperforms several MI learning and standard classification methods on both bag-level and instance-level label prediction. All code for replicating our experiments is available from https://github.com/hsoleimani/MLTM .


2018 ◽  
Author(s):  
Peng Xie ◽  
Mingxuan Gao ◽  
Chunming Wang ◽  
Pawan Noel ◽  
Chaoyong Yang ◽  
...  

AbstractCharacterization of individual cell types is fundamental to the study of multicellular samples such as tumor tissues. Single-cell RNAseq techniques, which allow high-throughput expression profiling of individual cells, have significantly advanced our ability of this task. Currently, most of the scRNA-seq data analyses are commenced with unsupervised clustering of cells followed by visualization of clusters in a low-dimensional space. Clusters are often assigned to different cell types based on canonical markers. However, the efficiency of characterizing the known cell types in this way is low and limited by the investigator[s] knowledge. In this study, we present a technical framework of training the expandable supervised-classifier in order to reveal the single-cell identities based on their RNA expression profiles. Using multiple scRNA-seq datasets we demonstrate the superior accuracy, robustness, compatibility and expandability of this new solution compared to the traditional methods. We use two examples of model upgrade to demonstrate how the projected evolution of the cell-type classifier is realized.


Sign in / Sign up

Export Citation Format

Share Document