Deep learning for blocking in entity matching

Entity matching (EM) finds data instances that refer to the same real-world entity. Most EM solutions perform blocking then matching. Many works have applied deep learning (DL) to matching, but far fewer works have applied DL to blocking. These blocking works are also limited in that they consider only a simple form of DL and some of them require labeled training data. In this paper, we develop the DeepBlocker framework that significantly advances the state of the art in applying DL to blocking for EM. We first define a large space of DL solutions for blocking, which contains solutions of varying complexity and subsumes most previous works. Next, we develop eight representative solutions in this space. These solutions do not require labeled training data and exploit recent advances in DL (e.g., sequence modeling, transformer, self supervision). We empirically determine which solutions perform best on what kind of datasets (structured, textual, or dirty). We show that the best solutions (among the above eight) outperform the best existing DL solution and the best existing non-DL solutions (including a state-of-the-art industrial non-DL solution), on dirty and textual data, and are comparable on structured data. Finally, we show that the combination of the best DL and non-DL solutions can perform even better, suggesting a new venue for research.

Download Full-text

Named Entity Recognition and Relation Extraction

ACM Computing Surveys ◽

10.1145/3445965 ◽

2021 ◽

Vol 54 (1) ◽

pp. 1-39

Author(s):

Zara Nasar ◽

Syed Waqar Jaffry ◽

Muhammad Kamran Malik

Keyword(s):

Deep Learning ◽

State Of The Art ◽

Named Entity Recognition ◽

Relation Extraction ◽

The State ◽

Entity Recognition ◽

Joint Models ◽

Named Entity ◽

Textual Data ◽

Benchmark Datasets

With the advent of Web 2.0, there exist many online platforms that result in massive textual-data production. With ever-increasing textual data at hand, it is of immense importance to extract information nuggets from this data. One approach towards effective harnessing of this unstructured textual data could be its transformation into structured text. Hence, this study aims to present an overview of approaches that can be applied to extract key insights from textual data in a structured way. For this, Named Entity Recognition and Relation Extraction are being majorly addressed in this review study. The former deals with identification of named entities, and the latter deals with problem of extracting relation between set of entities. This study covers early approaches as well as the developments made up till now using machine learning models. Survey findings conclude that deep-learning-based hybrid and joint models are currently governing the state-of-the-art. It is also observed that annotated benchmark datasets for various textual-data generators such as Twitter and other social forums are not available. This scarcity of dataset has resulted into relatively less progress in these domains. Additionally, the majority of the state-of-the-art techniques are offline and computationally expensive. Last, with increasing focus on deep-learning frameworks, there is need to understand and explain the under-going processes in deep architectures.

Download Full-text

A Survey on Bias and Fairness in Machine Learning

ACM Computing Surveys ◽

10.1145/3457607 ◽

2021 ◽

Vol 54 (6) ◽

pp. 1-35

Author(s):

Ninareh Mehrabi ◽

Fred Morstatter ◽

Nripsuta Saxena ◽

Kristina Lerman ◽

Aram Galstyan

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Deep Learning ◽

Real World ◽

State Of The Art ◽

Future Directions ◽

Discriminatory Behavior ◽

Real World Applications ◽

Near Future ◽

Different Sources

With the widespread use of artificial intelligence (AI) systems and applications in our everyday lives, accounting for fairness has gained significant importance in designing and engineering of such systems. AI systems can be used in many sensitive environments to make important and life-changing decisions; thus, it is crucial to ensure that these decisions do not reflect discriminatory behavior toward certain groups or populations. More recently some work has been developed in traditional machine learning and deep learning that address such challenges in different subdomains. With the commercialization of these systems, researchers are becoming more aware of the biases that these applications can contain and are attempting to address them. In this survey, we investigated different real-world applications that have shown biases in various ways, and we listed different sources of biases that can affect AI applications. We then created a taxonomy for fairness definitions that machine learning researchers have defined to avoid the existing bias in AI systems. In addition to that, we examined different domains and subdomains in AI showing what researchers have observed with regard to unfair outcomes in the state-of-the-art methods and ways they have tried to address them. There are still many future directions and solutions that can be taken to mitigate the problem of bias in AI systems. We are hoping that this survey will motivate researchers to tackle these issues in the near future by observing existing work in their respective fields.

Download Full-text

Unsupervised Deep Learning via Affinity Diffusion

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6757 ◽

2020 ◽

Vol 34 (07) ◽

pp. 11029-11036

Author(s):

Jiabo Huang ◽

Qi Dong ◽

Shaogang Gong ◽

Xiatian Zhu

Keyword(s):

Deep Learning ◽

State Of The Art ◽

General Purpose ◽

Training Data ◽

Learning Approach ◽

Model Learning ◽

Feature Representations ◽

Discriminative Feature ◽

Training Samples ◽

Unsupervised Deep Learning

Convolutional neural networks (CNNs) have achieved unprecedented success in a variety of computer vision tasks. However, they usually rely on supervised model learning with the need for massive labelled training data, limiting dramatically their usability and deployability in real-world scenarios without any labelling budget. In this work, we introduce a general-purpose unsupervised deep learning approach to deriving discriminative feature representations. It is based on self-discovering semantically consistent groups of unlabelled training samples with the same class concepts through a progressive affinity diffusion process. Extensive experiments on object image classification and clustering show the performance superiority of the proposed method over the state-of-the-art unsupervised learning models using six common image recognition benchmarks including MNIST, SVHN, STL10, CIFAR10, CIFAR100 and ImageNet.

Download Full-text

A Probabilistic Deep Learning Approach for Twitter Sentiment Analysis

International Journal of Distributed Artificial Intelligence ◽

10.4018/ijdai.2020070102 ◽

2020 ◽

Vol 12 (2) ◽

pp. 21-34

Author(s):

Mostefai Abdelkader

Keyword(s):

Deep Learning ◽

Sentiment Analysis ◽

State Of The Art ◽

Learning Approach ◽

Probabilistic Representation ◽

Effective Manner ◽

Textual Data ◽

Positive Class ◽

Negative Class ◽

Deep Learning Model

In recent years, increasing attention is being paid to sentiment analysis on microblogging platforms such as Twitter. Sentiment analysis refers to the task of detecting whether a textual item (e.g., a tweet) contains an opinion about a topic. This paper proposes a probabilistic deep learning approach for sentiments analysis. The deep learning model used is a convolutional neural network (CNN). The main contribution of this approach is a new probabilistic representation of the text to be fed as input to the CNN. This representation is a matrix that stores for each word composing the message the probability that it belongs to a positive class and the probability that it belongs to a negative class. The proposed approach is evaluated on four well-known datasets HCR, OMD, STS-gold, and a dataset provided by the SemEval-2017 Workshop. The results of the experiments show that the proposed approach competes with the state-of-the-art sentiment analyzers and has the potential to detect sentiments from textual data in an effective manner.

Download Full-text

Examining Deep Learning Architectures for Crime Classification and Prediction

Forecasting ◽

10.3390/forecast3040046 ◽

2021 ◽

Vol 3 (4) ◽

pp. 741-762

Author(s):

Panagiotis Stalidis ◽

Theodoros Semertzidis ◽

Petros Daras

Keyword(s):

Deep Learning ◽

State Of The Art ◽

Open Data ◽

Training Data ◽

Crime Prediction ◽

Crime Types ◽

Improved Performance ◽

Learning Architectures ◽

And Training ◽

Crime Classification

In this paper, a detailed study on crime classification and prediction using deep learning architectures is presented. We examine the effectiveness of deep learning algorithms in this domain and provide recommendations for designing and training deep learning systems for predicting crime areas, using open data from police reports. Having time-series of crime types per location as training data, a comparative study of 10 state-of-the-art methods against 3 different deep learning configurations is conducted. In our experiments with 5 publicly available datasets, we demonstrate that the deep learning-based methods consistently outperform the existing best-performing methods. Moreover, we evaluate the effectiveness of different parameters in the deep learning architectures and give insights for configuring them to achieve improved performance in crime classification and finally crime prediction.

Download Full-text

An Input-aware Factorization Machine for Sparse Prediction

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/203 ◽

2019 ◽

Cited By ~ 2

Author(s):

Yantao Yu ◽

Zhen Wang ◽

Bo Yuan

Keyword(s):

Neural Network ◽

Deep Learning ◽

Real World ◽

State Of The Art ◽

Overall Performance ◽

Factorization Machine ◽

The Impact ◽

Novel Model ◽

Individual Input ◽

Better Than

Factorization machines (FMs) are a class of general predictors working effectively with sparse data, which represents features using factorized parameters and weights. However, the accuracy of FMs can be adversely affected by the fixed representation trained for each feature, as the same feature is usually not equally predictive and useful in different instances. In fact, the inaccurate representation of features may even introduce noise and degrade the overall performance. In this work, we improve FMs by explicitly considering the impact of individual input upon the representation of features. We propose a novel model named \textit{Input-aware Factorization Machine} (IFM), which learns a unique input-aware factor for the same feature in different instances via a neural network. Comprehensive experiments on three real-world recommendation datasets are used to demonstrate the effectiveness and mechanism of IFM. Empirical results indicate that IFM is significantly better than the standard FM model and consistently outperforms four state-of-the-art deep learning based methods.

Download Full-text

Var-CNN: A Data-Efficient Website Fingerprinting Attack Based on Deep Learning

Proceedings on Privacy Enhancing Technologies ◽

10.2478/popets-2019-0070 ◽

2019 ◽

Vol 2019 (4) ◽

pp. 292-310 ◽

Cited By ~ 10

Author(s):

Sanjit Bhat ◽

David Lu ◽

Albert Kwon ◽

Srinivas Devadas

Keyword(s):

Deep Learning ◽

State Of The Art ◽

False Positive Rate ◽

True Positive Rate ◽

Training Data ◽

Open World ◽

Prior Art ◽

Lower False Positive Rate ◽

Positive Rate ◽

Fingerprinting Attack

Abstract In recent years, there have been several works that use website fingerprinting techniques to enable a local adversary to determine which website a Tor user visits. While the current state-of-the-art attack, which uses deep learning, outperforms prior art with medium to large amounts of data, it attains marginal to no accuracy improvements when both use small amounts of training data. In this work, we propose Var-CNN, a website fingerprinting attack that leverages deep learning techniques along with novel insights specific to packet sequence classification. In open-world settings with large amounts of data, Var-CNN attains over 1% higher true positive rate (TPR) than state-of-the-art attacks while achieving 4× lower false positive rate (FPR). Var-CNN’s improvements are especially notable in low-data scenarios, where it reduces the FPR of prior art by 3.12% while increasing the TPR by 13%. Overall, insights used to develop Var-CNN can be applied to future deep learning based attacks, and substantially reduce the amount of training data needed to perform a successful website fingerprinting attack. This shortens the time needed for data collection and lowers the likelihood of having data staleness issues.

Download Full-text

LABEL-EFFICIENT DEEP LEARNING-BASED SEMANTIC SEGMENTATION OF BUILDING POINT CLOUDS AT LOD3 LEVEL

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xliii-b2-2021-449-2021 ◽

2021 ◽

Vol XLIII-B2-2021 ◽

pp. 449-456

Author(s):

Y. Cao ◽

M. Scaioni

Keyword(s):

Deep Learning ◽

State Of The Art ◽

Semantic Segmentation ◽

Point Clouds ◽

Training Data ◽

Second Step ◽

Dynamic Graph ◽

Input Point ◽

Supervised Methods ◽

Global And Local

Abstract. In recent research, fully supervised Deep Learning (DL) techniques and large amounts of pointwise labels are employed to train a segmentation network to be applied to buildings’ point clouds. However, fine-labelled buildings’ point clouds are hard to find and manually annotating pointwise labels is time-consuming and expensive. Consequently, the application of fully supervised DL for semantic segmentation of buildings’ point clouds at LoD3 level is severely limited. To address this issue, we propose a novel label-efficient DL network that obtains per-point semantic labels of LoD3 buildings’ point clouds with limited supervision. In general, it consists of two steps. The first step (Autoencoder – AE) is composed of a Dynamic Graph Convolutional Neural Network-based encoder and a folding-based decoder, designed to extract discriminative global and local features from input point clouds by reconstructing them without any label. The second step is semantic segmentation. By supplying a small amount of task-specific supervision, a segmentation network is proposed for semantically segmenting the encoded features acquired from the pre-trained AE. Experimentally, we evaluate our approach based on the ArCH dataset. Compared to the fully supervised DL methods, we find that our model achieved state-of-the-art results on the unseen scenes, with only 10% of labelled training data from fully supervised methods as input.

Download Full-text

Deep industrial transfer learning at runtime for image recognition

at - Automatisierungstechnik ◽

10.1515/auto-2020-0119 ◽

2021 ◽

Vol 69 (3) ◽

pp. 211-220

Author(s):

Benjamin Maschler ◽

Simon Kamm ◽

Michael Weyrich

Keyword(s):

Deep Learning ◽

Transfer Learning ◽

State Of The Art ◽

Training Data ◽

Use Case ◽

Industrial Transfer ◽

Distributed Training ◽

Two Factors ◽

Changes Over Time ◽

Over Time

Abstract The utilization of deep learning in the field of industrial automation is hindered by two factors: The amount and diversity of training data needed as well as the need to continuously retrain as the use case changes over time. Both problems can be addressed by industrial deep transfer learning allowing for the performant, continuous and potentially distributed training on small, dispersed datasets. As a specific example, a dual memory algorithm for computer vision problems is developed and evaluated. It shows the potential for state-of-the-art performance while being trained only on fractions of the complete ImageNet dataset at multiple locations at once.

Download Full-text

Bootstrapping Entity Alignment with Knowledge Graph Embedding

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/611 ◽

2018 ◽

Cited By ~ 35

Author(s):

Zequn Sun ◽

Wei Hu ◽

Qingheng Zhang ◽

Yuzhong Qu

Keyword(s):

Performance Improvement ◽

Real World ◽

State Of The Art ◽

Graph Embedding ◽

Training Data ◽

Knowledge Graph ◽

Error Accumulation ◽

Knowledge Graphs ◽

Real World Datasets ◽

Low Dimensional

Embedding-based entity alignment represents different knowledge graphs (KGs) as low-dimensional embeddings and finds entity alignment by measuring the similarities between entity embeddings. Existing approaches have achieved promising results, however, they are still challenged by the lack of enough prior alignment as labeled training data. In this paper, we propose a bootstrapping approach to embedding-based entity alignment. It iteratively labels likely entity alignment as training data for learning alignment-oriented KG embeddings. Furthermore, it employs an alignment editing method to reduce error accumulation during iterations. Our experiments on real-world datasets showed that the proposed approach significantly outperformed the state-of-the-art embedding-based ones for entity alignment. The proposed alignment-oriented KG embedding, bootstrapping process and alignment editing method all contributed to the performance improvement.

Download Full-text