scholarly journals Deep learning based approach to unstructured record linkage

2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Anna Jurek-Loughrey

Purpose In the world of big data, data integration technology is crucial for maximising the capability of data-driven decision-making. Integrating data from multiple sources drastically expands the power of information and allows us to address questions that are impossible to answer using a single data source. Record Linkage (RL) is a task of identifying and linking records from multiple sources that describe the same real world object (e.g. person), and it plays a crucial role in the data integration process. RL is challenging, as it is uncommon for different data sources to share a unique identifier. Hence, the records must be matched based on the comparison of their corresponding values. Most of the existing RL techniques assume that records across different data sources are structured and represented by the same scheme (i.e. set of attributes). Given the increasing amount of heterogeneous data sources, those assumptions are rather unrealistic. The purpose of this paper is to propose a novel RL model for unstructured data. Design/methodology/approach In the previous work (Jurek-Loughrey, 2020), the authors proposed a novel approach to linking unstructured data based on the application of the Siamese Multilayer Perceptron model. It was demonstrated that the method performed on par with other approaches that make constraining assumptions regarding the data. This paper expands the previous work originally presented at iiWAS2020 [16] by exploring new architectures of the Siamese Neural Network, which improves the generalisation of the RL model and makes it less sensitive to parameter selection. Findings The experimental results confirm that the new Autoencoder-based architecture of the Siamese Neural Network obtains better results in comparison to the Siamese Multilayer Perceptron model proposed in (Jurek et al., 2020). Better results have been achieved in three out of four data sets. Furthermore, it has been demonstrated that the second proposed (hybrid) architecture based on integrating the Siamese Autoencoder with a Multilayer Perceptron model, makes the model more stable in terms of the parameter selection. Originality/value To address the problem of unstructured RL, this paper presents a new deep learning based approach to improve the generalisation of the Siamese Multilayer Preceptron model and make is less sensitive to parameter selection.

2015 ◽  
Vol 11 (3) ◽  
pp. 370-396 ◽  
Author(s):  
Tuan-Dat Trinh ◽  
Peter Wetz ◽  
Ba-Lam Do ◽  
Elmar Kiesling ◽  
A Min Tjoa

Purpose – This paper aims to present a collaborative mashup platform for dynamic integration of heterogeneous data sources. The platform encourages sharing and connects data publishers, integrators, developers and end users. Design/methodology/approach – This approach is based on a visual programming paradigm and follows three fundamental principles: openness, connectedness and reusability. The platform is based on semantic Web technologies and the concept of linked widgets, i.e. semantic modules that allow users to access, integrate and visualize data in a creative and collaborative manner. Findings – The platform can effectively tackle data integration challenges by allowing users to explore relevant data sources for different contexts, tackling the data heterogeneity problem and facilitating automatic data integration, easing data integration via simple operations and fostering reusability of data processing tasks. Research limitations/implications – This research has focused exclusively on conceptual and technical aspects so far; a comprehensive user study, extensive performance and scalability testing is left for future work. Originality/value – A key contribution of this paper is the concept of distributed mashups. These ad hoc data integration applications allow users to perform data processing tasks in a collaborative and distributed manner simultaneously on multiple devices. This approach requires no server infrastructure to upload data, but rather allows each user to keep control over their data and expose only relevant subsets. Distributed mashups can run persistently in the background and are hence ideal for real-time data monitoring or data streaming use cases. Furthermore, we introduce automatic mashup composition as an innovative approach based on an explicit semantic widget model.


Author(s):  
Lihua Lu ◽  
Hengzhen Zhang ◽  
Xiao-Zhi Gao

Purpose – Data integration is to combine data residing at different sources and to provide the users with a unified interface of these data. An important issue on data integration is the existence of conflicts among the different data sources. Data sources may conflict with each other at data level, which is defined as data inconsistency. The purpose of this paper is to aim at this problem and propose a solution for data inconsistency in data integration. Design/methodology/approach – A relational data model extended with data source quality criteria is first defined. Then based on the proposed data model, a data inconsistency solution strategy is provided. To accomplish the strategy, fuzzy multi-attribute decision-making (MADM) approach based on data source quality criteria is applied to obtain the results. Finally, users feedbacks strategies are proposed to optimize the result of fuzzy MADM approach as the final data inconsistent solution. Findings – To evaluate the proposed method, the data obtained from the sensors are extracted. Some experiments are designed and performed to explain the effectiveness of the proposed strategy. The results substantiate that the solution has a better performance than the other methods on correctness, time cost and stability indicators. Practical implications – Since the inconsistent data collected from the sensors are pervasive, the proposed method can solve this problem and correct the wrong choice to some extent. Originality/value – In this paper, for the first time the authors study the effect of users feedbacks on integration results aiming at the inconsistent data.


2021 ◽  
Author(s):  
Ewerthon Dyego de Araújo Batista ◽  
Wellington Candeia de Araújo ◽  
Romeryto Vieira Lira ◽  
Laryssa Izabel de Araújo Batista

Dengue é um problema de saúde pública no Brasil, os casos da doença voltaram a crescer na Paraíba. O boletim epidemiológico da Paraíba, divulgado em agosto de 2021, informa um aumento de 53% de casos em relação ao ano anterior. Técnicas de Machine Learning (ML) e de Deep Learning estão sendo utilizadas como ferramentas para a predição da doença e suporte ao seu combate. Por meio das técnicas Random Forest (RF), Support Vector Regression (SVR), Multilayer Perceptron (MLP), Long ShortTerm Memory (LSTM) e Convolutional Neural Network (CNN), este artigo apresenta um sistema capaz de realizar previsões de internações causadas por dengue para as cidades Bayeux, Cabedelo, João Pessoa e Santa Rita. O sistema conseguiu realizar previsões para Bayeux com taxa de erro 0,5290, já em Cabedelo o erro foi 0,92742, João Pessoa 9,55288 e Santa Rita 0,74551.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Venkateswara Rao Kota ◽  
Shyamala Devi Munisamy

PurposeNeural network (NN)-based deep learning (DL) approach is considered for sentiment analysis (SA) by incorporating convolutional neural network (CNN), bi-directional long short-term memory (Bi-LSTM) and attention methods. Unlike the conventional supervised machine learning natural language processing algorithms, the authors have used unsupervised deep learning algorithms.Design/methodology/approachThe method presented for sentiment analysis is designed using CNN, Bi-LSTM and the attention mechanism. Word2vec word embedding is used for natural language processing (NLP). The discussed approach is designed for sentence-level SA which consists of one embedding layer, two convolutional layers with max-pooling, one LSTM layer and two fully connected (FC) layers. Overall the system training time is 30 min.FindingsThe method performance is analyzed using metrics like precision, recall, F1 score, and accuracy. CNN is helped to reduce the complexity and Bi-LSTM is helped to process the long sequence input text.Originality/valueThe attention mechanism is adopted to decide the significance of every hidden state and give a weighted sum of all the features fed as input.


Author(s):  
Hyunjung Cheon ◽  
Charles M. Katz ◽  
Vincent J. Webb

Purpose Although trafficking of persons for commercial sex has been increasingly recognized as a community level problem most estimates of the prevalence of sex trafficking in the USA are made by federal entities and vary depending on the data sources used. Little is known about how local police agencies assess and understand sex trafficking in their own communities. The paper aims to discuss this issue. Design/methodology/approach To help fill this gap, the current study using survey data from a sample of local police agencies across the USA (n=72) examines law enforcement agencies’ knowledge of and experience with addressing local sex trafficking problems in their jurisdiction. Findings The majority of police agencies reported that sex trafficking is a problem in their jurisdictions and that they have a special unit that has a primary responsibility for addressing sex trafficking issues. Agencies with a special unit tend to use multiple sources of information including official record, intelligence data and personal experience to estimate the community’s trafficking problems when compared to agencies without a unit; however, most of agencies primarily depend on their professional experience. Originality/value This is the first study to examine the data sources used by local police agencies to estimate the scope and nature of their community’s sex trafficking problem, and the findings have important policy implications for understanding the reliability and validity of these estimates, and for their potential use to develop and implement data driven responses to sex trafficking problems.


Author(s):  
Wojciech Pietrowski

Purpose Diagnostics of electrical machines is a very important task. The purpose of this paper is the presentation of coupling three numerical techniques, a finite element analysis, a signal analysis and an artificial neural network, in diagnostics of electrical machines. The study focused on detection of a time-varying inter-turn short-circuit in a stator winding of induction motor. Design/methodology/approach A finite element method is widely used for the calculation of phase current waveforms of induction machines. In the presented results, a time-varying inter-turn short-circuit of stator winding has been taken into account in the elaborated field-circuit model of machine. One of the time-varying short-circuit symptoms is a time-varying resistance of shorted circuit and consequently the waveform of phase current. A general regression neural network (GRNN) has been elaborated to find a number of shorted turns on the basis of fast Fourier transform (FFT) of phase current. The input vector of GRNN has been built on the basis of the FFT of phase current waveform. The output vector has been built upon the values of resistance of shorted circuit for respective values of shorted turns. The performance of the GRNN was compared with that of the multilayer perceptron neural network. Findings The GRNN can contribute to better detection of the time-varying inter-turn short-circuit in stator winding than the multilayer perceptron neural network. Originality/value It is argued that the proposed method based on FFT of phase current and GRNN is capable to detect a time-varying inter-turn short-circuit. The GRNN can be used in a health monitoring system as an inference module.


2020 ◽  
Author(s):  
Dongdong Zhang ◽  
Changchang Yin ◽  
Jucheng Zeng ◽  
Xiaohui Yuan ◽  
Ping Zhang

Background: The broad adoption of Electronic Health Records (EHRs) provides great opportunities to conduct health care research and solve various clinical problems in medicine. With recent advances and success, methods based on machine learning and deep learning have become increasingly popular in medical informatics. However, while many research studies utilize temporal structured data on predictive modeling, they typically neglect potentially valuable information in unstructured clinical notes. Integrating heterogeneous data types across EHRs through deep learning techniques may help improve the performance of prediction models. Methods: In this research, we proposed 2 general-purpose multi-modal neural network architectures to enhance patient representation learning by combining sequential unstructured notes with structured data. The proposed fusion models leverage document embeddings for the representation of long clinical note documents and either convolutional neural network or long short-term memory networks to model the sequential clinical notes and temporal signals, and one-hot encoding for static information representation. The concatenated representation is the final patient representation which is used to make predictions. Results: We evaluate the performance of proposed models on 3 risk prediction tasks (i.e., in-hospital mortality, 30-day hospital readmission, and long length of stay prediction) using derived data from the publicly available Medical Information Mart for Intensive Care III dataset. Our results show that by combining unstructured clinical notes with structured data, the proposed models outperform other models that utilize either unstructured notes or structured data only. Conclusions: The proposed fusion models learn better patient representation by combining structured and unstructured data. Integrating heterogeneous data types across EHRs helps improve the performance of prediction models and reduce errors.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Faris Elghaish ◽  
Saeed Talebi ◽  
Essam Abdellatef ◽  
Sandra T. Matarneh ◽  
M. Reza Hosseini ◽  
...  

Purpose This paper aims to Test the capabilities/accuracies of four deep learning pre trained convolutional neural network (CNN) models to detect and classify types of highway cracks, as well as developing a new CNN model to maximize the accuracy at different learning rates. Design/methodology/approach A sample of 4,663 images of highway cracks were collected and classified into three categories of cracks, namely, “vertical cracks,” “horizontal and vertical cracks” and “diagonal cracks,” subsequently, using “Matlab” to classify the sample to training (70%) and testing (30%) to apply the four deep learning CNN models and compute their accuracies. After that, developing a new deep learning CNN model to maximize the accuracy of detecting and classifying highway cracks and testing the accuracy using three optimization algorithms at different learning rates. Findings The accuracies result of the four deep learning pre-trained models are above the averages between top-1 and top-5 and the accuracy of classifying and detecting the samples exceeded the top-5 accuracy for the pre-trained AlexNet model around 3% and by 0.2% for the GoogleNet model. The accurate model here is the GoogleNet model as the accuracy is 89.08% and it is higher than AlexNet by 1.26%. While the computed accuracy for the new created deep learning CNN model exceeded all pre-trained models by achieving 97.62% at a learning rate of 0.001 using Adam’s optimization algorithm. Practical implications The created deep learning CNN model will enable users (e.g. highway agencies) to scan a long highway and detect types of cracks accurately in a very short time compared to traditional approaches. Originality/value A new deep learning CNN-based highway cracks detection was developed based on testing four pre-trained CNN models and analyze the capabilities of each model to maximize the accuracy of the proposed CNN.


2022 ◽  
Author(s):  
Isaac Ronald Ward ◽  
Jack Joyner ◽  
Casey Lickfold ◽  
Yulan Guo ◽  
Mohammed Bennamoun

Graph neural networks (GNNs) have recently grown in popularity in the field of artificial intelligence (AI) due to their unique ability to ingest relatively unstructured data types as input data. Although some elements of the GNN architecture are conceptually similar in operation to traditional neural networks (and neural network variants), other elements represent a departure from traditional deep learning techniques. This tutorial exposes the power and novelty of GNNs to AI practitioners by collating and presenting details regarding the motivations, concepts, mathematics, and applications of the most common and performant variants of GNNs. Importantly, we present this tutorial concisely, alongside practical examples, thus providing a practical and accessible tutorial on the topic of GNNs.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Ran Feng ◽  
Xiaoe Qu

PurposeTo identify and analyze the occurrence of Internet financial market risk, data mining technology is combined with deep learning to process and analyze. The market risk management of the Internet is to improve the management level of Internet financial risk, improve the policy of Internet financial supervision and promote the healthy development of Internet finance.Design/methodology/approachIn this exploration, data mining technology is combined with deep learning to mine the Internet financial data, warn the potential risks in the market and provide targeted risk management measures. Therefore, in this article, to improve the application ability of data mining in dealing with Internet financial risk management, the radial basis function (RBF) neural network algorithm optimized by ant colony optimization (ACO) is proposed.FindingsThe results show that the actual error of the ACO optimized RBF neural network is 0.249, which is 0.149 different from the target error, indicating that the optimized algorithm can make the calculation results more accurate. The fitting results of the RBF neural network and ACO optimized RBF neural network for nonlinear function are compared. Compared with the performance of other algorithms, the error of ACO optimized RBF neural network is 0.249, the running time is 2.212 s, and the number of iterations is 36, which is far less than the actual results of the other two algorithms.Originality/valueThe optimized algorithm has a better spatial mapping and generalization ability and can get higher accuracy in short-term training. Therefore, the ACO optimized RBF neural network algorithm designed in this exploration has a high accuracy for the prediction of Internet financial market risk.


Sign in / Sign up

Export Citation Format

Share Document