Parallel discrepancy detection and incremental detection

2021 ◽  
Vol 14 (8) ◽  
pp. 1351-1364
Author(s):  
Wenfei Fan ◽  
Chao Tian ◽  
Yanghao Wang ◽  
Qiang Yin

This paper studies how to catch duplicates, mismatches and conflicts in the same process. We adopt a class of entity enhancing rules that embed machine learning predicates, unify entity resolution and conflict resolution, and are collectively defined across multiple relations. We detect discrepancies as violations of such rules. We establish the complexity of discrepancy detection and incremental detection problems with the rules; they are both NP-complete and W[1]-hard. To cope with the intractability and scale with large datasets, we develop parallel algorithms and parallel incremental algorithms for discrepancy detection. We show that both algorithms are parallelly scalable, i.e. , they guarantee to reduce runtime when more processors are used. Moreover, the parallel incremental algorithm is relatively bounded. The complexity bounds and algorithms carry over to denial constraints, a special case of the entity enhancing rules. Using real-life and synthetic datasets, we experimentally verify the effectiveness, scalability and efficiency of the algorithms.

Interval data mining is used to extract unknown patterns, hidden rules, associations etc. associated in interval based data. The extraction of closed interval is important because by mining the set of closed intervals and their support counts, the support counts of any interval can be computed easily. In this work an incremental algorithm for computing closed intervals together with their support counts from interval dataset is proposed. Many methods for mining closed intervals are available. Most of these methods assume a static data set as input and hence the algorithms are non-incremental. Real life data sets are however dynamic by nature. An efficient incremental algorithm called CI-Tree has been already proposed for computing closed intervals present in dynamic interval data. However this method could not compute the support values of the closed intervals. The proposed algorithm called SCI-Tree extracts all closed intervals together with their support values incrementally from the given interval data. Also, all the frequent closed intervals can be computed for any user defined minimum support with a single scan of SCI-Tree without revisiting the dataset. The proposed method has been tested with real life and synthetic datasets and results have been reported.


2021 ◽  
Vol 16 (2) ◽  
pp. 1-31
Author(s):  
Chunkai Zhang ◽  
Zilin Du ◽  
Yuting Yang ◽  
Wensheng Gan ◽  
Philip S. Yu

Utility mining has emerged as an important and interesting topic owing to its wide application and considerable popularity. However, conventional utility mining methods have a bias toward items that have longer on-shelf time as they have a greater chance to generate a high utility. To eliminate the bias, the problem of on-shelf utility mining (OSUM) is introduced. In this article, we focus on the task of OSUM of sequence data, where the sequential database is divided into several partitions according to time periods and items are associated with utilities and several on-shelf time periods. To address the problem, we propose two methods, OSUM of sequence data (OSUMS) and OSUMS + , to extract on-shelf high-utility sequential patterns. For further efficiency, we also design several strategies to reduce the search space and avoid redundant calculation with two upper bounds time prefix extension utility ( TPEU ) and time reduced sequence utility ( TRSU ). In addition, two novel data structures are developed for facilitating the calculation of upper bounds and utilities. Substantial experimental results on certain real and synthetic datasets show that the two methods outperform the state-of-the-art algorithm. In conclusion, OSUMS may consume a large amount of memory and is unsuitable for cases with limited memory, while OSUMS + has wider real-life applications owing to its high efficiency.


2021 ◽  
Vol 14 (3) ◽  
pp. 1-21
Author(s):  
Roy Abitbol ◽  
Ilan Shimshoni ◽  
Jonathan Ben-Dov

The task of assembling fragments in a puzzle-like manner into a composite picture plays a significant role in the field of archaeology as it supports researchers in their attempt to reconstruct historic artifacts. In this article, we propose a method for matching and assembling pairs of ancient papyrus fragments containing mostly unknown scriptures. Papyrus paper is manufactured from papyrus plants and therefore portrays typical thread patterns resulting from the plant’s stems. The proposed algorithm is founded on the hypothesis that these thread patterns contain unique local attributes such that nearby fragments show similar patterns reflecting the continuations of the threads. We posit that these patterns can be exploited using image processing and machine learning techniques to identify matching fragments. The algorithm and system which we present support the quick and automated classification of matching pairs of papyrus fragments as well as the geometric alignment of the pairs against each other. The algorithm consists of a series of steps and is based on deep-learning and machine learning methods. The first step is to deconstruct the problem of matching fragments into a smaller problem of finding thread continuation matches in local edge areas (squares) between pairs of fragments. This phase is solved using a convolutional neural network ingesting raw images of the edge areas and producing local matching scores. The result of this stage yields very high recall but low precision. Thus, we utilize these scores in order to conclude about the matching of entire fragments pairs by establishing an elaborate voting mechanism. We enhance this voting with geometric alignment techniques from which we extract additional spatial information. Eventually, we feed all the data collected from these steps into a Random Forest classifier in order to produce a higher order classifier capable of predicting whether a pair of fragments is a match. Our algorithm was trained on a batch of fragments which was excavated from the Dead Sea caves and is dated circa the 1st century BCE. The algorithm shows excellent results on a validation set which is of a similar origin and conditions. We then tried to run the algorithm against a real-life set of fragments for which we have no prior knowledge or labeling of matches. This test batch is considered extremely challenging due to its poor condition and the small size of its fragments. Evidently, numerous researchers have tried seeking matches within this batch with very little success. Our algorithm performance on this batch was sub-optimal, returning a relatively large ratio of false positives. However, the algorithm was quite useful by eliminating 98% of the possible matches thus reducing the amount of work needed for manual inspection. Indeed, experts that reviewed the results have identified some positive matches as potentially true and referred them for further investigation.


Author(s):  
Daniel R. Cassar ◽  
Saulo Martiello Mastelini ◽  
Tiago Botari ◽  
Edesio Alcobaça ◽  
André C.P. L.F. de Carvalho ◽  
...  

Sensors ◽  
2021 ◽  
Vol 21 (11) ◽  
pp. 3726
Author(s):  
Ivan Vaccari ◽  
Vanessa Orani ◽  
Alessia Paglialonga ◽  
Enrico Cambiaso ◽  
Maurizio Mongelli

The application of machine learning and artificial intelligence techniques in the medical world is growing, with a range of purposes: from the identification and prediction of possible diseases to patient monitoring and clinical decision support systems. Furthermore, the widespread use of remote monitoring medical devices, under the umbrella of the “Internet of Medical Things” (IoMT), has simplified the retrieval of patient information as they allow continuous monitoring and direct access to data by healthcare providers. However, due to possible issues in real-world settings, such as loss of connectivity, irregular use, misuse, or poor adherence to a monitoring program, the data collected might not be sufficient to implement accurate algorithms. For this reason, data augmentation techniques can be used to create synthetic datasets sufficiently large to train machine learning models. In this work, we apply the concept of generative adversarial networks (GANs) to perform a data augmentation from patient data obtained through IoMT sensors for Chronic Obstructive Pulmonary Disease (COPD) monitoring. We also apply an explainable AI algorithm to demonstrate the accuracy of the synthetic data by comparing it to the real data recorded by the sensors. The results obtained demonstrate how synthetic datasets created through a well-structured GAN are comparable with a real dataset, as validated by a novel approach based on machine learning.


Author(s):  
Amrik Singh ◽  
K.R. Ramkumar

Due to the advancement of medical sensor technologies new vectors can be added to the health insurance packages. Such medical sensors can help the health as well as the insurance sector to construct mathematical risk equation models with parameters that can map the real-life risk conditions. In this paper parameter analysis in terms of medical relevancy as well in terms of correlation has been done. Considering it as ‘inverse problem’ the mathematical relationship has been found and are tested against the ground truth between the risk indicators. The pairwise correlation analysis gives a stable mathematical equation model can be used for health risk analysis. The equation gives coefficient values from which classification regarding health insurance risk can be derived and quantified. The Logistic Regression equation model gives the maximum accuracy (86.32%) among the Ridge Bayesian and Ordinary Least Square algorithms. Machine learning algorithm based risk analysis approach was formulated and the series of experiments show that K-Nearest Neighbor classifier has the highest accuracy of 93.21% to do risk classification.


Author(s):  
Marlene Arangú ◽  
Miguel Salido

A fine-grained arc-consistency algorithm for non-normalized constraint satisfaction problems Constraint programming is a powerful software technology for solving numerous real-life problems. Many of these problems can be modeled as Constraint Satisfaction Problems (CSPs) and solved using constraint programming techniques. However, solving a CSP is NP-complete so filtering techniques to reduce the search space are still necessary. Arc-consistency algorithms are widely used to prune the search space. The concept of arc-consistency is bidirectional, i.e., it must be ensured in both directions of the constraint (direct and inverse constraints). Two of the most well-known and frequently used arc-consistency algorithms for filtering CSPs are AC3 and AC4. These algorithms repeatedly carry out revisions and require support checks for identifying and deleting all unsupported values from the domains. Nevertheless, many revisions are ineffective, i.e., they cannot delete any value and consume a lot of checks and time. In this paper, we present AC4-OP, an optimized version of AC4 that manages the binary and non-normalized constraints in only one direction, storing the inverse founded supports for their later evaluation. Thus, it reduces the propagation phase avoiding unnecessary or ineffective checking. The use of AC4-OP reduces the number of constraint checks by 50% while pruning the same search space as AC4. The evaluation section shows the improvement of AC4-OP over AC4, AC6 and AC7 in random and non-normalized instances.


2020 ◽  
Vol 13 (10) ◽  
pp. 1669-1681
Author(s):  
Zijing Tan ◽  
Ai Ran ◽  
Shuai Ma ◽  
Sheng Qin

Pointwise order dependencies (PODs) are dependencies that specify ordering semantics on attributes of tuples. POD discovery refers to the process of identifying the set Σ of valid and minimal PODs on a given data set D. In practice D is typically large and keeps changing, and it is prohibitively expensive to compute Σ from scratch every time. In this paper, we make a first effort to study the incremental POD discovery problem, aiming at computing changes ΔΣ to Σ such that Σ ⊕ ΔΣ is the set of valid and minimal PODs on D with a set Δ D of tuple insertion updates. (1) We first propose a novel indexing technique for inputs Σ and D. We give algorithms to build and choose indexes for Σ and D , and to update indexes in response to Δ D. We show that POD violations w.r.t. Σ incurred by Δ D can be efficiently identified by leveraging the proposed indexes, with a cost dependent on log (| D |). (2) We then present an effective algorithm for computing ΔΣ, based on Σ and identified violations caused by Δ D. The PODs in Σ that become invalid on D + Δ D are efficiently detected with the proposed indexes, and further new valid PODs on D + Δ D are identified by refining those invalid PODs in Σ on D + Δ D. (3) Finally, using both real-life and synthetic datasets, we experimentally show that our approach outperforms the batch approach that computes from scratch, up to orders of magnitude.


2022 ◽  
Vol 12 (2) ◽  
pp. 828
Author(s):  
Tebogo Bokaba ◽  
Wesley Doorsamy ◽  
Babu Sena Paul

Road traffic accidents (RTAs) are a major cause of injuries and fatalities worldwide. In recent years, there has been a growing global interest in analysing RTAs, specifically concerned with analysing and modelling accident data to better understand and assess the causes and effects of accidents. This study analysed the performance of widely used machine learning classifiers using a real-life RTA dataset from Gauteng, South Africa. The study aimed to assess prediction model designs for RTAs to assist transport authorities and policymakers. It considered classifiers such as naïve Bayes, logistic regression, k-nearest neighbour, AdaBoost, support vector machine, random forest, and five missing data methods. These classifiers were evaluated using five evaluation metrics: accuracy, root-mean-square error, precision, recall, and receiver operating characteristic curves. Furthermore, the assessment involved parameter adjustment and incorporated dimensionality reduction techniques. The empirical results and analyses show that the RF classifier, combined with multiple imputations by chained equations, yielded the best performance when compared with the other combinations.


Author(s):  
Sebastian Panman de Wit ◽  
Doina Bucur ◽  
Jeroen van der Ham

Mobile malware are malicious programs that target mobile devices. They are an increasing problem, as seen in the rise of detected mobile malware samples per year. The number of active smartphone users is expected to grow, stressing the importance of research on the detection of mobile malware. Detection methods for mobile malware exist but are still limited. In this paper, we propose dynamic malware-detection methods that use device information such as the CPU usage, battery usage, and memory usage for the detection of 10 subtypes of Mobile Trojans on the Android Operating System (OS). We use a real-life sensor dataset containing device and malware data from 47 users for a year (2016) to create multiple mobile malware detection methods. We examine which features, i.e. aspects, of a device, are most important to monitor to detect (subtypes of) Mobile Trojans. The focus of this paper is on dynamic hardware features. Using these dynamic features we apply the following machine learning classifiers: Random Forest, K-Nearest Neighbour, and AdaBoost.


Sign in / Sign up

Export Citation Format

Share Document