scholarly journals First Steps towards Data-driven Adversarial Deduplication

Author(s):  
Jose Paredes ◽  
Gerardo Simari ◽  
Maria Vanina Martinez ◽  
Marcelo Falappa

In traditional databases, the entity resolution problem (which is also known as deduplication), refers to the task of mapping multiple manifestations of virtual objects to its corresponding real-world entity. When addressing this problem, in both theory and practice, it is widely assumed that such sets of virtual object appear as the result of clerical errors, transliterations, missing or updated attributes, abbreviations, and so forth. In this paper, we address this problem under the assumption that this situation is caused by malicious actors operating in domains in which they do not wish to be identified, such as hacker forums and markets in which the participants are motivated to remain semi-anonymous (though they wish to keep their true identities secret, they find it useful for customers to identify their products and services). We are therefore in the presence of a different, even more challenging problem that we refer to as adversarial deduplication. In this paper, we study this problem via examples that arise from real-world data on malicious hacker forums and markets arising from collaborations with a cyber threat intelligence company focusing on understanding this kind of behavior. We argue that it is very difficult---if not impossible---to find ground truth data on which to build solutions to this problem, and develop a set of preliminary experiments based on training machine learning classifiers that leverage text analysis to detect potential cases of duplicate entities. Our results are encouraging as a first step towards building tools that human analysts can use to enhance their capabilities towards fighting cyber threats.

Author(s):  
Jose Paredes ◽  
Gerardo Simari ◽  
Maria Vanina Martinez ◽  
Marcelo Falappa

In traditional databases, the entity resolution problem (which is also known as deduplication), refers to the task of mapping multiple manifestations of virtual objects to its corresponding real-world entity. When addressing this problem, in both theory and practice, it is widely assumed that such sets of virtual object appear as the result of clerical errors, transliterations, missing or updated attributes, abbreviations, and so forth. In this paper, we address this problem under the assumption that this situation is caused by malicious actors operating in domains in which they do not wish to be identified, such as hacker forums and markets in which the participants are motivated to remain semi-anonymous (though they wish to keep their true identities secret, they find it useful for customers to identify their products and services). We are therefore in the presence of a different, even more challenging problem that we refer to as adversarial deduplication. In this paper, we study this problem via examples that arise from real-world data on malicious hacker forums and markets arising from collaborations with a cyber threat intelligence company focusing on understanding this kind of behavior. We argue that it is very difficult---if not impossible---to find ground truth data on which to build solutions to this problem, and develop a set of preliminary experiments based on training machine learning classifiers that leverage text analysis to detect potential cases of duplicate entities. Our results are encouraging as a first step towards building tools that human analysts can use to enhance their capabilities towards fighting cyber threats.


Information ◽  
2018 ◽  
Vol 9 (8) ◽  
pp. 189 ◽  
Author(s):  
Jose Paredes ◽  
Gerardo Simari ◽  
Maria Martinez ◽  
Marcelo Falappa

In traditional databases, the entity resolution problem (which is also known as deduplication) refers to the task of mapping multiple manifestations of virtual objects to their corresponding real-world entities. When addressing this problem, in both theory and practice, it is widely assumed that such sets of virtual objects appear as the result of clerical errors, transliterations, missing or updated attributes, abbreviations, and so forth. In this paper, we address this problem under the assumption that this situation is caused by malicious actors operating in domains in which they do not wish to be identified, such as hacker forums and markets in which the participants are motivated to remain semi-anonymous (though they wish to keep their true identities secret, they find it useful for customers to identify their products and services). We are therefore in the presence of a different, and even more challenging, problem that we refer to as adversarial deduplication. In this paper, we study this problem via examples that arise from real-world data on malicious hacker forums and markets arising from collaborations with a cyber threat intelligence company focusing on understanding this kind of behavior. We argue that it is very difficult—if not impossible—to find ground truth data on which to build solutions to this problem, and develop a set of preliminary experiments based on training machine learning classifiers that leverage text analysis to detect potential cases of duplicate entities. Our results are encouraging as a first step towards building tools that human analysts can use to enhance their capabilities towards fighting cyber threats.


2021 ◽  
Vol 14 (6) ◽  
pp. 997-1005
Author(s):  
Sandeep Tata ◽  
Navneet Potti ◽  
James B. Wendt ◽  
Lauro Beltrão Costa ◽  
Marc Najork ◽  
...  

Extracting structured information from templatic documents is an important problem with the potential to automate many real-world business workflows such as payment, procurement, and payroll. The core challenge is that such documents can be laid out in virtually infinitely different ways. A good solution to this problem is one that generalizes well not only to known templates such as invoices from a known vendor, but also to unseen ones. We developed a system called Glean to tackle this problem. Given a target schema for a document type and some labeled documents of that type, Glean uses machine learning to automatically extract structured information from other documents of that type. In this paper, we describe the overall architecture of Glean, and discuss three key data management challenges : 1) managing the quality of ground truth data, 2) generating training data for the machine learning model using labeled documents, and 3) building tools that help a developer rapidly build and improve a model for a given document type. Through empirical studies on a real-world dataset, we show that these data management techniques allow us to train a model that is over 5 F1 points better than the exact same model architecture without the techniques we describe. We argue that for such information-extraction problems, designing abstractions that carefully manage the training data is at least as important as choosing a good model architecture.


2020 ◽  
Vol 39 (6) ◽  
pp. 688-728
Author(s):  
Teodor Tomić ◽  
Philipp Lutz ◽  
Korbinian Schmid ◽  
Andrew Mathers ◽  
Sami Haddadin

In this article, we consider the problem of multirotor flying robots physically interacting with the environment under influence of wind. The results are the first algorithms for simultaneous online estimation of contact and aerodynamic wrenches acting on the robot based on real-world data, without the need for dedicated sensors. For this purpose, we investigated two model-based techniques for discriminating between aerodynamic and interaction forces. The first technique is based on aerodynamic and contact torque models, and uses the external force to estimate wind speed. Contacts are then detected based on the residual between estimated external torque and expected (modeled) aerodynamic torque. Upon detecting contact, wind speed is assumed to change very slowly. From the estimated interaction wrench, we are also able to determine the contact location. This is embedded into a particle filter framework to further improve contact location estimation. The second algorithm uses the propeller aerodynamic power and angular speed as measured by the speed controllers to obtain an estimate of the airspeed. An aerodynamics model is then used to determine the aerodynamic wrench. Both methods rely on accurate aerodynamics models. Therefore, we evaluate data-driven and physics-based models as well as offline system identification for flying robots. For obtaining ground-truth data, we performed autonomous flights in a 3D wind tunnel. Using this data, aerodynamic model selection, parameter identification, and discrimination between aerodynamic and contact forces could be performed. Finally, the developed methods could serve as useful estimators for interaction control schemes with simultaneous compensation of wind disturbances.


2020 ◽  
Vol 34 (07) ◽  
pp. 11661-11668 ◽  
Author(s):  
Yunfei Liu ◽  
Feng Lu

Many real world vision tasks, such as reflection removal from a transparent surface and intrinsic image decomposition, can be modeled as single image layer separation. However, this problem is highly ill-posed, requiring accurately aligned and hard to collect triplet data to train the CNN models. To address this problem, this paper proposes an unsupervised method that requires no ground truth data triplet in training. At the core of the method are two assumptions about data distributions in the latent spaces of different layers, based on which a novel unsupervised layer separation pipeline can be derived. Then the method can be constructed based on the GANs framework with self-supervision and cycle consistency constraints, etc. Experimental results demonstrate its successfulness in outperforming existing unsupervised methods in both synthetic and real world tasks. The method also shows its ability to solve a more challenging multi-layer separation task.


2021 ◽  
Vol 4 ◽  
Author(s):  
Bradley Butcher ◽  
Vincent S. Huang ◽  
Christopher Robinson ◽  
Jeremy Reffin ◽  
Sema K. Sgaier ◽  
...  

Developing data-driven solutions that address real-world problems requires understanding of these problems’ causes and how their interaction affects the outcome–often with only observational data. Causal Bayesian Networks (BN) have been proposed as a powerful method for discovering and representing the causal relationships from observational data as a Directed Acyclic Graph (DAG). BNs could be especially useful for research in global health in Lower and Middle Income Countries, where there is an increasing abundance of observational data that could be harnessed for policy making, program evaluation, and intervention design. However, BNs have not been widely adopted by global health professionals, and in real-world applications, confidence in the results of BNs generally remains inadequate. This is partially due to the inability to validate against some ground truth, as the true DAG is not available. This is especially problematic if a learned DAG conflicts with pre-existing domain doctrine. Here we conceptualize and demonstrate an idea of a “Causal Datasheet” that could approximate and document BN performance expectations for a given dataset, aiming to provide confidence and sample size requirements to practitioners. To generate results for such a Causal Datasheet, a tool was developed which can generate synthetic Bayesian networks and their associated synthetic datasets to mimic real-world datasets. The results given by well-known structure learning algorithms and a novel implementation of the OrderMCMC method using the Quotient Normalized Maximum Likelihood score were recorded. These results were used to populate the Causal Datasheet, and recommendations could be made dependent on whether expected performance met user-defined thresholds. We present our experience in the creation of Causal Datasheets to aid analysis decisions at different stages of the research process. First, one was deployed to help determine the appropriate sample size of a planned study of sexual and reproductive health in Madhya Pradesh, India. Second, a datasheet was created to estimate the performance of an existing maternal health survey we conducted in Uttar Pradesh, India. Third, we validated generated performance estimates and investigated current limitations on the well-known ALARM dataset. Our experience demonstrates the utility of the Causal Datasheet, which can help global health practitioners gain more confidence when applying BNs.


Algorithms ◽  
2021 ◽  
Vol 14 (2) ◽  
pp. 57
Author(s):  
Ryan Feng ◽  
Yu Yao ◽  
Ella Atkins

Autonomous vehicles require fleet-wide data collection for continuous algorithm development and validation. The smart black box (SBB) intelligent event data recorder has been proposed as a system for prioritized high-bandwidth data capture. This paper extends the SBB by applying anomaly detection and action detection methods for generalized event-of-interest (EOI) detection. An updated SBB pipeline is proposed for the real-time capture of driving video data. A video dataset is constructed to evaluate the SBB on real-world data for the first time. SBB performance is assessed by comparing the compression of normal and anomalous data and by comparing our prioritized data recording with an FIFO strategy. The results show that SBB data compression can increase the anomalous-to-normal memory ratio by ∼25%, while the prioritized recording strategy increases the anomalous-to-normal count ratio when compared to an FIFO strategy. We compare the real-world dataset SBB results to a baseline SBB given ground-truth anomaly labels and conclude that improved general EOI detection methods will greatly improve SBB performance.


2021 ◽  
Author(s):  
Prasanta Pal ◽  
Remko Van Lutterveld ◽  
Nancy Quirós ◽  
Veronique Taylor ◽  
Judson Brewer

Real world signal acquisition through sensors, is at the heart of modern digital revolution. However, almost every signal acquisition systems are contaminated with noise and outliers. Precise detec- tion, and curation of data is an essential step to reveal the true-nature of the uncorrupted observations. With the exploding volumes of digital data sources, there is a critical need for a robust but easy-to-operate, low-latency, generic yet highly customizable, outlier- detection and curation tool, easily accessible, adaptable to diverse types of data sources. Existing methods often boil down to data smoothing that inherently cause valuable information loss. We have developed a C++ based, software tool to decontaminate time- series and matrix like data sources, with the goal of recovering the ground-truth. The SOCKS tool would be made available as an open-source software for broader adoption in the scientific community. Our work calls for a philosophical shift in the design pipelines of real- world data processing. We propose, raw data should be decontaminated first, through conditional flagging of outliers, curation of flagged points, followed by iterative, parametrically tuned, asymptotic converge to the ground-truth as accurately as possible, before performing traditional data processing tasks.


2021 ◽  
Vol 6 (1) ◽  
Author(s):  
Iacopo Pozzana ◽  
Christos Ellinas ◽  
Georgios Kalogridis ◽  
Konstantinos Sakellariou

AbstractUnderstanding the role of individual nodes is a key challenge in the study of spreading processes on networks. In this work we propose a novel metric, the reachability-heterogeneity (RH), to quantify the contribution of each node to the robustness of the network against a spreading process. We then introduce a dataset consisting of four large engineering projects described by their activity networks, including records of the performance of each activity, i.e., whether it was timely delivered or delayed; such data, describing the spreading of performance fluctuations across activities, can be used as a reliable ground truth for the study of spreading phenomena on networks. We test the validity of the RH metric on these project networks, and discover that nodes scoring low in RH tend to consistently perform better. We also compare RH and seven other node metrics, showing that the former is highly interdependent with activity performance. Given the context agnostic nature of RH, our results, based on real-world data, signify the role that network structure plays with respect to overall project performance.


2019 ◽  
Author(s):  
Gerhard Aigner ◽  
Bernd Grimm ◽  
Christian Lederer ◽  
Martin Daumer

Background. Physical activity (PA) is increasingly being recognized as a major factor related to the development or prevention of many diseases, as an intervention to cure or delay disease and for patient assessment in diagnostics, as a clinical outcome measure or clinical trial endpoint. Thus, wearable sensors and signal algorithms to monitor PA in the free-living environment (real-world) are becoming popular in medicine and clinical research. This is especially true for walking speed, a parameter of PA behaviour with increasing evidence to serve as a patient outcome and clinical trial endpoint in many diseases. The development and validation of sensor signal algorithms for PA classification, in particular walking, and deriving specific PA parameters, such as real world walking speed depends on the availability of large reference data sets with ground truth values. In this study a novel, reliable, scalable (high throughput), user-friendly device and method to generate such ground truth data for real world walking speed, other physical activity types and further gait-related parameters in a real-world environment is described and validated. Methods. A surveyor’s wheel was instrumented with a rotating 3D accelerometer (actibelt). A signal processing algorithm is described to derive distance and speed values. In addition, a high-resolution camera was attached via an active gimbal to video record context and detail. Validation was performed in the following main parts: 1) walking distance measurement is compared to the wheel’s built-in mechanical counter, 2) walking speed measurement is analysed on a treadmill at various speed settings, 3) speed measurement accuracy is analysed by an independent certified calibration laboratory - accreditation by DAkkS applying standardised test procedures. Results: The mean relative error for distance measurements between our method and the built-in counter was 0.12%. Comparison of the speed values algorithmically extracted from accelerometry data and true treadmill speed revealed a mean adjusted absolute error of 0.01 m/s (relative error: 0.71 %). The calibration laboratory found a mean relative error between values algorithmically extracted from accelerometry data and laboratory gold standard of 0.36% (0.17-0.64 min/max), which is below the resolution of the laboratory. An official certificate was issued. Discussion. Error values were a magnitude smaller than the any clinically important difference for walking speed. Conclusion. Besides the high accuracy, the presented method can be deployed in a real world setting and allows to be integrated into the digital data flow.


Sign in / Sign up

Export Citation Format

Share Document