Boosting Instance Segmentation with Synthetic Data: A study to overcome the limits of real world data sets

Malicious software utilizes HTTP protocol for communication purposes, creating network traffic that is hard to identify as it blends into the traffic generated by benign applications. To this aim, fingerprinting tools have been developed to help track and identify such traffic by providing a short representation of malicious HTTP requests. However, currently existing tools do not analyze all information included in the HTTP message or analyze it insufficiently. To address these issues, we propose Hfinger, a novel malware HTTP request fingerprinting tool. It extracts information from the parts of the request such as URI, protocol information, headers, and payload, providing a concise request representation that preserves the extracted information in a form interpretable by a human analyst. For the developed solution, we have performed an extensive experimental evaluation using real-world data sets and we also compared Hfinger with the most related and popular existing tools such as FATT, Mercury, and p0f. The conducted effectiveness analysis reveals that on average only 1.85% of requests fingerprinted by Hfinger collide between malware families, what is 8–34 times lower than existing tools. Moreover, unlike these tools, in default mode, Hfinger does not introduce collisions between malware and benign applications and achieves it by increasing the number of fingerprints by at most 3 times. As a result, Hfinger can effectively track and hunt malware by providing more unique fingerprints than other standard tools.

Download Full-text

Comparison of M2M Traffic Models Against Real World Data Sets

2018 IEEE 23rd International Workshop on Computer Aided Modeling and Design of Communication Links and Networks (CAMAD) ◽

10.1109/camad.2018.8515000 ◽

2018 ◽

Cited By ~ 1

Author(s):

Marco Sansoni ◽

Giuseppe Ravagnani ◽

Daniel Zucchetto ◽

Chiara Pielli ◽

Andrea Zanella ◽

...

Keyword(s):

Real World ◽

Data Sets ◽

Real World Data ◽

Traffic Models ◽

World Data

Download Full-text

Activities for Students: As the Ball Rolls: A Quadratic Investigation Using Multiple Representations

Mathematics Teacher ◽

10.5951/mt.103.1.0062 ◽

2009 ◽

Vol 103 (1) ◽

pp. 62-68

Author(s):

Kathleen Cage Mittag ◽

Sharon Taylor

Keyword(s):

Data Collection ◽

Real Time ◽

Physical Model ◽

Real World ◽

Graphing Calculator ◽

Data Sets ◽

Real World Data ◽

World Data ◽

Hands On ◽

Modeling Data

Using activities to create and collect data is not a new idea. Teachers have been incorporating real-world data into their classes since at least the advent of the graphing calculator. Plenty of data collection activities and data sets exist, and the graphing calculator has made modeling data much easier. However, the authors were in search of a better physical model for a quadratic. We wanted students to see an actual parabola take shape in real time and then explore its characteristics, but we could not find such a hands-on model.

Download Full-text

Comparing ICP variants on real-world data sets

Autonomous Robots ◽

10.1007/s10514-013-9327-2 ◽

2013 ◽

Vol 34 (3) ◽

pp. 133-148 ◽

Cited By ~ 318

Author(s):

François Pomerleau ◽

Francis Colas ◽

Roland Siegwart ◽

Stéphane Magnenat

Keyword(s):

Real World ◽

Data Sets ◽

Real World Data ◽

World Data

Download Full-text

An empirical analysis of dealing with patients who are lost to follow-up when developing prognostic models using a cohort design

10.21203/rs.3.rs-54715/v3 ◽

2020 ◽

Author(s):

Jenna M Reps ◽

Peter Rijnbeek ◽

Alana Cuthbert ◽

Patrick B Ryan ◽

Nicole Pratt ◽

...

Keyword(s):

At Risk ◽

Real World ◽

Synthetic Data ◽

Real World Data ◽

Test Set ◽

Loss To Follow Up ◽

World Data ◽

Cohort Design ◽

Lost To Follow Up

Abstract Background: Researchers developing prediction models are faced with numerous design choices that may impact model performance. One key decision is how to include patients who are lost to follow-up. In this paper we perform a large-scale empirical evaluation investigating the impact of this decision. In addition, we aim to provide guidelines for how to deal with loss to follow-up.Methods: We generate a partially synthetic dataset with complete follow-up and simulate loss to follow-up based either on random selection or on selection based on comorbidity. In addition to our synthetic data study we investigate 21 real-world data prediction problems. We compare four simple strategies for developing models when using a cohort design that encounters loss to follow-up. Three strategies employ a binary classifier with data that: i) include all patients (including those lost to follow-up), ii) exclude all patients lost to follow-up or iii) only exclude patients lost to follow-up who do not have the outcome before being lost to follow-up. The fourth strategy uses a survival model with data that include all patients. We empirically evaluate the discrimination and calibration performance.Results: The partially synthetic data study results show that excluding patients who are lost to follow-up can introduce bias when loss to follow-up is common and does not occur at random. However, when loss to follow-up was completely at random, the choice of addressing it had negligible impact on model discrimination performance. Our empirical real-world data results showed that the four design choices investigated to deal with loss to follow-up resulted in comparable performance when the time-at-risk was 1-year but demonstrated differential bias when we looked into 3-year time-at-risk. Removing patients who are lost to follow-up before experiencing the outcome but keeping patients who are lost to follow-up after the outcome can bias a model and should be avoided.Conclusion: Based on this study we therefore recommend i) developing models using data that includes patients that are lost to follow-up and ii) evaluate the discrimination and calibration of models twice: on a test set including patients lost to follow-up and a test set excluding patients lost to follow-up.

Download Full-text

A Spatial Biosurveillance Synthetic Data Generator in R

Online Journal of Public Health Informatics ◽

10.5210/ojphi.v9i1.7583 ◽

2017 ◽

Vol 9 (1) ◽

Cited By ~ 1

Author(s):

Drew Levin ◽

Patrick Finley

Keyword(s):

Power Law ◽

Real World ◽

Degree Distribution ◽

Transportation Network ◽

Synthetic Data ◽

Data Sets ◽

Real World Data ◽

Data Set ◽

Scale Free ◽

Data Generator

ObjectiveTo develop a spatially accurate biosurveillance synthetic datagenerator for the testing, evaluation, and comparison of new outbreakdetection techniques.IntroductionDevelopment of new methods for the rapid detection of emergingdisease outbreaks is a research priority in the field of biosurveillance.Because real-world data are often proprietary in nature, scientists mustutilize synthetic data generation methods to evaluate new detectionmethodologies. Colizza et. al. have shown that epidemic spread isdependent on the airline transportation network [1], yet current datagenerators do not operate over network structures.Here we present a new spatial data generator that models thespread of contagion across a network of cities connected by airlineroutes. The generator is developed in the R programming languageand produces data compatible with the popular `surveillance’ softwarepackage.MethodsColizza et. al. demonstrate the power-law relationships betweencity population, air traffic, and degree distribution [1]. We generate atransportation network as a Chung-Lu random graph [2] that preservesthese scale-free relationships (Figure 1).First, given a power-law exponent and a desired number of cities,a probability mass function (PMF) is generated that mirrors theexpected degree distribution for the given power-law relationship.Values are then sampled from this PMF to generate an expecteddegree (number of connected cities) for each city in the network.Edges (airline connections) are added to the network probabilisticallyas described in [2]. Unconnected graph components are each joinedto the largest component using linear preferential attachment. Finally,city sizes are calculated based on an observed three-quarter power-law scaling relationship with the sampled degree distribution.Each city is represented as a customizable stochastic compartmentalSIR model. Transportation between cities is modeled similar to [2].An infection is initialized in a single random city and infection countsare recorded in each city for a fixed period of time. A consistentfraction of the modeled infection cases are recorded as daily clinicvisits. These counts are then added onto statically generated baselinedata for each city to produce a full synthetic data set. Alternatively,data sets can be generated using real-world networks, such as the onemaintained by the International Air Transport Association.ResultsDynamics such as the number of cities, degree distribution power-law exponent, traffic flow, and disease kinetics can be customized.In the presented example (Figure 2) the outbreak spreads over a 20city transportation network. Infection spreads rapidly once the morepopulated hub cities are infected. Cities that are multiple flights awayfrom the initially infected city are infected late in the process. Thegenerator is capable of creating data sets of arbitrary size, length, andconnectivity to better mirror a diverse set of observed network types.ConclusionsNew computational methods for outbreak detection andsurveillance must be compared to established approaches. Outbreakmitigation strategies require a realistic model of human transportationbehavior to best evaluate impact. These actions require test data thataccurately reflect the complexity of the real-world data they wouldbe applied to. The outbreak data generated here represents thecomplexity of modern transportation networks and are made to beeasily integrated with established software packages to allow for rapidtesting and deployment.Randomly generated scale-free transportation network with a power-lawdegree exponent ofλ=1.8. City and link sizes are scaled to reflect their weight.An example of observed daily outbreak-related clinic visits across a randomlygenerated network of 20 cities. Each city is colored by the number of flightsrequired to reach the city from the initial infection location. These generatedcounts are then added onto baseline data to create a synthetic data set forexperimentation.KeywordsSimulation; Network; Spatial; Synthetic; Data

Download Full-text

Computing top-k temporal closeness in temporal networks

Knowledge and Information Systems ◽

10.1007/s10115-021-01639-4 ◽

2022 ◽

Author(s):

Lutz Oettershagen ◽

Petra Mutzel

Keyword(s):

Real World ◽

High Probability ◽

Transitive Closure ◽

Closeness Centrality ◽

Data Sets ◽

Temporal Networks ◽

Real World Data ◽

World Data ◽

Temporal Domain ◽

Vertex Sets

AbstractThe closeness centrality of a vertex in a classical static graph is the reciprocal of the sum of the distances to all other vertices. However, networks are often dynamic and change over time. Temporal distances take these dynamics into account. In this work, we consider the harmonic temporal closeness with respect to the shortest duration distance. We introduce an efficient algorithm for computing the exact top-k temporal closeness values and the corresponding vertices. The algorithm can be generalized to the task of computing all closeness values. Furthermore, we derive heuristic modifications that perform well on real-world data sets and drastically reduce the running times. For the case that edge traversal takes an equal amount of time for all edges, we lift two approximation algorithms to the temporal domain. The algorithms approximate the transitive closure of a temporal graph (which is an essential ingredient for the top-k algorithm) and the temporal closeness for all vertices, respectively, with high probability. We experimentally evaluate all our new approaches on real-world data sets and show that they lead to drastically reduced running times while keeping high quality in many cases. Moreover, we demonstrate that the top-k temporal and static closeness vertex sets differ quite largely in the considered temporal networks.

Download Full-text

Evaluating a Longitudinal Synthetic Data Generator using Real World Data

2021 IEEE 34th International Symposium on Computer-Based Medical Systems (CBMS) ◽

10.1109/cbms52027.2021.00074 ◽

2021 ◽

Author(s):

Zhenchen Wang ◽

Puja Myles ◽

Anu Jain ◽

James L. Keidel ◽

Roberto Liddi ◽

...

Keyword(s):

Real World ◽

Synthetic Data ◽

Real World Data ◽

World Data ◽

Data Generator

Download Full-text

Verification of statistical oncological endpoints on encrypted data: Confirming the feasibility of real-world data sharing without the need to reveal protected patient information.

Journal of Clinical Oncology ◽

10.1200/jco.2021.39.15_suppl.e18725 ◽

2021 ◽

Vol 39 (15_suppl) ◽

pp. e18725-e18725

Author(s):

Ravit Geva ◽

Barliz Waissengrin ◽

Dan Mirelman ◽

Felix Bokstein ◽

Deborah T. Blumenthal ◽

...

Keyword(s):

Data Sharing ◽

Real World ◽

Clinical Decision Making ◽

Homomorphic Encryption ◽

Data Sets ◽

Real World Data ◽

Data Set ◽

Raw Data ◽

Encrypted Data ◽

World Data

e18725 Background: Healthcare data sharing is important for the creation of diverse and large data sets, supporting clinical decision making, and accelerating efficient research to improve patient outcomes. This is especially vital in the case of real world data analysis. However, stakeholders are reluctant to share their data without ensuring patients’ privacy, proper protection of their data sets and the ways they are being used. Homomorphic encryption is a cryptographic capability that can address these issues by enabling computation on encrypted data without ever decrypting it, so the analytics results are obtained without revealing the raw data. The aim of this study is to prove the accuracy of analytics results and the practical efficiency of the technology. Methods: A real-world data set of colorectal cancer patients’ survival data, following two different treatment interventions, including 623 patients and 24 variables, amounting to 14,952 items of data, was encrypted using leveled homomorphic encryption implemented in the PALISADE software library. Statistical analysis of key oncological endpoints was blindly performed on both the raw data and the homomorphically-encrypted data using descriptive statistics and survival analysis with Kaplan-Meier curves. Results were then compared with an accuracy goal of two decimals. Results: The difference between the raw data and the homomorphically encrypted data results, regarding all variables analyzed was within the pre-determined accuracy range goal, as well as the practical efficiency of the encrypted computation measured by run time, are presented in table. Conclusions: This study demonstrates that data encrypted with Homomorphic Encryption can be statistical analyzed with a precision of at least two decimal places, allowing safe clinical conclusions drawing while preserving patients’ privacy and protecting data owners’ data assets. Homomorphic encryption allows performing efficient computation on encrypted data non-interactively and without requiring decryption during computation time. Utilizing the technology will empower large-scale cross-institution and cross- stakeholder collaboration, allowing safe international collaborations. Clinical trial information: 0048-19-TLV. [Table: see text]

Download Full-text