Machine Learning for Detecting Data Exfiltration

Context : Research at the intersection of cybersecurity, Machine Learning (ML), and Software Engineering (SE) has recently taken significant steps in proposing countermeasures for detecting sophisticated data exfiltration attacks. It is important to systematically review and synthesize the ML-based data exfiltration countermeasures for building a body of knowledge on this important topic. Objective : This article aims at systematically reviewing ML-based data exfiltration countermeasures to identify and classify ML approaches, feature engineering techniques, evaluation datasets, and performance metrics used for these countermeasures. This review also aims at identifying gaps in research on ML-based data exfiltration countermeasures. Method : We used Systematic Literature Review (SLR) method to select and review 92 papers. Results : The review has enabled us to: (a) classify the ML approaches used in the countermeasures into data-driven, and behavior-driven approaches; (b) categorize features into six types: behavioral, content-based, statistical, syntactical, spatial, and temporal; (c) classify the evaluation datasets into simulated, synthesized, and real datasets; and (d) identify 11 performance measures used by these studies. Conclusion : We conclude that: (i) The integration of data-driven and behavior-driven approaches should be explored; (ii) There is a need of developing high quality and large size evaluation datasets; (iii) Incremental ML model training should be incorporated in countermeasures; (iv) Resilience to adversarial learning should be considered and explored during the development of countermeasures to avoid poisoning attacks; and (v) The use of automated feature engineering should be encouraged for efficiently detecting data exfiltration attacks.

Download Full-text

Machine Learning for Materials Scientists: An Introductory Guide Towards Best Practices

10.26434/chemrxiv.12249752 ◽

2020 ◽

Author(s):

Anthony Wang ◽

Ryan Murdock ◽

Steven Kauwe ◽

Anton Oliynyk ◽

Aleksander Gurlo ◽

...

Keyword(s):

Machine Learning ◽

Best Practices ◽

Data Driven ◽

Feature Engineering ◽

Engineering Model ◽

Learning Centered ◽

Domain Expertise ◽

Learning Research ◽

Model Training

<div>This Editorial is intended for materials scientists interested in performing machine learning-centered research.</div><div><br></div><div>We cover broad guidelines and best practices regarding the obtaining and treatment of data, feature engineering, model training, validation, evaluation and comparison, popular repositories for materials data and benchmarking datasets, model and architecture sharing, and finally publication.</div><div>In addition, we include interactive Jupyter notebooks with example Python code to demonstrate some of the concepts, workflows, and best practices discussed.</div><div><br></div><div>Overall, the data-driven methods and machine learning workflows and considerations are presented in a simple way, allowing interested readers to more intelligently guide their machine learning research using the suggested references, best practices, and their own materials domain expertise.</div>

Download Full-text

Machine Learning for Materials Scientists: An Introductory Guide Towards Best Practices

10.26434/chemrxiv.12249752.v1 ◽

2020 ◽

Author(s):

Anthony Wang ◽

Ryan Murdock ◽

Steven Kauwe ◽

Anton Oliynyk ◽

Aleksander Gurlo ◽

...

Keyword(s):

Machine Learning ◽

Best Practices ◽

Data Driven ◽

Feature Engineering ◽

Engineering Model ◽

Learning Centered ◽

Domain Expertise ◽

Learning Research ◽

Model Training

Download Full-text

Two-Stage Monitoring of Patients in Intensive Care Unit for Sepsis Prediction Using Non-Overfitted Machine Learning Models

Electronics ◽

10.3390/electronics9071133 ◽

2020 ◽

Vol 9 (7) ◽

pp. 1133

Author(s):

Vytautas Abromavičius ◽

Darius Plonis ◽

Deividas Tarasevičius ◽

Artūras Serackis

Keyword(s):

Machine Learning ◽

Intensive Care Unit ◽

Intensive Care ◽

Early Detection ◽

Performance Metrics ◽

Imbalanced Data ◽

Clinical Criteria ◽

Unbalanced Dataset ◽

Model Training ◽

Clinical Records

The presented research faces the problem of early detection of sepsis for patients in the Intensive Care Unit. The PhysioNet/Computing in Cardiology Challenge 2019 facilitated the development of automated, open-source algorithms for the early detection of sepsis from clinical data. A labeled clinical records dataset for training and verification of the algorithms was provided by the challenge organizers. However, a relatively small number of records with sepsis, supported by Sepsis-3 clinical criteria, led to highly unbalanced dataset (only 2% records with sepsis label). A high number of unbalanced data records is a great challenge for machine learning model training and is not suitable for training classical classifiers. To address these issues, a method taking into the account the amount of time the patients spent in the intensive care unit (ICU) was proposed. The proposed method uses two separate ensemble models, one trained on patient records under 56 h in the ICU, and another for patients who stayed longer than 56 h. A solution including feature selection and weighting based training on imbalanced data was proposed in this paper. In addition, several performance metrics were investigated. Results show, that for successful prediction, a particular model having few or more predictors based on the length of stay in the Intensive Care Unit should be applied.

Download Full-text

Practices and Infrastructures for ML Systems – An Interview Study

10.36227/techrxiv.16939192.v1 ◽

2021 ◽

Author(s):

Dennis Muiruri ◽

Lucy Ellen Lwakatare ◽

Jukka K. Nurminen ◽

Tommi Mikkonen

Keyword(s):

Machine Learning ◽

Best Practices ◽

Data Management ◽

Management Practices ◽

Interview Study ◽

The State ◽

Data Driven ◽

Software Systems ◽

State Of Practice ◽

Model Training

<div> <div> <div> <p>The best practices and infrastructures for developing and maintaining machine learning (ML) enabled software systems are often reported by large and experienced data-driven organizations. However, little is known about the state of practice across other organizations. Using interviews, we investigated practices and tool-chains for ML-enabled systems from 16 organizations in various domains. Our study makes three broad observations related to data management practices, monitoring practices and automation practices in ML model training, and serving workflows. These have limited number of generic practices and tools applicable across organizations in different domains. </p> </div> </div> </div>

Download Full-text

dfgcompare: a library to support process variant analysis through Markov models

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-021-01715-3 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Amin Jalali ◽

Paul Johannesson ◽

Erik Perjons ◽

Ylva Askfors ◽

Abdolazim Rezaei Kalladj ◽

...

Keyword(s):

Performance Metrics ◽

Markov Models ◽

Process Analysis ◽

Cohort Analysis ◽

Data Driven ◽

Software Support ◽

Cohort Comparison ◽

Analysis Technique ◽

Variant Analysis ◽

And Performance

Abstract Background Data-driven process analysis is an important area that relies on software support. Process variant analysis is a sort of analysis technique in which analysts compare executed process variants, a.k.a. process cohorts. This comparison can help to identify insights for improving processes. There are a few software supports to enable process cohort comparison based on the frequencies of process activities and performance metrics. These metrics are effective in cohort analysis, but they cannot support cohort comparison based on the probability of transitions among states, which is an important enabler for cohort analysis in healthcare. Results This paper defines an approach to compare process cohorts using Markov models. The approach is formalized, and it is implemented as an open-source python library, named dfgcompare. This library can be used by other researchers to compare process cohorts. The implementation is also used to compare caregivers’ behavior when prescribing drugs in the Stockholm Region. The result shows that the approach enables the comparison of process cohorts in practice. Conclusions We conclude that dfgcompare supports identifying differences among process cohorts.

Download Full-text

Investigation of Machine Learning Models and Different Feature Sets for the Efficiency of Early Sepsis Prediction from Highly Unbalanced Data

10.20944/preprints202005.0205.v1 ◽

2020 ◽

Author(s):

Vytautas Abromavičius ◽

Darius Plonis ◽

Deividas Tarasevičius ◽

Artūras Serackis

Keyword(s):

Machine Learning ◽

Intensive Care Unit ◽

Intensive Care ◽

Early Detection ◽

Performance Metrics ◽

Unbalanced Data ◽

Clinical Criteria ◽

Unbalanced Dataset ◽

Model Training ◽

Clinical Records

The presented research faces the problem of early detection of sepsis for patients in the Intensive Care Unit. The PhysioNet/Computing in Cardiology Challenge 2019 facilitated the development of automated, open-source algorithms for the early detection of sepsis from clinical data. A labeled clinical records dataset for training and verification of the algorithms was provided by the challenge organizers. However, a relatively small number of records with sepsis, supported by Sepsis-3 clinical criteria, led to highly unbalanced dataset (only 2% records with sepsis label). A high number of unbalanced data records is a great challenge for machine learning model training and is not suitable for training classical classifiers. To address these issues, a number of various models were investigated. A solution including feature selection and data balancing techniques was proposed in this paper. In addition, several performance metrics were investigated. Results show, that for successful prediction, a particular model having few or more predictors based on the length of stay in the Intensive Care Unit should be applied.

Download Full-text

Data-Driven Personalization of Body-Machine Interfaces to Control Diverse Robot Types

10.21203/rs.3.rs-657990/v1 ◽

2021 ◽

Author(s):

Matteo Macchini ◽

Fabrizio Schiano ◽

Dario Floreano

Keyword(s):

Machine Learning ◽

Degrees Of Freedom ◽

Data Driven ◽

Machine Learning Method ◽

Learning Method ◽

Body Movements ◽

Motor Synergies ◽

And Performance ◽

Robotic Teleoperation ◽

Individual Motor

Abstract Body-Machine Interfaces (BoMIs) for robotic teleoperation can improve a user’s experience and performance. However, the implementation of such systems needs to be optimized on each robot independently, as a general approach has not been proposed to date. Here, we present a novel machine learning method to generate personalized BoMIs from an operator’s spontaneous body movements. The method captures individual motor synergies that can be used for the teleoperation of robots. The proposed algorithm applies to people with diverse behavioral patterns to control robots with diverse morphologies and degrees of freedom, such as a fixed-wing drone, a quadrotor, and a robotic manipulator.

Download Full-text

Corsi, Fenwick and Gramsci: How bloggers and advanced analytics are changing the National Hockey League

International Review for the Sociology of Sport ◽

10.1177/1012690219869192 ◽

2019 ◽

Vol 55 (8) ◽

pp. 1192-1211

Author(s):

Stephen W Sheps

Keyword(s):

Online Community ◽

Performance Metrics ◽

The Body ◽

National Hockey League ◽

Antonio Gramsci ◽

Sport Media ◽

Body Of Knowledge ◽

The People ◽

Advanced Analytics ◽

And Performance

Since the early 2000s, an ever growing online community of bloggers and amateur statisticians has been developing a new set of advanced analytics and performance metrics for the National Hockey League. Many of the people who are driving innovation in this field are not data scientists but, rather, intellectually curious fans of the game, who are playing a significant role in reshaping the way the game is consumed and understood. Yet, despite the body of knowledge created online, only within the last few years have the National Hockey League and the mainstream sport media begun to take notice of these innovations. I argue that the analytics movement is being driven from the fans up, rather than from the National Hockey League and other professional leagues down, and that the drivers of this movement are examples of what Antonio Gramsci calls ‘organic intellectuals’ – the analytics camp is locked into its own ‘war of position’ against the hegemony of traditional hockey fans, coaches, management and sport media. My research explores the resistance Internet-based content creators have experienced from established hockey media personalities (‘Hockey Men’) and the National Hockey League itself, connecting this resistance to a growing trend away from evidence-based discourse in the current Western media landscape.

Download Full-text

Machine learning in men’s professional football: Current applications and future directions for improving attacking play

International Journal of Sports Science & Coaching ◽

10.1177/1747954119879350 ◽

2019 ◽

Vol 14 (6) ◽

pp. 798-817

Author(s):

Mat Herold ◽

Floris Goes ◽

Stephan Nopp ◽

Pascal Bauer ◽

Chris Thompson ◽

...

Keyword(s):

Machine Learning ◽

Performance Metrics ◽

Prediction Method ◽

Professional Football ◽

Future Directions ◽

Video Footage ◽

Complex Process ◽

Football Match ◽

And Performance ◽

Tactical Knowledge

It is common practice amongst coaches and analysts to search for key performance indicators related to attacking play in football. Match analysis in professional football has predominately utilised notational analysis, a statistical summary of events based on video footage, to study the sport and prepare teams for competition. Recent increases in technology have facilitated the dynamic analysis of more complex process variables, giving practitioners the potential to quickly evaluate a match with consideration to contextual parameters. One field of research, known as machine learning, is a form of artificial intelligence that uses algorithms to detect meaningful patterns based on positional data. Machine learning is a relatively new concept in football, and little is known about its usefulness in identifying performance metrics that determine match outcome. Few studies and no reviews have focused on the use of machine learning to improve tactical knowledge and performance, instead focusing on the models used, or as a prediction method. Accordingly, this article provides a critical appraisal of the application of machine learning in football related to attacking play, discussing current challenges and future directions that may provide deeper insight to practitioners.

Download Full-text

An Overview of Opportunities for Machine Learning Methods in Underground Rock Engineering Design

Geosciences ◽

10.3390/geosciences9120504 ◽

2019 ◽

Vol 9 (12) ◽

pp. 504

Author(s):

Josephine Morgenroth ◽

Usman T. Khan ◽

Matthew A. Perras

Keyword(s):

Machine Learning ◽

Engineering Design ◽

Performance Metrics ◽

Mining Industry ◽

Learning Methods ◽

Rock Engineering ◽

Input Selection ◽

Machine Learning Methods ◽

And Performance ◽

Rock Engineering Design

Machine learning methods for data processing are gaining momentum in many geoscience industries. This includes the mining industry, where machine learning is primarily being applied to autonomously driven vehicles such as haul trucks, and ore body and resource delineation. However, the development of machine learning applications in rock engineering literature is relatively recent, despite being widely used and generally accepted for decades in other risk assessment-type design areas, such as flood forecasting. Operating mines and underground infrastructure projects collect more instrumentation data than ever before, however, only a small fraction of the useful information is typically extracted for rock engineering design, and there is often insufficient time to investigate complex rock mass phenomena in detail. This paper presents a summary of current practice in rock engineering design, as well as a review of literature and methods at the intersection of machine learning and rock engineering. It identifies gaps, such as standards for architecture, input selection and performance metrics, and areas for future work. These gaps present an opportunity to define a framework for integrating machine learning into conventional rock engineering design methodologies to make them more rigorous and reliable in predicting probable underlying physical mechanics and phenomenon.

Download Full-text