Machine Learning for Detecting Data Exfiltration

2021 ◽  
Vol 54 (3) ◽  
pp. 1-47
Author(s):  
Bushra Sabir ◽  
Faheem Ullah ◽  
M. Ali Babar ◽  
Raj Gaire

Context : Research at the intersection of cybersecurity, Machine Learning (ML), and Software Engineering (SE) has recently taken significant steps in proposing countermeasures for detecting sophisticated data exfiltration attacks. It is important to systematically review and synthesize the ML-based data exfiltration countermeasures for building a body of knowledge on this important topic. Objective : This article aims at systematically reviewing ML-based data exfiltration countermeasures to identify and classify ML approaches, feature engineering techniques, evaluation datasets, and performance metrics used for these countermeasures. This review also aims at identifying gaps in research on ML-based data exfiltration countermeasures. Method : We used Systematic Literature Review (SLR) method to select and review 92 papers. Results : The review has enabled us to: (a) classify the ML approaches used in the countermeasures into data-driven, and behavior-driven approaches; (b) categorize features into six types: behavioral, content-based, statistical, syntactical, spatial, and temporal; (c) classify the evaluation datasets into simulated, synthesized, and real datasets; and (d) identify 11 performance measures used by these studies. Conclusion : We conclude that: (i) The integration of data-driven and behavior-driven approaches should be explored; (ii) There is a need of developing high quality and large size evaluation datasets; (iii) Incremental ML model training should be incorporated in countermeasures; (iv) Resilience to adversarial learning should be considered and explored during the development of countermeasures to avoid poisoning attacks; and (v) The use of automated feature engineering should be encouraged for efficiently detecting data exfiltration attacks.

2020 ◽  
Author(s):  
Anthony Wang ◽  
Ryan Murdock ◽  
Steven Kauwe ◽  
Anton Oliynyk ◽  
Aleksander Gurlo ◽  
...  

<div>This Editorial is intended for materials scientists interested in performing machine learning-centered research.</div><div><br></div><div>We cover broad guidelines and best practices regarding the obtaining and treatment of data, feature engineering, model training, validation, evaluation and comparison, popular repositories for materials data and benchmarking datasets, model and architecture sharing, and finally publication.</div><div>In addition, we include interactive Jupyter notebooks with example Python code to demonstrate some of the concepts, workflows, and best practices discussed.</div><div><br></div><div>Overall, the data-driven methods and machine learning workflows and considerations are presented in a simple way, allowing interested readers to more intelligently guide their machine learning research using the suggested references, best practices, and their own materials domain expertise.</div>


2020 ◽  
Author(s):  
Anthony Wang ◽  
Ryan Murdock ◽  
Steven Kauwe ◽  
Anton Oliynyk ◽  
Aleksander Gurlo ◽  
...  

<div>This Editorial is intended for materials scientists interested in performing machine learning-centered research.</div><div><br></div><div>We cover broad guidelines and best practices regarding the obtaining and treatment of data, feature engineering, model training, validation, evaluation and comparison, popular repositories for materials data and benchmarking datasets, model and architecture sharing, and finally publication.</div><div>In addition, we include interactive Jupyter notebooks with example Python code to demonstrate some of the concepts, workflows, and best practices discussed.</div><div><br></div><div>Overall, the data-driven methods and machine learning workflows and considerations are presented in a simple way, allowing interested readers to more intelligently guide their machine learning research using the suggested references, best practices, and their own materials domain expertise.</div>


Electronics ◽  
2020 ◽  
Vol 9 (7) ◽  
pp. 1133
Author(s):  
Vytautas Abromavičius ◽  
Darius Plonis ◽  
Deividas Tarasevičius ◽  
Artūras Serackis

The presented research faces the problem of early detection of sepsis for patients in the Intensive Care Unit. The PhysioNet/Computing in Cardiology Challenge 2019 facilitated the development of automated, open-source algorithms for the early detection of sepsis from clinical data. A labeled clinical records dataset for training and verification of the algorithms was provided by the challenge organizers. However, a relatively small number of records with sepsis, supported by Sepsis-3 clinical criteria, led to highly unbalanced dataset (only 2% records with sepsis label). A high number of unbalanced data records is a great challenge for machine learning model training and is not suitable for training classical classifiers. To address these issues, a method taking into the account the amount of time the patients spent in the intensive care unit (ICU) was proposed. The proposed method uses two separate ensemble models, one trained on patient records under 56 h in the ICU, and another for patients who stayed longer than 56 h. A solution including feature selection and weighting based training on imbalanced data was proposed in this paper. In addition, several performance metrics were investigated. Results show, that for successful prediction, a particular model having few or more predictors based on the length of stay in the Intensive Care Unit should be applied.


2021 ◽  
Author(s):  
Dennis Muiruri ◽  
Lucy Ellen Lwakatare ◽  
Jukka K. Nurminen ◽  
Tommi Mikkonen

<div> <div> <div> <p>The best practices and infrastructures for developing and maintaining machine learning (ML) enabled software systems are often reported by large and experienced data-driven organizations. However, little is known about the state of practice across other organizations. Using interviews, we investigated practices and tool-chains for ML-enabled systems from 16 organizations in various domains. Our study makes three broad observations related to data management practices, monitoring practices and automation practices in ML model training, and serving workflows. These have limited number of generic practices and tools applicable across organizations in different domains. </p> </div> </div> </div>


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Amin Jalali ◽  
Paul Johannesson ◽  
Erik Perjons ◽  
Ylva Askfors ◽  
Abdolazim Rezaei Kalladj ◽  
...  

Abstract Background Data-driven process analysis is an important area that relies on software support. Process variant analysis is a sort of analysis technique in which analysts compare executed process variants, a.k.a. process cohorts. This comparison can help to identify insights for improving processes. There are a few software supports to enable process cohort comparison based on the frequencies of process activities and performance metrics. These metrics are effective in cohort analysis, but they cannot support cohort comparison based on the probability of transitions among states, which is an important enabler for cohort analysis in healthcare. Results This paper defines an approach to compare process cohorts using Markov models. The approach is formalized, and it is implemented as an open-source python library, named dfgcompare. This library can be used by other researchers to compare process cohorts. The implementation is also used to compare caregivers’ behavior when prescribing drugs in the Stockholm Region. The result shows that the approach enables the comparison of process cohorts in practice. Conclusions We conclude that dfgcompare supports identifying differences among process cohorts.


Author(s):  
Vytautas Abromavičius ◽  
Darius Plonis ◽  
Deividas Tarasevičius ◽  
Artūras Serackis

The presented research faces the problem of early detection of sepsis for patients in the Intensive Care Unit. The PhysioNet/Computing in Cardiology Challenge 2019 facilitated the development of automated, open-source algorithms for the early detection of sepsis from clinical data. A labeled clinical records dataset for training and verification of the algorithms was provided by the challenge organizers. However, a relatively small number of records with sepsis, supported by Sepsis-3 clinical criteria, led to highly unbalanced dataset (only 2% records with sepsis label). A high number of unbalanced data records is a great challenge for machine learning model training and is not suitable for training classical classifiers. To address these issues, a number of various models were investigated. A solution including feature selection and data balancing techniques was proposed in this paper. In addition, several performance metrics were investigated. Results show, that for successful prediction, a particular model having few or more predictors based on the length of stay in the Intensive Care Unit should be applied.


2021 ◽  
Author(s):  
Matteo Macchini ◽  
Fabrizio Schiano ◽  
Dario Floreano

Abstract Body-Machine Interfaces (BoMIs) for robotic teleoperation can improve a user’s experience and performance. However, the implementation of such systems needs to be optimized on each robot independently, as a general approach has not been proposed to date. Here, we present a novel machine learning method to generate personalized BoMIs from an operator’s spontaneous body movements. The method captures individual motor synergies that can be used for the teleoperation of robots. The proposed algorithm applies to people with diverse behavioral patterns to control robots with diverse morphologies and degrees of freedom, such as a fixed-wing drone, a quadrotor, and a robotic manipulator.


2019 ◽  
Vol 55 (8) ◽  
pp. 1192-1211
Author(s):  
Stephen W Sheps

Since the early 2000s, an ever growing online community of bloggers and amateur statisticians has been developing a new set of advanced analytics and performance metrics for the National Hockey League. Many of the people who are driving innovation in this field are not data scientists but, rather, intellectually curious fans of the game, who are playing a significant role in reshaping the way the game is consumed and understood. Yet, despite the body of knowledge created online, only within the last few years have the National Hockey League and the mainstream sport media begun to take notice of these innovations. I argue that the analytics movement is being driven from the fans up, rather than from the National Hockey League and other professional leagues down, and that the drivers of this movement are examples of what Antonio Gramsci calls ‘organic intellectuals’ – the analytics camp is locked into its own ‘war of position’ against the hegemony of traditional hockey fans, coaches, management and sport media. My research explores the resistance Internet-based content creators have experienced from established hockey media personalities (‘Hockey Men’) and the National Hockey League itself, connecting this resistance to a growing trend away from evidence-based discourse in the current Western media landscape.


2019 ◽  
Vol 14 (6) ◽  
pp. 798-817
Author(s):  
Mat Herold ◽  
Floris Goes ◽  
Stephan Nopp ◽  
Pascal Bauer ◽  
Chris Thompson ◽  
...  

It is common practice amongst coaches and analysts to search for key performance indicators related to attacking play in football. Match analysis in professional football has predominately utilised notational analysis, a statistical summary of events based on video footage, to study the sport and prepare teams for competition. Recent increases in technology have facilitated the dynamic analysis of more complex process variables, giving practitioners the potential to quickly evaluate a match with consideration to contextual parameters. One field of research, known as machine learning, is a form of artificial intelligence that uses algorithms to detect meaningful patterns based on positional data. Machine learning is a relatively new concept in football, and little is known about its usefulness in identifying performance metrics that determine match outcome. Few studies and no reviews have focused on the use of machine learning to improve tactical knowledge and performance, instead focusing on the models used, or as a prediction method. Accordingly, this article provides a critical appraisal of the application of machine learning in football related to attacking play, discussing current challenges and future directions that may provide deeper insight to practitioners.


Geosciences ◽  
2019 ◽  
Vol 9 (12) ◽  
pp. 504
Author(s):  
Josephine Morgenroth ◽  
Usman T. Khan ◽  
Matthew A. Perras

Machine learning methods for data processing are gaining momentum in many geoscience industries. This includes the mining industry, where machine learning is primarily being applied to autonomously driven vehicles such as haul trucks, and ore body and resource delineation. However, the development of machine learning applications in rock engineering literature is relatively recent, despite being widely used and generally accepted for decades in other risk assessment-type design areas, such as flood forecasting. Operating mines and underground infrastructure projects collect more instrumentation data than ever before, however, only a small fraction of the useful information is typically extracted for rock engineering design, and there is often insufficient time to investigate complex rock mass phenomena in detail. This paper presents a summary of current practice in rock engineering design, as well as a review of literature and methods at the intersection of machine learning and rock engineering. It identifies gaps, such as standards for architecture, input selection and performance metrics, and areas for future work. These gaps present an opportunity to define a framework for integrating machine learning into conventional rock engineering design methodologies to make them more rigorous and reliable in predicting probable underlying physical mechanics and phenomenon.


Sign in / Sign up

Export Citation Format

Share Document