Mitigation Techniques to Overcome Data Harm in Model Building for ML

Mapping Intimacies ◽

10.5121/csit.2021.111916 ◽

2021 ◽

Author(s):

Ayse Arslan

Keyword(s):

Machine Learning ◽

Life Cycle ◽

Data Collection ◽

Model Building ◽

Collection Development ◽

Potential Sources ◽

Mitigation Techniques ◽

Model Training ◽

The Impact

Given the impact of Machine Learning (ML) on individuals and the society, understanding how harm might be occur throughout the ML life cycle becomes critical more than ever. By offering a framework to determine distinct potential sources of downstream harm in ML pipeline, the paper demonstrates the importance of choices throughout distinct phases of data collection, development, and deployment that extend far beyond just model training. Relevant mitigation techniques are also suggested for being used instead of merely relying on generic notions of what counts as fairness.

Download Full-text

Data Science Methods for Psychology

Psychology ◽

10.1093/obo/9780199828340-0259 ◽

2020 ◽

Author(s):

Jeffrey Stanton

Keyword(s):

Machine Learning ◽

Big Data ◽

Data Analysis ◽

Data Collection ◽

Data Science ◽

Large Data ◽

Large Data Sets ◽

Predictive Analysis ◽

Data Sets ◽

The Impact

The term “data science” refers to an emerging field of research and practice that focuses on obtaining, processing, visualizing, analyzing, preserving, and re-using large collections of information. A related term, “big data,” has been used to refer to one of the important challenges faced by data scientists in many applied environments: the need to analyze large data sources, in certain cases using high-speed, real-time data analysis techniques. Data science encompasses much more than big data, however, as a result of many advancements in cognate fields such as computer science and statistics. Data science has also benefited from the widespread availability of inexpensive computing hardware—a development that has enabled “cloud-based” services for the storage and analysis of large data sets. The techniques and tools of data science have broad applicability in the sciences. Within the field of psychology, data science offers new opportunities for data collection and data analysis that have begun to streamline and augment efforts to investigate the brain and behavior. The tools of data science also enable new areas of research, such as computational neuroscience. As an example of the impact of data science, psychologists frequently use predictive analysis as an investigative tool to probe the relationships between a set of independent variables and one or more dependent variables. While predictive analysis has traditionally been accomplished with techniques such as multiple regression, recent developments in the area of machine learning have put new predictive tools in the hands of psychologists. These machine learning tools relax distributional assumptions and facilitate exploration of non-linear relationships among variables. These tools also enable the analysis of large data sets by opening options for parallel processing. In this article, a range of relevant areas from data science is reviewed for applicability to key research problems in psychology including large-scale data collection, exploratory data analysis, confirmatory data analysis, and visualization. This bibliography covers data mining, machine learning, deep learning, natural language processing, Bayesian data analysis, visualization, crowdsourcing, web scraping, open source software, application programming interfaces, and research resources such as journals and textbooks.

Download Full-text

Understanding Potential Sources of Harm throughout the Machine Learning Life Cycle

10.21428/2c646de5.c16a07bb ◽

2021 ◽

Author(s):

Harini Suresh ◽

John Guttag

Keyword(s):

Machine Learning ◽

Life Cycle ◽

Potential Sources

Download Full-text

The Essential Role of Open Data and Software for the Future of Ultrasound-Based Neuronavigation

Frontiers in Oncology ◽

10.3389/fonc.2020.619274 ◽

2021 ◽

Vol 10 ◽

Author(s):

Ingerid Reinertsen ◽

D. Louis Collins ◽

Simon Drouin

Keyword(s):

Machine Learning ◽

Data Collection ◽

Open Source ◽

Open Source Software ◽

Graphics Processing Units ◽

Large Scale ◽

Training Data ◽

Standard Format ◽

Real Time Processing ◽

The Impact

With the recent developments in machine learning and modern graphics processing units (GPUs), there is a marked shift in the way intra-operative ultrasound (iUS) images can be processed and presented during surgery. Real-time processing of images to highlight important anatomical structures combined with in-situ display, has the potential to greatly facilitate the acquisition and interpretation of iUS images when guiding an operation. In order to take full advantage of the recent advances in machine learning, large amounts of high-quality annotated training data are necessary to develop and validate the algorithms. To ensure efficient collection of a sufficient number of patient images and external validity of the models, training data should be collected at several centers by different neurosurgeons, and stored in a standard format directly compatible with the most commonly used machine learning toolkits and libraries. In this paper, we argue that such effort to collect and organize large-scale multi-center datasets should be based on common open source software and databases. We first describe the development of existing open-source ultrasound based neuronavigation systems and how these systems have contributed to enhanced neurosurgical guidance over the last 15 years. We review the impact of the large number of projects worldwide that have benefited from the publicly available datasets “Brain Images of Tumors for Evaluation” (BITE) and “Retrospective evaluation of Cerebral Tumors” (RESECT) that include MR and US data from brain tumor cases. We also describe the need for continuous data collection and how this effort can be organized through the use of a well-adapted and user-friendly open-source software platform that integrates both continually improved guidance and automated data collection functionalities.

Download Full-text

A New Proposal on the Advanced Persistent Threat: A Survey

Applied Sciences ◽

10.3390/app10113874 ◽

2020 ◽

Vol 10 (11) ◽

pp. 3874

Author(s):

Santiago Quintero-Bonilla ◽

Angel Martín del Rey

Keyword(s):

Machine Learning ◽

Life Cycle ◽

Early Detection ◽

Machine Learning Techniques ◽

Cyber Attack ◽

New Approach ◽

Advanced Persistent Threat ◽

Learning Techniques ◽

Proposed Model ◽

Mitigation Techniques

An advanced persistent threat (APT) can be defined as a targeted and very sophisticated cyber attack. IT administrators need tools that allow for the early detection of these attacks. Several approaches have been proposed to provide solutions to this problem based on the attack life cycle. Recently, machine learning techniques have been implemented in these approaches to improve the problem of detection. This paper aims to propose a new approach to APT detection, using machine learning techniques, and is based on the life cycle of an APT attack. The proposed model is organised into two passive stages and three active stages to adapt the mitigation techniques based on machine learning.

Download Full-text

ML-morph: A Fast, Accurate and General Approach for Automated Detection and Landmarking of Biological Structures in Images

10.1101/769075 ◽

2019 ◽

Author(s):

Arthur Porto ◽

Kjetil L. Voje

Keyword(s):

Machine Learning ◽

Data Collection ◽

Biological Research ◽

List Type ◽

Shape Variation ◽

Morphometric Data ◽

General Applicability ◽

Minimal Impact ◽

Biological Structures ◽

Model Training

ABSTRACTMorphometrics has become an indispensable component of the statistical analysis of size and shape variation in biological structures. Morphometric data has traditionally been gathered through low-throughput manual landmark annotation, which represents a significant bottleneck for morphometric-based phenomics. Here we propose a machine-learning-based high-throughput pipeline to collect high-dimensional morphometric data in images of semi rigid biological structures.The proposed framework has four main strengths. First, it allows for dense phenotyping with minimal impact on specimens. Second, it presents landmarking accuracy comparable to manual annotators, when applied to standardized datasets. Third, it performs data collection at speeds several orders of magnitude higher than manual annotators. And finally, it is of general applicability (i.e., not tied to a specific study system).State-of-the-art validation procedures show that the method achieves low error levels when applied to three morphometric datasets of increasing complexity, with error varying from 0.5% to 2% of the structure’s length in the automated placement of landmarks. As a benchmark for the speed of the entire automated landmarking pipeline, our framework places 23 landmarks on 13,686 objects (zooids) detected in 1684 pictures of fossil bryozoans in 3.12 minutes using a personal computer.The proposed machine-learning-based phenotyping pipeline can greatly increase the scale, reproducibility and speed of data collection within biological research. To aid the use of the framework, we have developed a file conversion algorithm that can be used to leverage current morphometric datasets for automation, allowing the entire procedure, from model training all the way to prediction, to be performed in a matter of hours.

Download Full-text

The Impact of Pan-Sharpening and Spectral Resolution on Vineyard Segmentation through Machine Learning

Remote Sensing ◽

10.3390/rs12060934 ◽

2020 ◽

Vol 12 (6) ◽

pp. 934 ◽

Cited By ~ 3

Author(s):

Eriita G. Jones ◽

Sebastien Wong ◽

Anthony Milton ◽

Joseph Sclauzero ◽

Holly Whittenbury ◽

...

Keyword(s):

Machine Learning ◽

Remote Sensing ◽

Vegetation Index ◽

A Priori ◽

Elevation Angle ◽

Model Performance ◽

Machine Learning Techniques ◽

Normalised Difference Vegetation Index ◽

Model Training ◽

The Impact

Precision viticulture benefits from the accurate detection of vineyard vegetation from remote sensing, without a priori knowledge of vine locations. Vineyard detection enables efficient, and potentially automated, derivation of spatial measures such as length and area of crop, and hence required volumes of water, fertilizer, and other resources. Machine learning techniques have provided significant advancements in recent years in the areas of image segmentation, classification, and object detection, with neural networks shown to perform well in the detection of vineyards and other crops. However, what has not been extensively quantitatively examined is the extent to which the initial choice of input imagery impacts detection/segmentation accuracy. Here, we use a standard deep convolutional neural network (CNN) to detect and segment vineyards across Australia using DigitalGlobe Worldview-2 images at ∼50 cm (panchromatic) and ∼2 m (multispectral) spatial resolution. A quantitative assessment of the variation in model performance with input parameters during model training is presented from a remote sensing perspective, with combinations of panchromatic, multispectral, pan-sharpened multispectral, and the spectral Normalised Difference Vegetation Index (NDVI) considered. The impact of image acquisition parameters—namely, the off-nadir angle and solar elevation angle—on the quality of pan-sharpening is also assessed. The results are synthesised into a ‘recipe’ for optimising the accuracy of vineyard segmentation, which can provide a guide to others aiming to implement or improve automated crop detection and classification.

Download Full-text

Delineating the impact of machine learning elements in pre-microRNA detection

PeerJ ◽

10.7717/peerj.3131 ◽

2017 ◽

Vol 5 ◽

pp. e3131 ◽

Cited By ~ 6

Author(s):

Müşerref Duygu Saçar Demirci ◽

Jens Allmer

Keyword(s):

Machine Learning ◽

Gene Regulation ◽

Model Performance ◽

Protein Product ◽

Algorithm Selection ◽

Microrna Detection ◽

Mirna Detection ◽

Model Training ◽

Transcriptional Gene Regulation ◽

The Impact

Gene regulation modulates RNA expression via transcription factors. Post-transcriptional gene regulation in turn influences the amount of protein product through, for example, microRNAs (miRNAs). Experimental establishment of miRNAs and their effects is complicated and even futile when aiming to establish the entirety of miRNA target interactions. Therefore, computational approaches have been proposed. Many such tools rely on machine learning (ML) which involves example selection, feature extraction, model training, algorithm selection, and parameter optimization. Different ML algorithms have been used for model training on various example sets, more than 1,000 features describing pre-miRNAs have been proposed and different training and testing schemes have been used for model establishment. For pre-miRNA detection, negative examples cannot easily be established causing a problem for two class classification algorithms. There is also no consensus on what ML approach works best and, therefore, we set forth and established the impact of the different parts involved in ML on model performance. Furthermore, we established two new negative datasets and analyzed the impact of using them for training and testing. It was our aim to attach an order of importance to the parts involved in ML for pre-miRNA detection, but instead we found that all parts are intricately connected and their contributions cannot be easily untangled leading us to suggest that when attempting ML-based pre-miRNA detection many scenarios need to be explored.

Download Full-text

Forecasting US movies box office performances in Turkey using machine learning algorithms

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189120 ◽

2020 ◽

Vol 39 (5) ◽

pp. 6579-6590

Author(s):

Sandy Çağlıyor ◽

Başar Öztayşi ◽

Selime Sezgin

Keyword(s):

Machine Learning ◽

Global Economy ◽

Learning Algorithms ◽

Forecast Model ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

High Stakes ◽

Box Office ◽

Industry Forecast ◽

The Impact

The motion picture industry is one of the largest industries worldwide and has significant importance in the global economy. Considering the high stakes and high risks in the industry, forecast models and decision support systems are gaining importance. Several attempts have been made to estimate the theatrical performance of a movie before or at the early stages of its release. Nevertheless, these models are mostly used for predicting domestic performances and the industry still struggles to predict box office performances in overseas markets. In this study, the aim is to design a forecast model using different machine learning algorithms to estimate the theatrical success of US movies in Turkey. From various sources, a dataset of 1559 movies is constructed. Firstly, independent variables are grouped as pre-release, distributor type, and international distribution based on their characteristic. The number of attendances is discretized into three classes. Four popular machine learning algorithms, artificial neural networks, decision tree regression and gradient boosting tree and random forest are employed, and the impact of each group is observed by compared by the performance models. Then the number of target classes is increased into five and eight and results are compared with the previously developed models in the literature.

Download Full-text

INDONESIA LANGUAGE INTERFERENCE FORM IN BASIC PPL LEARNING THAILAND UNHASY COLLEGE SYUDENTS

SASTRANESIA Jurnal Program Studi Pendidikan Bahasa dan Sastra Indonesia ◽

10.32682/sastranesia.v8i2.1438 ◽

2020 ◽

Vol 8 (2) ◽

pp. 38

Author(s):

Siti Mariana Ulfa

Keyword(s):

Qualitative Research ◽

Data Analysis ◽

Data Collection ◽

Teaching Practice ◽

Data Presentation ◽

Analysis Techniques ◽

Thai Students ◽

The Impact ◽

Collection Methods ◽

Recording Techniques

AbstractHumans on earth need social interaction with others. Humans can use more than one language in communication. Thus, the impact that arises when the use of one or more languages is the contact between languages. One obvious form of contact between languages is interference. Interference can occur at all levels of life. As in this study, namely Indonesian Language Interference in Learning PPL Basic Thailand Unhasy Students. This study contains the form of interference that occurs in Thai students who are conducting teaching practices in the classroom. This type of research is descriptive qualitative research that seeks to describe any interference that occurs in the speech of Thai students when teaching practice. Data collection methods in this study are (1) observation techniques, (2) audio-visual recording techniques using CCTV and (3) recording techniques, by recording all data that has been obtained. Whereas the data wetness uses, (1) data triangulation, (2) improvement in perseverance and (3) peer review through discussion. Data analysis techniques in this study are (1) data collection, (2) data reduction, (3) data presentation and (4) conclusions. It can be seen that the interference that occurs includes (1) interference in phonological systems, (2) interference in morphological systems and (3) interference in syntactic systems.

Download Full-text