scholarly journals Mitigation Techniques to Overcome Data Harm in Model Building for ML

2021 ◽  
Author(s):  
Ayse Arslan

Given the impact of Machine Learning (ML) on individuals and the society, understanding how harm might be occur throughout the ML life cycle becomes critical more than ever. By offering a framework to determine distinct potential sources of downstream harm in ML pipeline, the paper demonstrates the importance of choices throughout distinct phases of data collection, development, and deployment that extend far beyond just model training. Relevant mitigation techniques are also suggested for being used instead of merely relying on generic notions of what counts as fairness.

Psychology ◽  
2020 ◽  
Author(s):  
Jeffrey Stanton

The term “data science” refers to an emerging field of research and practice that focuses on obtaining, processing, visualizing, analyzing, preserving, and re-using large collections of information. A related term, “big data,” has been used to refer to one of the important challenges faced by data scientists in many applied environments: the need to analyze large data sources, in certain cases using high-speed, real-time data analysis techniques. Data science encompasses much more than big data, however, as a result of many advancements in cognate fields such as computer science and statistics. Data science has also benefited from the widespread availability of inexpensive computing hardware—a development that has enabled “cloud-based” services for the storage and analysis of large data sets. The techniques and tools of data science have broad applicability in the sciences. Within the field of psychology, data science offers new opportunities for data collection and data analysis that have begun to streamline and augment efforts to investigate the brain and behavior. The tools of data science also enable new areas of research, such as computational neuroscience. As an example of the impact of data science, psychologists frequently use predictive analysis as an investigative tool to probe the relationships between a set of independent variables and one or more dependent variables. While predictive analysis has traditionally been accomplished with techniques such as multiple regression, recent developments in the area of machine learning have put new predictive tools in the hands of psychologists. These machine learning tools relax distributional assumptions and facilitate exploration of non-linear relationships among variables. These tools also enable the analysis of large data sets by opening options for parallel processing. In this article, a range of relevant areas from data science is reviewed for applicability to key research problems in psychology including large-scale data collection, exploratory data analysis, confirmatory data analysis, and visualization. This bibliography covers data mining, machine learning, deep learning, natural language processing, Bayesian data analysis, visualization, crowdsourcing, web scraping, open source software, application programming interfaces, and research resources such as journals and textbooks.


2021 ◽  
Vol 10 ◽  
Author(s):  
Ingerid Reinertsen ◽  
D. Louis Collins ◽  
Simon Drouin

With the recent developments in machine learning and modern graphics processing units (GPUs), there is a marked shift in the way intra-operative ultrasound (iUS) images can be processed and presented during surgery. Real-time processing of images to highlight important anatomical structures combined with in-situ display, has the potential to greatly facilitate the acquisition and interpretation of iUS images when guiding an operation. In order to take full advantage of the recent advances in machine learning, large amounts of high-quality annotated training data are necessary to develop and validate the algorithms. To ensure efficient collection of a sufficient number of patient images and external validity of the models, training data should be collected at several centers by different neurosurgeons, and stored in a standard format directly compatible with the most commonly used machine learning toolkits and libraries. In this paper, we argue that such effort to collect and organize large-scale multi-center datasets should be based on common open source software and databases. We first describe the development of existing open-source ultrasound based neuronavigation systems and how these systems have contributed to enhanced neurosurgical guidance over the last 15 years. We review the impact of the large number of projects worldwide that have benefited from the publicly available datasets “Brain Images of Tumors for Evaluation” (BITE) and “Retrospective evaluation of Cerebral Tumors” (RESECT) that include MR and US data from brain tumor cases. We also describe the need for continuous data collection and how this effort can be organized through the use of a well-adapted and user-friendly open-source software platform that integrates both continually improved guidance and automated data collection functionalities.


2020 ◽  
Vol 10 (11) ◽  
pp. 3874
Author(s):  
Santiago Quintero-Bonilla ◽  
Angel Martín del Rey

An advanced persistent threat (APT) can be defined as a targeted and very sophisticated cyber attack. IT administrators need tools that allow for the early detection of these attacks. Several approaches have been proposed to provide solutions to this problem based on the attack life cycle. Recently, machine learning techniques have been implemented in these approaches to improve the problem of detection. This paper aims to propose a new approach to APT detection, using machine learning techniques, and is based on the life cycle of an APT attack. The proposed model is organised into two passive stages and three active stages to adapt the mitigation techniques based on machine learning.


2019 ◽  
Author(s):  
Arthur Porto ◽  
Kjetil L. Voje

ABSTRACTMorphometrics has become an indispensable component of the statistical analysis of size and shape variation in biological structures. Morphometric data has traditionally been gathered through low-throughput manual landmark annotation, which represents a significant bottleneck for morphometric-based phenomics. Here we propose a machine-learning-based high-throughput pipeline to collect high-dimensional morphometric data in images of semi rigid biological structures.The proposed framework has four main strengths. First, it allows for dense phenotyping with minimal impact on specimens. Second, it presents landmarking accuracy comparable to manual annotators, when applied to standardized datasets. Third, it performs data collection at speeds several orders of magnitude higher than manual annotators. And finally, it is of general applicability (i.e., not tied to a specific study system).State-of-the-art validation procedures show that the method achieves low error levels when applied to three morphometric datasets of increasing complexity, with error varying from 0.5% to 2% of the structure’s length in the automated placement of landmarks. As a benchmark for the speed of the entire automated landmarking pipeline, our framework places 23 landmarks on 13,686 objects (zooids) detected in 1684 pictures of fossil bryozoans in 3.12 minutes using a personal computer.The proposed machine-learning-based phenotyping pipeline can greatly increase the scale, reproducibility and speed of data collection within biological research. To aid the use of the framework, we have developed a file conversion algorithm that can be used to leverage current morphometric datasets for automation, allowing the entire procedure, from model training all the way to prediction, to be performed in a matter of hours.


2020 ◽  
Vol 12 (6) ◽  
pp. 934 ◽  
Author(s):  
Eriita G. Jones ◽  
Sebastien Wong ◽  
Anthony Milton ◽  
Joseph Sclauzero ◽  
Holly Whittenbury ◽  
...  

Precision viticulture benefits from the accurate detection of vineyard vegetation from remote sensing, without a priori knowledge of vine locations. Vineyard detection enables efficient, and potentially automated, derivation of spatial measures such as length and area of crop, and hence required volumes of water, fertilizer, and other resources. Machine learning techniques have provided significant advancements in recent years in the areas of image segmentation, classification, and object detection, with neural networks shown to perform well in the detection of vineyards and other crops. However, what has not been extensively quantitatively examined is the extent to which the initial choice of input imagery impacts detection/segmentation accuracy. Here, we use a standard deep convolutional neural network (CNN) to detect and segment vineyards across Australia using DigitalGlobe Worldview-2 images at ∼50 cm (panchromatic) and ∼2 m (multispectral) spatial resolution. A quantitative assessment of the variation in model performance with input parameters during model training is presented from a remote sensing perspective, with combinations of panchromatic, multispectral, pan-sharpened multispectral, and the spectral Normalised Difference Vegetation Index (NDVI) considered. The impact of image acquisition parameters—namely, the off-nadir angle and solar elevation angle—on the quality of pan-sharpening is also assessed. The results are synthesised into a ‘recipe’ for optimising the accuracy of vineyard segmentation, which can provide a guide to others aiming to implement or improve automated crop detection and classification.


PeerJ ◽  
2017 ◽  
Vol 5 ◽  
pp. e3131 ◽  
Author(s):  
Müşerref Duygu Saçar Demirci ◽  
Jens Allmer

Gene regulation modulates RNA expression via transcription factors. Post-transcriptional gene regulation in turn influences the amount of protein product through, for example, microRNAs (miRNAs). Experimental establishment of miRNAs and their effects is complicated and even futile when aiming to establish the entirety of miRNA target interactions. Therefore, computational approaches have been proposed. Many such tools rely on machine learning (ML) which involves example selection, feature extraction, model training, algorithm selection, and parameter optimization. Different ML algorithms have been used for model training on various example sets, more than 1,000 features describing pre-miRNAs have been proposed and different training and testing schemes have been used for model establishment. For pre-miRNA detection, negative examples cannot easily be established causing a problem for two class classification algorithms. There is also no consensus on what ML approach works best and, therefore, we set forth and established the impact of the different parts involved in ML on model performance. Furthermore, we established two new negative datasets and analyzed the impact of using them for training and testing. It was our aim to attach an order of importance to the parts involved in ML for pre-miRNA detection, but instead we found that all parts are intricately connected and their contributions cannot be easily untangled leading us to suggest that when attempting ML-based pre-miRNA detection many scenarios need to be explored.


2020 ◽  
Vol 39 (5) ◽  
pp. 6579-6590
Author(s):  
Sandy Çağlıyor ◽  
Başar Öztayşi ◽  
Selime Sezgin

The motion picture industry is one of the largest industries worldwide and has significant importance in the global economy. Considering the high stakes and high risks in the industry, forecast models and decision support systems are gaining importance. Several attempts have been made to estimate the theatrical performance of a movie before or at the early stages of its release. Nevertheless, these models are mostly used for predicting domestic performances and the industry still struggles to predict box office performances in overseas markets. In this study, the aim is to design a forecast model using different machine learning algorithms to estimate the theatrical success of US movies in Turkey. From various sources, a dataset of 1559 movies is constructed. Firstly, independent variables are grouped as pre-release, distributor type, and international distribution based on their characteristic. The number of attendances is discretized into three classes. Four popular machine learning algorithms, artificial neural networks, decision tree regression and gradient boosting tree and random forest are employed, and the impact of each group is observed by compared by the performance models. Then the number of target classes is increased into five and eight and results are compared with the previously developed models in the literature.


Author(s):  
Siti Mariana Ulfa

AbstractHumans on earth need social interaction with others. Humans can use more than one language in communication. Thus, the impact that arises when the use of one or more languages is the contact between languages. One obvious form of contact between languages is interference. Interference can occur at all levels of life. As in this study, namely Indonesian Language Interference in Learning PPL Basic Thailand Unhasy Students. This study contains the form of interference that occurs in Thai students who are conducting teaching practices in the classroom. This type of research is descriptive qualitative research that seeks to describe any interference that occurs in the speech of Thai students when teaching practice. Data collection methods in this study are (1) observation techniques, (2) audio-visual recording techniques using CCTV and (3) recording techniques, by recording all data that has been obtained. Whereas the data wetness uses, (1) data triangulation, (2) improvement in perseverance and (3) peer review through discussion. Data analysis techniques in this study are (1) data collection, (2) data reduction, (3) data presentation and (4) conclusions. It can be seen that the interference that occurs includes (1) interference in phonological systems, (2) interference in morphological systems and (3) interference in syntactic systems. 


Sign in / Sign up

Export Citation Format

Share Document