Optimization of Diabetes Training DATA using Machine Learning Algorithms

Abstract American mink (Neovison vison) is the major source of fur for the fur industries worldwide and Aleutian disease (AD) is causing severe financial losses to the mink industry. Different methods have been used to diagnose the AD in mink, but the combination of several methods can be the most appropriate approach for the selection of AD resilient mink. Iodine agglutination test (IAT) and counterimmunoelectrophoresis (CIEP) methods are commonly employed in test-and-remove strategy; meanwhile, enzyme-linked immunosorbent assay (ELISA) and packed-cell volume (PCV) methods are complementary. However, using multiple methods are expensive; and therefore, hindering the corrected use of AD tests in selection. This research presented the assessments of the AD classification based on machine learning algorithms. The Aleutian disease was tested on 1,830 individuals using these tests in an AD positive mink farm (Canadian Centre for Fur Animal Research, NS, Canada). The accuracy of classification for CIEP was evaluated based on the sex information, and IAT, ELISA and PCV test results implemented in seven machine learning classification algorithms (Random Forest, Artificial Neural Networks, C50Tree, Naive Bayes, Generalized Linear Models, Boost, and Linear Discriminant Analysis) using the Caret package in R. The accuracy of prediction varied among the methods. Overall, the Random Forest was the best-performing algorithm for the current dataset with an accuracy of 0.89 in the training data and 0.94 in the testing data. Our work demonstrated the utility and relative ease of using machine learning algorithms to assess the CIEP information, and consequently reducing the cost of AD tests. However, further works require the inclusion of production and reproduction information in the models and extension of phenotypic collection to increase the accuracy of current methods.

Download Full-text

Identification of Leukemia Subtypes from Microscopic Images Using Convolutional Neural Network

Diagnostics ◽

10.3390/diagnostics9030104 ◽

2019 ◽

Vol 9 (3) ◽

pp. 104 ◽

Cited By ~ 11

Author(s):

Ahmed ◽

Yigit ◽

Isik ◽

Alpkocak

Keyword(s):

Machine Learning ◽

Data Augmentation ◽

Nearest Neighbor ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Training Data ◽

Support Vector ◽

K Nearest Neighbor ◽

Data Set ◽

Leukemia Data

Leukemia is a fatal cancer and has two main types: Acute and chronic. Each type has two more subtypes: Lymphoid and myeloid. Hence, in total, there are four subtypes of leukemia. This study proposes a new approach for diagnosis of all subtypes of leukemia from microscopic blood cell images using convolutional neural networks (CNN), which requires a large training data set. Therefore, we also investigated the effects of data augmentation for an increasing number of training samples synthetically. We used two publicly available leukemia data sources: ALL-IDB and ASH Image Bank. Next, we applied seven different image transformation techniques as data augmentation. We designed a CNN architecture capable of recognizing all subtypes of leukemia. Besides, we also explored other well-known machine learning algorithms such as naive Bayes, support vector machine, k-nearest neighbor, and decision tree. To evaluate our approach, we set up a set of experiments and used 5-fold cross-validation. The results we obtained from experiments showed that our CNN model performance has 88.25% and 81.74% accuracy, in leukemia versus healthy and multiclass classification of all subtypes, respectively. Finally, we also showed that the CNN model has a better performance than other wellknown machine learning algorithms.

Download Full-text

Mapping (un)certainty of machine learning-based spatial prediction models based on predictor space distances

10.5194/egusphere-egu2020-8492 ◽

2020 ◽

Author(s):

Hanna Meyer ◽

Edzer Pebesma

Keyword(s):

Machine Learning ◽

Spatial Patterns ◽

Environmental Science ◽

Prediction Models ◽

Learning Algorithms ◽

Predictor Variable ◽

Spatial Prediction ◽

Machine Learning Algorithms ◽

Training Data ◽

Field Samples

Spatial mapping is an important task in environmental science to reveal spatial patterns and changes of the environment. In this context predictive modelling using flexible machine learning algorithms has become very popular. However, looking at the diversity of modelled (global) maps of environmental variables, there might be increasingly the impression that machine learning is a magic tool to map everything. Recently, the reliability of such maps have been increasingly questioned, calling for a reliable quantification of uncertainties.Though spatial (cross-)validation allows giving a general error estimate for the predictions, models are usually applied to make predictions for a much larger area or might even be transferred to make predictions for an area where they were not trained on. But by making predictions on heterogeneous landscapes, there will be areas that feature environmental properties that have not been observed in the training data and hence not learned by the algorithm. This is problematic as most machine learning algorithms are weak in extrapolations and can only make reliable predictions for environments with conditions the model has knowledge about. Hence predictions for environmental conditions that differ significantly from the training data have to be considered as uncertain.To approach this problem, we suggest a measure of uncertainty that allows identifying locations where predictions should be regarded with care. The proposed uncertainty measure is based on distances to the training data in the multidimensional predictor variable space. However, distances are not equally relevant within the feature space but some variables are more important than others in the machine learning model and hence are mainly responsible for prediction patterns. Therefore, we weight the distances by the model-derived importance of the predictors.&#160;As a case study we use a simulated area-wide response variable for Europe, bio-climatic variables as predictors, as well as simulated field samples. Random Forest is applied as algorithm to predict the simulated response. The model is then used to make predictions for entire Europe. We then calculate the corresponding uncertainty and compare it to the area-wide true prediction error.&#160;The results show that the uncertainty map reflects the patterns in the true error very well and considerably outperforms ensemble-based standard deviations of predictions as indicator for uncertainty.The resulting map of uncertainty gives valuable insights into spatial patterns of prediction uncertainty which is important when the predictions are used as a baseline for decision making or subsequent environmental modelling. Hence, we suggest that a map of distance-based uncertainty should be given in addition to prediction maps.

Download Full-text

An Assessment of Machine Learning Techniques for Predicting Turbine Airfoil Component Temperatures, Using FEA Simulations for Training Data

Volume 5A: Heat Transfer ◽

10.1115/gt2019-91004 ◽

2019 ◽

Author(s):

James A. Tallman ◽

Michal Osusky ◽

Nick Magina ◽

Evan Sewall

Keyword(s):

Neural Network ◽

Machine Learning ◽

Artificial Neural Network ◽

Surrogate Model ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Training Data ◽

Machine Learning Techniques ◽

Learning Techniques ◽

Artificial Neural

Abstract This paper provides an assessment of three different machine learning techniques for accurately reproducing a distributed temperature prediction of a high-pressure turbine airfoil. A three-dimensional Finite Element Analysis thermal model of a cooled turbine airfoil was solved repeatedly (200 instances) for various operating point settings of the corresponding gas turbine engine. The response surface created by the repeated solutions was fed into three machine learning algorithms and surrogate model representations of the FEA model’s response were generated. The machine learning algorithms investigated were a Gaussian Process, a Boosted Decision Tree, and an Artificial Neural Network. Additionally, a simple Linear Regression surrogate model was created for comparative purposes. The Artificial Neural Network model proved to be the most successful at reproducing the FEA model over the range of operating points. The mean and standard deviation differences between the FEA and the Neural Network models were 15% and 14% of a desired accuracy threshold, respectively. The Digital Thread for Design (DT4D) was used to expedite all model execution and machine learning training. A description of DT4D is also provided.

Download Full-text

Creep Rupture Forecasting

International Journal of Monitoring and Surveillance Technologies Research ◽

10.4018/ijmstr.2014040101 ◽

2014 ◽

Vol 2 (2) ◽

pp. 1-25 ◽

Cited By ~ 2

Author(s):

Stylianos Chatzidakis ◽

Miltiadis Alamaniotis ◽

Lefteri H. Tsoukalas

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Process Model ◽

Production Systems ◽

Learning Algorithms ◽

Difficult Problem ◽

Network Size ◽

Creep Rupture ◽

Machine Learning Algorithms ◽

Training Data

Creep rupture is becoming increasingly one of the most important problems affecting behavior and performance of power production systems operating in high temperature environments and potentially under irradiation as is the case of nuclear reactors. Creep rupture forecasting and estimation of the useful life is required to avoid unanticipated component failure and cost ineffective operation. Despite the rigorous investigations of creep mechanisms and their effect on component lifetime, experimental data are sparse rendering the time to rupture prediction a rather difficult problem. An approach for performing creep rupture forecasting that exploits the unique characteristics of machine learning algorithms is proposed herein. The approach seeks to introduce a mechanism that will synergistically exploit recent findings in creep rupture with the state-of-the-art computational paradigm of machine learning. In this study, three machine learning algorithms, namely General Regression Neural Networks, Artificial Neural Networks and Gaussian Processes, were employed to capture the underlying trends and provide creep rupture forecasting. The current implementation is demonstrated and evaluated on actual experimental creep rupture data. Results show that the Gaussian process model based on the Matérn kernel achieved the best overall prediction performance (56.38%). Significant dependencies exist on the number of training data, neural network size, kernel selection and whether interpolation or extrapolation is performed.

Download Full-text

Structure label prediction using similarity-based retrieval and weakly supervised label mapping

Geophysics ◽

10.1190/geo2018-0028.1 ◽

2019 ◽

Vol 84 (1) ◽

pp. V67-V79 ◽

Cited By ~ 9

Author(s):

Yazeed Alaudah ◽

Motaz Alfarraj ◽

Ghassan AlRegib

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Seismic Interpretation ◽

Machine Learning Algorithms ◽

Training Data ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Subsurface Structures ◽

Weakly Supervised ◽

Seismic Volumes

Recently, there has been significant interest in various supervised machine learning techniques that can help reduce the time and effort consumed by manual interpretation workflows. However, most successful supervised machine learning algorithms require huge amounts of annotated training data. Obtaining these labels for large seismic volumes is a very time-consuming and laborious task. We have addressed this problem by presenting a weakly supervised approach for predicting the labels of various seismic structures. By having an interpreter select a very small number of exemplar images for every class of subsurface structures, we use a novel similarity-based retrieval technique to extract thousands of images that contain similar subsurface structures from the seismic volume. By assuming that similar images belong to the same class, we obtain thousands of image-level labels for these images; we validate this assumption. We have evaluated a novel weakly supervised algorithm for mapping these rough image-level labels into more accurate pixel-level labels that localize the different subsurface structures within the image. This approach dramatically simplifies the process of obtaining labeled data for training supervised machine learning algorithms on seismic interpretation tasks. Using our method, we generate thousands of automatically labeled images from the Netherlands Offshore F3 block with reasonably accurate pixel-level labels. We believe that this work will allow for more advances in machine learning-enabled seismic interpretation.

Download Full-text

Model-Driven Prototyping Support for Pervasive Healthcare Applications

Pervasive and Smart Technologies for Healthcare ◽

10.4018/978-1-61520-765-7.ch012 ◽

2010 ◽

pp. 251-281

Author(s):

Werner Kurschl ◽

Stefan Mitsch ◽

Johannes Schoenboeck

Keyword(s):

Machine Learning ◽

Programming Languages ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Training Data ◽

Application Development ◽

Healthcare Applications ◽

Development Environment ◽

Pervasive Healthcare ◽

Model Driven

Pervasive healthcare applications aim at improving habitability by assisting individuals in living autonomously. To achieve this goal, data on an individual’s behavior and his or her environment (often collected with wireless sensors) is interpreted by machine learning algorithms; their decision finally leads to the initiation of appropriate actions, e.g., turning on the light. Developers of pervasive healthcare applications therefore face complexity stemming, amongst others, from different types of environmental and vital parameters, heterogeneous sensor platforms, unreliable network connections, as well as from different programming languages. Moreover, developing such applications often includes extensive prototyping work to collect large amounts of training data to optimize the machine learning algorithms. In this chapter the authors present a model-driven prototyping approach for the development of pervasive healthcare applications to leverage the complexity incurred in developing prototypes and applications. They support the approach with a development environment that simplifies application development with graphical editors, code generators, and pre-defined components.

Download Full-text

Statistical Study of Machine Learning Algorithms Using Parametric and Non-Parametric Tests

International Journal of Ambient Computing and Intelligence ◽

10.4018/ijaci.2020070105 ◽

2020 ◽

Vol 11 (3) ◽

pp. 80-105 ◽

Cited By ~ 1

Author(s):

Vijay M. Khadse ◽

Parikshit Narendra Mahalle ◽

Gitanjali R. Shinde

Keyword(s):

Machine Learning ◽

Optimal Algorithm ◽

Smart Cities ◽

Learning Algorithms ◽

Performance Measure ◽

Machine Learning Algorithms ◽

Training Data ◽

Statistical Validation ◽

The Difference ◽

Validation Of Results

The emerging area of the internet of things (IoT) generates a large amount of data from IoT applications such as health care, smart cities, etc. This data needs to be analyzed in order to derive useful inferences. Machine learning (ML) plays a significant role in analyzing such data. It becomes difficult to select optimal algorithm from the available set of algorithms/classifiers to obtain best results. The performance of algorithms differs when applied to datasets from different application domains. In learning, it is difficult to understand if the difference in performance is real or due to random variation in test data, training data, or internal randomness of the learning algorithms. This study takes into account these issues during a comparison of ML algorithms for binary and multivariate classification. It helps in providing guidelines for statistical validation of results. The results obtained show that the performance measure of accuracy for one algorithm differs by critical difference (CD) than others over binary and multivariate datasets obtained from different application domains.

Download Full-text

Aligning text mining and machine learning algorithms with best practices for study selection in systematic literature reviews

Systematic Reviews ◽

10.1186/s13643-020-01520-5 ◽

2020 ◽

Vol 9 (1) ◽

Author(s):

E. Popoff ◽

M. Besada ◽

J. P. Jansen ◽

S. Cope ◽

S. Kanters

Keyword(s):

Machine Learning ◽

Text Mining ◽

Sensitivity And Specificity ◽

Full Text ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Training Data ◽

Support Vector ◽

Study Selection ◽

Literature Reviews

Abstract Background Despite existing research on text mining and machine learning for title and abstract screening, the role of machine learning within systematic literature reviews (SLRs) for health technology assessment (HTA) remains unclear given lack of extensive testing and of guidance from HTA agencies. We sought to address two knowledge gaps: to extend ML algorithms to provide a reason for exclusion—to align with current practices—and to determine optimal parameter settings for feature-set generation and ML algorithms. Methods We used abstract and full-text selection data from five large SLRs (n = 3089 to 12,769 abstracts) across a variety of disease areas. Each SLR was split into training and test sets. We developed a multi-step algorithm to categorize each citation into the following categories: included; excluded for each PICOS criterion; or unclassified. We used a bag-of-words approach for feature-set generation and compared machine learning algorithms using support vector machines (SVMs), naïve Bayes (NB), and bagged classification and regression trees (CART) for classification. We also compared alternative training set strategies: using full data versus downsampling (i.e., reducing excludes to balance includes/excludes because machine learning algorithms perform better with balanced data), and using inclusion/exclusion decisions from abstract versus full-text screening. Performance comparisons were in terms of specificity, sensitivity, accuracy, and matching the reason for exclusion. Results The best-fitting model (optimized sensitivity and specificity) was based on the SVM algorithm using training data based on full-text decisions, downsampling, and excluding words occurring fewer than five times. The sensitivity and specificity of this model ranged from 94 to 100%, and 54 to 89%, respectively, across the five SLRs. On average, 75% of excluded citations were excluded with a reason and 83% of these citations matched the reviewers’ original reason for exclusion. Sensitivity significantly improved when both downsampling and abstract decisions were used. Conclusions ML algorithms can improve the efficiency of the SLR process and the proposed algorithms could reduce the workload of a second reviewer by identifying exclusions with a relevant PICOS reason, thus aligning with HTA guidance. Downsampling can be used to improve study selection, and improvements using full-text exclusions have implications for a learn-as-you-go approach.

Download Full-text

Performance of Machine Learning Algorithms and Diversity in Data

MATEC Web of Conferences ◽

10.1051/matecconf/201821004019 ◽

2018 ◽

Vol 210 ◽

pp. 04019 ◽

Cited By ~ 1

Author(s):

Hyontai SUG

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Real World ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Training Data ◽

Real World Data ◽

Random Data ◽

Data Set ◽

World Data

Recent world events in go games between human and artificial intelligence called AlphaGo showed the big advancement in machine learning technologies. While AlphaGo was trained using real world data, AlphaGo Zero was trained using massive random data, and the fact that AlphaGo Zero won AlphaGo completely revealed that diversity and size in training data is important for better performance for the machine learning algorithms, especially in deep learning algorithms of neural networks. On the other hand, artificial neural networks and decision trees are widely accepted machine learning algorithms because of their robustness in errors and comprehensibility respectively. In this paper in order to prove that diversity and size in data are important factors for better performance of machine learning algorithms empirically, the two representative algorithms are used for experiment. A real world data set called breast tissue was chosen, because the data set consists of real numbers that is very good property for artificial random data generation. The result of the experiment proved the fact that the diversity and size of data are very important factors for better performance.

Download Full-text