scholarly journals A Data-Centric Approach to Improve Machine Learning Model’s Performance in Production

Author(s):  
Pritom Bhowmik ◽  
◽  
Arabinda Saha Partha ◽  

Machine learning teaches computers to think in a similar way to how humans do. An ML models work by exploring data and identifying patterns with minimal human intervention. A supervised ML model learns by mapping an input to an output based on labeled examples of input-output (X, y) pairs. Moreover, an unsupervised ML model works by discovering patterns and information that was previously undetected from unlabelled data. As an ML project is an extensively iterative process, there is always a need to change the ML code/model and datasets. However, when an ML model achieves 70-75% of accuracy, then the code or algorithm most probably works fine. Nevertheless, in many cases, e.g., medical or spam detection models, 75% accuracy is too low to deploy in production. A medical model used in susceptible tasks such as detecting certain diseases must have an accuracy label of 98-99%. Furthermore, that is a big challenge to achieve. In that scenario, we may have a good working model, so a model-centric approach may not help much achieve the desired accuracy threshold. However, improving the dataset will improve the overall performance of the model. Improving the dataset does not always require bringing more and more data into the dataset. Improving the quality of the data by establishing a reasonable baseline level of performance, labeler consistency, error analysis, and performance auditing will thoroughly improve the model's accuracy. This review paper focuses on the data-centric approach to improve the performance of a production machine learning model.

2020 ◽  
pp. short19-1-short19-9
Author(s):  
Alexey Kochkarev ◽  
Alexander Khvostikov ◽  
Dmitry Korshunov ◽  
Andrey Krylov ◽  
Mikhail Boguslavskiy

Data imbalance is a common problem in machine learning and image processing. The lack of training data for the rarest classes can lead to worse learning ability and negatively affect the quality of segmentation. In this paper, we focus on the problem of data balancing for the task of image segmentation. We review major trends in handling unbalanced data and propose a new method for data balancing, based on Distance Transform. This method is designed for using in segmentation convolutional neural networks (CNNs), but it is universal and can be used with any patch-based segmentation machine learning model. The evaluation of the proposed data balancing method is performed on two datasets. The first is medical dataset LiTS, containing CT images of liver with tumor abnormalities. The second one is a geological dataset, containing of photographs of polished sections of different ores. The proposed algorithm enhances the data balance between classes and improves the overall performance of CNN model.


2017 ◽  
Author(s):  
Aymen A. Elfiky ◽  
Maximilian J. Pany ◽  
Ravi B. Parikh ◽  
Ziad Obermeyer

ABSTRACTBackgroundCancer patients who die soon after starting chemotherapy incur costs of treatment without benefits. Accurately predicting mortality risk from chemotherapy is important, but few patient data-driven tools exist. We sought to create and validate a machine learning model predicting mortality for patients starting new chemotherapy.MethodsWe obtained electronic health records for patients treated at a large cancer center (26,946 patients; 51,774 new regimens) over 2004-14, linked to Social Security data for date of death. The model was derived using 2004-11 data, and performance measured on non-overlapping 2012-14 data.Findings30-day mortality from chemotherapy start was 2.1%. Common cancers included breast (21.1%), colorectal (19.3%), and lung (18.0%). Model predictions were accurate for all patients (AUC 0.94). Predictions for patients starting palliative chemotherapy (46.6% of regimens), for whom prognosis is particularly important, remained highly accurate (AUC 0.92). To illustrate model discrimination, we ranked patients initiating palliative chemotherapy by model-predicted mortality risk, and calculated observed mortality by risk decile. 30-day mortality in the highest-risk decile was 22.6%; in the lowest-risk decile, no patients died. Predictions remained accurate across all primary cancers, stages, and chemotherapies—even for clinical trial regimens that first appeared in years after the model was trained (AUC 0.94). The model also performed well for prediction of 180-day mortality (AUC 0.87; mortality 74.8% in the highest risk decile vs. 0.2% in the lowest). Predictions were more accurate than data from randomized trials of individual chemotherapies, or SEER estimates.InterpretationA machine learning algorithm accurately predicted short-term mortality in patients starting chemotherapy using EHR data. Further research is necessary to determine generalizability and the feasibility of applying this algorithm in clinical settings.


Author(s):  
Monalisa Ghosh ◽  
Chetna Singhal

Video streaming services top the internet traffic surging forward a competitive environment to impart best quality of experience (QoE) to the users. The standard codecs utilized in video transmission systems eliminate the spatiotemporal redundancies in order to decrease the bandwidth requirement. This may adversely affect the perceptual quality of videos. To rate a video quality both subjective and objective parameters can be used. So, it is essential to construct frameworks which will measure integrity of video just like humans. This chapter focuses on application of machine learning to evaluate the QoE without requiring human efforts with higher accuracy of 86% and 91% employing the linear and support vector regression respectively. Machine learning model is developed to forecast the subjective quality of H.264 videos obtained after streaming through wireless networks from the subjective scores.


Energies ◽  
2020 ◽  
Vol 13 (17) ◽  
pp. 4368 ◽  
Author(s):  
Chun-Wei Chen ◽  
Chun-Chang Li ◽  
Chen-Yu Lin

Energy baseline is an important method for measuring the energy-saving benefits of chiller system, and the benefits can be calculated by comparing prediction models and actual results. Currently, machine learning is often adopted as a prediction model for energy baselines. Common models include regression, ensemble learning, and deep learning models. In this study, we first reviewed several machine learning algorithms, which were used to establish prediction models. Then, the concept of clustering to preprocess chiller data was adopted. Data mining, K-means clustering, and gap statistic were used to successfully identify the critical variables to cluster chiller modes. Applying these key variables effectively enhanced the quality of the chiller data, and combining the clustering results and the machine learning model effectively improved the prediction accuracy of the model and the reliability of the energy baselines.


2020 ◽  
Vol 8 (7_suppl6) ◽  
pp. 2325967120S0036
Author(s):  
Audrey Wright ◽  
Jaret Karnuta ◽  
Bryan Luu ◽  
Heather Haeberle ◽  
Eric Makhni ◽  
...  

Objectives: With the accumulation of big data surrounding National Hockey League (NHL) and the advent of advanced computational processors, machine learning (ML) is ideally suited to develop a predictive algorithm capable of imbibing historical data to accurately project a future player’s availability to play based on prior injury and performance. To the end of leveraging available analytics to permit data-driven injury prevention strategies and informed decisions for NHL franchises beyond static logistic regression (LR) analysis, the objective of this study of NHL players was to (1) characterize the epidemiology of publicly reported NHL injuries from 2007-17, (2) determine the validity of a machine learning model in predicting next season injury risk for both goalies and non-goalies, and (3) compare the performance of modern ML algorithms versus LR analyses. Methods: Hockey player data was compiled for the years 2007 to 2017 from two publicly reported databases in the absence of an official NHL-approved database. Attributes acquired from each NHL player from each professional year included: age, 85 player metrics, and injury history. A total of 5 ML algorithms were created for both non-goalie and goalie data; Random Forest, K-Nearest Neighbors, Naive Bayes, XGBoost, and Top 3 Ensemble. Logistic regression was also performed for both non-goalie and goalie data. Area under the receiver operating characteristics curve (AUC) primarily determined validation. Results: Player data was generated from 2,109 non-goalies and 213 goalies with an average follow-up of 4.5 years. The results are shown below in Table 1.For models predicting following season injury risk for non-goalies, XGBoost performed the best with an AUC of 0.948, compared to an AUC of 0.937 for logistic regression. For models predicting following season injury risk for goalies, XGBoost had the highest AUC with 0.956, compared to an AUC of 0.947 for LR. Conclusion: Advanced ML models such as XGBoost outperformed LR and demonstrated good to excellent capability of predicting whether a publicly reportable injury is likely to occur the next season. As more player-specific data become available, algorithm refinement may be possible to strengthen predictive insights and allow ML to offer quantitative risk management for franchises, present opportunity for targeted preventative intervention by medical personnel, and replace regression analysis as the new gold standard for predictive modeling. [Figure: see text]


2021 ◽  
Vol 2021 ◽  
pp. 1-13
Author(s):  
Andrei Bratu ◽  
Gabriela Czibula

Data augmentation is a commonly used technique in data science for improving the robustness and performance of machine learning models. The purpose of the paper is to study the feasibility of generating synthetic data points of temporal nature towards this end. A general approach named DAuGAN (Data Augmentation using Generative Adversarial Networks) is presented for identifying poorly represented sections of a time series, studying the synthesis and integration of new data points, and performance improvement on a benchmark machine learning model. The problem is studied and applied in the domain of algorithmic trading, whose constraints are presented and taken into consideration. The experimental results highlight an improvement in performance on a benchmark reinforcement learning agent trained on a dataset enhanced with DAuGAN to trade a financial instrument.


2020 ◽  
Vol 25 (6) ◽  
pp. 655-664
Author(s):  
Wienand A. Omta ◽  
Roy G. van Heesbeen ◽  
Ian Shen ◽  
Jacob de Nobel ◽  
Desmond Robers ◽  
...  

There has been an increase in the use of machine learning and artificial intelligence (AI) for the analysis of image-based cellular screens. The accuracy of these analyses, however, is greatly dependent on the quality of the training sets used for building the machine learning models. We propose that unsupervised exploratory methods should first be applied to the data set to gain a better insight into the quality of the data. This improves the selection and labeling of data for creating training sets before the application of machine learning. We demonstrate this using a high-content genome-wide small interfering RNA screen. We perform an unsupervised exploratory data analysis to facilitate the identification of four robust phenotypes, which we subsequently use as a training set for building a high-quality random forest machine learning model to differentiate four phenotypes with an accuracy of 91.1% and a kappa of 0.85. Our approach enhanced our ability to extract new knowledge from the screen when compared with the use of unsupervised methods alone.


2020 ◽  
Author(s):  
Carlotta Valerio ◽  
Alberto Garrido ◽  
Gonzalo Martinez-Muñoz ◽  
Lucia De Stefano

<p>Freshwater ecosystems are threatened by multiple anthropic pressures. Understanding the effect of pressures on the ecological status is essential for the design of effective policy measures but can be challenging from a methodological point of view. In this study we propose to capture these complex relations by means of a machine learning model that predicts the ecological response of surface water bodies to several anthropic stressors. The model was applied to the Spanish stretch of the Tagus River Basin. The performance of two machine learning algorithms -Random Forest (RF) and Boosted Regression Trees (BRT) - was compared. The response variables in the model were the biotic quality indices of macroinvertebrates (Iberian Biomonitoring Working Party) and diatoms (Indice de Polluosensibilité Spécifique). The stressors used as explanatory variables belong to the following categories: physicochemical water quality, land use, alteration of the hydrological regime and hydromorphological degradation. Variables describing the natural environmental variability were also included. According to the coefficient of determination, the root mean square error and the mean absolute error, the RF algorithm has the best explanatory power for both biotic indices. The categories of land cover in the upstream catchment area, the nutrient concentrations and the elevation of the water body are ranked as the main features at play in determining the quality of biological communities. Among the hydromorphological elements, the alteration of the riparian forest (expressed by the Riparian Forest Quality Index) is the most relevant feature, while the hydrological alteration does not seem to influence significantly the value of the biotic indices. Our model was used to identify potential policy measures aimed at improving the biological quality of surface water bodies in the most critical areas of the basin. Specifically, the biotic quality indices were modelled imposing the maximum concentration of nutrients that the Spanish legislation prescribes to ensure a good ecological status. According to our model, the nutrient thresholds set by the Spanish legislation are insufficient to ensure values of biological indicators consistent with the good ecological status in the entire basin. We tested several scenarios of more restrictive nutrient concentrations and values of hydromorphological quality to explore the conditions required to achieve the good ecological status. The predicted percentage of water bodies in good status increases when a high  Riparian Forest Quality Index is set, confirming the importance of combining physico-chemical and hydromorphological improvements in order to ameliorate the status of freshwater ecosystems. </p>


Sign in / Sign up

Export Citation Format

Share Document