scholarly journals Estimating Degradation of Machine Learning Data Assets

2022 ◽  
Vol 14 (2) ◽  
pp. 1-15
Author(s):  
Lara Mauri ◽  
Ernesto Damiani

Large-scale adoption of Artificial Intelligence and Machine Learning (AI-ML) models fed by heterogeneous, possibly untrustworthy data sources has spurred interest in estimating degradation of such models due to spurious, adversarial, or low-quality data assets. We propose a quantitative estimate of the severity of classifiers’ training set degradation: an index expressing the deformation of the convex hulls of the classes computed on a held-out dataset generated via an unsupervised technique. We show that our index is computationally light, can be calculated incrementally and complements well existing ML data assets’ quality measures. As an experimentation, we present the computation of our index on a benchmark convolutional image classifier.

2021 ◽  
Vol 11 (2) ◽  
pp. 472
Author(s):  
Hyeongmin Cho ◽  
Sangkyun Lee

Machine learning has been proven to be effective in various application areas, such as object and speech recognition on mobile systems. Since a critical key to machine learning success is the availability of large training data, many datasets are being disclosed and published online. From a data consumer or manager point of view, measuring data quality is an important first step in the learning process. We need to determine which datasets to use, update, and maintain. However, not many practical ways to measure data quality are available today, especially when it comes to large-scale high-dimensional data, such as images and videos. This paper proposes two data quality measures that can compute class separability and in-class variability, the two important aspects of data quality, for a given dataset. Classical data quality measures tend to focus only on class separability; however, we suggest that in-class variability is another important data quality factor. We provide efficient algorithms to compute our quality measures based on random projections and bootstrapping with statistical benefits on large-scale high-dimensional data. In experiments, we show that our measures are compatible with classical measures on small-scale data and can be computed much more efficiently on large-scale high-dimensional datasets.


Author(s):  
Ladly Patel ◽  
Kumar Abhishek Gaurav

In today's world, a huge amount of data is available. So, all the available data are analyzed to get information, and later this data is used to train the machine learning algorithm. Machine learning is a subpart of artificial intelligence where machines are given training with data and the machine predicts the results. Machine learning is being used in healthcare, image processing, marketing, etc. The aim of machine learning is to reduce the work of the programmer by doing complex coding and decreasing human interaction with systems. The machine learns itself from past data and then predict the desired output. This chapter describes machine learning in brief with different machine learning algorithms with examples and about machine learning frameworks such as tensor flow and Keras. The limitations of machine learning and various applications of machine learning are discussed. This chapter also describes how to identify features in machine learning data.


Author(s):  
Divya Chaudhary ◽  
Er. Richa Vasuja

In today's scenario all of data is being generated by everyone of us . so it becomes vital for us to handle this data. To do so new technologies are being developed such as machine learning, data mining etc. This paper gives the study related to machine learning(ML).Precise approximations are repetitively being produced by Machine Learning algorithms. Machine learning system effectively “learns” how to guess from training set of completed jobs. The main purpose of the review is to give a jagged estimate or overview about the mostly used algorithms in machine learning.


Author(s):  
P. Alison Paprica ◽  
Frank Sullivan ◽  
Yin Aphinyanaphongs ◽  
Garth Gibson

Many health systems and research institutes are interested in supplementing their traditional analyses of linked data with machine learning (ML) and other artificial intelligence (AI) methods and tools. However, the availability of individuals who have the required skills to develop and/or implement ML/AI is a constraint, as there is high demand for ML/AI talent in many sectors. The three organizations presenting are all actively involved in training and capacity building for ML/AI broadly, and each has a focus on, and/or discrete initiatives for, particular trainees. P. Alison Paprica, Vector Institute for artificial intelligence, Institute for Clinical Evaluative Sciences, University of Toronto, Canada. Alison is VP, Health Strategy and Partnerships at Vector, responsible for health strategy and also playing a lead role in “1000AIMs” – a Vector-led initiative in support of the Province of Ontario’s \$30 million investment to increase the number of AI-related master’s program graduates to 1,000 per year within five years. Frank Sullivan, University of St Andrews Scotland. Frank is a family physician and an associate director of HDRUK@Scotland. Health Data Research UK \url{https://hdruk.ac.uk/} has recently provided funding to six sites across the UK to address challenging healthcare issues through use of data science. A 50 PhD student Doctoral Training Scheme in AI has also been announced. Each site works in close partnership with National Health Service bodies and the public to translate research findings into benefits for patients and populations. Yin Aphinyanaphongs – INTREPID NYU clinical training program for incoming clinical fellows. Yin is the Director of the Clinical Informatics Training Program at NYU Langone Health. He is deeply interested in the intersection of computer science and health care and as a physician and a scientist, he has a unique perspective on how to train medical professionals for a data drive world. One version of this teaching process is demonstrated in the INTREPID clinical training program. Yin teaches clinicians to work with large scale data within the R environment and generate hypothesis and insights. The session will begin with three brief presentations followed by a facilitated session where all participants share their insights about the essential skills and competencies required for different kinds of ML/AI application and contributions. Live polling and voting will be used at the end of the session to capture participants’ view on the key learnings and take away points. The intended outputs and outcomes of the session are: Participants will have a better understanding of the skills and competencies required for individuals to contribute to AI applications in health in various ways Participants will gain knowledge about different options for capacity building from targeted enhancement of the skills of clinical fellows, to producing large number of applied master’s graduates, to doctoral-level training After the session, the co-leads will work together to create a resource that summarizes the learnings from the session and make them public (though publication in a peer-reviewed journal and/or through the IPDLN website)


2021 ◽  
Vol 8 (32) ◽  
pp. 22-38
Author(s):  
José Manuel Amigo

Concepts like Machine Learning, Data Mining or Artificial Intelligence have become part of our daily life. This is mostly due to the incredible advances made in computation (hardware and software), the increasing capabilities of generating and storing all types of data and, especially, the benefits (societal and economical) that generate the analysis of such data. Simultaneously, Chemometrics has played an important role since the late 1970s, analyzing data within natural science (and especially in Analytical Chemistry). Even with the strong parallelisms between all of the abovementioned terms and being popular with most of us, it is still difficult to clearly define or differentiate the meaning of Machine Learning, Data Mining, Artificial Intelligence, Deep Learning and Chemometrics. This manuscript brings some light to the definitions of Machine Learning, Data Mining, Artificial Intelligence and Big Data Analysis, defines their application ranges and seeks an application space within the field of analytical chemistry (a.k.a. Chemometrics). The manuscript is full of personal, sometimes probably subjective, opinions and statements. Therefore, all opinions here are open for constructive discussion with the only purpose of Learning (like the Machines do nowadays).


2021 ◽  
Author(s):  
Dillon Kessy ◽  
Jose Ignacio Sierra Castro ◽  
Jose Chirinos ◽  
Giorgio De Paola ◽  
Maria Jose Lopez Perez-Valiente

Abstract The application of Artificial Intelligence for planning has received increased attention in the energy industry in the past few years, particularly for the increased production efficiency requirements and environmental standards. The objective of this paper is to show the successful integration of production, completion, subsurface and spatial data using machine-learning algorithms to predict production performance for future development wells. The internal Marcellus Business Unit (MBU) well database, populated with data of 500+ historical wells, has been used in this study. Production data, treated as timeseries, has been processed using functional Principal Component Analysis (PCA) to allow removal of outliers and mode detection. Utilizing this data, a suite of machine-learning algorithms has been applied to reconstruct gas production from available and target well data. Uncertainty quantification has been provided for production curves to identify the quality of prediction. During the study, the sensitivity analysis on input variables has been performed iteratively to screen and rank the most important variables for prediction. The workflow, Unconventional Reservoir Assistant (URA), has been implemented in a proprietary cloud-based platform providing the necessary means for data upload, integration, pre-processing, and finally model training and deployment. This allows the user to focus on the evaluation of model output quality, data filter and workspace generation for continuous model testing and improvement. The full well dataset, split into trained and tested data, has been used for prediction as a preliminary guide to where the most prolific areas of development are located. Results were ranked based on production expected by pad and based on normalized performance. The information was then used to compare with type curves and original development order. In parallel, economic evaluation of break-even was performed to rank all future pads. Consequently, integration of the model prediction and breakeven ranking were used to generate the final development order for the MBU. The URA tool has shown preliminary success in predicting production performance for the pilot development area. Multiple case studies have been run achieving blind test results with high accuracy for historical prediction. Results show some dependency of predictor variable ranking on the field development area, providing insight on how subsurface may affect well dynamic behavior. This paper describes how the integration of URA can complement the development workflow for unconventional reservoirs and be used to predict performance based on complex data integration. The methodology used is superior to standard machine learning models providing only production indicators, as it gives the user the possibility to evaluate economics and completion design sensitivity for future well activities. The methodology can be further extended as a proxy model for well schedule optimization in planning or for better insight into well refrac selection.


Author(s):  
M. Stashevskaya

The article contains a study of existing views on the economic content of big data. From among the views, within which the authors define big data, the descriptive-model, utility-digital and complex-technological approaches are formulated. Against the back- ground of the large-scale spread of digital technologies (machine learning, cloud computing, artificial intelligence, augmented and virtual reality, etc.), functioning thanks to big data, the study of their economic essence is becoming especially relevant. As a result, it was found that the basis of economic activity in the digital economy is big data. The definition of big data as a resource of the digital economy is proposed.


Today is the generation of Machine Learning and Artificial Intelligence. Machine Learning is a field of scientific study and statistical models to predict the answers of never before asked questions. Machine Learning algorithms use a huge quantity of sample data that is further used to generate model. The higher amount and quality of training set lead to higher accuracy in approximate result calculation. ML is the most popular field to research and also helpful in pattern finding, artificial intelligence and data analysis. In this paper we are going to explain the basic concept of Machine Learning with its various types of methods. These methods can be used according to user’s requirement. Machine Learning tasks are divided into various categories . These tasks are accomplished by computer system without being explicitly programmed.


Sign in / Sign up

Export Citation Format

Share Document