Programmatic Labeling of Dark Data for Artificial Intelligence in Spatial Informatics

Author(s):  
Jason Meil

<p>Data preparation process generally consumes up to 80% of the Data Scientists time, with 60% of that being attributed to cleaning and labeling data.[1]  Our solution is to use automated pipelines to prepare, annotate, and catalog data. The first step upon ingestion, especially in the case of real world—unstructured and unlabeled datasets—is to leverage Snorkel, a tool specifically designed around a paradigm to rapidly create, manage, and model training data. Configured properly, Snorkel can be leveraged to temper this labeling bottle-neck through a process called weak supervision. Weak supervision uses programmatic labeling functions—heuristics, distant supervision, SME or knowledge base—scripted in python to generate “noisy labels”. The function traverses the entirety of the dataset and feeds the labeled data into a generative—conditionally probabilistic—model. The function of this model is to output the distribution of each response variable and predict the conditional probability based on a joint probability distribution algorithm. This is done by comparing the various labeling functions and the degree to which their outputs are congruent to each other. A single labeling function that has a high degree of congruence with other labeling functions will have a high degree of learned accuracy, that is, the fraction of predictions that the model got right. Conversely, single labeling functions that have a low degree of congruence with other functions will have low learned accuracy. Each prediction is then combined by the estimated weighted accuracy, whereby the predictions of the higher learned functions are counted multiple times. The result yields a transformation from a binary classification of 0 or 1 to a fuzzy label between 0 and 1— there is “x” probability that based on heuristic “n”, the response variable is “y”. The addition of data to this generative model multi-class inference will be made on the response variables positive, negative, or abstain, assigning probabilistic labels to potentially millions of data points. Thus, we have generated a discriminative ground truth for all further labeling efforts and have improved the scalability of our models. Labeling functions can be applied to unlabeled data to further machine learning efforts.<br> <br>Once our datasets are labeled and a ground truth is established, we need to persist the data into our delta lake since it combines the most performant aspects of a warehouse with the low-cost storage for data lakes. In addition, the lake can accept unstructured, semi structured, or structured data sources, and those sources can be further aggregated into raw ingestion, cleaned, and feature engineered data layers.  By sectioning off the data sources into these “layers”, the data engineering portion is abstracted away from the data scientist, who can access model ready data at any time.  Data can be ingested via batch or stream. <br> <br>The design of the entire ecosystem is to eliminate as much technical debt in machine learning paradigms as possible in terms of configuration, data collection, verification, governance, extraction, analytics, process management, resource management, infrastructure, monitoring, and post verification. </p>

Author(s):  
Meenakshi Anurag Thalor ◽  
Shrishailapa Patil

<span lang="EN-US">Incremental Learning on non stationary distribution has been shown to be a very challenging problem in machine learning and data mining, because the joint probability distribution between the data and classes changes over time. Many real time problems suffer concept drift as they changes with time. For example, an advertisement recommendation system, in which customer’s behavior may change depending on the season of the year, on the inflation and on new products made available. An extra challenge arises when the classes to be learned are not represented equally in the training data i.e. classes are imbalanced, as most machine learning algorithms work well only when the training data  is balanced. The objective of this paper is to develop an ensemble based classification algorithm for non-stationary data stream (ENSDS) with focus on two-class problems. In addition, we are presenting here an exhaustive comparison of purposed algorithms with state-of-the-art classification approaches using different evaluation measures like recall, f-measure and g-mean</span>


Author(s):  
Meenakshi Anurag Thalor ◽  
Shrishailapa Patil

<span lang="EN-US">Incremental Learning on non stationary distribution has been shown to be a very challenging problem in machine learning and data mining, because the joint probability distribution between the data and classes changes over time. Many real time problems suffer concept drift as they changes with time. For example, an advertisement recommendation system, in which customer’s behavior may change depending on the season of the year, on the inflation and on new products made available. An extra challenge arises when the classes to be learned are not represented equally in the training data i.e. classes are imbalanced, as most machine learning algorithms work well only when the training data  is balanced. The objective of this paper is to develop an ensemble based classification algorithm for non-stationary data stream (ENSDS) with focus on two-class problems. In addition, we are presenting here an exhaustive comparison of purposed algorithms with state-of-the-art classification approaches using different evaluation measures like recall, f-measure and g-mean</span>


2021 ◽  
Vol 14 (6) ◽  
pp. 997-1005
Author(s):  
Sandeep Tata ◽  
Navneet Potti ◽  
James B. Wendt ◽  
Lauro Beltrão Costa ◽  
Marc Najork ◽  
...  

Extracting structured information from templatic documents is an important problem with the potential to automate many real-world business workflows such as payment, procurement, and payroll. The core challenge is that such documents can be laid out in virtually infinitely different ways. A good solution to this problem is one that generalizes well not only to known templates such as invoices from a known vendor, but also to unseen ones. We developed a system called Glean to tackle this problem. Given a target schema for a document type and some labeled documents of that type, Glean uses machine learning to automatically extract structured information from other documents of that type. In this paper, we describe the overall architecture of Glean, and discuss three key data management challenges : 1) managing the quality of ground truth data, 2) generating training data for the machine learning model using labeled documents, and 3) building tools that help a developer rapidly build and improve a model for a given document type. Through empirical studies on a real-world dataset, we show that these data management techniques allow us to train a model that is over 5 F1 points better than the exact same model architecture without the techniques we describe. We argue that for such information-extraction problems, designing abstractions that carefully manage the training data is at least as important as choosing a good model architecture.


Animals ◽  
2020 ◽  
Vol 10 (9) ◽  
pp. 1690 ◽  
Author(s):  
Marianne Cockburn

Dairy farmers use herd management systems, behavioral sensors, feeding lists, breeding schedules, and health records to document herd characteristics. Consequently, large amounts of dairy data are becoming available. However, a lack of data integration makes it difficult for farmers to analyze the data on their dairy farm, which indicates that these data are currently not being used to their full potential. Hence, multiple issues in dairy farming such as low longevity, poor performance, and health issues remain. We aimed to evaluate whether machine learning (ML) methods can solve some of these existing issues in dairy farming. This review summarizes peer-reviewed ML papers published in the dairy sector between 2015 and 2020. Ultimately, 97 papers from the subdomains of management, physiology, reproduction, behavior analysis, and feeding were considered in this review. The results confirm that ML algorithms have become common tools in most areas of dairy research, particularly to predict data. Despite the quantity of research available, most tested algorithms have not performed sufficiently for a reliable implementation in practice. This may be due to poor training data. The availability of data resources from multiple farms covering longer periods would be useful to improve prediction accuracies. In conclusion, ML is a promising tool in dairy research, which could be used to develop and improve decision support for farmers. As the cow is a multifactorial system, ML algorithms could analyze integrated data sources that describe and ultimately allow managing cows according to all relevant influencing factors. However, both the integration of multiple data sources and the obtainability of public data currently remain challenging.


Author(s):  
Jacob Whitehill

Recent work on privacy-preserving machine learning has considered how datamining competitions such as Kaggle could potentially be “hacked”, either intentionally or inadvertently, by using information from an oracle that reports a classifier’s accuracy on the test set (Blum and Hardt 2015; Hardt and Ullman 2014; Zheng 2015; Whitehill 2016). For binary classification tasks in particular, one of the most common accuracy metrics is the Area Under the ROC Curve (AUC), and in this paper we explore the mathematical structure of how the AUC is computed from an n-vector of real-valued “guesses” with respect to the ground-truth labels. Under the assumption of perfect knowledge of the test set AUC c=p/q, we show how knowing c constrains the set W of possible ground-truth labelings, and we derive an algorithm both to compute the exact number of such labelings and to enumerate efficiently over them. We also provide empirical evidence that, surprisingly, the number of compatible labelings can actually decrease as n grows, until a test set-dependent threshold is reached. Finally, we show how W can be efficiently whittled down, through pairs of oracle queries, to infer all the groundtruth test labels with complete certainty.


2021 ◽  
Author(s):  
Viraj Kulkarni ◽  
Manish Gawali ◽  
Amit Kharat

UNSTRUCTURED The use of machine learning to develop intelligent software tools for interpretation of radiology images has gained widespread attention in recent years. The development, deployment, and eventual adoption of these models in clinical practice, however, remains fraught with challenges. In this paper, we propose a list of key considerations that machine learning researchers must recognize and address to make their models accurate, robust, and usable in practice. Namely, we discuss: insufficient training data, decentralized datasets, high cost of annotations, ambiguous ground truth, imbalance in class representation, asymmetric misclassification costs, relevant performance metrics, generalization of models to unseen datasets, model decay, adversarial attacks, explainability, fairness and bias, and clinical validation. We describe each consideration and identify techniques to address it. Although these techniques have been discussed in prior research literature, by freshly examining them in the context of medical imaging and compiling them in the form of a laundry list, we hope to make them more accessible to researchers, software developers, radiologists, and other stakeholders.


Author(s):  
Ruslan Babudzhan ◽  
Konstantyn Isaienkov ◽  
Danilo Krasiy ◽  
Oleksii Vodka ◽  
Ivan Zadorozhny ◽  
...  

The paper investigates the relationship between vibration acceleration of bearings with their operational state. To determine these dependencies, a testbench was built and 112 experiments were carried out with different bearings: 100 bearings that developed an internal defect during operation and 12bearings without a defect. From the obtained records, a dataset was formed, which was used to build classifiers. Dataset is freely available. A methodfor classifying new and used bearings was proposed, which consists in searching for dependencies and regularities of the signal using descriptive functions: statistical, entropy, fractal dimensions and others. In addition to processing the signal itself, the frequency domain of the bearing operationsignal was also used to complement the feature space. The paper considered the possibility of generalizing the classification for its application on thosesignals that were not obtained in the course of laboratory experiments. An extraneous dataset was found in the public domain. This dataset was used todetermine how accurate a classifier was when it was trained and tested on significantly different signals. Training and validation were carried out usingthe bootstrapping method to eradicate the effect of randomness, given the small amount of training data available. To estimate the quality of theclassifiers, the F1-measure was used as the main metric due to the imbalance of the data sets. The following supervised machine learning methodswere chosen as classifier models: logistic regression, support vector machine, random forest, and K nearest neighbors. The results are presented in theform of plots of density distribution and diagrams.


2020 ◽  
Author(s):  
Emad Kasaeyan Naeini ◽  
Ajan Subramanian ◽  
Michael-David Calderon ◽  
Kai Zheng ◽  
Nikil Dutt ◽  
...  

BACKGROUND There is a strong demand for an accurate and objective means for assessing acute pain among hospitalized patients to help clinicians provide a proper dosage of pain medications and in a timely manner. Heart rate variability (HRV) comprises changes in the time intervals between consecutive heartbeats, which can be measured through acquisition and interpretation of electrocardiogram (ECG) captured from bedside monitors or wearable devices. As increased sympathetic activity affects the HRV, an index of autonomic regulation of heart rate, ultra-short-term HRV analysis can provide a reliable source of information for acute pain monitoring. In this study, widely used HRV time- and frequency-domain measurements are used in acute pain assessments among postoperative patients. The existing approaches have only focused on stimulated pain on healthy subjects, whereas, to the best of our knowledge, there is no work in the literature building models using real pain data and on postoperative patients. OBJECTIVE To develop and evaluate an automatic and adaptable pain assessment algorithm based on ECG features for assessing acute pain in postoperative patients likely experiencing mild to moderate pain. METHODS The study used a prospective observational design. The sample consisted of 25 patient participants aged 18 to 65 years. In part 1 of the study, a Transcutaneous Electrical Nerve Stimulation unit was employed to obtain baseline discomfort threshold for the patients. In part 2, a multichannel biosignal acquisition device was used as patients were engaging in non-noxious activities. At all times, pain intensity was measured using patient self-reports based on the Numerical Rating Scale (NRS). A weak supervision framework was inherited for rapid training data creation. The collected labels were then transformed from 11 intensity levels to 5 intensity levels. Prediction models were developed using 5 different machine-learning methods. Mean prediction accuracy was calculated using Leave-One-Subject-Out cross-validation. We compared the performance of these models with the results from a previously published research study. RESULTS Five different machine-learning algorithms were applied to perform binary classification of no pain (NP) vs. 4 distinct pain levels (PL1 through PL4). Highest validation accuracy using 3 time-domain HRV features of BioVid research paper for no pain vs. any other pain level was achieved by SVM 62.72% (NP vs. PL4) to 84.14% (NP vs. PL2). Similar results were achieved for the top 8 features based on the Gini Index using the SVM method; with an accuracy ranging from 63.86% (NP vs. PL4) to 84.79% (NP vs. PL2). CONCLUSIONS We propose a novel pain assessment method for postoperative patients using the ECG signal. Weak supervision applied for labeling and feature extraction improves the robustness of the approach. Our results show the viability of using a machine-learning algorithm to accurately and objectively assess acute pain among hospitalized patients. INTERNATIONAL REGISTERED REPORT RR2-10.2196/17783


2019 ◽  
Vol 2 ◽  
pp. 1-8
Author(s):  
Lukas Gokl ◽  
Marvin Mc Cutchan ◽  
Bartosz Mazurkiewicz ◽  
Paolo Fogliaroni ◽  
Ioannis Giannopoulos

Abstract. Location Based Services (LBS) are definitely very helpful for people that interact within an unfamiliar environment, but also for those that already possess a certain level of familiarity with it. In order to avoid overwhelming familiar users with unnecessary information, the level of details offered by the LBS shall be adapted to the level of familiarity with the environment: providing more details to unfamiliar users and a lighter amount of information (that would be superfluous, if not even misleading) to the users that are more familiar with the current environment. Currently, the information exchange between the service and its users is not taking into account familiarity. Within this work, we investigate the potential of machine learning for a binary classification of environment familiarity (i.e., familiar vs unfamiliar) with the surrounding environment. For this purpose, a 3D virtual environment based on a part of Vienna, Austria was designed using datasets from the municipal government. During a navigation experiment with 22 participants we collected ground truth data in order to train four machine learning algorithms. The captured data included motion and orientation of the users as well as visual interaction with the surrounding buildings during navigation. This work demonstrates the potential of machine learning for predicting the state of familiarity as an enabling step for the implementation of LBS better tailored to the user.


Author(s):  
M. A. Zurbaran ◽  
P. Wightman ◽  
M. A. Brovelli

<p><strong>Abstract.</strong> Satellite imagery from earth observation missions enable processing big data to gather information about the world. Automatizing the creation of maps that reflect ground truth is a desirable outcome that would aid decision makers to take adequate actions in alignment with the United Nations Sustainable Development Goals. In order to harness the power that the availability of the new generation of satellites enable, it is necessary to implement techniques capable of handling annotations for the massive volume and variability of high spatial resolution imagery for further processing. However, the availability of public datasets for training machine learning models for image segmentation plays an important role for scalability.</p><p>This work focuses on bridging remote sensing and computer vision by providing an open source based pipeline for generating machine learning training datasets for road detection in an area of interest. The proposed pipeline addresses road detection as a binary classification problem using road annotations existing in OpenStreetMap for creating masks. For this case study, Planet images of 3m resolution are used for creating a training dataset for road detection in Kenya.</p>


Sign in / Sign up

Export Citation Format

Share Document