Resolving data-hungry nature of machine learning reference evapotranspiration estimating models using inter-model ensembles with various data management schemes

Data availability statements can provide useful information about how researchers actually share research data. We used unsupervised machine learning to analyze 124,000 data availability statements submitted by research authors to 176 Wiley journals between 2013 and 2019. We categorized the data availability statements, and looked at trends over time. We found expected increases in the number of data availability statements submitted over time, and marked increases that correlate with policy changes made by journals. Our open data challenge becomes to use what we have learned to present researchers with relevant and easy options that help them to share and make an impact with new research data.

Download Full-text

Glean

Proceedings of the VLDB Endowment ◽

10.14778/3447689.3447703 ◽

2021 ◽

Vol 14 (6) ◽

pp. 997-1005

Author(s):

Sandeep Tata ◽

Navneet Potti ◽

James B. Wendt ◽

Lauro Beltrão Costa ◽

Marc Najork ◽

...

Keyword(s):

Machine Learning ◽

Data Management ◽

Real World ◽

Empirical Studies ◽

Ground Truth ◽

Training Data ◽

Ground Truth Data ◽

Document Type ◽

Machine Learning Model ◽

Structured Information

Extracting structured information from templatic documents is an important problem with the potential to automate many real-world business workflows such as payment, procurement, and payroll. The core challenge is that such documents can be laid out in virtually infinitely different ways. A good solution to this problem is one that generalizes well not only to known templates such as invoices from a known vendor, but also to unseen ones. We developed a system called Glean to tackle this problem. Given a target schema for a document type and some labeled documents of that type, Glean uses machine learning to automatically extract structured information from other documents of that type. In this paper, we describe the overall architecture of Glean, and discuss three key data management challenges : 1) managing the quality of ground truth data, 2) generating training data for the machine learning model using labeled documents, and 3) building tools that help a developer rapidly build and improve a model for a given document type. Through empirical studies on a real-world dataset, we show that these data management techniques allow us to train a model that is over 5 F1 points better than the exact same model architecture without the techniques we describe. We argue that for such information-extraction problems, designing abstractions that carefully manage the training data is at least as important as choosing a good model architecture.

Download Full-text

IoT Data Management System for Rapid Development of Machine Learning Models

2019 IEEE International Conference on Cognitive Computing (ICCC) ◽

10.1109/iccc.2019.00021 ◽

2019 ◽

Author(s):

Keith Grueneberg ◽

Bongjun Ko ◽

David Wood ◽

Xiping Wang ◽

Dean Steuer ◽

...

Keyword(s):

Machine Learning ◽

Data Management ◽

Management System ◽

Rapid Development ◽

Data Management System ◽

Learning Models ◽

Machine Learning Models

Download Full-text

Infrared spectroscopy coupled to cloud-based data management as a tool to diagnose malaria: a pilot study in a malaria-endemic country

Malaria Journal ◽

10.1186/s12936-019-2945-1 ◽

2019 ◽

Vol 18 (1) ◽

Cited By ~ 3

Author(s):

Philip Heraud ◽

Patutong Chatchawal ◽

Molin Wongwattanakul ◽

Patcharaporn Tippayawat ◽

Christian Doerig ◽

...

Keyword(s):

Machine Learning ◽

Infrared Spectroscopy ◽

Data Management ◽

False Positives ◽

Mobile Telephone ◽

Support Vector ◽

Laptop Computer ◽

Asymptomatic Carriers ◽

Machine Learning Classification ◽

Validation Testing

Abstract Background Widespread elimination of malaria requires an ultra-sensitive detection method that can detect low parasitaemia levels seen in asymptomatic carriers who act as reservoirs for further transmission of the disease, but is inexpensive and easy to deploy in the field in low income settings. It was hypothesized that a new method of malaria detection based on infrared spectroscopy, shown in the laboratory to have similar sensitivity to PCR based detection, could prove effective in detecting malaria in a field setting using cheap portable units with data management systems allowing them to be used by users inexpert in spectroscopy. This study was designed to determine whether the methodology developed in the laboratory could be translated to the field to diagnose the presence of Plasmodium in the blood of patients presenting at hospital with symptoms of malaria, as a precursor to trials testing the sensitivity of to detect asymptomatic carriers. Methods The field study tested 318 patients presenting with suspected malaria at four regional clinics in Thailand. Two portable infrared spectrometers were employed, operated from a laptop computer or a mobile telephone with in-built software that guided the user through the simple measurement steps. Diagnostic modelling and validation testing using linear and machine learning approaches was performed against the gold standard qPCR. Sample spectra from 318 patients were used for building calibration models (112 positive and 110 negative samples according to PCR testing) and independent validation testing (39 positive and 57 negatives samples by PCR). Results The machine learning classification (support vector machines; SVM) performed with 92% sensitivity (3 false negatives) and 97% specificity (2 false positives). The Area Under the Receiver Operation Curve (AUROC) for the SVM classification was 0.98. These results may be better than as stated as one of the spectroscopy false positives was infected by a Plasmodium species other than Plasmodium falciparum or Plasmodium vivax, not detected by the PCR primers employed. Conclusions In conclusion, it was demonstrated that ATR-FTIR spectroscopy could be used as an efficient and reliable malaria diagnostic tool and has the potential to be developed for use at point of care under tropical field conditions with spectra able to be analysed via a Cloud-based system, and the diagnostic results returned to the user’s mobile telephone or computer. The combination of accessibility to mass screening, high sensitivity and selectivity, low logistics requirements and portability, makes this new approach a potentially outstanding tool in the context of malaria elimination programmes. The next step in the experimental programme now underway is to reduce the sample requirements to fingerprick volumes.

Download Full-text

Des Translators neue Kleider. Die Translationwirtschaft in Zeiten von Digitalisierung, Datafizierung und Big Data Management

Lebende Sprachen ◽

10.1515/les-2017-0027 ◽

2017 ◽

Vol 62 (2) ◽

Cited By ~ 2

Author(s):

Martin Forstner

Keyword(s):

Machine Learning ◽

At Risk ◽

Big Data ◽

Internet Of Things ◽

Data Management ◽

Low Cost ◽

The Internet ◽

Computer Linguistics ◽

Translation Services ◽

The Internet Of Things

AbstractThe Internet of things will influence all professional environments, including translation services. Advances in machine learning, supported by accelerating improvements in computer linguistics, have enabled new systems that can learn from their own experience and will have repercussions on the workflow processes of translators or even put their services at risk in the expected digitalized society. Outsourcing has become a common practice and working in the cloud and in the crowd tend to enable translating on a very low-cost level. Confronted with promising new labels like

Download Full-text

Forecasting Marine and Structural Integrity Parameters for Offshore Platforms

Volume 3: Structures, Safety and Reliability ◽

10.1115/omae2015-41046 ◽

2015 ◽

Cited By ~ 1

Author(s):

Igor Prislin ◽

Reza Jafarkhani ◽

Soma Maroju

Keyword(s):

Machine Learning ◽

Data Management ◽

Structural Integrity ◽

Operational Risk ◽

Computer Programs ◽

Measured Data ◽

Offshore Platforms ◽

Integrity Monitoring ◽

Design Values ◽

Paper Address

Marine and structural integrity monitoring for offshore platforms is the cornerstone for managing operational risk and safety. Measuring platform responses and loads enables comparisons with design values thus ensuring that the risk does not exceed the designed limits. This paper discusses an advanced data management that is based on machine learning, a set of specialized computer programs that can learn and generalize the platform responses from measured data. The programs should produce sufficiently accurate predictions in previously unseen cases. Examples provided in the paper address capabilities for forecasting the marine and structural integrity parameters.

Download Full-text

Experiencing ProvLake to Manage the Data Lineage of AI Workflows

10.5753/sbsi.2020.13144 ◽

2020 ◽

Author(s):

Leonardo Guerreiro Azevedo ◽

Renan Souza ◽

Raphael Melo Thiago ◽

Elton Soares ◽

Marcio Moreno

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Data Management ◽

Oil And Gas ◽

Core Concept ◽

Data Lineage ◽

Oil And Gas Exploration ◽

Provenance Data ◽

Management Techniques ◽

Artificial Intelligence Systems

Machine Learning (ML) is a core concept behind Artificial Intelligence systems, which work driven by data and generate ML models. These models are used for decision making, and it is crucial to trust their outputs by, e.g., understanding the process that derives them. One way to explain the derivation of ML models is by tracking the whole ML lifecycle, generating its data lineage, which may be accomplished by provenance data management techniques. In this work, we present the use of ProvLake tool for ML provenance data management in the ML lifecycle for Well Top Picking, an essential process in Oil and Gas exploration. We show how ProvLake supported the validation of ML models, the understanding of whether the ML models generalize respecting the domain characteristics, and their derivation.

Download Full-text

Practices and Infrastructures for ML Systems – An Interview Study

10.36227/techrxiv.16939192.v1 ◽

2021 ◽

Author(s):

Dennis Muiruri ◽

Lucy Ellen Lwakatare ◽

Jukka K. Nurminen ◽

Tommi Mikkonen

Keyword(s):

Machine Learning ◽

Best Practices ◽

Data Management ◽

Management Practices ◽

Interview Study ◽

The State ◽

Data Driven ◽

Software Systems ◽

State Of Practice ◽

Model Training

<div> <div> <div> <p>The best practices and infrastructures for developing and maintaining machine learning (ML) enabled software systems are often reported by large and experienced data-driven organizations. However, little is known about the state of practice across other organizations. Using interviews, we investigated practices and tool-chains for ML-enabled systems from 16 organizations in various domains. Our study makes three broad observations related to data management practices, monitoring practices and automation practices in ML model training, and serving workflows. These have limited number of generic practices and tools applicable across organizations in different domains. </p> </div> </div> </div>

Download Full-text