An embedded system for the automated generation of labeled plant images to enable machine learning applications in agriculture

A lack of sufficient training data, both in terms of variety and quantity, is often the bottleneck in the development of machine learning (ML) applications in any domain. For agricultural applications, ML-based models designed to perform tasks such as autonomous plant classification will typically be coupled to just one or perhaps a few plant species. As a consequence, each crop-specific task is very likely to require its own specialized training data, and the question of how to serve this need for data now often overshadows the more routine exercise of actually training such models. To tackle this problem, we have developed an embedded robotic system to automatically generate and label large datasets of plant images for ML applications in agriculture. The system can image plants from virtually any angle, thereby ensuring a wide variety of data; and with an imaging rate of up to one image per second, it can produce lableled datasets on the scale of thousands to tens of thousands of images per day. As such, this system offers an important alternative to time- and cost-intensive methods of manual generation and labeling. Furthermore, the use of a uniform background made of blue keying fabric enables additional image processing techniques such as background replacement and image segementation. It also helps in the training process, essentially forcing the model to focus on the plant features and eliminating random correlations. To demonstrate the capabilities of our system, we generated a dataset of over 34,000 labeled images, with which we trained an ML-model to distinguish grasses from non-grasses in test data from a variety of sources. We now plan to generate much larger datasets of Canadian crop plants and weeds that will be made publicly available in the hope of further enabling ML applications in the agriculture sector.

Download Full-text

Machine Learning and Value-Based Software Engineering

Software and Intelligent Sciences ◽

10.4018/978-1-4666-0261-8.ch017 ◽

2012 ◽

pp. 287-301

Author(s):

Du Zhang

Keyword(s):

Machine Learning ◽

Software Engineering ◽

Full Range ◽

Training Data ◽

Learning Methods ◽

Engineering Research ◽

Machine Learning Methods ◽

Machine Learning Applications ◽

Value Propositions ◽

Software Engineering Research

Software engineering research and practice thus far are primarily conducted in a value-neutral setting where each artifact in software development such as requirement, use case, test case, and defect, is treated as equally important during a software system development process. There are a number of shortcomings of such value-neutral software engineering. Value-based software engineering is to integrate value considerations into the full range of existing and emerging software engineering principles and practices. Machine learning has been playing an increasingly important role in helping develop and maintain large and complex software systems. However, machine learning applications to software engineering have been largely confined to the value-neutral software engineering setting. In this paper, the general message to be conveyed is to apply machine learning methods and algorithms to value-based software engineering. The training data or the background knowledge or domain theory or heuristics or bias used by machine learning methods in generating target models or functions should be aligned with stakeholders’ value propositions. An initial research agenda is proposed for machine learning in value-based software engineering.

Download Full-text

Linking Human And Machine Behavior: A New Approach to Evaluate Training Data Quality for Beneficial Machine Learning

Minds and Machines ◽

10.1007/s11023-021-09573-8 ◽

2021 ◽

Author(s):

Thilo Hagendorff

Keyword(s):

Machine Learning ◽

Data Quality ◽

Training Data ◽

Supervised Machine Learning ◽

Ethical Perspective ◽

New Approach ◽

Specific Objective ◽

Machine Learning Applications ◽

Socially Relevant

AbstractMachine behavior that is based on learning algorithms can be significantly influenced by the exposure to data of different qualities. Up to now, those qualities are solely measured in technical terms, but not in ethical ones, despite the significant role of training and annotation data in supervised machine learning. This is the first study to fill this gap by describing new dimensions of data quality for supervised machine learning applications. Based on the rationale that different social and psychological backgrounds of individuals correlate in practice with different modes of human–computer-interaction, the paper describes from an ethical perspective how varying qualities of behavioral data that individuals leave behind while using digital technologies have socially relevant ramification for the development of machine learning applications. The specific objective of this study is to describe how training data can be selected according to ethical assessments of the behavior it originates from, establishing an innovative filter regime to transition from the big data rationale n = all to a more selective way of processing data for training sets in machine learning. The overarching aim of this research is to promote methods for achieving beneficial machine learning applications that could be widely useful for industry as well as academia.

Download Full-text

MS2AI: Automated repurposing of public peptide LC-MS data for machine learning applications

10.1101/2021.01.27.428375 ◽

2021 ◽

Author(s):

Tobias Greisager Rehfeldt ◽

Konrad Krawczyk ◽

Mathias Bøgebjerg ◽

Veit Schwämmle ◽

Richard Röttger

Keyword(s):

Machine Learning ◽

Mass Spectrometry ◽

Large Scale ◽

Peptide Identification ◽

Training Data ◽

Supplementary Information ◽

Large Sample Size ◽

Raw Data ◽

Machine Learning Applications ◽

Rich Data

AbstractMotivationLiquid-chromatography mass-spectrometry (LC-MS) is the established standard for analyzing the proteome in biological samples by identification and quantification of thousands of proteins. Machine learning (ML) promises to considerably improve the analysis of the resulting data, however, there is yet to be any tool that mediates the path from raw data to modern ML applications. More specifically, ML applications are currently hampered by three major limitations: (1) absence of balanced training data with large sample size; (2) unclear definition of sufficiently information-rich data representations for e.g. peptide identification; (3) lack of benchmarking of ML methods on specific LC-MS problems.ResultsWe created the MS2AI pipeline that automates the process of gathering vast quantities of mass spectrometry (MS) data for large scale ML applications. The software retrieves raw data from either in-house sources or from the proteomics identifications database, PRIDE. Subsequently, the raw data is stored in a standardized format amenable for ML encompassing MS1/MS2 spectra and peptide identifications. This tool bridges the gap between MS and AI, and to this effect we also present an ML application in the form of a convolutional neural network for the identification of oxidized peptides.AvailabilityAn open source implementation of the software can be found freely available for non-commercial use at https://gitlab.com/roettgerlab/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Machine Learning and Value-Based Software Engineering

Software Applications ◽

10.4018/978-1-60566-060-8.ch192 ◽

2009 ◽

pp. 3325-3339

Author(s):

Du Zhang

Keyword(s):

Machine Learning ◽

Software Engineering ◽

Full Range ◽

Training Data ◽

Learning Methods ◽

Engineering Research ◽

Machine Learning Methods ◽

Machine Learning Applications ◽

Value Propositions ◽

Software Engineering Research

Download Full-text

Fog-Computing-Based Heartbeat Detection and Arrhythmia Classification Using Machine Learning

Algorithms ◽

10.3390/a12020032 ◽

2019 ◽

Vol 12 (2) ◽

pp. 32 ◽

Cited By ~ 8

Author(s):

Alessandro Scirè ◽

Fabrizio Tropeano ◽

Aris Anagnostopoulos ◽

Ioannis Chatzigiannakis

Keyword(s):

Machine Learning ◽

Embedded System ◽

Fog Computing ◽

Cloud Services ◽

Sensor Data ◽

High Detection Rate ◽

Embedded Processor ◽

Machine Learning Applications ◽

Active Research ◽

Asymmetric Multicore

Designing advanced health monitoring systems is still an active research topic. Wearable and remote monitoring devices enable monitoring of physiological and clinical parameters (heart rate, respiration rate, temperature, etc.) and analysis using cloud-centric machine-learning applications and decision-support systems to predict critical clinical states. This paper moves from a totally cloud-centric concept to a more distributed one, by transferring sensor data processing and analysis tasks to the edges of the network. The resulting solution enables the analysis and interpretation of sensor-data traces within the wearable device to provide actionable alerts without any dependence on cloud services. In this paper, we use a supervised-learning approach to detect heartbeats and classify arrhythmias. The system uses a window-based feature definition that is suitable for execution within an asymmetric multicore embedded processor that provides a dedicated core for hardware assisted pattern matching. We evaluate the performance of the system in comparison with various existing approaches, in terms of achieved accuracy in the detection of abnormal events. The results show that the proposed embedded system achieves a high detection rate that in some cases matches the accuracy of the state-of-the-art algorithms executed in standard processors.

Download Full-text

Efficient crowdsourcing of crowd-generated microtasks

PLoS ONE ◽

10.1371/journal.pone.0244245 ◽

2020 ◽

Vol 15 (12) ◽

pp. e0244245

Author(s):

Abigail Hotaling ◽

James P. Bagrow

Keyword(s):

Machine Learning ◽

Question Answering ◽

Hypothesis Generation ◽

Training Data ◽

User Generated Content ◽

Efficiency Gains ◽

Machine Learning Applications ◽

Cost Forecasting ◽

Algorithmic Approaches ◽

Improved Accuracy

Allowing members of the crowd to propose novel microtasks for one another is an effective way to combine the efficiencies of traditional microtask work with the inventiveness and hypothesis generation potential of human workers. However, microtask proposal leads to a growing set of tasks that may overwhelm limited crowdsourcer resources. Crowdsourcers can employ methods to utilize their resources efficiently, but algorithmic approaches to efficient crowdsourcing generally require a fixed task set of known size. In this paper, we introduce cost forecasting as a means for a crowdsourcer to use efficient crowdsourcing algorithms with a growing set of microtasks. Cost forecasting allows the crowdsourcer to decide between eliciting new tasks from the crowd or receiving responses to existing tasks based on whether or not new tasks will cost less to complete than existing tasks, efficiently balancing resources as crowdsourcing occurs. Experiments with real and synthetic crowdsourcing data show that cost forecasting leads to improved accuracy. Accuracy and efficiency gains for crowd-generated microtasks hold the promise to further leverage the creativity and wisdom of the crowd, with applications such as generating more informative and diverse training data for machine learning applications and improving the performance of user-generated content and question-answering platforms.

Download Full-text

Comparison of data augmentation methods for legal document classification

Acta Technica Jaurinensis ◽

10.14513/actatechjaur.00628 ◽

2021 ◽

Author(s):

Gergely Csányi ◽

Tamás Orosz

Keyword(s):

Machine Learning ◽

Text Categorization ◽

Data Augmentation ◽

Training Data ◽

Legal Documents ◽

Textual Data ◽

Machine Learning Applications ◽

Legal Document ◽

Different Characteristics ◽

Problem Data

Sorting out the legal documents by their subject matter is an essential and time-consuming task due to the large amount of data. Many machine learning-based text categorization methods exist, which can resolve this problem. However, these algorithms can not perform well if they do not have enough training data for every category. Text augmentation can resolve this problem. Data augmentation is a widely used technique in machine learning applications, especially in computer vision. Textual data has different characteristics than images, so different solutions must be applied when the need for data augmentation arises. However, the type and different characteristics of the textual data or the task itself may reduce the number of methods that could be applied in a certain scenario. This paper focuses on text augmentation methods that could be applied to legal documents when classifying them into specific groups of subject matters.

Download Full-text

A Taxonomy of Training Data

Artificial Intelligence and Intellectual Property ◽

10.1093/oso/9780198870944.003.0011 ◽

2021 ◽

pp. 221-242

Author(s):

Benjamin Sobel

Keyword(s):

Machine Learning ◽

Data Mining ◽

Routine Activities ◽

Training Data ◽

Positive Development ◽

Machine Learning Applications ◽

Exceptions And Limitations ◽

Low Threshold ◽

The Right ◽

Text And Data Mining

Many machine learning applications depend on unauthorized uses of copyrighted training data. Scholars and lawmakers often articulate this problem as a deficiency in copyright’s exceptions and limitations. In fact, today’s predicament results not from inadequate exceptions to copyright, but rather from two systemic features of the regime—the absence of formalities and the low threshold of copyrightable originality—combined with technology that turns routine activities into acts of authorship. This chapter taxonomizes AI applications by their training data. Four categories emerge: (1) public-domain data, (2) licensed data, (3) market-encroaching uses of copyrighted data, and (4) non-market-encroaching uses of copyrighted data. Copyright can only regulate market-encroaching uses, but these uses comprise only a narrow subset of AI, and copyright’s remedies are ill-suited to address them. The chapter concludes with a discussion of solutions to the problems it identifies. It observes that the exception for Text and Data Mining in the European Union’s Directive on Copyright in the Digital Single Market represents a positive development because the exception addresses structural causes of AI’s copyright problems. The TDM provision styles itself as an exception, but it may be better understood as a formality: rights holders must affirmatively reserve the right to exclude materials from training datasets. Thus, the TDM exception addresses a cause of AI’s copyright dilemma. The next step for an equitable AI framework will be to transition towards rules that not only clarify that non-market-encroaching uses do not infringe copyright, but also facilitate remunerated uses of copyrighted works for market-encroaching purposes.

Download Full-text

Leveraging Expert Knowledge for Label Noise Mitigation in Machine Learning

Applied Sciences ◽

10.3390/app112211040 ◽

2021 ◽

Vol 11 (22) ◽

pp. 11040

Author(s):

Quoc Nguyen ◽

Tomoaki Shikina ◽

Daichi Teruya ◽

Seiji Hotta ◽

Huy-Dung Han ◽

...

Keyword(s):

Machine Learning ◽

Expert Knowledge ◽

Classification Problem ◽

Training Data ◽

Noise Mitigation ◽

Label Noise ◽

Training Models ◽

Gradient Descent Algorithm ◽

Machine Learning Applications ◽

Model Training

In training-based Machine Learning applications, the training data are frequently labeled by non-experts and expose substantial label noise which greatly alters the training models. In this work, a novel method for reducing the effect of label noise is introduced. The rules are created from expert knowledge to identify the incorrect non-expert training data. Using the gradient descent algorithm, the violating data samples are weighted less to mitigate their effects during model training. The proposed method is applied to the image classification problem using Manga109 and CIFAR-10 dataset. The experiments show that when the noise level is up to 50% our proposed method significantly increases the accuracy of the model compared to conventional learning methods.

Download Full-text

scikit-activeml: A Library and Toolbox for Active Learning Algorithms

10.20944/preprints202103.0194.v1 ◽

2021 ◽

Author(s):

Daniel Kottke ◽

Marek Herde ◽

Tuan Pham Minh ◽

Alexander Benz ◽

Pascal Mergard ◽

...

Keyword(s):

Machine Learning ◽

Active Learning ◽

Learning Algorithm ◽

Learning Algorithms ◽

Unlabeled Data ◽

Training Data ◽

Partially Labeled Data ◽

Difficult Time ◽

Machine Learning Applications ◽

Data Points

Machine learning applications often need large amounts of training data to perform well. Whereas unlabeled data can be easily gathered, the labeling process is difficult, time-consuming, or expensive in most applications. Active learning can help solve this problem by querying labels for those data points that will improve the performance the most. Thereby, the goal is that the learning algorithm performs sufficiently well with fewer labels. We provide a library called scikit-activeml that covers the most relevant query strategies and implements tools to work with partially labeled data. It is programmed in Python and builds on top of scikit-learn.

Download Full-text