A system for automated analysis of subsea movies using citizen science and machine learning

This paper describes a data system to analyse large amounts of subsea movie data for marine ecological research. The system consists of three distinct modules for data management and archiving, citizen science, and machine learning in a high performance computation environment. It allows scientists to upload underwater footage to a customised citizen science website hosted by Zooniverse, where volunteers from the public classify the footage. Classifications with high agreement among citizen scientists are then used to train machine learning algorithms. An application programming interface allows researchers to test the algorithms and track biological objects in new footage. We tested our system using recordings from remotely operated vehicles (ROVs) in a Marine Protected Area, the Kosterhavet National Park in Sweden. Results indicate a strong decline of cold-water corals in the park over a period of 15 years, showing that our system allows to effectively extract valuable occurrence and abundance data for key ecological species from underwater footage. We argue that the combination of citizen science tools, machine learning, and high performance computers are key to successfully analyse large amounts of image data in the future, suggesting that these services should be consolidated and interlinked by national and international research infrastructures. Novel information system to analyse marine underwater footage.

Download Full-text

An open-source, citizen science and machine learning approach to analyse subsea movies

Biodiversity Data Journal ◽

10.3897/bdj.9.e60548 ◽

2021 ◽

Vol 9 ◽

Author(s):

Victor Anton ◽

Jannes Germishuys ◽

Per Bergström ◽

Mats Lindegarth ◽

Matthias Obst

Keyword(s):

Machine Learning ◽

Open Source ◽

Citizen Science ◽

High Performance ◽

Marine Protected Area ◽

Cold Water ◽

Biological Data ◽

Machine Learning Algorithms ◽

Test Machine ◽

Ecological Research

The increasing access to autonomously-operated technologies offer vast opportunities to sample large volumes of biological data. However, these technologies also impose novel demands on ecologists who need to apply tools for data management and processing that are efficient, publicly available and easy to use. Such tools are starting to be developed for a wider community and here we present an approach to combine essential analytical functions for analysing large volumes of image data in marine ecological research. This paper describes the Koster Seafloor Observatory, an open-source approach to analysing large amounts of subsea movie data for marine ecological research. The approach incorporates three distinct modules to: manage and archive the subsea movies, involve citizen scientists to accurately classify the footage and, finally, train and test machine learning algorithms for detection of biological objects. This modular approach is based on open-source code and allows researchers to customise and further develop the presented functionalities to various types of data and questions related to analysis of marine imagery. We tested our approach for monitoring cold water corals in a Marine Protected Area in Sweden using videos from remotely-operated vehicles (ROVs). Our study resulted in a machine learning model with an adequate performance, which was entirely trained with classifications provided by citizen scientists. We illustrate the application of machine learning models for automated inventories and monitoring of cold water corals. Our approach shows how citizen science can be used to effectively extract occurrence and abundance data for key ecological species and habitats from underwater footage. We conclude that the combination of open-source tools, citizen science systems, machine learning and high performance computational resources are key to successfully analyse large amounts of underwater imagery in the future.

Download Full-text

Smartphone-based real-time object recognition architecture for portable and constrained systems

Journal of Real-Time Image Processing ◽

10.1007/s11554-021-01164-1 ◽

2021 ◽

Author(s):

Ignacio Martinez-Alpiste ◽

Gelayol Golcarenarenji ◽

Qi Wang ◽

Jose Maria Alcaraz-Calero

Keyword(s):

Machine Learning ◽

Object Recognition ◽

Real Time ◽

High Performance ◽

High Efficiency ◽

Constrained Systems ◽

Machine Learning Algorithms ◽

Learning Platforms ◽

High Performance Computers ◽

Final System

AbstractMachine learning algorithms based on convolutional neural networks (CNNs) have recently been explored in a myriad of object detection applications. Nonetheless, many devices with limited computation resources and strict power consumption constraints are not suitable to run such algorithms designed for high-performance computers. Hence, a novel smartphone-based architecture intended for portable and constrained systems is designed and implemented to run CNN-based object recognition in real time and with high efficiency. The system is designed and optimised by leveraging the integration of the best of its kind from the state-of-the-art machine learning platforms including OpenCV, TensorFlow Lite, and Qualcomm Snapdragon informed by empirical testing and evaluation of each candidate framework in a comparable scenario with a high demanding neural network. The final system has been prototyped combining the strengths from these frameworks and led to a new machine learning-based object recognition execution environment embedded in a smartphone with advantageous performance compared with the previous frameworks.

Download Full-text

PyBDA: a command line tool for automated analysis of big biological data sets

BMC Bioinformatics ◽

10.1186/s12859-019-3087-8 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 1

Author(s):

Simon Dirmeier ◽

Mario Emmenlauer ◽

Christoph Dehio ◽

Niko Beerenwinkel

Keyword(s):

Machine Learning ◽

High Performance ◽

Single Cells ◽

Automated Analysis ◽

Biological Data ◽

Machine Learning Algorithms ◽

Data Sets ◽

Command Line ◽

Command Line Tool ◽

High Performance Computing Cluster

Abstract Background Analysing large and high-dimensional biological data sets poses significant computational difficulties for bioinformaticians due to lack of accessible tools that scale to hundreds of millions of data points. Results We developed a novel machine learning command line tool called PyBDA for automated, distributed analysis of big biological data sets. By using Apache Spark in the backend, PyBDA scales to data sets beyond the size of current applications. It uses Snakemake in order to automatically schedule jobs to a high-performance computing cluster. We demonstrate the utility of the software by analyzing image-based RNA interference data of 150 million single cells. Conclusion PyBDA allows automated, easy-to-use data analysis using common statistical methods and machine learning algorithms. It can be used with simple command line calls entirely making it accessible to a broad user base. PyBDA is available at https://pybda.rtfd.io.

Download Full-text

Advances in the Prediction of Protein Subcellular Locations with Machine Learning

Current Bioinformatics ◽

10.2174/1574893614666181217145156 ◽

2019 ◽

Vol 14 (5) ◽

pp. 406-421 ◽

Cited By ~ 3

Author(s):

Ting-He Zhang ◽

Shao-Wu Zhang

Keyword(s):

Machine Learning ◽

Feature Fusion ◽

Protein Sequences ◽

Subcellular Location ◽

Automated Analysis ◽

Cellular Level ◽

Machine Learning Algorithms ◽

Feature Representation ◽

Protein Subcellular Location ◽

Protein Subcellular Locations

Background: Revealing the subcellular location of a newly discovered protein can bring insight into their function and guide research at the cellular level. The experimental methods currently used to identify the protein subcellular locations are both time-consuming and expensive. Thus, it is highly desired to develop computational methods for efficiently and effectively identifying the protein subcellular locations. Especially, the rapidly increasing number of protein sequences entering the genome databases has called for the development of automated analysis methods. Methods: In this review, we will describe the recent advances in predicting the protein subcellular locations with machine learning from the following aspects: i) Protein subcellular location benchmark dataset construction, ii) Protein feature representation and feature descriptors, iii) Common machine learning algorithms, iv) Cross-validation test methods and assessment metrics, v) Web servers. Result & Conclusion: Concomitant with a large number of protein sequences generated by highthroughput technologies, four future directions for predicting protein subcellular locations with machine learning should be paid attention. One direction is the selection of novel and effective features (e.g., statistics, physical-chemical, evolutional) from the sequences and structures of proteins. Another is the feature fusion strategy. The third is the design of a powerful predictor and the fourth one is the protein multiple location sites prediction.

Download Full-text

An IoT-Focused Intrusion Detection System Approach Based on Preprocessing Characterization for Cybersecurity Datasets

Sensors ◽

10.3390/s21020656 ◽

2021 ◽

Vol 21 (2) ◽

pp. 656

Author(s):

Xavier Larriva-Novo ◽

Víctor A. Villagrá ◽

Mario Vega-Barbas ◽

Diego Rivera ◽

Mario Sanz Rodrigo

Keyword(s):

Machine Learning ◽

Intrusion Detection ◽

High Performance ◽

Learning Algorithm ◽

Detection System ◽

Machine Learning Algorithms ◽

Statistical Characteristics ◽

Detection Techniques ◽

Traffic Characteristics ◽

Benchmark Datasets

Security in IoT networks is currently mandatory, due to the high amount of data that has to be handled. These systems are vulnerable to several cybersecurity attacks, which are increasing in number and sophistication. Due to this reason, new intrusion detection techniques have to be developed, being as accurate as possible for these scenarios. Intrusion detection systems based on machine learning algorithms have already shown a high performance in terms of accuracy. This research proposes the study and evaluation of several preprocessing techniques based on traffic categorization for a machine learning neural network algorithm. This research uses for its evaluation two benchmark datasets, namely UGR16 and the UNSW-NB15, and one of the most used datasets, KDD99. The preprocessing techniques were evaluated in accordance with scalar and normalization functions. All of these preprocessing models were applied through different sets of characteristics based on a categorization composed by four groups of features: basic connection features, content characteristics, statistical characteristics and finally, a group which is composed by traffic-based features and connection direction-based traffic characteristics. The objective of this research is to evaluate this categorization by using various data preprocessing techniques to obtain the most accurate model. Our proposal shows that, by applying the categorization of network traffic and several preprocessing techniques, the accuracy can be enhanced by up to 45%. The preprocessing of a specific group of characteristics allows for greater accuracy, allowing the machine learning algorithm to correctly classify these parameters related to possible attacks.

Download Full-text

Machine Learning-Assisted Sampling of SERS Substrates Improves Data Collection Efficiency

Applied Spectroscopy ◽

10.1177/00037028211034543 ◽

2021 ◽

pp. 000370282110345

Author(s):

Tatu Rojalin ◽

Dexter Antonio ◽

Ambarish Kulkarni ◽

Randy P. Carney

Keyword(s):

Machine Learning ◽

Data Collection ◽

Domain Knowledge ◽

Collection Efficiency ◽

Point Of Care ◽

Automated Analysis ◽

Downstream Processing ◽

Machine Learning Algorithms ◽

Label Free ◽

Expert User

Surface-enhanced Raman scattering (SERS) is a powerful technique for sensitive label-free analysis of chemical and biological samples. While much recent work has established sophisticated automation routines using machine learning and related artificial intelligence methods, these efforts have largely focused on downstream processing (e.g., classification tasks) of previously collected data. While fully automated analysis pipelines are desirable, current progress is limited by cumbersome and manually intensive sample preparation and data collection steps. Specifically, a typical lab-scale SERS experiment requires the user to evaluate the quality and reliability of the measurement (i.e., the spectra) as the data are being collected. This need for expert user-intuition is a major bottleneck that limits applicability of SERS-based diagnostics for point-of-care clinical applications, where trained spectroscopists are likely unavailable. While application-agnostic numerical approaches (e.g., signal-to-noise thresholding) are useful, there is an urgent need to develop algorithms that leverage expert user intuition and domain knowledge to simplify and accelerate data collection steps. To address this challenge, in this work, we introduce a machine learning-assisted method at the acquisition stage. We tested six common algorithms to measure best performance in the context of spectral quality judgment. For adoption into future automation platforms, we developed an open-source python package tailored for rapid expert user annotation to train machine learning algorithms. We expect that this new approach to use machine learning to assist in data acquisition can serve as a useful building block for point-of-care SERS diagnostic platforms.

Download Full-text

Development of use-specific high-performance cyber-nanomaterial optical detectors by effective choice of machine learning algorithms

Machine Learning: Science and Technology ◽

10.1088/2632-2153/ab8967 ◽

2020 ◽

Vol 1 (2) ◽

pp. 025007 ◽

Cited By ~ 2

Author(s):

Davoud Hejazi ◽

Shuangjun Liu ◽

Amirreza Farnoosh ◽

Sarah Ostadabbas ◽

Swastik Kar

Keyword(s):

Machine Learning ◽

High Performance ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Optical Detectors ◽

Effective Choice

Download Full-text

Expert-level Automated Biomarker Identification in Optical Coherence Tomography Scans

Scientific Reports ◽

10.1038/s41598-019-49740-7 ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 4

Author(s):

Thomas Kurmann ◽

Siqing Yu ◽

Pablo Márquez-Neila ◽

Andreas Ebneter ◽

Martin Zinkernagel ◽

...

Keyword(s):

Machine Learning ◽

Optical Coherence Tomography ◽

Critical Role ◽

Cost Effective ◽

Automated Analysis ◽

Machine Learning Algorithms ◽

Three Dimensions ◽

Optical Coherence ◽

Clinical Routine ◽

Wide Range

Abstract In ophthalmology, retinal biological markers, or biomarkers, play a critical role in the management of chronic eye conditions and in the development of new therapeutics. While many imaging technologies used today can visualize these, Optical Coherence Tomography (OCT) is often the tool of choice due to its ability to image retinal structures in three dimensions at micrometer resolution. But with widespread use in clinical routine, and growing prevalence in chronic retinal conditions, the quantity of scans acquired worldwide is surpassing the capacity of retinal specialists to inspect these in meaningful ways. Instead, automated analysis of scans using machine learning algorithms provide a cost effective and reliable alternative to assist ophthalmologists in clinical routine and research. We present a machine learning method capable of consistently identifying a wide range of common retinal biomarkers from OCT scans. Our approach avoids the need for costly segmentation annotations and allows scans to be characterized by biomarker distributions. These can then be used to classify scans based on their underlying pathology in a device-independent way.

Download Full-text

Integrating hierarchical statistical models and machine-learning algorithms for ground-truthing drone images of the vegetation: taxonomy, abundance and population ecological models

10.1101/491381 ◽

2018 ◽

Cited By ~ 1

Author(s):

Christian Damgaard

Keyword(s):

Machine Learning ◽

Statistical Models ◽

Learning Algorithms ◽

Plant Competition ◽

Image Data ◽

Ground Truth ◽

Ecological Models ◽

Machine Learning Algorithms ◽

Ground Truth Data ◽

Ground Truthing

AbstractIn order to fit population ecological models, e.g. plant competition models, to new drone-aided image data, we need to develop statistical models that may take the new type of measurement uncertainty when applying machine-learning algorithms into account and quantify its importance for statistical inferences and ecological predictions. Here, it is proposed to quantify the uncertainty and bias of image predicted plant taxonomy and abundance in a hierarchical statistical model that is linked to ground-truth data obtained by the pin-point method. It is critical that the error rate in the species identification process is minimized when the image data are fitted to the population ecological models, and several avenues for reaching this objective are discussed. The outlined method to statistically model known sources of uncertainty when applying machine-learning algorithms may be relevant for other applied scientific disciplines.

Download Full-text

Classification of masked image data

PLoS ONE ◽

10.1371/journal.pone.0254181 ◽

2021 ◽

Vol 16 (7) ◽

pp. e0254181

Author(s):

Kamila Lis ◽

Mateusz Koryciński ◽

Konrad A. Ciecierski

Keyword(s):

Neural Network ◽

Machine Learning ◽

Image Data ◽

Original Data ◽

Machine Learning Algorithms ◽

General Data Protection Regulation ◽

Additional Information ◽

Classification Of Images ◽

Applications Of Machine Learning

Data classification is one of the most commonly used applications of machine learning. The are many developed algorithms that can work in various environments and for different data distributions that perform this task with excellence. Classification algorithms, just like other machine learning algorithms have one thing in common: in order to operate on data, they must see the data. In the present world, where concerns about privacy, GDPR (General Data Protection Regulation), business confidentiality and security are growing bigger and bigger; this requirement to work directly on the original data might become, in some situations, a burden. In this paper, an approach to the classification of images that cannot be directly accessed during training has been made. It has been shown that one can train a deep neural network to create such a representation of the original data that i) without additional information, the original data cannot be restored, and ii) that this representation—called a masked form—can still be used for classification purposes. Moreover, it has been shown that classification of the masked data can be done using both classical and neural network-based classifiers.

Download Full-text