An open-source, citizen science and machine learning approach to analyse subsea movies

The increasing access to autonomously-operated technologies offer vast opportunities to sample large volumes of biological data. However, these technologies also impose novel demands on ecologists who need to apply tools for data management and processing that are efficient, publicly available and easy to use. Such tools are starting to be developed for a wider community and here we present an approach to combine essential analytical functions for analysing large volumes of image data in marine ecological research. This paper describes the Koster Seafloor Observatory, an open-source approach to analysing large amounts of subsea movie data for marine ecological research. The approach incorporates three distinct modules to: manage and archive the subsea movies, involve citizen scientists to accurately classify the footage and, finally, train and test machine learning algorithms for detection of biological objects. This modular approach is based on open-source code and allows researchers to customise and further develop the presented functionalities to various types of data and questions related to analysis of marine imagery. We tested our approach for monitoring cold water corals in a Marine Protected Area in Sweden using videos from remotely-operated vehicles (ROVs). Our study resulted in a machine learning model with an adequate performance, which was entirely trained with classifications provided by citizen scientists. We illustrate the application of machine learning models for automated inventories and monitoring of cold water corals. Our approach shows how citizen science can be used to effectively extract occurrence and abundance data for key ecological species and habitats from underwater footage. We conclude that the combination of open-source tools, citizen science systems, machine learning and high performance computational resources are key to successfully analyse large amounts of underwater imagery in the future.

Download Full-text

A system for automated analysis of subsea movies using citizen science and machine learning

10.3897/arphapreprints.e60597 ◽

2020 ◽

Author(s):

Victor Anton ◽

Jannes Germishuys ◽

Matthias Obst

Keyword(s):

Machine Learning ◽

Citizen Science ◽

High Performance ◽

Marine Protected Area ◽

Cold Water ◽

Image Data ◽

Application Programming Interface ◽

Automated Analysis ◽

Machine Learning Algorithms ◽

High Performance Computers

This paper describes a data system to analyse large amounts of subsea movie data for marine ecological research. The system consists of three distinct modules for data management and archiving, citizen science, and machine learning in a high performance computation environment. It allows scientists to upload underwater footage to a customised citizen science website hosted by Zooniverse, where volunteers from the public classify the footage. Classifications with high agreement among citizen scientists are then used to train machine learning algorithms. An application programming interface allows researchers to test the algorithms and track biological objects in new footage. We tested our system using recordings from remotely operated vehicles (ROVs) in a Marine Protected Area, the Kosterhavet National Park in Sweden. Results indicate a strong decline of cold-water corals in the park over a period of 15 years, showing that our system allows to effectively extract valuable occurrence and abundance data for key ecological species from underwater footage. We argue that the combination of citizen science tools, machine learning, and high performance computers are key to successfully analyse large amounts of image data in the future, suggesting that these services should be consolidated and interlinked by national and international research infrastructures. Novel information system to analyse marine underwater footage.

Download Full-text

PyBDA: a command line tool for automated analysis of big biological data sets

BMC Bioinformatics ◽

10.1186/s12859-019-3087-8 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 1

Author(s):

Simon Dirmeier ◽

Mario Emmenlauer ◽

Christoph Dehio ◽

Niko Beerenwinkel

Keyword(s):

Machine Learning ◽

High Performance ◽

Single Cells ◽

Automated Analysis ◽

Biological Data ◽

Machine Learning Algorithms ◽

Data Sets ◽

Command Line ◽

Command Line Tool ◽

High Performance Computing Cluster

Abstract Background Analysing large and high-dimensional biological data sets poses significant computational difficulties for bioinformaticians due to lack of accessible tools that scale to hundreds of millions of data points. Results We developed a novel machine learning command line tool called PyBDA for automated, distributed analysis of big biological data sets. By using Apache Spark in the backend, PyBDA scales to data sets beyond the size of current applications. It uses Snakemake in order to automatically schedule jobs to a high-performance computing cluster. We demonstrate the utility of the software by analyzing image-based RNA interference data of 150 million single cells. Conclusion PyBDA allows automated, easy-to-use data analysis using common statistical methods and machine learning algorithms. It can be used with simple command line calls entirely making it accessible to a broad user base. PyBDA is available at https://pybda.rtfd.io.

Download Full-text

An IoT-Focused Intrusion Detection System Approach Based on Preprocessing Characterization for Cybersecurity Datasets

Sensors ◽

10.3390/s21020656 ◽

2021 ◽

Vol 21 (2) ◽

pp. 656

Author(s):

Xavier Larriva-Novo ◽

Víctor A. Villagrá ◽

Mario Vega-Barbas ◽

Diego Rivera ◽

Mario Sanz Rodrigo

Keyword(s):

Machine Learning ◽

Intrusion Detection ◽

High Performance ◽

Learning Algorithm ◽

Detection System ◽

Machine Learning Algorithms ◽

Statistical Characteristics ◽

Detection Techniques ◽

Traffic Characteristics ◽

Benchmark Datasets

Security in IoT networks is currently mandatory, due to the high amount of data that has to be handled. These systems are vulnerable to several cybersecurity attacks, which are increasing in number and sophistication. Due to this reason, new intrusion detection techniques have to be developed, being as accurate as possible for these scenarios. Intrusion detection systems based on machine learning algorithms have already shown a high performance in terms of accuracy. This research proposes the study and evaluation of several preprocessing techniques based on traffic categorization for a machine learning neural network algorithm. This research uses for its evaluation two benchmark datasets, namely UGR16 and the UNSW-NB15, and one of the most used datasets, KDD99. The preprocessing techniques were evaluated in accordance with scalar and normalization functions. All of these preprocessing models were applied through different sets of characteristics based on a categorization composed by four groups of features: basic connection features, content characteristics, statistical characteristics and finally, a group which is composed by traffic-based features and connection direction-based traffic characteristics. The objective of this research is to evaluate this categorization by using various data preprocessing techniques to obtain the most accurate model. Our proposal shows that, by applying the categorization of network traffic and several preprocessing techniques, the accuracy can be enhanced by up to 45%. The preprocessing of a specific group of characteristics allows for greater accuracy, allowing the machine learning algorithm to correctly classify these parameters related to possible attacks.

Download Full-text

Empirical comparison of machine learning algorithms for bug prediction in open source software

2017 International Conference on Big Data Analytics and Computational Intelligence (ICBDAC) ◽

10.1109/icbdaci.2017.8070806 ◽

2017 ◽

Cited By ~ 2

Author(s):

Ruchika Malhotra ◽

Laavanye Bahl ◽

Sushant Sehgal ◽

Pragati Priya

Keyword(s):

Machine Learning ◽

Open Source ◽

Open Source Software ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Empirical Comparison

Download Full-text

Development of use-specific high-performance cyber-nanomaterial optical detectors by effective choice of machine learning algorithms

Machine Learning: Science and Technology ◽

10.1088/2632-2153/ab8967 ◽

2020 ◽

Vol 1 (2) ◽

pp. 025007 ◽

Cited By ~ 2

Author(s):

Davoud Hejazi ◽

Shuangjun Liu ◽

Amirreza Farnoosh ◽

Sarah Ostadabbas ◽

Swastik Kar

Keyword(s):

Machine Learning ◽

High Performance ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Optical Detectors ◽

Effective Choice

Download Full-text

Targeting Virus-host Protein Interactions: Feature Extraction and Machine Learning Approaches

Current Drug Metabolism ◽

10.2174/1389200219666180829121038 ◽

2019 ◽

Vol 20 (3) ◽

pp. 177-184 ◽

Cited By ~ 16

Author(s):

Nantao Zheng ◽

Kairou Wang ◽

Weihua Zhan ◽

Lei Deng

Keyword(s):

Machine Learning ◽

Computational Methods ◽

Protein Interactions ◽

Prediction Models ◽

Learning Algorithms ◽

Biological Data ◽

Machine Learning Algorithms ◽

Host Protein ◽

Protein Protein Interactions ◽

Protein Motifs

Background:Targeting critical viral-host Protein-Protein Interactions (PPIs) has enormous application prospects for therapeutics. Using experimental methods to evaluate all possible virus-host PPIs is labor-intensive and time-consuming. Recent growth in computational identification of virus-host PPIs provides new opportunities for gaining biological insights, including applications in disease control. We provide an overview of recent computational approaches for studying virus-host PPI interactions.Methods:In this review, a variety of computational methods for virus-host PPIs prediction have been surveyed. These methods are categorized based on the features they utilize and different machine learning algorithms including classical and novel methods.Results:We describe the pivotal and representative features extracted from relevant sources of biological data, mainly include sequence signatures, known domain interactions, protein motifs and protein structure information. We focus on state-of-the-art machine learning algorithms that are used to build binary prediction models for the classification of virus-host protein pairs and discuss their abilities, weakness and future directions.Conclusion:The findings of this review confirm the importance of computational methods for finding the potential protein-protein interactions between virus and host. Although there has been significant progress in the prediction of virus-host PPIs in recent years, there is a lot of room for improvement in virus-host PPI prediction.

Download Full-text

Can Gut Microbiota Be a Good Predictor for Parkinson’s Disease? A Machine Learning Approach

Brain Sciences ◽

10.3390/brainsci10040242 ◽

2020 ◽

Vol 10 (4) ◽

pp. 242 ◽

Cited By ~ 3

Author(s):

Daniele Pietrucci ◽

Adelaide Teofani ◽

Valeria Unida ◽

Rocco Cerroni ◽

Silvia Biocca ◽

...

Keyword(s):

Machine Learning ◽

Parkinson’S Disease ◽

Parkinson's Disease ◽

Random Forest ◽

Gut Microbiota ◽

Biological Data ◽

Machine Learning Algorithms ◽

Support Vector ◽

Published Data ◽

Promising Tool

The involvement of the gut microbiota in Parkinson’s disease (PD), investigated in several studies, identified some common alterations of the microbial community, such as a decrease in Lachnospiraceae and an increase in Verrucomicrobiaceae families in PD patients. However, the results of other bacterial families are often contradictory. Machine learning is a promising tool for building predictive models for the classification of biological data, such as those produced in metagenomic studies. We tested three different machine learning algorithms (random forest, neural networks and support vector machines), analyzing 846 metagenomic samples (472 from PD patients and 374 from healthy controls), including our published data and those downloaded from public databases. Prediction performance was evaluated by the area under curve, accuracy, precision, recall and F-score metrics. The random forest algorithm provided the best results. Bacterial families were sorted according to their importance in the classification, and a subset of 22 families has been identified for the prediction of patient status. Although the results are promising, it is necessary to train the algorithm with a larger number of samples in order to increase the accuracy of the procedure.

Download Full-text

PREDICTING SOFTWARE CHANGE IN AN OPEN SOURCE SOFTWARE USING MACHINE LEARNING ALGORITHMS

International Journal of Reliability Quality and Safety Engineering ◽

10.1142/s0218539313500253 ◽

2013 ◽

Vol 20 (06) ◽

pp. 1350025

Author(s):

RUCHIKA MALHOTRA ◽

ANKITA JAIN BANSAL

Keyword(s):

Machine Learning ◽

Open Source ◽

Open Source Software ◽

Roc Analysis ◽

Area Under The Curve ◽

Machine Learning Algorithms ◽

Operating Characteristics ◽

Preventive Actions ◽

Change Proneness ◽

The Relationship

Due to various reasons such as ever increasing demands of the customer or change in the environment or detection of a bug, changes are incorporated in a software. This results in multiple versions or evolving nature of a software. Identification of parts of a software that are more prone to changes than others is one of the important activities. Identifying change prone classes will help developers to take focused and timely preventive actions on the classes of the software with similar characteristics in the future releases. In this paper, we have studied the relationship between various object oriented (OO) metrics and change proneness. We collected a set of OO metrics and change data of each class that appeared in two versions of an open source dataset, 'Java TreeView', i.e., version 1.1.6 and version 1.0.3. Besides this, we have also predicted various models that can be used to identify change prone classes, using machine learning and statistical techniques and then compared their performance. The results are analyzed using Area Under the Curve (AUC) obtained from Receiver Operating Characteristics (ROC) analysis. The results show that the models predicted using both machine learning and statistical methods demonstrate good performance in terms of predicting change prone classes. Based on the results, it is reasonable to claim that quality models have a significant relevance with OO metrics and hence can be used by researchers for early prediction of change prone classes.

Download Full-text

Applying Machine Learning Techniques to Predict the Maintainability of Open Source Software

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.e1045.0785s319 ◽

2019 ◽

Vol 8 (5S3) ◽

pp. 192-195

Keyword(s):

Machine Learning ◽

Open Source ◽

Open Source Software ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Software Applications ◽

Learning Techniques ◽

Software Maintainability ◽

Quality Aspect

Software maintainability is a vital quality aspect as per ISO standards. This has been a concern since decades and even today, it is of top priority. At present, majority of the software applications, particularly open source software are being developed using Object-Oriented methodologies. Researchers in the earlier past have used statistical techniques on metric data extracted from software to evaluate maintainability. Recently, machine learning models and algorithms are also being used in a majority of research works to predict maintainability. In this research, we performed an empirical case study on an open source software jfreechart by applying machine learning algorithms. The objective was to study the relationships between certain metrics and maintainability.

Download Full-text

Cricket umpire assistance and ball tracking system using a single smartphone camera

10.7287/peerj.preprints.3402 ◽

2017 ◽

Author(s):

Udit Arora ◽

Sohit Verma ◽

Sarthak Sahni ◽

Tushar Sharma

Keyword(s):

Machine Learning ◽

Open Source ◽

Motion Tracking ◽

Tracking System ◽

Contour Detection ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Support Vector ◽

Ball Tracking ◽

Histogram Of Gradients

Several ball tracking algorithms have been reported in literature. However, most of them use high-quality video and multiple cameras, and the emphasis has been on coordinating the cameras or visualizing the tracking results. This paper aims to develop a system for assisting the umpire in the sport of Cricket in making decisions like detection of no-balls, wide-balls, leg before wicket and bouncers, with the help of a single smartphone camera. It involves the implementation of Computer Vision algorithms for object detection and motion tracking, as well as the integration of machine learning algorithms to optimize the results. Techniques like Histogram of Gradients (HOG) and Support Vector Machine (SVM) are used for object classification and recognition. Frame subtraction, minimum enclosing circle, and contour detection algorithms are optimized and used for the detection of a cricket ball. These algorithms are applied using the Open Source Python Library - OpenCV. Machine Learning techniques - Linear and Quadratic Regression are used to track and predict the motion of the ball. It also involves the use of open source Python library VPython for the visual representation of the results. The paper describes the design and structure for the approach undertaken in the system for analyzing and visualizing off-air low-quality cricket videos.

Download Full-text