Batch-Mask: An automated Mask R-CNN workflow to isolate non-standard biological specimens for color pattern analysis

Efficient comparisons of biological color patterns are critical for understanding the mechanisms by which organisms evolve in ecosystems, including sexual selection, predator-prey interactions, and thermoregulation. However, elongate or spiral-shaped organisms do not conform to the standard orientation and photographic techniques required for automated analysis. Currently, large-scale color analysis of elongate animals requires time-consuming manual landmarking, which reduces their representation in coloration research despite their ecological importance. We present Batch-Mask: an automated and customizable workflow to facilitate the analysis of large photographic data sets of non-standard biological subjects. First, we present a user guide to run an open-source region-based convolutional neural network with fine-tuned weights for identifying and isolating a biological subject from a background (masking). Then, we demonstrate how to combine masking with existing manual visual analysis tools into a single streamlined, automated workflow for comparing color patterns across images. Batch-Mask was 60x faster than manual landmarking, produced masks that correctly identified 96% of all snake pixels, and produced pattern energy results that were not significantly different from the manually landmarked data set. The fine-tuned weights for the masking neural network, user guide, and automated workflow substantially decrease the amount of time and attention required to quantitatively analyze non-standard biological subjects. By using these tools, biologists will be able to compare color, pattern, and shape differences in large data sets that include significant morphological variation in elongate body forms. This advance will be especially valuable for comparative analyses of natural history collections, and through automation can greatly expand the scale of space, time, or taxonomic breadth across which color variation can be quantitatively examined.

Download Full-text

An Interaction-Based Convolutional Neural Network (ICNN) Toward a Better Understanding of COVID-19 X-ray Images

Algorithms ◽

10.3390/a14110337 ◽

2021 ◽

Vol 14 (11) ◽

pp. 337

Author(s):

Shaw-Hwa Lo ◽

Yiqiao Yin

Keyword(s):

Neural Network ◽

Deep Learning ◽

Convolutional Neural Network ◽

Large Scale ◽

Explanatory Power ◽

Prediction Performance ◽

Data Sets ◽

Real World Data ◽

Data Set ◽

Model Free

The field of explainable artificial intelligence (XAI) aims to build explainable and interpretable machine learning (or deep learning) methods without sacrificing prediction performance. Convolutional neural networks (CNNs) have been successful in making predictions, especially in image classification. These popular and well-documented successes use extremely deep CNNs such as VGG16, DenseNet121, and Xception. However, these well-known deep learning models use tens of millions of parameters based on a large number of pretrained filters that have been repurposed from previous data sets. Among these identified filters, a large portion contain no information yet remain as input features. Thus far, there is no effective method to omit these noisy features from a data set, and their existence negatively impacts prediction performance. In this paper, a novel interaction-based convolutional neural network (ICNN) is introduced that does not make assumptions about the relevance of local information. Instead, a model-free influence score (I-score) is proposed to directly extract the influential information from images to form important variable modules. This innovative technique replaces all pretrained filters found by trial-and-error with explainable, influential, and predictive variable sets (modules) determined by the I-score. In other words, future researchers need not rely on pretrained filters; the suggested algorithm identifies only the variables or pixels with high I-score values that are extremely predictive and important. The proposed method and algorithm were tested on real-world data set and a state-of-the-art prediction performance of 99.8% was achieved without sacrificing the explanatory power of the model. This proposed design can efficiently screen patients infected by COVID-19 before human diagnosis and can be a benchmark for addressing future XAI problems in large-scale data sets.

Download Full-text

Galaxy spin direction distribution in HST and SDSS show similar large-scale asymmetry

Publications of the Astronomical Society of Australia ◽

10.1017/pasa.2020.46 ◽

2020 ◽

Vol 37 ◽

Author(s):

Lior Shamir

Keyword(s):

Large Scale ◽

Spiral Galaxies ◽

Hubble Space Telescope ◽

Gravitational Interaction ◽

Large Data ◽

Sloan Digital Sky Survey ◽

Data Sets ◽

Dipole Axis ◽

Data Set ◽

The Asymmetry

Abstract Several recent observations using large data sets of galaxies showed non-random distribution of the spin directions of spiral galaxies, even when the galaxies are too far from each other to have gravitational interaction. Here, a data set of $\sim8.7\cdot10^3$ spiral galaxies imaged by Hubble Space Telescope (HST) is used to test and profile a possible asymmetry between galaxy spin directions. The asymmetry between galaxies with opposite spin directions is compared to the asymmetry of galaxies from the Sloan Digital Sky Survey. The two data sets contain different galaxies at different redshift ranges, and each data set was annotated using a different annotation method. The results show that both data sets show a similar asymmetry in the COSMOS field, which is covered by both telescopes. Fitting the asymmetry of the galaxies to cosine dependence shows a dipole axis with probabilities of $\sim2.8\sigma$ and $\sim7.38\sigma$ in HST and SDSS, respectively. The most likely dipole axis identified in the HST galaxies is at $(\alpha=78^{\rm o},\delta=47^{\rm o})$ and is well within the $1\sigma$ error range compared to the location of the most likely dipole axis in the SDSS galaxies with $z>0.15$ , identified at $(\alpha=71^{\rm o},\delta=61^{\rm o})$ .

Download Full-text

A Comparison Study of Mahalanobis-Taguchi System and Neural Network for Multivariate Pattern Recognition

Design Engineering, Parts A and B ◽

10.1115/imece2005-80029 ◽

2005 ◽

Cited By ~ 10

Author(s):

Jungeui Hong ◽

Elizabeth A. Cudney ◽

Genichi Taguchi ◽

Rajesh Jugulum ◽

Kioumars Paryani ◽

...

Keyword(s):

Neural Network ◽

Small Data ◽

Data Sets ◽

Comparison Study ◽

Data Set ◽

Set Size ◽

Breast Cancer Study ◽

Discriminant Ability ◽

Small Data Sets ◽

Multivariate Pattern

The Mahalanobis-Taguchi System is a diagnosis and predictive method for analyzing patterns in multivariate cases. The goal of this study is to compare the ability of the Mahalanobis-Taguchi System and a neural network to discriminate using small data sets. We examine the discriminant ability as a function of data set size using an application area where reliable data is publicly available. The study uses the Wisconsin Breast Cancer study with nine attributes and one class.

Download Full-text

The Midlatitude Continental Convective Clouds Experiment (MC3E) sounding network: operations, processing and analysis

Atmospheric Measurement Techniques ◽

10.5194/amt-8-421-2015 ◽

2015 ◽

Vol 8 (1) ◽

pp. 421-434 ◽

Cited By ~ 18

Author(s):

M. P. Jensen ◽

T. Toto ◽

D. Troyan ◽

P. E. Ciesielski ◽

D. Holdridge ◽

...

Keyword(s):

Large Scale ◽

Scale Model ◽

Data Sets ◽

Central Plains ◽

Data Set ◽

Convective Systems ◽

Convective Clouds ◽

Quality Checks ◽

Network Operations ◽

The Impact

Abstract. The Midlatitude Continental Convective Clouds Experiment (MC3E) took place during the spring of 2011 centered in north-central Oklahoma, USA. The main goal of this field campaign was to capture the dynamical and microphysical characteristics of precipitating convective systems in the US Central Plains. A major component of the campaign was a six-site radiosonde array designed to capture the large-scale variability of the atmospheric state with the intent of deriving model forcing data sets. Over the course of the 46-day MC3E campaign, a total of 1362 radiosondes were launched from the enhanced sonde network. This manuscript provides details on the instrumentation used as part of the sounding array, the data processing activities including quality checks and humidity bias corrections and an analysis of the impacts of bias correction and algorithm assumptions on the determination of convective levels and indices. It is found that corrections for known radiosonde humidity biases and assumptions regarding the characteristics of the surface convective parcel result in significant differences in the derived values of convective levels and indices in many soundings. In addition, the impact of including the humidity corrections and quality controls on the thermodynamic profiles that are used in the derivation of a large-scale model forcing data set are investigated. The results show a significant impact on the derived large-scale vertical velocity field illustrating the importance of addressing these humidity biases.

Download Full-text

A fast methodology for large-scale focusing inversion of gravity and magnetic data using the structured model matrix and the 2-D fast Fourier transform

Geophysical Journal International ◽

10.1093/gji/ggaa372 ◽

2020 ◽

Vol 223 (2) ◽

pp. 1378-1397

Author(s):

Rosemary A Renaut ◽

Jarom D Hogue ◽

Saeed Vatankhah ◽

Shuang Liu

Keyword(s):

Fourier Transform ◽

Fast Fourier Transform ◽

Linear Systems ◽

Large Scale ◽

Surface Measurement ◽

Magnetic Data ◽

Uniform Grid ◽

Data Sets ◽

Inversion Algorithm ◽

Data Set

SUMMARY We discuss the focusing inversion of potential field data for the recovery of sparse subsurface structures from surface measurement data on a uniform grid. For the uniform grid, the model sensitivity matrices have a block Toeplitz Toeplitz block structure for each block of columns related to a fixed depth layer of the subsurface. Then, all forward operations with the sensitivity matrix, or its transpose, are performed using the 2-D fast Fourier transform. Simulations are provided to show that the implementation of the focusing inversion algorithm using the fast Fourier transform is efficient, and that the algorithm can be realized on standard desktop computers with sufficient memory for storage of volumes up to size n ≈ 106. The linear systems of equations arising in the focusing inversion algorithm are solved using either Golub–Kahan bidiagonalization or randomized singular value decomposition algorithms. These two algorithms are contrasted for their efficiency when used to solve large-scale problems with respect to the sizes of the projected subspaces adopted for the solutions of the linear systems. The results confirm earlier studies that the randomized algorithms are to be preferred for the inversion of gravity data, and for data sets of size m it is sufficient to use projected spaces of size approximately m/8. For the inversion of magnetic data sets, we show that it is more efficient to use the Golub–Kahan bidiagonalization, and that it is again sufficient to use projected spaces of size approximately m/8. Simulations support the presented conclusions and are verified for the inversion of a magnetic data set obtained over the Wuskwatim Lake region in Manitoba, Canada.

Download Full-text

Six years of total ozone column measurements from SCIAMACHY nadir observations

Atmospheric Measurement Techniques ◽

10.5194/amt-2-87-2009 ◽

2009 ◽

Vol 2 (1) ◽

pp. 87-98 ◽

Cited By ~ 39

Author(s):

C. Lerot ◽

M. Van Roozendael ◽

J. van Geffen ◽

J. van Gent ◽

C. Fayt ◽

...

Keyword(s):

Cross Sections ◽

Total Ozone ◽

Large Scale ◽

European Space Agency ◽

Data Sets ◽

Data Set ◽

Ozone Data ◽

Space Agency ◽

German Aerospace ◽

The Impact

Abstract. Total O3 columns have been retrieved from six years of SCIAMACHY nadir UV radiance measurements using SDOAS, an adaptation of the GDOAS algorithm previously developed at BIRA-IASB for the GOME instrument. GDOAS and SDOAS have been implemented by the German Aerospace Center (DLR) in the version 4 of the GOME Data Processor (GDP) and in version 3 of the SCIAMACHY Ground Processor (SGP), respectively. The processors are being run at the DLR processing centre on behalf of the European Space Agency (ESA). We first focus on the description of the SDOAS algorithm with particular attention to the impact of uncertainties on the reference O3 absorption cross-sections. Second, the resulting SCIAMACHY total ozone data set is globally evaluated through large-scale comparisons with results from GOME and OMI as well as with ground-based correlative measurements. The various total ozone data sets are found to agree within 2% on average. However, a negative trend of 0.2–0.4%/year has been identified in the SCIAMACHY O3 columns; this probably originates from instrumental degradation effects that have not yet been fully characterized.

Download Full-text

Collecting public RGB-D datasets for human daily activity recognition

International Journal of Advanced Robotic Systems ◽

10.1177/1729881417709079 ◽

2017 ◽

Vol 14 (4) ◽

pp. 172988141770907 ◽

Cited By ~ 2

Author(s):

Hanbo Wu ◽

Xin Ma ◽

Zhimeng Zhang ◽

Haibo Wang ◽

Yibin Li

Keyword(s):

Activity Recognition ◽

Daily Activity ◽

Visual Cues ◽

Large Scale ◽

Hot Spot ◽

Feature Representation ◽

Data Sets ◽

Activity Data ◽

Data Set ◽

Depth Motion Maps

Human daily activity recognition has been a hot spot in the field of computer vision for many decades. Despite best efforts, activity recognition in naturally uncontrolled settings remains a challenging problem. Recently, by being able to perceive depth and visual cues simultaneously, RGB-D cameras greatly boost the performance of activity recognition. However, due to some practical difficulties, the publicly available RGB-D data sets are not sufficiently large for benchmarking when considering the diversity of their activities, subjects, and background. This severely affects the applicability of complicated learning-based recognition approaches. To address the issue, this article provides a large-scale RGB-D activity data set by merging five public RGB-D data sets that differ from each other on many aspects such as length of actions, nationality of subjects, or camera angles. This data set comprises 4528 samples depicting 7 action categories (up to 46 subcategories) performed by 74 subjects. To verify the challengeness of the data set, three feature representation methods are evaluated, which are depth motion maps, spatiotemporal depth cuboid similarity feature, and curvature space scale. Results show that the merged large-scale data set is more realistic and challenging and therefore more suitable for benchmarking.

Download Full-text

Characterising RDF data sets

Journal of Information Science ◽

10.1177/0165551516677945 ◽

2017 ◽

Vol 44 (2) ◽

pp. 203-229 ◽

Cited By ~ 6

Author(s):

Javier D Fernández ◽

Miguel A Martínez-Prieto ◽

Pablo de la Fuente Redondo ◽

Claudio Gutiérrez

Keyword(s):

Data Structures ◽

Large Scale ◽

Open Data ◽

Structural Features ◽

Data Sets ◽

Data Set ◽

Wide Range ◽

Rdf Data ◽

Description Framework ◽

Resource Description

The publication of semantic web data, commonly represented in Resource Description Framework (RDF), has experienced outstanding growth over the last few years. Data from all fields of knowledge are shared publicly and interconnected in active initiatives such as Linked Open Data. However, despite the increasing availability of applications managing large-scale RDF information such as RDF stores and reasoning tools, little attention has been given to the structural features emerging in real-world RDF data. Our work addresses this issue by proposing specific metrics to characterise RDF data. We specifically focus on revealing the redundancy of each data set, as well as common structural patterns. We evaluate the proposed metrics on several data sets, which cover a wide range of designs and models. Our findings provide a basis for more efficient RDF data structures, indexes and compressors.

Download Full-text

The Application of Probabilistic Neural Network in Speech Recognition Based on Partition Clustering

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.263-266.2173 ◽

2012 ◽

Vol 263-266 ◽

pp. 2173-2178

Author(s):

Xin Guang Li ◽

Min Feng Yao ◽

Li Rui Jian ◽

Zhen Jiang Li

Keyword(s):

Neural Network ◽

Speech Recognition ◽

Clustering Algorithm ◽

Probabilistic Neural Network ◽

Back Propagation ◽

Back Propagation Neural Network ◽

Data Sets ◽

Data Set ◽

Proposed Model ◽

Partition Clustering

A probabilistic neural network (PNN) speech recognition model based on the partition clustering algorithm is proposed in this paper. The most important advantage of PNN is that training is easy and instantaneous. Therefore, PNN is capable of dealing with real time speech recognition. Besides, in order to increase the performance of PNN, the selection of data set is one of the most important issues. In this paper, using the partition clustering algorithm to select data is proposed. The proposed model is tested on two data sets from the field of spoken Arabic numbers, with promising results. The performance of the proposed model is compared to single back propagation neural network and integrated back propagation neural network. The final comparison result shows that the proposed model performs better than the other two neural networks, and has an accuracy rate of 92.41%.

Download Full-text

Neural Network for Big Data Sets

10.4018/978-1-6684-2408-7.ch003 ◽

2022 ◽

pp. 41-67

Author(s):

Vo Ngoc Phu ◽

Vo Thi Ngoc Tran

Keyword(s):

Neural Network ◽

Big Data ◽

Computer Science ◽

Large Scale ◽

Massive Data ◽

Data Sets ◽

Massive Data Sets ◽

Large Scale Data ◽

Commercial Applications ◽

Novel Model

Machine learning (ML), neural network (NN), evolutionary algorithm (EA), fuzzy systems (FSs), as well as computer science have been very famous and very significant for many years. They have been applied to many different areas. They have contributed much to developments of many large-scale corporations, massive organizations, etc. Lots of information and massive data sets (MDSs) have been generated from these big corporations, organizations, etc. These big data sets (BDSs) have been the challenges of many commercial applications, researches, etc. Therefore, there have been many algorithms of the ML, the NN, the EA, the FSs, as well as computer science which have been developed to handle these massive data sets successfully. To support for this process, the authors have displayed all the possible algorithms of the NN for the large-scale data sets (LSDSs) successfully in this chapter. Finally, they have presented a novel model of the NN for the BDS in a sequential environment (SE) and a distributed network environment (DNE).

Download Full-text