scholarly journals Machine learning for improved data analysis of biological aerosol using the WIBS

Author(s):  
Simon Ruske ◽  
David O. Topping ◽  
Virginia E. Foot ◽  
Andrew P. Morse ◽  
Martin W. Gallagher

Abstract. Primary biological aerosol including bacteria, fungal spores and pollen have important implications for public health and the environment. Such particles may have different concentrations of chemical fluorophores and will provide different responses in the presence of ultraviolet light which potentially could be used to discriminate between different types of biological aerosol. Development of ultraviolet light induced fluorescence (UV-LIF) instruments such as the Wideband Integrated Bioaerosol Sensor (WIBS) has made is possible to collect size, morphology and fluorescence measurements in real-time. However, it is unclear without studying responses from the instrument in the laboratory, the extent to which we can discriminate between different types of particles. Collection of laboratory data is vital to validate any approach used to analyse the data and to ensure that the data available is utilised as effectively as possible. In this manuscript we test a variety of methodologies on traditional reference particles and a range of laboratory generated aerosols. Hierarchical Agglomerative Clustering (HAC) has been previously applied to UV-LIF data in a number of studies and is tested alongside other algorithms that could be used to solve the classification problem: Density Based Spectral Clustering and Noise (DBSCAN), k-means and gradient boosting. Whilst HAC was able to effectively discriminate between the reference particles, yielding a classification error of only 1.8 %, similar results were not obtained when testing on laboratory generated aerosol where the classification error was found to be between 11.5 % and 24.2 %. Furthermore, there is a worryingly large uncertainty in this approach in terms of the data preparation and the cluster index used, and we were unable attain consistent results across the different sets of laboratory generated aerosol tested. The best results were obtained using gradient boosting, where the misclassification rate was between 4.38 % and 5.42 %. The largest contribution to this error was the pollen samples where 28.5 % of the samples were misclassified as fungal spores. The technique was also robust to changes in data preparation provided a fluorescent threshold was applied to the data. Where laboratory training data is unavailable, DBSCAN was found to be an potential alternative to HAC. In the case of one of the data sets where 22.9 % of the data was left unclassified we were able to produce three distinct clusters obtaining a classification error of only 1.42 % on the classified data. These results could not be replicated however for the other data set where 26.8 % of the data was not classified and a classification error of 13.8 % was obtained. This method, like HAC, also appeared to be heavily dependent on data preparation, requiring different selection of parameters dependent on the preparation used. Further analysis will also be required to confirm our selection of parameters when using this method on ambient data. There is a clear need for the collection of additional laboratory generated aerosol to improve interpretation of current databases and to aid in the analysis of data collected from an ambient environment. New instruments with a greater resolution are likely improve on current discrimination between pollen, bacteria and fungal spores and even between their different types, however the need for extensive laboratory training data sets will grow as a result.

2018 ◽  
Vol 11 (11) ◽  
pp. 6203-6230 ◽  
Author(s):  
Simon Ruske ◽  
David O. Topping ◽  
Virginia E. Foot ◽  
Andrew P. Morse ◽  
Martin W. Gallagher

Abstract. Primary biological aerosol including bacteria, fungal spores and pollen have important implications for public health and the environment. Such particles may have different concentrations of chemical fluorophores and will respond differently in the presence of ultraviolet light, potentially allowing for different types of biological aerosol to be discriminated. Development of ultraviolet light induced fluorescence (UV-LIF) instruments such as the Wideband Integrated Bioaerosol Sensor (WIBS) has allowed for size, morphology and fluorescence measurements to be collected in real-time. However, it is unclear without studying instrument responses in the laboratory, the extent to which different types of particles can be discriminated. Collection of laboratory data is vital to validate any approach used to analyse data and ensure that the data available is utilized as effectively as possible. In this paper a variety of methodologies are tested on a range of particles collected in the laboratory. Hierarchical agglomerative clustering (HAC) has been previously applied to UV-LIF data in a number of studies and is tested alongside other algorithms that could be used to solve the classification problem: Density Based Spectral Clustering and Noise (DBSCAN), k-means and gradient boosting. Whilst HAC was able to effectively discriminate between reference narrow-size distribution PSL particles, yielding a classification error of only 1.8 %, similar results were not obtained when testing on laboratory generated aerosol where the classification error was found to be between 11.5 % and 24.2 %. Furthermore, there is a large uncertainty in this approach in terms of the data preparation and the cluster index used, and we were unable to attain consistent results across the different sets of laboratory generated aerosol tested. The lowest classification errors were obtained using gradient boosting, where the misclassification rate was between 4.38 % and 5.42 %. The largest contribution to the error, in the case of the higher misclassification rate, was the pollen samples where 28.5 % of the samples were incorrectly classified as fungal spores. The technique was robust to changes in data preparation provided a fluorescent threshold was applied to the data. In the event that laboratory training data are unavailable, DBSCAN was found to be a potential alternative to HAC. In the case of one of the data sets where 22.9 % of the data were left unclassified we were able to produce three distinct clusters obtaining a classification error of only 1.42 % on the classified data. These results could not be replicated for the other data set where 26.8 % of the data were not classified and a classification error of 13.8 % was obtained. This method, like HAC, also appeared to be heavily dependent on data preparation, requiring a different selection of parameters depending on the preparation used. Further analysis will also be required to confirm our selection of the parameters when using this method on ambient data. There is a clear need for the collection of additional laboratory generated aerosol to improve interpretation of current databases and to aid in the analysis of data collected from an ambient environment. New instruments with a greater resolution are likely to improve on current discrimination between pollen, bacteria and fungal spores and even between different species, however the need for extensive laboratory data sets will grow as a result.


2016 ◽  
Author(s):  
Simon Ruske ◽  
David O. Topping ◽  
Virginia E. Foot ◽  
Paul H. Kaye ◽  
Warren R. Stanley ◽  
...  

Abstract. Characterisation of bio-aerosols has important implications within Environment and Public Health sectors. Recent developments in Ultra-Violet Light Induced Fluorescence (UV-LIF) detectors such as the Wideband Integrated bio-aerosol Spectrometer (WIBS) and the newly introduced Multiparameter bio-aerosol Spectrometer (MBS) has allowed for the real time collection of fluorescence, size and morphology measurements for the purpose of discriminating between bacteria, fungal Spores and pollen. This new generation of instruments has enabled ever larger data sets to be compiled with the aim of studying more complex environments. In real world data sets, particularly those from an urban environment, the population may be dominated by non-biological fluorescent interferents bringing into question the accuracy of measurements of quantities such as concentrations. It is therefore imperative that we validate the performance of different algorithms which can be used for the task of classification. For unsupervised learning we test Hierarchical Agglomerative Clustering with various different linkages. For supervised learning, ten methods were tested; including decision trees, ensemble methods: Random Forests, Gradient Boosting and AdaBoost; two implementations for support vector machines: libsvm and liblinear; Gaussian methods: Gaussian naïve Bayesian, quadratic and linear discriminant analysis and finally the k-nearest neighbours algorithm. The methods were applied to two different data sets measured using a new Multiparameter bio-aerosol Spectrometer which provides multichannel UV-LIF fluorescence signatures for single airborne biological particles. Clustering, in general performs slightly worse than the supervised learning methods correctly classifying, at best, only 72.7 and 91.1 percent for the two data sets respectively. For supervised learning the gradient boosting algorithm was found to be the most effective, on average correctly classifying 88.1 and 97.8 percent of the testing data respectively across the two data sets.


Author(s):  
M. L. R. Gonzaga ◽  
M. T. S. Wong ◽  
A. C. Blanco ◽  
J. A. Principe

Abstract. With the Philippines ranking as the third largest source of plastics that end up in the oceans, there is a need to further explore methodologies that will become an aid in plastic waste removal from the ocean. Manila Bay is a natural harbor in the Philippines that serves as the center of different economic activities. However, the bay is also threatened with plastic pollution due to increasing population and industrial activities. BASECO is one of the areas in Manila Bay where clean-up activities are focused as this is where trash accumulates. Sentinel-2 images are provided free of charge by the European Commission's Copernicus Programme. Satellite images from June 2019 to May 2020 were inspected, then cloud-free images were downloaded. After downloading and pre-processing, spectral data of different types of plastic such as shipping pouch, bubble wrap, styrofoam, PET bottle, sando bag and snack packaging that were measured by a spectrometer during a fieldwork by the Development of Integrated Mapping, Monitoring, and Analytical Network System for Manila Bay and Linked Environments (project MapABLE) were utilized in the selection of training data. Then, indices such as the Normalized Vegetation Index (NDVI), Floating Debris Index (FDI) and Plastic Index (PI) from previous studies were analyzed for further separation of classes used as training data. These training data served as an input to the two supervised classification methods, Naive Bayes and Mixture Tuned Matched Filtering (MTMF). Both methods were validated by reports and articles from Philippine agencies indicating the spots where trash frequently accumulates.


Author(s):  
Jorge Arroyo-Palacios ◽  
Daniela M. Romano

Affective bio-feedback can be an important instrument to enhance the game experience. Several studies have provided evidence of the usefulness of physiological signals for affective gaming; however, due to the limited knowledge about the distinctive autonomic signatures for every emotion, the pattern matching models employed are limited in the number of emotions they are able to classify. This paper presents a bio-affective gaming interface (BAGI) that can be used to customize a game experience according to the player’s emotional response. Its architecture offers important characteristics for gaming that are important because they make possible the reusability of previous findings and the inclusion of new models to the system. In order to prove the effectiveness of BAGI, two different types of neural networks have been trained to recognize emotions. They were incorporated into the system to customize, in real-time, the computer wallpaper according to the emotion experienced by the user. Best results were obtained with a probabilistic neural network with accuracy results of 84.46% on the training data and 78.38% on the validation for new independent data sets.


2015 ◽  
Vol 54 (03) ◽  
pp. 215-220 ◽  
Author(s):  
M. Matteucci ◽  
L. Mainardi ◽  
A. Tahirovic

Summary Introduction: This article is part of the Focus Theme of Methods of Information in Medicine on “Biosignal Interpretation: Advanced Methods for Neural Signals and Images”. Objectives: The main objectives of the paper regard the analysis of amplitude spatial distribution of the P300 evoked potential over a scalp of a particular subject and finding an averaged spatial distribution template for that subject. This template, which may differ for two different subjects, can help in getting a more accurate P300 detection for all BCIs that inherently use spatial filtering to detect P300 signal. Finally, the proposed averaging technique for a particular subject obtains an averaged spatial distribution template through only several epochs, which makes the proposed averaging technique fast and possible to use without applying any prior training data as in case of data enhancement technique. Methods: The method used in the proposed framework for the averaging of spatial distribution of P300 evoked potentials is based on the statistical properties of independent components (ICs). These components are obtained by using independent component analysis (ICA) from different target epochs. Results: This paper gives a novel averaging technique for the spatial distribution of P300 evoked potentials, which is based on the P300 signals obtained from different target epochs using the ICA algorithm. Such a technique provides a more reliable P300 spatial distribution for a subject of interest, which can be used either for an improved spatial selection of ICs, or more accurate P300 detection and extraction. In addition, the experiments demonstrate that the values of spatial intensity computed by the proposed technique for P300 signal converge after only several target epochs for each electrode allocation. Such a speed of convergence allows the proposed algorithm to easily adapt to a subject of interest without any additional artificial data preparation prior the algorithm execution such in case of data enhancement technique. Conclusion: The proposed technique averages the P300 spatial distribution for a particular subject over all electrode allocations. First, the technique combines P300-like components obtained by the ICA run within a target epoch in order to obtainan averaged P300 spatial distribution. Second, the technique averages spatial distributions of P300 signals obtained from different target epochs in order to get the final averaged template. Such an template can be useful for any BCI technique where spatial selection is used to detect evoked potentials.


2017 ◽  
Vol 57 (8) ◽  
pp. 1012-1025 ◽  
Author(s):  
Andrei P. Kirilenko ◽  
Svetlana O. Stepchenkova ◽  
Hany Kim ◽  
Xiang (Robert) Li

Interest in applying Big Data to tourism is increasing, and automated sentiment analysis has been used to extract public opinion from various sources. This article evaluates the suitability of different types of automated classifiers for applications typical in tourism, hospitality, and marketing studies by comparing their performance to that of human raters. While the commonly used performance indices suggest that on easier-to-classify data sets machine learning methods demonstrate performance comparable to that by human raters, other performance measures such as Cohen’s kappa show that the results of machine learning are still inferior to manual processing. On more difficult and noisy data sets, automated analysis has poorer performance than human raters. The article discusses issues pertinent to selection of appropriate sentiment analysis software and offers a word of caution against using automated classifiers uncritically.


2020 ◽  
Vol 30 (Supplement_5) ◽  
Author(s):  
R Haneef ◽  
S Fuentes ◽  
R Hrzic ◽  
S Fosse-Edorh ◽  
S Kab ◽  
...  

Abstract Background The use of artificial intelligence is increasing to estimate and predict health outcomes from large data sets. The main objectives were to develop two algorithms using machine learning techniques to identify new cases of diabetes (case study I) and to classify type 1 and type 2 (case study II) in France. Methods We selected the training data set from a cohort study linked with French national Health database (i.e., SNDS). Two final datasets were used to achieve each objective. A supervised machine learning method including eight following steps was developed: the selection of the data set, case definition, coding and standardization of variables, split data into training and test data sets, variable selection, training, validation and selection of the model. We planned to apply the trained models on the SNDS to estimate the incidence of diabetes and the prevalence of type 1/2 diabetes. Results For the case study I, 23/3468 and for case study II, 14/3481 SNDS variables were selected based on an optimal balance between variance explained and using the ReliefExp algorithm. We trained four models using different classification algorithms on the training data set. The Linear Discriminant Analysis model performed best in both case studies. The models were assessed on the test datasets and achieved a specificity of 67% and a sensitivity of 62% in case study I, and a specificity of 97 % and sensitivity of 100% in case study II. The case study II model was applied to the SNDS and estimated the prevalence of type 1 diabetes in 2016 in France of 0.3% and for type 2, 4.4%. The case study model I was not applied to the SNDS. Conclusions The case study II model to estimate the prevalence of type 1/2 diabetes has good performance and will be used in routine surveillance. The case study I model to identify new cases of diabetes showed a poor performance due to missing necessary information on determinants of diabetes and will need to be improved for further research.


1995 ◽  
Vol 31 (2) ◽  
pp. 193-204 ◽  
Author(s):  
Koen Grijspeerdt ◽  
Peter Vanrolleghem ◽  
Willy Verstraete

A comparative study of several recently proposed one-dimensional sedimentation models has been made. This has been achieved by fitting these models to steady-state and dynamic concentration profiles obtained in a down-scaled secondary decanter. The models were evaluated with several a posteriori model selection criteria. Since the purpose of the modelling task is to do on-line simulations, the calculation time was used as one of the selection criteria. Finally, the practical identifiability of the models for the available data sets was also investigated. It could be concluded that the model of Takács et al. (1991) gave the most reliable results.


Author(s):  
Ritu Khandelwal ◽  
Hemlata Goyal ◽  
Rajveer Singh Shekhawat

Introduction: Machine learning is an intelligent technology that works as a bridge between businesses and data science. With the involvement of data science, the business goal focuses on findings to get valuable insights on available data. The large part of Indian Cinema is Bollywood which is a multi-million dollar industry. This paper attempts to predict whether the upcoming Bollywood Movie would be Blockbuster, Superhit, Hit, Average or Flop. For this Machine Learning techniques (classification and prediction) will be applied. To make classifier or prediction model first step is the learning stage in which we need to give the training data set to train the model by applying some technique or algorithm and after that different rules are generated which helps to make a model and predict future trends in different types of organizations. Methods: All the techniques related to classification and Prediction such as Support Vector Machine(SVM), Random Forest, Decision Tree, Naïve Bayes, Logistic Regression, Adaboost, and KNN will be applied and try to find out efficient and effective results. All these functionalities can be applied with GUI Based workflows available with various categories such as data, Visualize, Model, and Evaluate. Result: To make classifier or prediction model first step is learning stage in which we need to give the training data set to train the model by applying some technique or algorithm and after that different rules are generated which helps to make a model and predict future trends in different types of organizations Conclusion: This paper focuses on Comparative Analysis that would be performed based on different parameters such as Accuracy, Confusion Matrix to identify the best possible model for predicting the movie Success. By using Advertisement Propaganda, they can plan for the best time to release the movie according to the predicted success rate to gain higher benefits. Discussion: Data Mining is the process of discovering different patterns from large data sets and from that various relationships are also discovered to solve various problems that come in business and helps to predict the forthcoming trends. This Prediction can help Production Houses for Advertisement Propaganda and also they can plan their costs and by assuring these factors they can make the movie more profitable.


Sign in / Sign up

Export Citation Format

Share Document