A practical study of CITES wood species identification by untargeted DART/QTOF, GC/QTOF and LC/QTOF together with machine learning processes and statistical analysis

Abstract Background: The identification of tropical African wood species based on microscopic imagery is a challenging problem due to the heterogeneous nature of the composition of wood combined with the vast number of candidate species. Image classification methods that rely on machine learning can facilitate this identification, provided that sufficient training material is available. Despite the fact that the three main anatomical sections contain information that is relevant for species identification, current methods only rely on the transversal section. Additionally, commonly used procedures for evaluating the performance of these methods neglect the fact that multiple images often originate from the same tree, leading to an overly optimistic estimate of the performance. Results: We introduce a new image dataset containing microscopic images of the three main anatomical sections of 77 Congolese wood species. A dedicated multiview image classification method is developed and obtains an accuracy (computed using the naive but common approach) of 95%, outperforming the singleview methods by a large margin. An in-depth analysis shows that naive accuracy estimates can lead to a dramatic over-prediction, of up to 60%, of the accuracy. Conclusions: Additional images from the non-transversal sections can boost the performance of machine-learning-based wood species identification methods. Additionally, care should be taken when evaluating the performance of machine-learningbased wood species identification methods to avoid an overestimation of the performance.

Download Full-text

A Survey on Bias in Deep NLP

Applied Sciences ◽

10.3390/app11073184 ◽

2021 ◽

Vol 11 (7) ◽

pp. 3184

Author(s):

Ismael Garrido-Muñoz ◽

Arturo Montejo-Ráez ◽

Fernando Martínez-Santiago ◽

L. Alfonso Ureña-López

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Natural Language Processing ◽

Probability Distribution ◽

Natural Language ◽

Network Design ◽

Language Processing ◽

Deep Neural Networks ◽

Learning Processes ◽

Relevant Issue

Deep neural networks are hegemonic approaches to many machine learning areas, including natural language processing (NLP). Thanks to the availability of large corpora collections and the capability of deep architectures to shape internal language mechanisms in self-supervised learning processes (also known as “pre-training”), versatile and performing models are released continuously for every new network design. These networks, somehow, learn a probability distribution of words and relations across the training collection used, inheriting the potential flaws, inconsistencies and biases contained in such a collection. As pre-trained models have been found to be very useful approaches to transfer learning, dealing with bias has become a relevant issue in this new scenario. We introduce bias in a formal way and explore how it has been treated in several networks, in terms of detection and correction. In addition, available resources are identified and a strategy to deal with bias in deep NLP is proposed.

Download Full-text

Stiffness and Strength of Stabilized Organic Soils—Part I/II: Experimental Database and Statistical Description for Machine Learning Modelling

Geosciences ◽

10.3390/geosciences11060243 ◽

2021 ◽

Vol 11 (6) ◽

pp. 243

Author(s):

Hernandez-Martinez Francisco G. ◽

Al-Tabbaa Abir ◽

Medina-Cetina Zenon ◽

Yousefpour Negin

Keyword(s):

Machine Learning ◽

Statistical Analysis ◽

Portland Cement ◽

Soil Stabilization ◽

Data Availability ◽

Data Repository ◽

Organic Soils ◽

Experimental Database ◽

Soil Mixing ◽

The Impact

This paper presents the experimental database and corresponding statistical analysis (Part I), which serves as a basis to perform the corresponding parametric analysis and machine learning modelling (Part II) of a comprehensive study on organic soil strength and stiffness, stabilized via the wet soil mixing method. The experimental database includes unconfined compression tests performed under laboratory-controlled conditions to investigate the impact of soil type, the soil’s organic content, the soil’s initial natural water content, binder type, binder quantity, grout to soil ratio, water to binder ratio, curing time, temperature, curing relative humidity and carbon dioxide content on the stabilized organic specimens’ stiffness and strength. A descriptive statistical analysis complements the description of the experimental database, along with a qualitative study on the stabilization hydration process via scanning electron microscopy images. Results confirmed findings on the use of Portland cement alone and a mix of Portland cement with ground granulated blast furnace slag as suitable binders for soil stabilization. Findings on mixes including lime and magnesium oxide cements demonstrated minimal stabilization. Specimen size affected stiffness, but not the strength for mixes of peat and Portland cement. The experimental database, along with all produced data analyses, are available at the Texas Data Repository as indicated in the Data Availability Statement below, to allow for data reproducibility and promote the use of artificial intelligence and machine learning competing modelling techniques as the ones presented in Part II of this paper.

Download Full-text

Finding Homogeneous Climate Zones in Bangladesh From Statistical Analysis of Climate Data Using Machine Learning Technique

2020 23rd International Conference on Computer and Information Technology (ICCIT) ◽

10.1109/iccit51783.2020.9392689 ◽

2020 ◽

Author(s):

Faisal Bin Ashraf ◽

Md Rayhan Kabir ◽

Md Shafiur Raihan Shafi ◽

Jubair Ibn Malik Rifat

Keyword(s):

Machine Learning ◽

Statistical Analysis ◽

Climate Data ◽

Machine Learning Technique ◽

Climate Zones ◽

Learning Technique

Download Full-text

Advancement of Statistical Analysis, Machine Learning and Decision Analysis Based on the Fourteenth ICMSEM Proceedings

Proceedings of the Fourteenth International Conference on Management Science and Engineering Management - Advances in Intelligent Systems and Computing ◽

10.1007/978-3-030-49829-0_1 ◽

2020 ◽

pp. 1-9

Author(s):

Jiuping Xu

Keyword(s):

Machine Learning ◽

Statistical Analysis ◽

Decision Analysis

Download Full-text

Front Cover: Statistical Analysis and Discovery of Heterogeneous Catalysts Based on Machine Learning from Diverse Published Data (ChemCatChem 18/2019)

ChemCatChem ◽

10.1002/cctc.201901455 ◽

2019 ◽

Vol 11 (18) ◽

pp. 4443-4443

Author(s):

Keisuke Suzuki ◽

Takashi Toyao ◽

Zen Maeno ◽

Satoru Takakusagi ◽

Ken‐ichi Shimizu ◽

...

Keyword(s):

Machine Learning ◽

Statistical Analysis ◽

Heterogeneous Catalysts ◽

Published Data ◽

Front Cover

Download Full-text

Rapid identification of wood species using XRF and neural network machine learning

Scientific Reports ◽

10.1038/s41598-021-96850-2 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Aaron N. Shugar ◽

B. Lee Drake ◽

Greg Kelley

Keyword(s):

Neural Network ◽

Machine Learning ◽

Convolutional Neural Network ◽

Wood Species ◽

Cost Effective ◽

Rapid Identification ◽

Invasive Technique ◽

X Ray ◽

Non Invasive ◽

Cost Effective Alternative

AbstractAn innovative approach for the rapid identification of wood species is presented. By combining X-ray fluorescence spectrometry with convolutional neural network machine learning, 48 different wood specimens were clearly differentiated and identified with a 99% accuracy. Wood species identification is imperative to assess illegally logged and transported lumber. Alternative options for identification can be time consuming and require some level of sampling. This non-invasive technique offers a viable, cost-effective alternative to rapidly and accurately identify timber in efforts to support environmental protection laws and regulations.

Download Full-text

Evaluation of Biomarkers in Critical Care and Perioperative Medicine

Anesthesiology ◽

10.1097/aln.0000000000003600 ◽

2020 ◽

Vol 134 (1) ◽

pp. 15-25

Author(s):

Sabri Soussi ◽

Gary S. Collins ◽

Peter Jüni ◽

Alexandre Mebazaa ◽

Etienne Gayat ◽

...

Keyword(s):

Machine Learning ◽

Critical Care ◽

Statistical Analysis ◽

Research Methods ◽

Statistical Methods ◽

Perioperative Medicine ◽

Scientific Rigor ◽

Starting Point ◽

Novel Biomarkers

SUMMARY Interest in developing and using novel biomarkers in critical care and perioperative medicine is increasing. Biomarkers studies are often presented with flaws in the statistical analysis that preclude them from providing a scientifically valid and clinically relevant message for clinicians. To improve scientific rigor, the proper application and reporting of traditional and emerging statistical methods (e.g., machine learning) of biomarker studies is required. This Readers’ Toolbox article aims to be a starting point to nonexpert readers and investigators to understand traditional and emerging research methods to assess biomarkers in critical care and perioperative medicine.

Download Full-text

Techniques and Methods That Help to Make Big Data the Simplest Recipe for Success

Big Data Analytics for Entrepreneurial Success - Advances in Business Information Systems and Analytics ◽

10.4018/978-1-5225-7609-9.ch006 ◽

2019 ◽

pp. 161-194

Keyword(s):

Machine Learning ◽

Big Data ◽

Data Analysis ◽

Statistical Analysis ◽

Data Analytics ◽

Big Data Analysis ◽

Customer Segmentation ◽

Learning Context ◽

Feature Vectors

Data analytics has grown in a machine learning context. Whatever the reason data is used or exploited, customer segmentation or marketing targeting, it must be processed first and represented on feature vectors. Many algorithms, such as clustering, regression, classification, and others, need to be represented and clarified in order to facilitate processing and statistical analysis. If we have seen, through the previous chapters, the importance of big data analysis (the Why?), as with every major innovation, the biggest confusion lies in the exact scope (What?) and its implementation (How?). In this chapter, we will take a look at the different algorithms and techniques analytics that we can use in order to exploit the large amounts of data.

Download Full-text

A Statistical Analysis of Risk Factors and Biological Behavior in Canine Mammary Tumors: A Multicenter Study

Animals ◽

10.3390/ani10091687 ◽

2020 ◽

Vol 10 (9) ◽

pp. 1687

Author(s):

Giovanni P. Burrai ◽

Andrea Gabrieli ◽

Valentina Moccia ◽

Valentina Zappulli ◽

Ilaria Porcellato ◽

...

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Statistical Analysis ◽

Multicenter Study ◽

Malignant Tumors ◽

Mammary Tumors ◽

Supervised Machine Learning ◽

Biological Behavior ◽

Clinical Staging ◽

Canine Mammary Tumors

Canine mammary tumors (CMTs) represent a serious issue in worldwide veterinary practice and several risk factors are variably implicated in the biology of CMTs. The present study examines the relationship between risk factors and histological diagnosis of a large CMT dataset from three academic institutions by classical statistical analysis and supervised machine learning methods. Epidemiological, clinical, and histopathological data of 1866 CMTs were included. Dogs with malignant tumors were significantly older than dogs with benign tumors (9.6 versus 8.7 years, p < 0.001). Malignant tumors were significantly larger than benign counterparts (2.69 versus 1.7 cm, p < 0.001). Interestingly, 18% of malignant tumors were smaller than 1 cm in diameter, providing compelling evidence that the size of the tumor should be reconsidered during the assessment of the TNM-WHO clinical staging. The application of the logistic regression and the machine learning model identified the age and the tumor’s size as the best predictors with an overall diagnostic accuracy of 0.63, suggesting that these risk factors are sufficient but not exhaustive indicators of the malignancy of CMTs. This multicenter study increases the general knowledge of the main epidemiologica-clinical risk factors involved in the onset of CMTs and paves the way for further investigations of these factors in association with CMTs and in the application of machine learning technology.

Download Full-text