scholarly journals Automated classification of fauna in seabed photographs: the impact of training and validation dataset size, with considerations for the class imbalance

2021 ◽  
pp. 102612
Author(s):  
Jennifer M. Durden ◽  
Brett Hosking ◽  
Brian J. Bett ◽  
Danelle Cline ◽  
Henry A. Ruhl
2021 ◽  
Vol 21 (S2) ◽  
Author(s):  
Kun Zeng ◽  
Yibin Xu ◽  
Ge Lin ◽  
Likeng Liang ◽  
Tianyong Hao

Abstract Background Eligibility criteria are the primary strategy for screening the target participants of a clinical trial. Automated classification of clinical trial eligibility criteria text by using machine learning methods improves recruitment efficiency to reduce the cost of clinical research. However, existing methods suffer from poor classification performance due to the complexity and imbalance of eligibility criteria text data. Methods An ensemble learning-based model with metric learning is proposed for eligibility criteria classification. The model integrates a set of pre-trained models including Bidirectional Encoder Representations from Transformers (BERT), A Robustly Optimized BERT Pretraining Approach (RoBERTa), XLNet, Pre-training Text Encoders as Discriminators Rather Than Generators (ELECTRA), and Enhanced Representation through Knowledge Integration (ERNIE). Focal Loss is used as a loss function to address the data imbalance problem. Metric learning is employed to train the embedding of each base model for feature distinguish. Soft Voting is applied to achieve final classification of the ensemble model. The dataset is from the standard evaluation task 3 of 5th China Health Information Processing Conference containing 38,341 eligibility criteria text in 44 categories. Results Our ensemble method had an accuracy of 0.8497, a precision of 0.8229, and a recall of 0.8216 on the dataset. The macro F1-score was 0.8169, outperforming state-of-the-art baseline methods by 0.84% improvement on average. In addition, the performance improvement had a p-value of 2.152e-07 with a standard t-test, indicating that our model achieved a significant improvement. Conclusions A model for classifying eligibility criteria text of clinical trials based on multi-model ensemble learning and metric learning was proposed. The experiments demonstrated that the classification performance was improved by our ensemble model significantly. In addition, metric learning was able to improve word embedding representation and the focal loss reduced the impact of data imbalance to model performance.


2018 ◽  
Vol 7 (2.7) ◽  
pp. 786 ◽  
Author(s):  
T Sajana ◽  
M R.Narasingarao

Malaria disease is one whose presence is rampant in semi urban and non-urban areas especially resource poor developing countries. It is quite evident from the datasets like malaria, dengue, etc., where there is always a possibility of having more negative patients (non-occurrence of the disease) compared to patients suffering from disease (positive cases). Developing a model based decision support system with such unbalanced datasets is a cause of concern and it is indeed necessary to have a model predicting the disease quite accurately. Classification of imbalanced malaria disease data become a crucial task in medical application domain because most of the conventional machine learning algorithms are showing very poor performance to classify whether a patient is affected by malaria disease or not. In imbalanced data, majority (unaffected) class samples are dominates the minority (affected) class samples leading to class imbalance. To overcome the nature of class imbalance problem, balancing the data samples is the best solution which produces the better accuracy in classification of minority samples. The aim of this research is to propose a comparative study on classifying the imbalanced malaria disease data using Naive Bayesian classifier in different environments like weka and using an R-language. We present here, clinical descriptive study on 165 patients of different age group people collected at medical wards of Narasaraopet from 2014-17. Synthetic Minority Oversampling Technique (SMOTE) technique has been used to balance the class distribution and then we performed a comparative study on the dataset using Naïve Bayesian algorithm in various platforms. Out of balanced class distribution data, 70% data was given to train the Naive Bayesian algorithm and the rest of the data was used for testing the model for both weka and R programming environments. Experimental results have indicated that, classification of malaria disease data in weka environment has highest accuracy of 88.5% than the Naive Bayesian algorithm accuracy of 87.5% using R programming language. The impact of vector borne disease is very high in medical applications. Prediction of disease like malaria is an hour of the need and this is possible only with a suitable model for a given dataset. Hence, we have developed a model with Naive Bayesian algorithm is used for current research.    


2019 ◽  
pp. 27-35
Author(s):  
Alexandr Neznamov

Digital technologies are no longer the future but are the present of civil proceedings. That is why any research in this direction seems to be relevant. At the same time, some of the fundamental problems remain unattended by the scientific community. One of these problems is the problem of classification of digital technologies in civil proceedings. On the basis of instrumental and genetic approaches to the understanding of digital technologies, it is concluded that their most significant feature is the ability to mediate the interaction of participants in legal proceedings with information; their differentiating feature is the function performed by a particular technology in the interaction with information. On this basis, it is proposed to distinguish the following groups of digital technologies in civil proceedings: a) technologies of recording, storing and displaying (reproducing) information, b) technologies of transferring information, c) technologies of processing information. A brief description is given to each of the groups. Presented classification could serve as a basis for a more systematic discussion of the impact of digital technologies on the essence of civil proceedings. Particularly, it is pointed out that issues of recording, storing, reproducing and transferring information are traditionally more «technological» for civil process, while issues of information processing are more conceptual.


Sign in / Sign up

Export Citation Format

Share Document