Handwritten Gurmukhi Digit Recognition System for Small Datasets

Gurpartap Singh; Sunil Agrawal; Balwinder Singh Sohi

doi:10.18280/ts.370416

Handwritten Gurmukhi Digit Recognition System for Small Datasets

Traitement du signal ◽

10.18280/ts.370416 ◽

2020 ◽

Vol 37 (4) ◽

pp. 661-669

Author(s):

Gurpartap Singh ◽

Sunil Agrawal ◽

Balwinder Singh Sohi

Keyword(s):

Recognition Accuracy ◽

Recognition System ◽

Gradient Boosting ◽

Support Vector ◽

Discrete Wavelet ◽

Testing Time ◽

Training Time ◽

Digit Recognition ◽

Extreme Gradient Boosting ◽

The Impact

In the present study, a method to increase the recognition accuracy of Gurmukhi (Indian Regional Script) Handwritten Digits has been proposed. The proposed methodology uses a DCNN (Deep Convolutional Neural Network) with a cascaded XGBoost (Extreme Gradient Boosting) algorithm. Also, a comprehensive analysis has been done to apprehend the impact of kernel size of DCNN on recognition accuracy. The reason for using DCNN is its impressive performance in terms of recognition accuracy of handwritten digits, but in order to achieve good recognition accuracy, DCNN requires a huge amount of data and also significant training/testing time. In order to increase the accuracy of DCNN for a small dataset more images have been generated by applying a shear transformation (A transformation that preserves parallelism but not length and angles) to the original images. To address the issue of large training time only two hidden layers along with selective cascading XGBoost among the misclassified digits have been used. Also, the issue of overfitting is discussed in detail and has been reduced to a great extent. Finally, the results are compared with performance of some recent techniques like SVM (Support Vector Machine) Random Forest, and XGBoost classifiers on DCT (Discrete Cosine Transform) and DWT (Discrete Wavelet Transform) features obtained on the same dataset. It is found that proposed methodology can outperform other techniques in terms of overall rate of recognition.

Download Full-text

Explainable machine learning can outperform Cox regression predictions and provide insights in breast cancer survival

Scientific Reports ◽

10.1038/s41598-021-86327-7 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Arturo Moncada-Torres ◽

Marissa C. van Maaren ◽

Mathijs P. Hendriks ◽

Sabine Siesling ◽

Gijs Geleijnse

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Explicit Knowledge ◽

Cox Regression ◽

Metastatic Breast ◽

Gradient Boosting ◽

Support Vector ◽

Netherlands Cancer Registry ◽

Extreme Gradient Boosting ◽

The Impact

AbstractCox Proportional Hazards (CPH) analysis is the standard for survival analysis in oncology. Recently, several machine learning (ML) techniques have been adapted for this task. Although they have shown to yield results at least as good as classical methods, they are often disregarded because of their lack of transparency and little to no explainability, which are key for their adoption in clinical settings. In this paper, we used data from the Netherlands Cancer Registry of 36,658 non-metastatic breast cancer patients to compare the performance of CPH with ML techniques (Random Survival Forests, Survival Support Vector Machines, and Extreme Gradient Boosting [XGB]) in predicting survival using the $$c$$ c -index. We demonstrated that in our dataset, ML-based models can perform at least as good as the classical CPH regression ($$c$$ c -index $$\sim \,0.63$$ ∼ 0.63 ), and in the case of XGB even better ($$c$$ c -index $$\sim 0.73$$ ∼ 0.73 ). Furthermore, we used Shapley Additive Explanation (SHAP) values to explain the models’ predictions. We concluded that the difference in performance can be attributed to XGB’s ability to model nonlinearities and complex interactions. We also investigated the impact of specific features on the models’ predictions as well as their corresponding insights. Lastly, we showed that explainable ML can generate explicit knowledge of how models make their predictions, which is crucial in increasing the trust and adoption of innovative ML techniques in oncology and healthcare overall.

Download Full-text

Exploring the Mechanism of Crashes with Autonomous Vehicles Using Machine Learning

Mathematical Problems in Engineering ◽

10.1155/2021/5524356 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Hengrui Chen ◽

Hong Chen ◽

Ruiyu Zhou ◽

Zhizhen Liu ◽

Xiaoke Sun

Keyword(s):

Machine Learning ◽

Autonomous Vehicles ◽

Classification And Regression Tree ◽

Gradient Boosting ◽

Support Vector ◽

Crash Severity ◽

Apriori Algorithm ◽

Driving Mode ◽

Extreme Gradient Boosting ◽

The Impact

The safety issue has become a critical obstacle that cannot be ignored in the marketization of autonomous vehicles (AVs). The objective of this study is to explore the mechanism of AV-involved crashes and analyze the impact of each feature on crash severity. We use the Apriori algorithm to explore the causal relationship between multiple factors to explore the mechanism of crashes. We use various machine learning models, including support vector machine (SVM), classification and regression tree (CART), and eXtreme Gradient Boosting (XGBoost), to analyze the crash severity. Besides, we apply the Shapley Additive Explanations (SHAP) to interpret the importance of each factor. The results indicate that XGBoost obtains the best result (recall = 75%; G-mean = 67.82%). Both XGBoost and Apriori algorithm effectively provided meaningful insights about AV-involved crash characteristics and their relationship. Among all these features, vehicle damage, weather conditions, accident location, and driving mode are the most critical features. We found that most rear-end crashes are conventional vehicles bumping into the rear of AVs. Drivers should be extremely cautious when driving in fog, snow, and insufficient light. Besides, drivers should be careful when driving near intersections, especially in the autonomous driving mode.

Download Full-text

Predicting Suitable Habitats of Melia Azedarach L. Using Data Mining

10.21203/rs.3.rs-1004808/v1 ◽

2021 ◽

Author(s):

Lei Feng ◽

Xiangni Tian ◽

Yousry A. El-Kassaby ◽

Jian Qiu ◽

Ze Feng ◽

...

Keyword(s):

Data Mining ◽

Species Distribution ◽

Mean Annual Precipitation ◽

Gradient Boosting ◽

Melia Azedarach ◽

Support Vector ◽

Suitable Habitat ◽

Degree Days ◽

Extreme Gradient Boosting ◽

The Impact

Abstract Background: Melia azedarach L. is a globally distributed tree species of economic importance; however, it is unclear how the species distribution will respond to future climate changes.Methods: We aimed to select the most accurate one among seven data mining models to predict the species suitable contemporary and future habitats. These models include: maximum entropy (MaxEnt), support vector machine (SVM), generalized linear model (GLM), random forest (RF), naive bayesian model (NBM), extreme gradient boosting (XGBoost), and gradient boosting machine (GBM). A total of 906 M. azedarach locations were identified, and sixteen climate predictors were used for model building. The models’ validity was assessed using three measures (Area Under the Curves (AUC), kappa, and accuracy). Results: We found that the RF provided the most outstanding performance in prediction power and generalization capacity. The top climate factors affecting the species distribution were mean coldest month temperature (MCMT), followed by the number of frost-free days (NFFD), degree-days above 18°C (DD>18), temperature difference between MWMT and MCMT, or continentality (TD), mean annual precipitation (MAP), and degree-days below 18°C (DD<18). We projected that future suitable habitat of this species would increase under both the RCP4.5 and RCP8.5 scenarios for the 2020s, 2050s, and 2080s.Conclusion: Our findings are expected to assist in better understanding the impact of climate change on the species and provide scientific basis for its planting and conservation.

Download Full-text

Improvement of Prediction Performance With Conjoint Molecular Fingerprint in Deep Learning

Frontiers in Pharmacology ◽

10.3389/fphar.2020.606668 ◽

2020 ◽

Vol 11 ◽

Author(s):

Liangxu Xie ◽

Lei Xu ◽

Ren Kong ◽

Shan Chang ◽

Xiaojun Xu

Keyword(s):

Deep Learning ◽

Short Term Memory ◽

Molecular Descriptor ◽

Predictive Performance ◽

Gradient Boosting ◽

Support Vector ◽

Quantitative Structure ◽

Structure Activity ◽

Extreme Gradient Boosting ◽

The Impact

The accurate predicting of physical properties and bioactivity of drug molecules in deep learning depends on how molecules are represented. Many types of molecular descriptors have been developed for quantitative structure-activity/property relationships quantitative structure-activity relationships (QSPR). However, each molecular descriptor is optimized for a specific application with encoding preference. Considering that standalone featurization methods may only cover parts of information of the chemical molecules, we proposed to build the conjoint fingerprint by combining two supplementary fingerprints. The impact of conjoint fingerprint and each standalone fingerprint on predicting performance was systematically evaluated in predicting the logarithm of the partition coefficient (logP) and binding affinity of protein-ligand by using machine learning/deep learning (ML/DL) methods, including random forest (RF), support vector regression (SVR), extreme gradient boosting (XGBoost), long short-term memory network (LSTM), and deep neural network (DNN). The results demonstrated that the conjoint fingerprint yielded improved predictive performance, even outperforming the consensus model using two standalone fingerprints among four out of five examined methods. Given that the conjoint fingerprint scheme shows easy extensibility and high applicability, we expect that the proposed conjoint scheme would create new opportunities for continuously improving predictive performance of deep learning by harnessing the complementarity of various types of fingerprints.

Download Full-text

OFFLINE YORÙBÁ HANDWRITTEN WORD RECOGNITION USING GEOMETRIC FEATURE EXTRACTION AND SUPPORT VECTOR MACHINE CLASSIFIER

MALAYSIAN JOURNAL OF COMPUTING ◽

10.24191/mjoc.v5i2.8947 ◽

2020 ◽

Vol 5 (2) ◽

pp. 504

Author(s):

Matthias Omotayo Oladele ◽

Temilola Morufat Adepoju ◽

Olaide ` Abiodun Olatoke ◽

Oluwaseun Adewale Ojo

Keyword(s):

Support Vector Machine ◽

Feature Extraction ◽

Word Recognition ◽

Support Vector Machine Classifier ◽

Recognition Accuracy ◽

Recognition System ◽

Support Vector ◽

Geometric Features ◽

Total Length ◽

Yoruba Language

Yorùbá language is one of the three main languages that is been spoken in Nigeria. It is a tonal language that carries an accent on the vowel alphabets. There are twenty-five (25) alphabets in Yorùbá language with one of the alphabets a digraph (GB). Due to the difficulty in typing handwritten Yorùbá documents, there is a need to develop a handwritten recognition system that can convert the handwritten texts to digital format. This study discusses the offline Yorùbá handwritten word recognition system (OYHWR) that recognizes Yorùbá uppercase alphabets. Handwritten characters and words were obtained from different writers using the paint application and M708 graphics tablets. The characters were used for training and the words were used for testing. Pre-processing was done on the images and the geometric features of the images were extracted using zoning and gradient-based feature extraction. Geometric features are the different line types that form a particular character such as the vertical, horizontal, and diagonal lines. The geometric features used are the number of horizontal lines, number of vertical lines, number of right diagonal lines, number of left diagonal lines, total length of all horizontal lines, total length of all vertical lines, total length of all right slanting lines, total length of all left-slanting lines and the area of the skeleton. The characters are divided into 9 zones and gradient feature extraction was used to extract the horizontal and vertical components and geometric features in each zone. The words were fed into the support vector machine classifier and the performance was evaluated based on recognition accuracy. Support vector machine is a two-class classifier, hence a multiclass SVM classifier least square support vector machine (LSSVM) was used for word recognition. The one vs one strategy and RBF kernel were used and the recognition accuracy obtained from the tested words ranges between 66.7%, 83.3%, 85.7%, 87.5%, and 100%. The low recognition rate for some of the words could be as a result of the similarity in the extracted features.

Download Full-text

Machine learning models to identify low adherence to influenza vaccination among Korean adults with cardiovascular disease

BMC Cardiovascular Disorders ◽

10.1186/s12872-021-01925-7 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Moojung Kim ◽

Young Jae Kim ◽

Sung Jin Park ◽

Kwang Gi Kim ◽

Pyung Chun Oh ◽

...

Keyword(s):

Machine Learning ◽

Cardiovascular Disease ◽

Influenza Vaccination ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Support Vector ◽

Age Group ◽

Learning Models ◽

Extreme Gradient Boosting ◽

Machine Learning Models

Abstract Background Annual influenza vaccination is an important public health measure to prevent influenza infections and is strongly recommended for cardiovascular disease (CVD) patients, especially in the current coronavirus disease 2019 (COVID-19) pandemic. The aim of this study is to develop a machine learning model to identify Korean adult CVD patients with low adherence to influenza vaccination Methods Adults with CVD (n = 815) from a nationally representative dataset of the Fifth Korea National Health and Nutrition Examination Survey (KNHANES V) were analyzed. Among these adults, 500 (61.4%) had answered "yes" to whether they had received seasonal influenza vaccinations in the past 12 months. The classification process was performed using the logistic regression (LR), random forest (RF), support vector machine (SVM), and extreme gradient boosting (XGB) machine learning techniques. Because the Ministry of Health and Welfare in Korea offers free influenza immunization for the elderly, separate models were developed for the < 65 and ≥ 65 age groups. Results The accuracy of machine learning models using 16 variables as predictors of low influenza vaccination adherence was compared; for the ≥ 65 age group, XGB (84.7%) and RF (84.7%) have the best accuracies, followed by LR (82.7%) and SVM (77.6%). For the < 65 age group, SVM has the best accuracy (68.4%), followed by RF (64.9%), LR (63.2%), and XGB (61.4%). Conclusions The machine leaning models show comparable performance in classifying adult CVD patients with low adherence to influenza vaccination.

Download Full-text

Interpretable Detection and Location of Myocardial Infarction Based on Ventricular Fusion Rule Features

Journal of Healthcare Engineering ◽

10.1155/2021/4123471 ◽

2021 ◽

Vol 2021 ◽

pp. 1-15

Author(s):

Wenzhi Zhang ◽

Runchuan Li ◽

Shengya Shen ◽

Jinliang Yao ◽

Yan Peng ◽

...

Keyword(s):

Myocardial Infarction ◽

Clinical Decision Making ◽

Human Life ◽

Principal Component ◽

Fusion Rule ◽

Clinical Decision ◽

Gradient Boosting ◽

Discrete Wavelet ◽

Extreme Gradient Boosting ◽

Ventricular Activity

Myocardial infarction (MI) is one of the most common cardiovascular diseases threatening human life. In order to accurately distinguish myocardial infarction and have a good interpretability, the classification method that combines rule features and ventricular activity features is proposed in this paper. Specifically, according to the clinical diagnosis rule and the pathological changes of myocardial infarction on the electrocardiogram, the local information extracted from the Q wave, ST segment, and T wave is computed as the rule feature. All samples of the QT segment are extracted as ventricular activity features. Then, in order to reduce the computational complexity of the ventricular activity features, the effects of Discrete Wavelet Transform (DWT), Principal Component Analysis (PCA), and Locality Preserving Projections (LPP) on the extracted ventricular activity features are compared. Combining rule features and ventricular activity features, all the 12 leads features are fused as the ultimate feature vector. Finally, eXtreme Gradient Boosting (XGBoost) is used to identify myocardial infarction, and the overall accuracy rate of 99.86% is obtained on the Physikalisch-Technische Bundesanstalt (PTB) database. This method has a good medical diagnosis basis while improving the accuracy, which is very important for clinical decision-making.

Download Full-text

Establishing a Credit Risk Evaluation System for SMEs Using the Soft Voting Fusion Model

Risks ◽

10.3390/risks9110202 ◽

2021 ◽

Vol 9 (11) ◽

pp. 202

Author(s):

Ge Gao ◽

Hongxin Wang ◽

Pengbin Gao

Keyword(s):

Credit Risk ◽

Evaluation System ◽

Predictive Accuracy ◽

Assessment System ◽

Gradient Boosting ◽

Support Vector ◽

Fusion Model ◽

Light Gradient ◽

Extreme Gradient Boosting ◽

The Government

In China, SMEs are facing financing difficulties, and commercial banks and financial institutions are the main financing channels for SMEs. Thus, a reasonable and efficient credit risk assessment system is important for credit markets. Based on traditional statistical methods and AI technology, a soft voting fusion model, which incorporates logistic regression, support vector machine (SVM), random forest (RF), eXtreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM), is constructed to improve the predictive accuracy of SMEs’ credit risk. To verify the feasibility and effectiveness of the proposed model, we use data from 123 SMEs nationwide that worked with a Chinese bank from 2016 to 2020, including financial information and default records. The results show that the accuracy of the soft voting fusion model is higher than that of a single machine learning (ML) algorithm, which provides a theoretical basis for the government to control credit risk in the future and offers important references for banks to make credit decisions.

Download Full-text

Classification of Hot Spots using XGBoost and LightGBM Algorithms

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.e9459.069520 ◽

2020 ◽

Vol 9 (5) ◽

pp. 722-724

Keyword(s):

Computational Methods ◽

Protein Interactions ◽

Hot Spots ◽

Cell Metabolism ◽

Pearson Correlation ◽

Classification Performance ◽

Gradient Boosting ◽

Support Vector ◽

Extreme Gradient Boosting ◽

Hub Proteins

Protein-Protein Interactions referred as PPIs perform significant role in biological functions like cell metabolism, immune response, signal transduction etc. Hot spots are small fractions of residues in interfaces and provide substantial binding energy in PPIs. Therefore, identification of hot spots is important to discover and analyze molecular medicines and diseases. The current strategy, alanine scanning isn't pertinent to enormous scope applications since the technique is very costly and tedious. The existing computational methods are poor in classification performance as well as accuracy in prediction. They are concerned with the topological structure and gene expression of hub proteins. The proposed system focuses on hot spots of hub proteins by eliminating redundant as well as highly correlated features using Pearson Correlation Coefficient and Support Vector Machine based feature elimination. Extreme Gradient boosting and LightGBM algorithms are used to ensemble a set of weak classifiers to form a strong classifier. The proposed system shows better accuracy than the existing computational methods. The model can also be used to predict accurate molecular inhibitors for specific PPIs

Download Full-text

HyP-ABC: A Novel Automated Hyper-Parameter Tuning Algorithm Using Evolutionary Optimization

10.36227/techrxiv.14714508.v2 ◽

2021 ◽

Author(s):

Leila Zahedi ◽

Farid Ghareh Mohammadi ◽

M. Hadi Amini

Keyword(s):

Parameter Optimization ◽

Real World ◽

Optimization Problems ◽

State Of The Art ◽

Parameter Tuning ◽

Gradient Boosting ◽

Support Vector ◽

Wide Range ◽

Extreme Gradient Boosting ◽

Art Techniques

Machine learning techniques lend themselves as promising decision-making and analytic tools in a wide range of applications. Different ML algorithms have various hyper-parameters. In order to tailor an ML model towards a specific application, a large number of hyper-parameters should be tuned. Tuning the hyper-parameters directly affects the performance (accuracy and run-time). However, for large-scale search spaces, efficiently exploring the ample number of combinations of hyper-parameters is computationally challenging. Existing automated hyper-parameter tuning techniques suffer from high time complexity. In this paper, we propose HyP-ABC, an automatic innovative hybrid hyper-parameter optimization algorithm using the modified artificial bee colony approach, to measure the classification accuracy of three ML algorithms, namely random forest, extreme gradient boosting, and support vector machine. Compared to the state-of-the-art techniques, HyP-ABC is more efficient and has a limited number of parameters to be tuned, making it worthwhile for real-world hyper-parameter optimization problems. We further compare our proposed HyP-ABC algorithm with state-of-the-art techniques. In order to ensure the robustness of the proposed method, the algorithm takes a wide range of feasible hyper-parameter values, and is tested using a real-world educational dataset.

Download Full-text