scholarly journals Examining the Part-of-speech Features in Assessing the Readability of Vietnamese Texts

2020 ◽  
Vol 10 (2) ◽  
pp. 127-142
Author(s):  
An-Vinh Luong ◽  
Diep Nguyen ◽  
Dien Dinh

The readability of the text plays a very important role in selecting appropriate materials for the level of the reader. Text readability in Vietnamese language has received a lot of attention in recent years, however, studies have mainly been limited to simple statistics at the level of a sentence length, word length, etc. In this article, we investigate the role of word-level grammatical characteristics in assessing the difficulty of texts in Vietnamese textbooks. We have used machine learning models (for instance, Decision Tree, K-nearest neighbor, Support Vector Machines, etc.) to evaluate the accuracy of classifying texts according to readability, using grammatical features in word level along with other statistical characteristics. Empirical results show that the presence of POS-level characteristics increases the accuracy of the classification by 2-4%.

2021 ◽  
Vol 2021 ◽  
pp. 1-8
Author(s):  
Shaker El-Sappagh ◽  
Tamer Abuhmed ◽  
Bader Alouffi ◽  
Radhya Sahal ◽  
Naglaa Abdelhade ◽  
...  

Early detection of Alzheimer’s disease (AD) progression is crucial for proper disease management. Most studies concentrate on neuroimaging data analysis of baseline visits only. They ignore the fact that AD is a chronic disease and patient’s data are naturally longitudinal. In addition, there are no studies that examine the effect of dementia medicines on the behavior of the disease. In this paper, we propose a machine learning-based architecture for early progression detection of AD based on multimodal data of AD drugs and cognitive scores data. We compare the performance of five popular machine learning techniques including support vector machine, random forest, logistic regression, decision tree, and K-nearest neighbor to predict AD progression after 2.5 years. Extensive experiments are performed using an ADNI dataset of 1036 subjects. The cross-validation performance of most algorithms has been improved by fusing the drugs and cognitive scores data. The results indicate the important role of patient’s taken drugs on the progression of AD disease.


2019 ◽  
Vol 2019 ◽  
pp. 1-7
Author(s):  
Pengliang Chen ◽  
Pengwei Shi ◽  
Gang Du ◽  
Zhen Zhang ◽  
Liang Liu

Predicting the outcome after a cancer diagnosis is critical. Advances in high-throughput sequencing technologies provide physicians with vast amounts of data, yet prognostication remains challenging because the data are greatly dimensional and complex. We evaluated Wnt/β-catenin, carbohydrate metabolism, and PI3K-Akt signaling pathway-related genes as predictive features for classifying tumors and normal samples. Using differentially expressed genes as controls, these pathway-related genes were assessed for accuracy using support-vector machines and three other recommended machine learning models, namely, the random forest, decision tree, and k-nearest neighbor algorithms. The first two outperformed the others. All candidate pathway-related genes yielded areas under the curve exceeding 95.00% for cancer outcomes, and they were most accurate in predicting colorectal cancer. These results suggest that these pathway-related genes are useful and accurate biomarkers for understanding the mechanisms behind cancer development.


2021 ◽  
pp. 179-218
Author(s):  
Magy Seif El-Nasr ◽  
Truong Huy Nguyen Dinh ◽  
Alessandro Canossa ◽  
Anders Drachen

This chapter discusses several classification and regression methods that can be used with game data. Specifically, we will discuss regression methods, including Linear Regression, and classification methods, including K-Nearest Neighbor, Naïve Bayes, Logistic Regression, Linear Discriminant Analysis, Support Vector Machines, Decisions Trees, and Random Forests. We will discuss how you can setup the data to apply these algorithms, as well as how you can interpret the results and the pros and cons for each of the methods discussed. We will conclude the chapter with some remarks on the process of application of these methods to games and the expected outcomes. The chapter also includes practical labs to walk you through the process of applying these methods to real game data.


2020 ◽  
Vol 38 (4) ◽  
pp. 1073-1082
Author(s):  
Florença das Graças MOURA ◽  
Álvaro Xavier FERREIRA ◽  
Tati ALMEIDA ◽  
Jérémie GARNIER ◽  
Rejane Ennes CICERELLI ◽  
...  

O lago Poópo é o segundo maior lago da Bolívia e atualmente vem passando por uma forte crise hídrica que alguns autores associam diretamente a mudança de ocupação da terra. Neste trabalho foi realizada a classificação do uso e ocupação do solo na sub-bacia P6 do lago entre os anos de 1985 e 2017. Foi analisado o desempenho dos classificadores SVM (Support Vector Machines), KNN (K-Nearest Neighbor) e MaxVer (Máxima Verossimilhança). A classificação que obteve melhor acurácia foi a gerada pelo classificador SVM, em que o valor do índice Kappa foi de 82,28% e 83,7% para as imagens Landat-5 e Landsat-8, respectivamente, e a exatidão global foi de 92% para ambas as imagens. A partir das classificações geradas foi verificado que as maiores alterações se deram nas classes de vegetação nativa, agricultura e área úmida. A perda de área úmida na sub-bacia vem ocorrendo desde 1995, 15 anos antes do aumento da atividade agrícola, que começou a partir de 2010. Assim, diversos são os fatores que podem estar contribuindo com essa redução acelerada dos corpos de água, como variações climáticas locais e as atividades antrópicas que interferem no ciclo hidrológico de forma regional.


Author(s):  
Meenakshi Garg ◽  
Manisha Malhotra ◽  
Harpal Singh

This paper presents a Multiple-features extraction and reduction-based approaches for Content-Based Image Retrieval (CBIR). Discrete Wavelet Transforms (DWT) on colored channels is used to decompose the image at multiple stages. The Gray Level Co-occurrence Matrix (GLCM) concept is used to extract statistical characteristics for texture image classification. The definition of shared knowledge is used to classify the most common features for all COREL dataset groups. These are also fed into a feature selector based on the particle swarm optimization which reduces the number of features that can be used during the classification stage. Three classifiers, called the Support Vector Machine (SVM), K-Nearest Neighbor (KNN) and Decision Tree (DT), are trained and tested, in which SVM give high classification accuracy and precise rates. In several of the COREL dataset types, experimental findings have demonstrated above 94 percent precision and 0.80 to 0.90 precision values.


2021 ◽  
Author(s):  
Hoda Heidari ◽  
zahra einalou ◽  
Mehrdad Dadgostar ◽  
Hamidreza Hosseinzadeh

Abstract Most of the studies in the field of Brain-Computer Interface (BCI) based on electroencephalography have a wide range of applications. Extracting Steady State Visual Evoked Potential (SSVEP) is regarded as one of the most useful tools in BCI systems. In this study, different methods which includes 1) feature extraction with different spectral methods (Shannon entropy, skewness, kurtosis, mean, variance) and wavelet transform magnitude, 2) feature selection performed by various methods (decision tree, principle component analysis (PCA), t-test, Wilcoxon, Receiver operating characteristic (ROC)), 3) classification step applying k nearest neighbor (k-NN), support vector machines (SVM), Bayesian, multiple layer perceptron (MLP) were compared from the whole stream of signal processing. Through combining such methods, the effective overview of the study indicated the accuracy of classical methods. In addition, the present study relied on a rather new feature selection described by decision tree and PCA, which is used for the BCI-SSVEP systems. Finally, the obtained accuracies were calculated based on the four recorded frequencies representing four directions including right, left, up, and down. The highest level of accuracy was obtained 91.39%.


Author(s):  
Moses L. Gadebe ◽  
◽  
Okuthe P. Kogeda ◽  
Sunday O. Ojo

Recognizing human activity in real time with a limited dataset is possible on a resource-constrained device. However, most classification algorithms such as Support Vector Machines, C4.5, and K Nearest Neighbor require a large dataset to accurately predict human activities. In this paper, we present a novel real-time human activity recognition model based on Gaussian Naïve Bayes (GNB) algorithm using a personalized JavaScript object notation dataset extracted from the publicly available Physical Activity Monitoring for Aging People dataset and University of Southern California Human Activity dataset. With the proposed method, the personalized JSON training dataset is extracted and compressed into a 12×8 multi-dimensional array of the time-domain features extracted using a signal magnitude vector and tilt angles from tri-axial accelerometer sensor data. The algorithm is implemented on the Android platform using the Cordova cross-platform framework with HTML5 and JavaScript. Leave-one-activity-out cross validation is implemented as a testTrainer() function, the results of which are presented using a confusion matrix. The testTrainer() function leaves category K as the testing subset and the remaining K-1 as the training dataset to validate the proposed GNB algorithm. The proposed model is inexpensive in terms of memory and computational power owing to the use of a compressed small training dataset. Each K category was repeated five times and the algorithm consistently produced the same result for each test. The result of the simulation using the tilted angle features shows overall precision, recall, F-measure, and accuracy rates of 90%, 99.6%, 94.18%, and 89.51% respectively, in comparison to rates of 36.9%, 75%, 42%, and 36.9% when the signal magnitude vector features were used. The results of the simulations confirmed and proved that when using the tilt angle dataset, the GNB algorithm is superior to Support Vector Machines, C4.5, and K Nearest Neighbor algorithms.


Author(s):  
MAYY M. AL-TAHRAWI ◽  
RAED ABU ZITAR

Many techniques and algorithms for automatic text categorization had been devised and proposed in the literature. However, there is still much space for researchers in this area to improve existing algorithms or come up with new techniques for text categorization (TC). Polynomial Networks (PNs) were never used before in TC. This can be attributed to the huge datasets used in TC, as well as the technique itself which has high computational demands. In this paper, we investigate and propose using PNs in TC. The proposed PN classifier has achieved a competitive classification performance in our experiments. More importantly, this high performance is achieved in one shot training (noniteratively) and using just 0.25%–0.5% of the corpora features. Experiments are conducted on the two benchmark datasets in TC: Reuters-21578 and the 20 Newsgroups. Five well-known classifiers are experimented on the same data and feature subsets: the state-of-the-art Support Vector Machines (SVM), Logistic Regression (LR), the k-nearest-neighbor (kNN), Naive Bayes (NB), and the Radial Basis Function (RBF) networks.


Sign in / Sign up

Export Citation Format

Share Document