scholarly journals Using of n-grams from morphological tags for fake news classification

2021 ◽  
Vol 7 ◽  
pp. e624
Author(s):  
Jozef Kapusta ◽  
Martin Drlik ◽  
Michal Munk

Research of the techniques for effective fake news detection has become very needed and attractive. These techniques have a background in many research disciplines, including morphological analysis. Several researchers stated that simple content-related n-grams and POS tagging had been proven insufficient for fake news classification. However, they did not realise any empirical research results, which could confirm these statements experimentally in the last decade. Considering this contradiction, the main aim of the paper is to experimentally evaluate the potential of the common use of n-grams and POS tags for the correct classification of fake and true news. The dataset of published fake or real news about the current Covid-19 pandemic was pre-processed using morphological analysis. As a result, n-grams of POS tags were prepared and further analysed. Three techniques based on POS tags were proposed and applied to different groups of n-grams in the pre-processing phase of fake news detection. The n-gram size was examined as the first. Subsequently, the most suitable depth of the decision trees for sufficient generalization was scoped. Finally, the performance measures of models based on the proposed techniques were compared with the standardised reference TF-IDF technique. The performance measures of the model like accuracy, precision, recall and f1-score are considered, together with the 10-fold cross-validation technique. Simultaneously, the question, whether the TF-IDF technique can be improved using POS tags was researched in detail. The results showed that the newly proposed techniques are comparable with the traditional TF-IDF technique. At the same time, it can be stated that the morphological analysis can improve the baseline TF-IDF technique. As a result, the performance measures of the model, precision for fake news and recall for real news, were statistically significantly improved.

Mekatronika ◽  
2021 ◽  
Vol 3 (1) ◽  
pp. 27-31
Author(s):  
Ken-ji Ee ◽  
Ahmad Fakhri Bin Ab. Nasir ◽  
Anwar P. P. Abdul Majeed ◽  
Mohd Azraai Mohd Razman ◽  
Nur Hafieza Ismail

The animal classification system is a technology to classify the animal class (type) automatically and useful in many applications. There are many types of learning models applied to this technology recently. Nonetheless, it is worth noting that the extraction of the features and the classification of the animal features is non-trivial, particularly in the deep learning approach for a successful animal classification system. The use of Transfer Learning (TL) has been demonstrated to be a powerful tool in the extraction of essential features. However, the employment of such a method towards animal classification applications are somewhat limited. The present study aims to determine a suitable TL-conventional classifier pipeline for animal classification. The VGG16 and VGG19 were used in extracting features and then coupled with either k-Nearest Neighbour (k-NN) or Support Vector Machine (SVM) classifier. Prior to that, a total of 4000 images were gathered consisting of a total of five classes which are cows, goats, buffalos, dogs, and cats. The data was split into the ratio of 80:20 for train and test. The classifiers hyper parameters are tuned by the Grids Search approach that utilises the five-fold cross-validation technique. It was demonstrated from the study that the best TL pipeline identified is the VGG16 along with an optimised SVM, as it was able to yield an average classification accuracy of 0.975. The findings of the present investigation could facilitate animal classification application, i.e. for monitoring animals in wildlife.


2021 ◽  
Vol 39 (15_suppl) ◽  
pp. 3044-3044
Author(s):  
David Haan ◽  
Anna Bergamaschi ◽  
Yuhong Ning ◽  
William Gibb ◽  
Michael Kesling ◽  
...  

3044 Background: Epigenomics assays have recently become popular tools for identification of molecular biomarkers, both in tissue and in plasma. In particular 5-hydroxymethyl-cytosine (5hmC) method, has been shown to enable the epigenomic regulation of gene expression and subsequent gene activity, with different patterns, across several tumor and normal tissues types. In this study we show that 5hmC profiles enable discrete classification of tumor and normal tissue for breast, colorectal, lung ovary and pancreas. Such classification was also recapitulated in cfDNA from patient with breast, colorectal, lung, ovarian and pancreatic cancers. Methods: DNA was isolated from 176 fresh frozen tissues from breast, colorectal, lung, ovary and pancreas (44 per tumor per tissue type and up to 11 tumor tissues for each stage (I-IV)) and up to 10 normal tissues per tissue type. cfDNA was isolated from plasma from 783 non-cancer individuals and 569 cancer patients. Plasma-isolated cfDNA and tumor genomic DNA, were enriched for the 5hmC fraction using chemical labelling, sequenced, and aligned to a reference genome to construct features sets of 5hmC patterns. Results: 5hmC multinomial logistic regression analysis was employed across tumor and normal tissues and identified a set of specific and discrete tumor and normal tissue gene-based features. This indicates that we can classify samples regardless of source, with a high degree of accuracy, based on tissue of origin and also distinguish between normal and tumor status.Next, we employed a stacked ensemble machine learning algorithm combining multiple logistic regression models across diverse feature sets to the cfDNA dataset composed of 783 non cancers and 569 cancers comprising 67 breast, 118 colorectal, 210 Lung, 71 ovarian and 100 pancreatic cancers. We identified a genomic signature that enable the classification of non-cancer versus cancers with an outer fold cross validation sensitivity of 49% (CI 45%-53%) at 99% specificity. Further, individual cancer outer fold cross validation sensitivity at 99% specificity, was measured as follows: breast 30% (CI 119% -42%); colorectal 41% (CI 32%-50%); lung 49% (CI 42%-56%); ovarian 72% (CI 60-82%); pancreatic 56% (CI 46%-66%). Conclusions: This study demonstrates that 5hmC profiles can distinguish cancer and normal tissues based on their origin. Further, 5hmC changes in cfDNA enables detection of the several cancer types: breast, colorectal, lung, ovarian and pancreatic cancers. Our technology provides a non-invasive tool for cancer detection with low risk sample collection enabling improved compliance than current screening methods. Among other utilities, we believe our technology could be applied to asymptomatic high-risk individuals thus enabling enrichment for those subjects that most need a diagnostic imaging follow up.


2020 ◽  
Vol 10 (6) ◽  
pp. 1999 ◽  
Author(s):  
Milica M. Badža ◽  
Marko Č. Barjaktarović

The classification of brain tumors is performed by biopsy, which is not usually conducted before definitive brain surgery. The improvement of technology and machine learning can help radiologists in tumor diagnostics without invasive measures. A machine-learning algorithm that has achieved substantial results in image segmentation and classification is the convolutional neural network (CNN). We present a new CNN architecture for brain tumor classification of three tumor types. The developed network is simpler than already-existing pre-trained networks, and it was tested on T1-weighted contrast-enhanced magnetic resonance images. The performance of the network was evaluated using four approaches: combinations of two 10-fold cross-validation methods and two databases. The generalization capability of the network was tested with one of the 10-fold methods, subject-wise cross-validation, and the improvement was tested by using an augmented image database. The best result for the 10-fold cross-validation method was obtained for the record-wise cross-validation for the augmented data set, and, in that case, the accuracy was 96.56%. With good generalization capability and good execution speed, the new developed CNN architecture could be used as an effective decision-support tool for radiologists in medical diagnostics.


2013 ◽  
Vol 658 ◽  
pp. 647-651 ◽  
Author(s):  
Jun Jie Zhu ◽  
Xiao Jun Zhang ◽  
Ji Hua Gu ◽  
He Ming Zhao ◽  
Qiang Zhou ◽  
...  

This paper mainly studies on the classification of pathological voice from normal voice based on the sustained vowel /a/. Firstly, the original 18 acoustic features are extracted. Then on the basis of the extracted parameters, this paper recognizes the pathological voice using AD Tree. During the classification stage, the cross-validation of features is also as references in the process. This method is validated with a sound database provided by the Massachusetts Eye and Ear Infirmary (MEEI). After the 10 fold cross-validation, comparing with 7 other kinds of classifiers, the experimental results show that AD Tree can get the highest recognition rate of 95.2%. The method in this paper shows that all the extracted parameters are reasonable in the following recognition process and AD tree is a good recognition way in pathological voice research.


2008 ◽  
Vol 17 (05) ◽  
pp. 957-971
Author(s):  
ATAOLLAH EBRAHIMZADEH ◽  
ABOLFAZL RANJBAR ◽  
MEHRDAD ARDEBLILPOUR

Classification of the communication signals has seen under increasing demands. In this paper, we present a new technique that identifies a variety of digital communication signal types. This technique utilizes a radial basis function neural network (RBFN) as the classifier. Swarm intelligence, as an evolutionary algorithm, is used to construct RBFN. A combination of the higher-order moments and the higher-order cumulants up to eight are selected as the features of the considered digital signal types. In conjunction with RBFN, we have used k-fold cross-validation to improve the generalization potentiality. Simulation results show that the proposed technique has high performance for classification of different communication signals even at very low signal-to-noise ratios.


Author(s):  
Gede Aditra Pradnyana ◽  
I Komang Agus Suryantara ◽  
I Gede Mahendra Darmawiguna

An impression can be interpreted as a psychological feeling toward a product and it plays an important role in decision making. Therefore, the understanding of the data in the domain of impressions will be very useful. This research had the objective of knowing the performance of K-Nearest Neighbors method to classify endek image impression using K-Fold Cross Validation method. The images were taken from 3 locations, namely CV. Artha Dharma, Agung Bali Collection, and Pengrajin Sri Rejeki. To get the image impression was done by consulting with an endek expert named Dr. D.A Tirta Ray, M.Si. The process of data mining was done by using K-Nearest Neighbors Method which was a classification method to a set of data based on learning data that had been classified previously and to classify new objects based on attributes and training samples. K-Fold Cross Validation testing obtained accuracy of 91% with K value in K-Nearest Neighbors of 3, 4, 7, 8.


2020 ◽  
Author(s):  
Zekuan Yu ◽  
Xiaohu Li ◽  
Haitao Sun ◽  
Jian Wang ◽  
Tongtong Zhao ◽  
...  

Abstract Background: To implement the real-time diagnosis of the severity of patients infected with novel coronavirus 2019 (COVID-19) and guide the follow-up therapeutic treatment, We collected chest CT scans of 202 patients diagnosed with the COVID-19 from three hospitals in Anhui Province, China.Methods: A total of 729 2D axial plan slices with 246 severe cases and 483 non-severe cases were employed in this study. Four pre-trained deep models (Inception-V3, ResNet-50, ResNet-101, DenseNet-201) with multiple classifiers (linear discriminant, linear SVM, cubic SVM, KNN and Adaboost decision tree) were applied to identify the severe and non-severe COVID-19 cases. Three validation strategies (holdout validation, 10-fold cross-validation and leave-one-out) are employed to validate the feasibility of proposed pipelines. Results and conclusion: The experimental results demonstrate that classification of the features from pre-trained deep models show the promising application in COVID-19 screening whereas the DenseNet-201 with cubic SVM model achieved the best performance. Specifically, it achieved the highest severity classification accuracy of 95.20% and 95.34% for 10-fold cross-validation and leave-one-out, respectively. The established pipeline was able to achieve a rapid and accurate identification of the severity of COVID-19. This may assist the physicians to make more efficient and reliable decisions.


2018 ◽  
Vol 7 (2.15) ◽  
pp. 136 ◽  
Author(s):  
Rosaida Rosly ◽  
Mokhairi Makhtar ◽  
Mohd Khalid Awang ◽  
Mohd Isa Awang ◽  
Mohd Nordin Abdul Rahman

This paper analyses the performance of classification models using single classification and combination of ensemble method, which are Breast Cancer Wisconsin and Hepatitis data sets as training datasets. This paper presents a comparison of different classifiers based on a 10-fold cross validation using a data mining tool. In this experiment, various classifiers are implemented including three popular ensemble methods which are boosting, bagging and stacking for the combination. The result shows that for the classification of the Breast Cancer Wisconsin data set, the single classification of Naïve Bayes (NB) and a combination of bagging+NB algorithm displayed the highest accuracy at the same percentage (97.51%) compared to other combinations of ensemble classifiers. For the classification of the Hepatitisdata set, the result showed that the combination of stacking+Multi-Layer Perception (MLP) algorithm achieved a higher accuracy at 86.25%. By using the ensemble classifiers, the result may be improved. In future, a multi-classifier approach will be proposed by introducing a fusion at the classification level between these classifiers to obtain classification with higher accuracies.  


2012 ◽  
Vol 542-543 ◽  
pp. 1438-1442
Author(s):  
Ting Hua Wang ◽  
Cai Yun Cai ◽  
Yan Liao

Kernel is a key component of the support vector machines (SVMs) and other kernel methods. Based on the data distributions of classes in the feature space, this paper proposed a model selection criterion to evaluate the goodness of a kernel in multiclass classification scenario. This criterion is computationally efficient and is differentiable with respect to the kernel parameters. Compared with the k-fold cross validation technique which is often regarded as a benchmark, this criterion is found to yield about the same performance with much less computational overhead.


Sign in / Sign up

Export Citation Format

Share Document