Roman Urdu Headline News Text Classification Using RNN, LSTM and CNN

Author(s):  
Irfan Ali Kandhro ◽  
Sahar Zafar Jumani ◽  
Kamlash Kumar ◽  
Abdul Hafeez ◽  
Fayyaz Ali

This paper presents the automated tool for the classification of text with respect to predefined categories. It has always been considered as a vital method to manage and process a huge number of documents in digital forms which are widespread and continuously increasing. Most of the research work in text classification has been done in Urdu, English and other languages. But limited research work has been carried out on roman data. Technically, the process of the text classification follows two steps: the first step consists of choosing the main features from all the available features of the text documents with the usage of feature extraction techniques. The second step applies classification algorithms on those chosen features. The data set is collected through scraping tools from the most popular news websites Awaji Awaze and Daily Jhoongar. Furthermore, the data set splits in training and testing 70%, 30%, respectively. In this paper, the deep learning models, such as RNN, LSTM, and CNN, are used for classification of roman Urdu headline news. The testing accuracy of RNN (81%), LSTM (82%), and CNN (79%), and the experimental results demonstrate that the performance of the LSTM method is state-of-art method compared to CNN and RNN.

Author(s):  
Adam Piotr Idczak

It is estimated that approximately 80% of all data gathered by companies are text documents. This article is devoted to one of the most common problems in text mining, i. e. text classification in sentiment analysis, which focuses on determining document’s sentiment. Lack of defined structure of the text makes this problem more challenging. This has led to development of various techniques used in determining document’s sentiment. In this paper the comparative analysis of two methods in sentiment classification: naive Bayes classifier and logistic regression was conducted. Analysed texts are written in Polish language and come from banks. Classification was conducted by means of bag-of-n-grams approach where text document is presented as set of terms and each term consists of n words. The results show that logistic regression performed better.


2021 ◽  
Vol 21 (3) ◽  
pp. 3-10
Author(s):  
Petr ŠALOUN ◽  
◽  
Barbora CIGÁNKOVÁ ◽  
David ANDREŠIČ ◽  
Lenka KRHUTOVÁ ◽  
...  

For a long time, both professionals and the lay public showed little interest in informal carers. Yet these people deals with multiple and common issues in their everyday lives. As the population is aging we can observe a change of this attitude. And thanks to the advances in computer science, we can offer them some effective assistance and support by providing necessary information and connecting them with both professional and lay public community. In this work we describe a project called “Research and development of support networks and information systems for informal carers for persons after stroke” producing an information system visible to public as a web portal. It does not provide just simple a set of information but using means of artificial intelligence, text document classification and crowdsourcing further improving its accuracy, it also provides means of effective visualization and navigation over the content made by most by the community itself and personalized on a level of informal carer’s phase of the care-taking timeline. In can be beneficial for informal carers as it allows to find a content specific to their current situation. This work describes our approach to classification of text documents and its improvement through crowdsourcing. Its goal is to test text documents classifier based on documents similarity measured by N-grams method and to design evaluation and crowdsourcing-based classification improvement mechanism. Interface for crowdsourcing was created using CMS WordPress. In addition to data collection, the purpose of interface is to evaluate classification accuracy, which leads to extension of classifier test data set, thus the classification is more successful.


Author(s):  
Desi Ramayanti

In digital business, the managerial commonly need to process text so that it can be used to support decision-making. The number of text documents contained ideas and opinions is progressing and challenging to understand one by one. Whereas if the data are processed and correctly rendered using machine learning, it can present a general overview of a particular case, organization, or object quickly. Numerous researches have been accomplished in this research area, nevertheless, most of the studies concentrated on English text classification. Every language has various techniques or methods to classify text depending on the characteristics of its grammar. The result of classification among languages may be different even though it used the same algorithm. Given the greatness of text classification, text classification algorithms that can be implemented is the support vector machine (SVM) and Random Forest (RF). Based on the background above, this research is aimed to find out the performance of support vector machine algorithm and random forest in classification of Indonesian text. 1. Result of SVM classifier with cross validation k-10 is derived the best accuracy with value 0.9648, however, it spends computational time as long as 40.118 second. Then, result of RF classifier with values, i.e. 'bootstrap': False, 'min_samples_leaf': 1, 'n_estimators': 10, 'min_samples_split': 3, 'criterion': 'entropy', 'max_features': 3, 'max_depth': None is achieved accuracy is 0.9561 and computational time 109.399 second.


2014 ◽  
Vol 7 (4) ◽  
pp. 1-11 ◽  
Author(s):  
Said Nouri ◽  
Mohamed Fakir

This paper presents a new method called density weight and zigzag sequence to recognize printed Arabic names. This technique was performed on two steps, the first aims to reduce matrix size of 96x96 into 12x12 using density weight techniques, in the second step the last matrix (12x12) was used to extract 144 sequences following path zigzag technique. 144 features found are used for representing each name in data set. This proposed technique was tested on Morocco town and village names using KNN with consensus rule and SVM classifiers. The perfect score was obtained with KNN (k=9) and SVM (linear kernel).


2014 ◽  
Vol 989-994 ◽  
pp. 4704-4707
Author(s):  
Sheng Wu Xu ◽  
Zheng You Xia

The current most news recommendations are suitable for news which comes from a single news website, not for news from different news websites. Little research work has been reported on utilizing hundreds of news websites to provide top hot news services for group customers (e.g. Government staffs). In this paper, we present hot news recommendation system based on Hadoop, which is from hundreds of different news websites. We discuss our news recommendation system architecture based on Hadoop.We conclude that Hadoop is an excellent tool for web big data analytics and scales well with increasing data set size and the number of nodes in the cluster. Experimental results demonstrate the reliability and effectiveness of our method.


Author(s):  
Vijayakumar T ◽  
Vinothkanna R ◽  
Duraipandian M

Our human heart is classified into four sections called the left side and right side of the atrium and ventricle accordingly. Monitoring and taking care of the heart of every human is the very essential part. Therefore, the early prediction is essential to save and give awareness to humans about diet plan, lifestyle schedule. Also, this is used to improve the clinical diagnosis and treatment of any patients. To predict or identifying any cardiovascular problems, Electro Cardio Gram (ECG) is used to record the electrical signal of the heart from the body surface of humans. The algorithm learns the dataset from before cluster is called supervised; The algorithm learns to train the data from the set of a dataset is called unsupervised. Then the classification of more amount of heartbeat for different category of normal, abnormal, irregular heartbeats to detect cardiovascular diseases. In this research article, a comparison of various methods to classify the dataset with a fusion-based feature extraction method. Besides, our research work consists of a de-noising filter to reconstruct the raw data from the original input. Our proposed framework performing preprocessing that consists of a filtering approach to remove noises from the raw data set. The signal is affected by thermal noise and instrumentation noise, calibration noise due to power line fluctuation. This interference is high in many handheld devices which can be eliminated by de-noising filters. The output of the de-noising filter is input for fusion-based feature extraction and prediction model construction. This workflow progress has given good results of classifier effectiveness and imbalance arrangement conditions. We achieved good accuracy 96.5% and minimum computation time for classification of ECG signal.


2017 ◽  
Vol 10 (2) ◽  
pp. 333-337
Author(s):  
Sindhu Sindhu ◽  
V Vaidhehi

The collection of large database of digital image has been used for efficient and advanced way for classifying and intelligent retrieval of medical imaging. This research work is to classify human organs based on MRI images. The various MRI images of organ have been considered as the data set. The main objective of this research work is to automate the medical imaging system. Digital images retrieved based on its shape by Canny Edge Detection and is clustered together in one class using K-Means Algorithm. 2564 data sets related to brain and heart is considered for this research work. The system was trained to classify the image which results in faster execution in medical field, also helped in obtain noiseless and efficient data.


2018 ◽  
Vol 14 (2) ◽  
pp. 18-36 ◽  
Author(s):  
Yongjun Zhang ◽  
Zijian Wang ◽  
Yongtao Yu ◽  
Bolun Chen ◽  
Jialin Ma ◽  
...  

This article describes how text documents are a major data structure in the era of big data. With the explosive growth of data, the number of documents with multi-labels has increased dramatically. The popular multi-label classification technology, which is usually employed to handle multinomial text documents, is sensitive to the noise terms of text documents. Therefore, there still exists a huge room for multi-label classification of text documents. This article introduces a supervised topic model, named labeled LDA with function terms (LF-LDA), to filter out the noisy function terms from text documents, which can help to improve the performance of multi-label classification of text documents. The article also shows the derivation of the Gibbs Sampling formulas in detail, which can be generalized to other similar topic models. Based on the textual data set RCV1-v2, the article compared the proposed model with other two state-of-the-art multi-label classifiers, Tuned SVM and labeled LDA, on both Macro-F1 and Micro-F1 metrics. The result shows that LF-LDA outperforms them and has the lowest variance, which indicates the robustness of the LF-LDA classifier.


2021 ◽  
Vol 10 (5) ◽  
pp. 2780-2788
Author(s):  
Denis Eka Cahyani ◽  
Irene Patasik

Emotion is the human feeling when communicating with other humans or reaction to everyday events. Emotion classification is needed to recognize human emotions from text. This study compare the performance of the TF-IDF and Word2Vec models to represent features in the emotional text classification. We use the support vector machine (SVM) and Multinomial Naïve Bayes (MNB) methods for classification of emotional text on commuter line and transjakarta tweet data. The emotion classification in this study has two steps. The first step classifies data that contain emotion or no emotion. The second step classifies data that contain emotions into five types of emotions i.e. happy, angry, sad, scared, and surprised. This study used three scenarios, namely SVM with TF-IDF, SVM with Word2Vec, and MNB with TF-IDF. The SVM with TF-IDF method generate the highest accuracy compared to other methods in the first dan second steps classification, then followed by the MNB with TF-IDF, and the last is SVM with Word2Vec. Then, the evaluation using precision, recall, and F1-measure results that the SVM with TF-IDF provides the best overall method. This study shows TF-IDF modeling has better performance than Word2Vec modeling and this study improves classification performance results compared to previous studies.


Author(s):  
M. Jeyanthi ◽  
C. Velayutham

In Science and Technology Development BCI plays a vital role in the field of Research. Classification is a data mining technique used to predict group membership for data instances. Analyses of BCI data are challenging because feature extraction and classification of these data are more difficult as compared with those applied to raw data. In this paper, We extracted features using statistical Haralick features from the raw EEG data . Then the features are Normalized, Binning is used to improve the accuracy of the predictive models by reducing noise and eliminate some irrelevant attributes and then the classification is performed using different classification techniques such as Naïve Bayes, k-nearest neighbor classifier, SVM classifier using BCI dataset. Finally we propose the SVM classification algorithm for the BCI data set.


Sign in / Sign up

Export Citation Format

Share Document