Text Mining of Research Articles Using Clustering Approach

Widening of research articles publication in various streams of research is epidemic. Tracking down of an appropriate article from the research archive is considered to be vast and also time consuming. Research articles are clustered based on their respective domain and it plays an important role for researchers to retrieve articles in a faster manner. Hence a commonly practiced search mechanism, namely domain name search has been applied to retrieve appropriate documents and articles. When new domains of documents are added to the repository it’s to spot keywords and boost the corresponding domains for proper classification. Classification techniques namely Random forest classifier, SVM and TF-IDF have been used to classify articles and compare its processing time. TF-IDF (Term Frequency-Inverse Document Frequency) has been further proposed to transform the corpus into vector space model. Clustering algorithm such as K-Means and Hierarchical have been used to cluster articles. Finally, the processing time of SVM is better than random forest classifier and TF-IDF and K-Means gives a better understanding than Hierarchical algorithm.

Download Full-text

Poisson mixtures

Natural Language Engineering ◽

10.1017/s1351324900000139 ◽

1995 ◽

Vol 1 (2) ◽

pp. 163-190 ◽

Cited By ~ 146

Author(s):

Kenneth W. Church ◽

William A. Gale

Keyword(s):

Negative Binomial ◽

Probability Distributions ◽

Hidden Variables ◽

Heterogeneous Structure ◽

Text Compression ◽

Inverse Document Frequency ◽

Poisson Mixtures ◽

Document Frequency ◽

Wide Range ◽

Better Than

AbstractShannon (1948) showed that a wide range of practical problems can be reduced to the problem of estimating probability distributions of words and ngrams in text. It has become standard practice in text compression, speech recognition, information retrieval and many other applications of Shannon's theory to introduce a “bag-of-words” assumption. But obviously, word rates vary from genre to genre, author to author, topic to topic, document to document, section to section, and paragraph to paragraph. The proposed Poisson mixture captures much of this heterogeneous structure by allowing the Poisson parameter θ to vary over documents subject to a density function φ. φ is intended to capture dependencies on hidden variables such genre, author, topic, etc. (The Negative Binomial is a well-known special case where φ is a Г distribution.) Poisson mixtures fit the data better than standard Poissons, producing more accurate estimates of the variance over documents (σ2), entropy (H), inverse document frequency (IDF), and adaptation (Pr(x ≥ 2/x ≥ 1)).

Download Full-text

SMS Spam Message Detection using Term Frequency-Inverse Document Frequency and Random Forest Algorithm

Procedia Computer Science ◽

10.1016/j.procs.2019.11.150 ◽

2019 ◽

Vol 161 ◽

pp. 509-515 ◽

Cited By ~ 1

Author(s):

Nilam Nur Amir Sjarif ◽

Nurulhuda Firdaus Mohd Azmi ◽

Suriayati Chuprat ◽

Haslina Md Sarkan ◽

Yazriwati Yahya ◽

...

Keyword(s):

Random Forest ◽

Random Forest Algorithm ◽

Inverse Document Frequency ◽

Term Frequency ◽

Document Frequency

Download Full-text

Aspect Category Classification dengan Pendekatan Machine Learning Menggunakan Dataset Bahasa Indonesia

Jurnal Nasional Teknik Elektro dan Teknologi Informasi (JNTETI) ◽

10.22146/jnteti.v10i3.1819 ◽

2021 ◽

Vol 10 (3) ◽

pp. 229-235

Author(s):

Syaifulloh Amien Pandega Perdana ◽

Teguh Bharata Aji ◽

Ridi Ferdiana

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Random Forest ◽

Sentiment Analysis ◽

Support Vector ◽

Term Weighting ◽

Inverse Document Frequency ◽

Term Frequency ◽

Document Frequency ◽

Bahasa Indonesia

Ulasan pelanggan merupakan opini terhadap kualitas barang atau jasa yang dirasakan konsumen. Ulasan pelanggan mengandung informasi yang berguna bagi konsumen maupun penyedia barang atau jasa. Ketersediaan ulasan pelanggan dalam jumlah besar pada website membutuhkan suatu framework untuk mengekstraksi sentimen secara otomatis. Sebuah ulasan pelanggan sering kali mengandung banyak aspek sehingga Aspect Based Sentiment Analysis (ABSA) harus digunakan untuk mengetahui polaritas masing-masing aspek. Salah satu tugas penting dalam ABSA adalah Aspect Category Detection. Metode machine learning untuk Aspect Category Detection sudah banyak dilakukan pada domain berbahasa Inggris, tetapi pada domain bahasa Indonesia masih sedikit. Makalah ini membandingkan kinerja tiga algoritme machine learning, yaitu Naïve Bayes (NB), Support Vector Machine (SVM), dan Random Forest (RF) pada ulasan pelanggan berbahasa Indonesia menggunakan Term Frequency–Inverse Document Frequency (TF-IDF) sebagai term weighting. Hasil menunjukkan bahwa RF memiliki kinerja paling unggul dibandingkan NB dan SVM pada tiga domain yang berbeda, yaitu restoran, hotel, dan e-commerce, dengan nilai f1-score untuk masing-masing domain adalah 84.3%, 85.7%, dan 89,3%.

Download Full-text

On the Feature Selection of Microarray Data for Cancer Detection based on Random Forest Classifier

JURNAL INFOTEL ◽

10.20895/infotel.v12i3.485 ◽

2020 ◽

Vol 12 (3) ◽

Author(s):

Tita Nurul Nuklianggraita ◽

Adiwijaya Adiwijaya ◽

Annisa Aditsania

Keyword(s):

Random Forest ◽

Cancer Detection ◽

Microarray Data ◽

Random Forest Classifier ◽

World Health ◽

Selection Operator ◽

Health Organization ◽

Microarray Technique ◽

Selection Of ◽

Better Than

Cancer is a disease that can affect all organs of humans. Based on data from the World Health Organization (WHO) fact sheet in 2018, cancer deaths have reached 9.6 million. One known way to detect cancer that is with Microarray Technique, but the microarray data have large dimensions due to the number of features that are very much compared to the number of samples. Therefore, dimension reduction should be made to produce optimum accuracy. In this paper, we compare Minimum Redundancy Maximum Relevance (MRMR) and Least Absolute Shrinkage and Selection Operator (LASSO) to reduce dimension of microarray data. Moreover, by using Random Forest (RF) Classifier, the performance of classification (cancer detection) is compared. Based on simulation, it can be concluded that LASSO is better than MRMR because it can produce an evaluation of 100% in lung and ovarian cancer, 92% colon cancer, 93% prostate tumor and 83% central nervous system.

Download Full-text

Automatic Complaints Categorization Using Random Forest and Gradient Boosting

Advance Sustainable Science, Engineering and Technology ◽

10.26877/asset.v3i1.8460 ◽

2021 ◽

Vol 3 (1) ◽

pp. 0210106

Author(s):

Muchamad Taufiq Anwar

Keyword(s):

Random Forest ◽

Gradient Boosting ◽

Future Research ◽

Research Directions ◽

Inverse Document Frequency ◽

The Public ◽

Bangalore City ◽

Document Frequency ◽

Future Research Directions ◽

Multi Class Classification

Capturing and responding to complaints from the public is an important effort to develop a good city/country. This project aims to utilize Data Mining to automatize complaints categorization. More than 35,000 complaints in Bangalore city, India, were retrieved from the “I Change My City” website (https://www.ichangemycity.com). The vector space of the complaints was created using Term Frequency–Inverse Document Frequency (TF-IDF) and the multi-class text classifications were done using Random Forest (RF) and Gradient Boosting (GB). Results showed that both RF and GB have similar performance with an accuracy of 73% on the 10-classes multi-class classification task. Result also showed that the model is highly dependent on the word usage in the complaint's description. Future research directions to increase task performance are also suggested.

Download Full-text

Classification of Genuinity in Job Posting Using Machine Learning

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.39580 ◽

2021 ◽

Vol 9 (12) ◽

pp. 1569-1575

Author(s):

Charan Lokku

Keyword(s):

Machine Learning ◽

Random Forest ◽

Language Processing ◽

Large Dataset ◽

Final Model ◽

Inverse Document Frequency ◽

Document Frequency ◽

Machine Learning Approach ◽

Job Postings

Abstract: To avoid fraudulent Job postings on the internet, we target to minimize the number of such frauds through the Machine Learning approach to predict the chances of a job being fake so that the candidate can stay alert and make informed decisions if required. The model will use NLP to analyze the sentiments and pattern in the job posting and TF-IDF vectorizer for feature extraction. In this model, we are going to use Synthetic Minority Oversampling Technique (SMOTE) to balance the data and for classification, we used Random Forest to predict output with high accuracy, even for the large dataset it runs efficiently, and it enhances the accuracy of the model and prevents the overfitting issue. The final model will take in any relevant job posting data and produce a result determining whether the job is real or fake. Keywords: Natural Language Processing (NLP), Term Frequency-Inverse Document Frequency (TF-IDF), Synthetic Minority Oversampling Technique (SMOTE), Random Forest.

Download Full-text

Textual analysis in finance

10.12681/eadd/49794 ◽

2021 ◽

Author(s):

Απόστολος Κατσαφάδος

Keyword(s):

Random Forest ◽

Textual Analysis ◽

Inverse Document Frequency ◽

Term Frequency ◽

Document Frequency ◽

Out Of Sample

Η παρούσα διδακτορική διατριβή χωρίζεται σε επτά κεφάλαια. Το κοινό συνδετικό στοιχείο σε όλα αυτά τα κεφάλαια είναι ότι περιστρέφονται γύρω από τη χρήση ανάλυσης κειμένου, και κατ’ επέκταση την εφαρμογή αυτής στο χρηματοοικονομικό κλάδο. Το πρώτο κεφάλαιο παρέχει την εισαγωγή της διατριβής και επισημαίνει γιατί είναι σημαντική η εστίαση στην ανάλυση κειμένου. Ύστερα, στο δεύτερο κεφάλαιο παρουσιάζεται μια σχετικά συνοπτική αλλά ουσιαστική επισκόπηση της βιβλιογραφίας, προκειμένου να αποκρυσταλλωθούν οι βάσεις, οι σταθερές, και οι τάσεις στην ερευνητική δραστηριότητα αυτής της περιοχής. Ο λόγος είναι ότι με αυτόν τον τρόπο αναδεικνύεται η σύνδεση της διατριβής με τη βιβλιογραφία, η συνεισφορά της σε αυτή, καθώς και τα εμπειρικά ευρήματα μπορούν πλέον να κατανοηθούν καλύτερα.Το τρίτο κεφάλαιο χρησιμοποιεί την ανάλυση κειμένου για να προσδιορίσει τις τράπεζες που συμμετέχουν σε μία συγχώνευση, είτε ως στόχος είτε ως αγοραστής, στον αμερικανικό τραπεζικό κλάδο. Με βάση τις θετικές και αρνητικές λέξεις των Loughran and McDonald, εμείς υπολογίζουμε το συναίσθημα των ετήσιων τραπεζικών δελτίων (10-Κs). Στην εμπειρική μας ανάλυση, χρησιμοποιούμε λογιστικές παλινδρομήσεις προκειμένου να εκτιμήσουμε την πιθανότητα μια τράπεζα να συμμετέχει σε μία συγχώνευση. Πρώτον, δείχνουμε ότι μεγαλύτερη συχνότητα από θετικές λέξεις μέσα στο 10-K της τράπεζας συνδέεται με μεγαλύτερη πιθανότητα να εξαγοράσει. Δεύτερον, βρίσκουμε ότι υψηλότερη συχνότητα από αρνητικές λέξεις μέσα στο 10-Κ της τράπεζας συσχετίζεται με υψηλότερη πιθανότητα να εξαγοραστεί. Τα εμπειρικά μας συμπεράσματα παραμένουν σταθερά ακόμα και έπειτα από την είσοδο ποικίλων εξειδικευμένων τραπεζικών μεταβλητών μέσα στα μοντέλα των λογιστικών παλινδρομήσεων. Το τέταρτο κεφάλαιο εξετάζει το θέμα του προηγούμενου κεφαλαίου από μια διαφορετική οπτική γωνία. Αντίθετα με την χρήση οικονομετρικών μεθοδολογιών για εξεύρεση στατιστικής σημαντικότητας συντελεστών κάτω από μια επεξηγηματική προσέγγιση, εδώ ο στόχος είναι η πρόβλεψη με τη χρήση τεχνικών μηχανικής μάθησης, συμπεριλαμβανομένων τεχνικών βαθιάς μάθησης. Πιο συγκεκριμένα, επιχειρείται να διερευνηθεί εάν οι πληροφορίες κειμένου από ετήσια δελτία έχουν προβλεπτική ικανότητα όταν προβλέπουμε τραπεζικές συγχωνεύσεις. Εμείς αποδεικνύουμε ότι τα δεδομένα κειμένου ενισχύουν την ακρίβεια των προβλέψεων των μοντέλων είτε για τις τράπεζες που αποτελούν στόχο είτε έχουν το ρόλο του αγοραστή. Γενικά ο συνδυασμός κειμενικών και οικονομικών μεταβλητών ως εισροή στα μοντέλα επιτυγχάνει καλύτερη προβλεπτική ικανότητα. Από την μία πλευρά, τα ευρήματα για τους στόχους υποδηλώνουν ότι τα τυχαία δάση (random forest) είναι το καλύτερο σε όρους πρόβλεψης εκτός δείγματος εκπαίδευσης (out-of-sample). Σε αυτή την περίπτωση, χρησιμοποιούμε χαρακτηριστικά κειμένου με μονογράμματα και διγράμματα σταθμισμένα με το ειδικό βάρος term frequency-inverse document frequency (TF-IDF), μαζί με οικονομικές μεταβλητές. Από την άλλη πλευρά, μοντέλα βαθιά μάθησης αποδίδουν πιο αποτελεσματικά όταν προβλέπουμε στόχους σε μια συγχώνευση. Πιο συγκεκριμένα, χρησιμοποιούμε το κεντροειδές των αναπαραστάσεων λέξεων μαζί με οικονομικές μεταβλητές. Αξιοσημείωτο είναι ότι οι εξειδικευμένες μας στα χρηματοοικονομικά αναπαραστάσεις λέξεων παράγουν καλύτερα αποτελέσματα σε σύγκριση με τα γενικά. Για άλλη μια φορά, η στάθμιση με TF-IDF φαίνεται να βελτιώνει το γενικότερο αποτέλεσμα της πρόβλεψης. Τα ευρήματά μας δείχνουν ότι η πληροφορία κειμένου καταφέρνει να μετριάσει την αδιαφάνεια των τραπεζών.Το πέμπτο κεφάλαιο επιχειρεί να διερευνήσει την προβλεπτική ικανότητα κειμενικών δεδομένων προερχόμενα από τα αρχικά ενημερωτικά δελτία (S-1) αναφορικά με την πρόβλεψη της υποτιμολόγησης στις αρχικές δημόσιες εγγραφές (ΑΔΕ). Πιο συγκεκριμένα, χρησιμοποιούμε μοντέλα μηχανικής μάθησης για να προχωρήσουμε στις προβλέψεις μας. Πρωτίστως η έρευνά μας διαφοροποιείται από την πρότερη βιβλιογραφία καθώς προβλέπουμε όχι μόνο αν μια ΑΔΕ θα είναι υποτιμολογημένη ή υπερτιμολογημένη υπό το πρίσμα δυαδικής ταξινόμησης, αλλά επιπλέον προβλέπουμε και το μέγεθος της ενδεχόμενης υποτίμησης. Και στις δύο αυτές περιπτώσεις, βρίσκουμε ότι τα χαρακτηριστικά του κειμένου μπορούν να συμπληρώσουν τις οικονομικές μεταβλητές με αποτελεσματικότητα. Στην πραγματικότητα, τα μοντέλα μηχανικής μάθησης που χρησιμοποιούν συνδυασμό κειμενικών και οικονομικών μεταβλητών κατορθώνουν υψηλότερη απόδοση σε σύγκριση με αυτά που λαμβάνουν ένα τύπο πληροφόρησης ως εισροή. Επίσης, διερευνούμε μεθοδολογικούς τρόπους με τους οποίους μπορεί να υπάρξει αποτελεσματική σύζευξη των οικονομικών μεταβλητών με την πληθώρα από τις κειμενικές μεταβλητές. Συνολικά, τα αποτελέσματά μας παρέχουν εμπειρικές αποδείξεις στο πώς πληροφορίες από κείμενα καταφέρνουν να μειώσουν την εκ των προτέρων αβεβαιότητα κατά την αξιολόγηση των ΑΔΕ. Το έκτο κεφάλαιο προσπαθεί να ερμηνεύσει την υποτιμολόγηση στις ΑΔΕ, συγκεκριμένα με βάση τον τόνο των ενημερωτικών δελτίων. Εμείς αποδεικνύουμε ότι όσο πιο αβέβαιο κείμενο υπάρχει μέσα στο S-1 αρχείο ως μια εσωτερική πηγή αβεβαιότητας σχετίζεται με πιο υψηλή υποτιμολόγηση. Όμως, η βασική συμβολή της έρευνάς μας είναι ότι επικεντρώνεται στην αβεβαιότητα πολιτικής ως μια εξωτερική πηγή αβεβαιότητας, επιπρόσθετα με την χρήση του συναισθήματος κειμένου. Περιέργως βρίσκουμε ότι η υψηλότερη αβεβαιότητα πολιτικής πριν την ημερομηνία έκδοσης του S-1 συνδέεται με λιγότερη υποτιμολόγηση. Με ενδιαφέρον, δείχνουμε ότι η υψηλή αβεβαιότητα πολιτικής επηρεάζει την απόφαση της εταιρείας να προχωρήσει με την ΑΔΕ. Στην πραγματικότητα, η αβεβαιότητα πολιτικής συνδέεται αρνητικά με τον όγκο των ΑΔΕ. Εμείς περαιτέρω τεκμηριώνουμε ότι μόνο οι εταιρείες με καλή ποιότητα συνεχίζουν να προχωρούν προς την ΑΔΕ παρά την υψηλή αβεβαιότητα πολιτικής, που κατά συνέπεια σημαίνει ότι απολαμβάνουν μικρότερη υποτίμηση. Το έβδομο κεφάλαιο παρέχει τα βασικά συμπεράσματα της διατριβής καθώς και προσφέρει προτάσεις για μελλοντική έρευνα.

Download Full-text

Random Forest: A Hybrid Implementation for Sarcasm Detection in Public Opinion Mining

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.l3758.1081219 ◽

2019 ◽

Vol 8 (12) ◽

pp. 5022-5025

Keyword(s):

Decision Making ◽

Public Opinion ◽

Random Forest ◽

Decision Tree ◽

Opinion Mining ◽

Random Forest Classifier ◽

Decision Tree Classifier ◽

Wrong Decision ◽

Tree Classifier ◽

Better Than

Modelling the sentiment with context is one of the most important part in Sentiment analysis. There are various classifiers which helps in detecting and classifying it. Detection of sentiment with consideration of sarcasm would make it more accurate. But detection of sarcasm in people review is a challenging task and it may lead to wrong decision making or classification if not detected. This paper uses Decision Tree and Random forest classifiers and compares the performance of both. Here we consider the random forest as hybrid decision tree classifier. We propose that performance of random forest classifier is better than any other normal decision tree classifier with appropriate reasoning

Download Full-text

Sentiment analysis of customer reviews in zomato bangalore restaurants using random forest classifier

Abstract Proceedings International Scholars Conference ◽

10.35974/isc.v7i1.1003 ◽

2019 ◽

Vol 7 (1) ◽

pp. 1831-1840

Author(s):

Bern Jonathan ◽

Jay Idoan Sihotang ◽

Stanley Martin

Keyword(s):

Natural Language Processing ◽

Random Forest ◽

Natural Language ◽

Sentiment Analysis ◽

Language Processing ◽

Natural Languages ◽

Inverse Document Frequency ◽

Customer Reviews ◽

Document Frequency ◽

Split Test

Introduction: Natural Language Processing is one part of Artificial Intelligence and Machine Learning to make an understanding of the interactions between computers and human (natural) languages. Sentiment analysis is one part of Natural Language Processing, that often used to analyze words based on the patterns of people in writing to find positive, negative, or neutral sentiments. Sentiment analysis is useful for knowing how users like something or not. Zomato is an application for rating restaurants. The rating has a review of the restaurant which can be used for sentiment analysis. Based on this, writers want to discuss the sentiment of the review to be predicted. Method: The method used for preprocessing the review is to make all words lowercase, tokenization, remove numbers and punctuation, stop words, and lemmatization. Then after that, we create word to vector with the term frequency-inverse document frequency (TF-IDF). The data that we process are 150,000 reviews. After that make positive with reviews that have a rating of 3 and above, negative with reviews that have a rating of 3 and below, and neutral who have a rating of 3. The author uses Split Test, 80% Data Training and 20% Data Testing. The metrics used to determine random forest classifiers are precision, recall, and accuracy. The accuracy of this research is 92%. Result: The precision of positive, negative, and neutral sentiment is 92%, 93%, 96%. The recall of positive, negative, and neutral sentiment are 99%, 89%, 73%. Average precision and recall are 93% and 87%. The 10 words that affect the results are: “bad”, “good”, “average”, “best”, “place”, “love”, “order”, “food”, “try”, and “nice”.

Download Full-text

Automated brain tumor detection and classification using weighted fuzzy clustering algorithm, deep auto encoder with barnacle mating algorithm and random forest classifier techniques

International Journal of Imaging Systems and Technology ◽

10.1002/ima.22582 ◽

2021 ◽

Author(s):

Shenbagarajan Anantharajan ◽

Shenbagalakshmi Gunasekaran

Keyword(s):

Brain Tumor ◽

Random Forest ◽

Fuzzy Clustering ◽

Clustering Algorithm ◽

Tumor Detection ◽

Random Forest Classifier ◽

Fuzzy Clustering Algorithm

Download Full-text