document frequency
Recently Published Documents


TOTAL DOCUMENTS

376
(FIVE YEARS 241)

H-INDEX

15
(FIVE YEARS 4)

Author(s):  
Sujatha Arun Kokatnoor ◽  
Balachandran Krishnan

<p>The main focus of this research is to find the reasons behind the fresh cases of COVID-19 from the public’s perception for data specific to India. The analysis is done using machine learning approaches and validating the inferences with medical professionals. The data processing and analysis is accomplished in three steps. First, the dimensionality of the vector space model (VSM) is reduced with improvised feature engineering (FE) process by using a weighted term frequency-inverse document frequency (TF-IDF) and forward scan trigrams (FST) followed by removal of weak features using feature hashing technique. In the second step, an enhanced K-means clustering algorithm is used for grouping, based on the public posts from Twitter®. In the last step, latent dirichlet allocation (LDA) is applied for discovering the trigram topics relevant to the reasons behind the increase of fresh COVID-19 cases. The enhanced K-means clustering improved Dunn index value by 18.11% when compared with the traditional K-means method. By incorporating improvised two-step FE process, LDA model improved by 14% in terms of coherence score and by 19% and 15% when compared with latent semantic analysis (LSA) and hierarchical dirichlet process (HDP) respectively thereby resulting in 14 root causes for spike in the disease.</p>


Author(s):  
Parita Shah ◽  
Priya Swaminarayan ◽  
Maitri Patel

<span>Opinion analysis is by a long shot most basic zone of characteristic language handling. It manages the portrayal of information to choose the motivation behind the wellspring of the content. The reason might be of a type of gratefulness (positive) or study (negative). This paper offers a correlation between the outcomes accomplished by applying the calculation arrangement using various classifiers for instance K-nearest neighbor and multinomial naive Bayes. These techniques are utilized to assess a significant assessment with either a positive remark or negative remark. The gathered information considered on the grounds of the extremity film datasets and an association with the results accessible proof has been created for a careful assessment. This paper investigates the word level count vectorizer and term frequency inverse document frequency (TF-IDF) influence on film sentiment analysis. We concluded that multinomial Naive Bayes (MNB) classier generate more accurate result using TF-IDF vectorizer compared to CountVectorizer, K-nearest-neighbors (KNN) classifier has the same accuracy result in case of TF-IDF and CountVectorizer.</span>


2022 ◽  
pp. 155-170
Author(s):  
Lap-Kei Lee ◽  
Kwok Tai Chui ◽  
Jingjing Wang ◽  
Yin-Chun Fung ◽  
Zhanhui Tan

The dependence on Internet in our daily life is ever-growing, which provides opportunity to discover valuable and subjective information using advanced techniques such as natural language processing and artificial intelligence. In this chapter, the research focus is a convolutional neural network for three-class (positive, neutral, and negative) cross-domain sentiment analysis. The model is enhanced in two-fold. First, a similarity label method facilitates the management between the source and target domains to generate more labelled data. Second, term frequency-inverse document frequency (TF-IDF) and latent semantic indexing (LSI) are employed to compute the similarity between source and target domains. Performance evaluation is conducted using three datasets, beauty reviews, toys reviews, and phone reviews. The proposed method enhances the accuracy by 4.3-7.6% and reduces the training time by 50%. The limitations of the research work have been discussed, which serve as the rationales of future research directions.


2022 ◽  
Vol 12 (1) ◽  
pp. 0-0

Retrieving keywords in a text is attracting researchers for a long time as it forms a base for many natural language applications like information retrieval, text summarization, document categorization etc. A text is a collection of words that represent the theme of the text naturally and to bring the naturalism under certain rules is itself a challenging task. In the present paper, the authors evaluate different spatial distribution based keyword extraction methods available in the literature on three standard scientific texts. The authors choose the first few high-frequency words for evaluation to reduce the complexity as all the methods are somehow based on frequency. The authors find that the methods are not providing good results particularly in the case of the first few retrieved words. Thus, the authors propose a new measure based on frequency, inverse document frequency, variance, and Tsallis entropy. Evaluation of different methods is done on the basis of precision, recall, and F-measure. Results show that the proposed method provides improved results.


Author(s):  
Charan Lokku

Abstract: To avoid fraudulent Job postings on the internet, we target to minimize the number of such frauds through the Machine Learning approach to predict the chances of a job being fake so that the candidate can stay alert and make informed decisions if required. The model will use NLP to analyze the sentiments and pattern in the job posting and TF-IDF vectorizer for feature extraction. In this model, we are going to use Synthetic Minority Oversampling Technique (SMOTE) to balance the data and for classification, we used Random Forest to predict output with high accuracy, even for the large dataset it runs efficiently, and it enhances the accuracy of the model and prevents the overfitting issue. The final model will take in any relevant job posting data and produce a result determining whether the job is real or fake. Keywords: Natural Language Processing (NLP), Term Frequency-Inverse Document Frequency (TF-IDF), Synthetic Minority Oversampling Technique (SMOTE), Random Forest.


2021 ◽  
Vol 12 (1) ◽  
pp. 347
Author(s):  
Min-Young Seo ◽  
Se-Yun Hwang ◽  
Jang-Hyun Lee ◽  
Jae-Gon Kim ◽  
Hong-Bae Jun

There are two types of maintenance policies for equipment: breakdown maintenance and preventive maintenance. In the case of applying preventive maintenance, the maintenance is carried out based on time or the condition of the equipment. However, with the development of Information and Communications Technologies (ICT) and the Internet of Things (IoT) technology, the data collected from equipment has rapidly increased and the use of Condition-Based Maintenance (CBM) to perform appropriate maintenance based on the condition of the equipment is increasing. In this study, based on gathered sensor data, we introduce an approach to diagnosing the condition of the equipment by extracting specific data features related to the types of failures that occur with equipment. To this end, we used the K-means clustering method, support vector machine (SVM) classifier, and Pattern Frequency–Inverse Failure mode Frequency (PF–IFF) method with the Term Frequency–Inverse Document Frequency (TF–IDF) method. As a case study, we applied the proposed approach to a centrifugal pump and carried out computational experiments for assessing the performance and validity of the proposed approach.


2021 ◽  
Author(s):  
Yadi Zhao ◽  
Zhifeng Wei ◽  
Bingqiang Gao ◽  
Shuo Zhang

With the completion of the State Grid Corporation’s maintenance system, the number of substations has increased dramatically, the grid structure has become increasingly complex, and there have been internal and external reasons such as the contingency of emergencies, and equipment failures have occurred from time to time. This paper aims to explore the potential value of massive data, show the laws of business data, and further give full play to the comprehensive support of data for enterprise operation and production management, and promote the realization of intelligent and lean power grid core business. This paper uses power system data to provide reliable data support for equipment defect full cycle management and equipment state analysis through ANOVA and neural network statistical analysis. At the same time, we use Term Frequency-Inverse Document Frequency(TF-IDF)Algorithm to calculate the importance of keywords and construct the power keyword library. By constructing Bayesian text classification model, we can classify the defect parts, defect categories and defect causes automatically. This method can be applied to the construction of power grid production work order text analysis system, improve the data quality and system automation level, help the business department to improve work efficiency and provide the basis for power grid business analysis. This method is applied to the data cleaning of the primary production equipment of power grid enterprises, and the accuracy of data error correction for equipment defects with voltages above 110kV is between 93% and 95%, and good results have been achieved.


2021 ◽  
Vol 6 (3) ◽  
pp. 236-251
Author(s):  
Novira Azpiranda ◽  
Ahmad Afif Supianto ◽  
Nanang Yudi Setiawan ◽  
Endang Suryawati ◽  
R. Sandra Yuwana ◽  
...  

Al-Ghiff Steak is a restaurant located in Cirebon City that offers quality steaks at affordable prices. For maintaining a competitive Al-Ghiff Steak advantage and reputation, it is important to build a good relationship with customers and have a business strategy that considers customer opinions. However, in its implementation, Al-Ghiff Steak has difficulty when collecting and processing customer review data manually. Therefore, it is necessary to conduct sentiment analysis by utilizing Google Reviews to determine customer perspectives regarding Al-Ghiff Steak products and services. This analysis was conducted on 968 Google Review reviews from 2016 to 2020 using the Support Vector Machine (SVM) and Term Frequency-Inverse Document Frequency (TF-IDF) methods. Classification testing is done with a confusion matrix against four parameters: accuracy, precision, recall, and f1-score. SVM with TF-IDF gets accuracy value 83%, precision 64%, recall 60% and f1-score 59%. The sentiment classification result is then visualized in the form of a dashboard. We utilize the System Usability Scale (SUS) for usability testing, which produces a value of 77.5. This result achieve the Acceptable category and an Excellent rating.


2021 ◽  
Vol 02 (02) ◽  
Author(s):  
Mohammed A. Ahmed ◽  
◽  
Hanif Baharin ◽  
Puteri N. E. Nohuddin ◽  
◽  
...  

Al-Quran is the primary text of Muslims’ religion and practise. Millions of Muslims around the world use al-Quran as their reference guide, and so knowledge can be obtained from it by Muslims and Islamic scholars in general. Al-Quran has been reinterpreted to various languages in the world, for example, English and has been written by several translators. Each translator has ideas, comments and statements to translate the verses from which he has obtained (Tafseer). Therefore, this paper tries to cluster the translation of the Tafseer using text clustering. Text clustering is the text mining method that needs to be clustered in the same section of related documents. The study adapted (mini-batch k-means and k-means) algorithms of clustering techniques to explain and to define the link between keywords known as features or concepts for Al-Baqarah chapter of 286 verses. For this dataset, data preprocessing and extraction of features using Term Frequency-Inverse Document Frequency (TF-IDF) and Principal Component Analysis (PCA) applied. Results showed that two/three-dimensional clustering plotting assigning seven cluster categories (k = 7) for the Tafseer. The implementation time of the mini-batch k-means algorithm (0.05485s) outperformed the time of the k-means algorithm (0.23334s). Finally, the features ‘god’, ‘people’, and ‘believe’ was the most frequent features.


2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Qianyao Zhu

In view of the lack of accurate recommendation and selection of courses on the network teaching platform in the new form of higher education, a network course recommendation system based on the double-layer attention mechanism is proposed. First of all, the collected data are preprocessed, while the data of students and course information are normalized and classified. Then, the dual attention mechanism is introduced into the parallel neural network recommendation model so as to improve the model’s ability to mine important features. TF-IDF (term frequency-inverse document frequency) based on the student score and course category is improved. The recommendation results are classified according to the weight of course categories, so as to construct different types of course groups and complete the recommendation. The experimental results show that the proposed algorithm can effectively improve the model recommendation accuracy compared with other algorithms.


Sign in / Sign up

Export Citation Format

Share Document