The Research and Implementation of Keyword Extraction Technology

Keyword extraction plays an important role in abstract, information retrieval, data mining, text clustering etc. Extracting the keywords from a document can increases the efficiency of retrieval, thus provide great help to efficiently organize the resource. Few writers on the Internet have given the keywords of a document. Artificially extracting the keywords of a document is a great deal of work, so we need a method of extracting the keywords automatically. The paper constructing a verb, function words, stop words etc. small library from the perspective of the Chinese part of speech, realize rapid word segmentation based on the research, analysis, improvement of traditional lexical maximum matching points, and analyze, realize extracting the keywords based on TFIDF(Term Frequency Inverse Document Frequency).

Download Full-text

Improving keyword extraction in multilingual texts

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v10i6.pp5909-5916 ◽

2020 ◽

Vol 10 (6) ◽

pp. 5909

Author(s):

Bahare Hashemzahde ◽

Majid Abdolrazzagh-Nezhad

Keyword(s):

Extraction Procedure ◽

The Other ◽

Keyword Extraction ◽

Inverse Document Frequency ◽

Retrieval Systems ◽

Document Frequency ◽

Extraction Algorithm ◽

Information Retrieval Systems ◽

Multilingual Text ◽

Available Information

The accuracy of keyword extraction is a leading factor in information retrieval systems and marketing. In the real world, text is produced in a variety of languages, and the ability to extract keywords based on information from different languages improves the accuracy of keyword extraction. In this paper, the available information of all languages is applied to improve a traditional keyword extraction algorithm from a multilingual text. The proposed keywork extraction procedure is an unsupervise algorithm and designed based on selecting a word as a keyword of a given text, if in addition to that language holds a high rank based on the keywords criteria in other languages, as well. To achieve to this aim, the average TF-IDF of the candidate words were calculated for the same and the other languages. Then the words with the higher averages TF-IDF were chosen as the extracted keywords. The obtained results indicat that the algorithms’ accuracis of the multilingual texts in term frequency-inverse document frequency (TF-IDF) algorithm, graph-based algorithm, and the improved proposed algorithm are 80%, 60.65%, and 91.3%, respectively.

Download Full-text

A comparative study of keyword extraction algorithms for English texts

Journal of Intelligent Systems ◽

10.1515/jisys-2021-0040 ◽

2021 ◽

Vol 30 (1) ◽

pp. 808-815

Author(s):

Jinye Li

Keyword(s):

English Literature ◽

Recall Rate ◽

English Text ◽

Keyword Extraction ◽

Keyphrase Extraction ◽

Inverse Document Frequency ◽

Document Frequency ◽

Analysis Experiment ◽

Extraction Algorithm ◽

Precision Rate

Abstract This study mainly analyzed the keyword extraction of English text. First, two commonly used algorithms, the term frequency–inverse document frequency (TF–IDF) algorithm and the keyphrase extraction algorithm (KEA), were introduced. Then, an improved TF–IDF algorithm was designed, which improved the calculation of word frequency, and it was combined with the position weight to improve the performance of keyword extraction. Finally, 100 English literature was selected from the British Academic Written English Corpus for the analysis experiment. The results showed that the improved TF–IDF algorithm had the shortest running time and took only 4.93 s in processing 100 texts; the precision of the algorithms decreased with the increase of the number of extracted keywords. The comparison between the two algorithms demonstrated that the improved TF–IDF algorithm had the best performance, with a precision rate of 71.2%, a recall rate of 52.98%, and an F 1 score of 60.75%, when five keywords were extracted from each article. The experimental results show that the improved TF–IDF algorithm is effective in extracting English text keywords, which can be further promoted and applied in practice.

Download Full-text

Peringkasan multi-dokumen berita berdasarkan fitur berita dan part of speech tagging

10.26594/register.v4i2.1251 ◽

2018 ◽

Vol 4 (2) ◽

pp. 56

Author(s):

Moch. Zawaruddin Abdullah ◽

Chastine Fatichah

Keyword(s):

Weighting Method ◽

Inverse Document Frequency ◽

Information Approach ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

Document Frequency ◽

Sentence Position ◽

Text Preprocessing ◽

Speech Tagging

News Feature Scoring (NeFS) merupakan metode pembobotan kalimat yang sering digunakan untuk melakukan pembobotan kalimat pada peringkasan dokumen berdasarkan fitur berita. Beberapa fitur berita diantaranya seperti word frequency, sentence position, Term Frequency-Inverse Document Frequency (TF-IDF), dan kemiripan kalimat terhadap judul. Metode NeFS mampu memilih kalimat penting dengan menghitung frekuensi kata dan mengukur similaritas kata antara kalimat dengan judul. Akan tetapi pembobotan dengan metode NeFS tidak cukup, karena metode tersebut mengabaikan kata informatif yang terkandung dalam kalimat. Kata-kata informatif yang terkandung pada kalimat dapat mengindikasikan bahwa kalimat tersebut penting. Penelitian ini bertujuan untuk melakukan pembobotan kalimat pada peringkasan multi-dokumen berita dengan pendekatan fitur berita dan informasi gramatikal (NeFGIS). Informasi gramatikal yang dibawa oleh part of speech tagging (POS Tagging) dapat menunjukkan adanya konten informatif. Pembobotan kalimat dengan pendekatan fitur berita dan informasi gramatikal diharapkan mampu memilih kalimat representatif secara lebih baik dan mampu meningkatkan kualitas hasil ringkasan. Pada penelitian ini terdapat 4 tahapan yang dilakukan antara lain seleksi berita, text preprocessing, sentence scoring, dan penyusunan ringkasan. Untuk mengukur hasil ringkasan menggunakan metode evaluasi Recall-Oriented Understudy for Gisting Evaluation (ROUGE) dengan empat varian fungsi yaitu ROUGE-1, ROUGE-2, ROUGE-L, dan ROUGE-SU4. Hasil ringkasan menggunakan metode yang diusulkan (NeFGIS) dibandingkan dengan hasil ringkasan menggunakan metode pembobotan dengan pendekatan fitur berita dan trending issue (NeFTIS). Metode NeFGIS memberikan hasil yang lebih baik dengan peningkatan nilai untuk fungsi recall pada ROUGE-1, ROUGE-2, ROUGE-L, dan ROUGE-SU4 secara berturut-turut adalah 20,37%, 33,33%, 1,85%, 23,14%. News Feature Scoring (NeFS) is a sentence weighting method that used to weight the sentences in document summarization based on news features. There are several news features including word frequency, sentence position, Term Frequency-Inverse Document Frequency (TF-IDF), and sentences resemblance to the title. The NeFS method is able to select important sentences by calculating the frequency of words and measuring the similarity of words between sentences and titles. However, NeFS weighting method is not enough, because the method ignores the informative word in the sentence. The informative words contained in the sentence can indicate that the sentence is important. This study aims to weight the sentence in news multi-document summarization with news feature and grammatical information approach (NeFGIS). Grammatical information carried by part of speech tagging (POS Tagging) can indicate the presence of informative content. Sentence weighting with news features and grammatical information approach is expected to be able to determine sentence representatives better and be able to improve the quality of the summary results. In this study, there are 4 stages that are carried out including news selection, text preprocessing, sentence scoring, and compilation of summaries. Recall-Oriented Understanding for Gisting Evaluation (ROUGE) is used to measure the summary results with four variants of function; ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-SU4. Summary results using the proposed method (NeFGIS) are compared with summary results using sentence weighting methods with news feature and trending issue approach (NeFTIS). The NeFGIS method provides better results with increased value for recall functions in ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-SU4 respectively 20.37%, 33.33%, 1.85%, 23.14%.

Download Full-text

A comparative study of sentiment analysis using SVM and SentiWordNet

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v13.i3.pp902-909 ◽

2019 ◽

Vol 13 (3) ◽

pp. 902 ◽

Cited By ~ 7

Author(s):

Mohammad Fikri ◽

Riyanarto Sarno

Keyword(s):

Sentiment Analysis ◽

Extraction Method ◽

Support Vector ◽

The Internet ◽

Imbalanced Dataset ◽

Rule Based ◽

Inverse Document Frequency ◽

Feature Extraction Method ◽

Document Frequency ◽

Svm Algorithm

<p><span>Sentiment analysis has grown rapidly which impact on the number of services using the internet popping up in Indonesia. In this research, the sentiment analysis uses the rule-based method with the help of SentiWordNet and Support Vector Machine (SVM) algorithm with Term Frequency–Inverse Document Frequency (TF-IDF) as feature extraction method. Since the number of sentences in positive, negative and neutral classes is imbalanced, the oversampling method is implemented. For imbalanced dataset, the rule-based SentiWordNet and SVM algorithm achieve accuracies of 56% and 76%, respectively. However, for the balanced dataset, the rule-based SentiWordNet and SVM algorithm achieve accuracies of 52% and 89%, respectively.</span></p>

Download Full-text

Extreme Gradient Boosting for Cyberpropaganda Detection

10.3233/faia210012 ◽

2021 ◽

Author(s):

Jaouhar Fattahi ◽

Mohamed Mejri ◽

Marwa Ziadia

Keyword(s):

Machine Learning ◽

Social Networks ◽

Gradient Boosting ◽

The Internet ◽

Bag Of Words ◽

Fake News ◽

Inverse Document Frequency ◽

Document Frequency ◽

Extreme Gradient Boosting ◽

New Phenomena

Propaganda, defamation, abuse, insults, disinformation and fake news are not new phenomena and have been around for several decades. However, with the advent of the Internet and social networks, their magnitude has increased and the damage caused to individuals and corporate entities is becoming increasingly greater, even irreparable. In this paper, we tackle the detection of text-based cyberpropaganda using Machine Learning and NLP techniques. We use the eXtreme Gradient Boosting (XGBoost) algorithm for learning and detection, in tandem with Bag-of-Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF) for text vectorization. We highlight the contribution of gradient boosting and regularization mechanisms in the performance of the explored model.

Download Full-text

A comparative evaluation of different keyword extraction techniques

International Journal of Information Retrieval Research ◽

10.4018/ijirr.289573 ◽

2022 ◽

Vol 12 (1) ◽

pp. 0-0

Keyword(s):

High Frequency ◽

Extraction Methods ◽

Text Summarization ◽

Keyword Extraction ◽

Extraction Techniques ◽

Scientific Texts ◽

Inverse Document Frequency ◽

Document Frequency ◽

Long Time ◽

Document Categorization

Retrieving keywords in a text is attracting researchers for a long time as it forms a base for many natural language applications like information retrieval, text summarization, document categorization etc. A text is a collection of words that represent the theme of the text naturally and to bring the naturalism under certain rules is itself a challenging task. In the present paper, the authors evaluate different spatial distribution based keyword extraction methods available in the literature on three standard scientific texts. The authors choose the first few high-frequency words for evaluation to reduce the complexity as all the methods are somehow based on frequency. The authors find that the methods are not providing good results particularly in the case of the first few retrieved words. Thus, the authors propose a new measure based on frequency, inverse document frequency, variance, and Tsallis entropy. Evaluation of different methods is done on the basis of precision, recall, and F-measure. Results show that the proposed method provides improved results.

Download Full-text

Large expert-curated database for benchmarking document similarity detection in biomedical literature search

Database ◽

10.1093/database/baz085 ◽

2019 ◽

Vol 2019 ◽

Author(s):

Peter Brown ◽

Aik-Choon Tan ◽

Mohamed A El-Esawi ◽

Thomas Liehr ◽

Oliver Blanck ◽

...

Keyword(s):

Literature Search ◽

Relevant Literature ◽

Biomedical Literature ◽

Medical Subject Headings ◽

Document Similarity ◽

Inverse Document Frequency ◽

Research Fields ◽

Experience Levels ◽

Document Frequency ◽

Systematic Biases

Abstract Document recommendation systems for locating relevant literature have mostly relied on methods developed a decade ago. This is largely due to the lack of a large offline gold-standard benchmark of relevant documents that cover a variety of research fields such that newly developed literature search techniques can be compared, improved and translated into practice. To overcome this bottleneck, we have established the RElevant LIterature SearcH consortium consisting of more than 1500 scientists from 84 countries, who have collectively annotated the relevance of over 180 000 PubMed-listed articles with regard to their respective seed (input) article/s. The majority of annotations were contributed by highly experienced, original authors of the seed articles. The collected data cover 76% of all unique PubMed Medical Subject Headings descriptors. No systematic biases were observed across different experience levels, research fields or time spent on annotations. More importantly, annotations of the same document pairs contributed by different scientists were highly concordant. We further show that the three representative baseline methods used to generate recommended articles for evaluation (Okapi Best Matching 25, Term Frequency–Inverse Document Frequency and PubMed Related Articles) had similar overall performances. Additionally, we found that these methods each tend to produce distinct collections of recommended articles, suggesting that a hybrid method may be required to completely capture all relevant articles. The established database server located at https://relishdb.ict.griffith.edu.au is freely available for the downloading of annotation data and the blind testing of new methods. We expect that this benchmark will be useful for stimulating the development of new powerful techniques for title and title/abstract-based search engines for relevant articles in biomedical research.

Download Full-text

Poisson mixtures

Natural Language Engineering ◽

10.1017/s1351324900000139 ◽

1995 ◽

Vol 1 (2) ◽

pp. 163-190 ◽

Cited By ~ 146

Author(s):

Kenneth W. Church ◽

William A. Gale

Keyword(s):

Negative Binomial ◽

Probability Distributions ◽

Hidden Variables ◽

Heterogeneous Structure ◽

Text Compression ◽

Inverse Document Frequency ◽

Poisson Mixtures ◽

Document Frequency ◽

Wide Range ◽

Better Than

AbstractShannon (1948) showed that a wide range of practical problems can be reduced to the problem of estimating probability distributions of words and ngrams in text. It has become standard practice in text compression, speech recognition, information retrieval and many other applications of Shannon's theory to introduce a “bag-of-words” assumption. But obviously, word rates vary from genre to genre, author to author, topic to topic, document to document, section to section, and paragraph to paragraph. The proposed Poisson mixture captures much of this heterogeneous structure by allowing the Poisson parameter θ to vary over documents subject to a density function φ. φ is intended to capture dependencies on hidden variables such genre, author, topic, etc. (The Negative Binomial is a well-known special case where φ is a Г distribution.) Poisson mixtures fit the data better than standard Poissons, producing more accurate estimates of the variance over documents (σ2), entropy (H), inverse document frequency (IDF), and adaptation (Pr(x ≥ 2/x ≥ 1)).

Download Full-text

Inverse document frequency-based sensitivity scoring for privacy analysis

Signal Image and Video Processing ◽

10.1007/s11760-021-02013-1 ◽

2021 ◽

Author(s):

Onder Coban ◽

Ali Inan ◽

Selma Ayse Ozel

Keyword(s):

Inverse Document Frequency ◽

Document Frequency ◽

Privacy Analysis

Download Full-text

Efficient natural language classification algorithm for detecting duplicate unsupervised features

Informatics and Automation - Информатика и автоматизация ◽

10.15622/ia.2021.3.5 ◽

2021 ◽

Vol 20 (3) ◽

pp. 623-653

Author(s):

Saud Altaf ◽

Sofia Iqbal ◽

Muhammad Waseem Soomro

Keyword(s):

Natural Language ◽

Short Term Memory ◽

Short Term ◽

Vocabulary Size ◽

Language Understanding ◽

Inverse Document Frequency ◽

Classification Technique ◽

Document Frequency ◽

Text Features ◽

Long Short Term Memory

This paper focuses on capturing the meaning of Natural Language Understanding (NLU) text features to detect the duplicate unsupervised features. The NLU features are compared with lexical approaches to prove the suitable classification technique. The transfer-learning approach is utilized to train the extraction of features on the Semantic Textual Similarity (STS) task. All features are evaluated with two types of datasets that belong to Bosch bug and Wikipedia article reports. This study aims to structure the recent research efforts by comparing NLU concepts for featuring semantics of text and applying it to IR. The main contribution of this paper is a comparative study of semantic similarity measurements. The experimental results demonstrate the Term Frequency–Inverse Document Frequency (TF-IDF) feature results on both datasets with reasonable vocabulary size. It indicates that the Bidirectional Long Short Term Memory (BiLSTM) can learn the structure of a sentence to improve the classification.

Download Full-text