Improving keyword extraction in multilingual texts

The accuracy of keyword extraction is a leading factor in information retrieval systems and marketing. In the real world, text is produced in a variety of languages, and the ability to extract keywords based on information from different languages improves the accuracy of keyword extraction. In this paper, the available information of all languages is applied to improve a traditional keyword extraction algorithm from a multilingual text. The proposed keywork extraction procedure is an unsupervise algorithm and designed based on selecting a word as a keyword of a given text, if in addition to that language holds a high rank based on the keywords criteria in other languages, as well. To achieve to this aim, the average TF-IDF of the candidate words were calculated for the same and the other languages. Then the words with the higher averages TF-IDF were chosen as the extracted keywords. The obtained results indicat that the algorithms’ accuracis of the multilingual texts in term frequency-inverse document frequency (TF-IDF) algorithm, graph-based algorithm, and the improved proposed algorithm are 80%, 60.65%, and 91.3%, respectively.

Download Full-text

A comparative study of keyword extraction algorithms for English texts

Journal of Intelligent Systems ◽

10.1515/jisys-2021-0040 ◽

2021 ◽

Vol 30 (1) ◽

pp. 808-815

Author(s):

Jinye Li

Keyword(s):

English Literature ◽

Recall Rate ◽

English Text ◽

Keyword Extraction ◽

Keyphrase Extraction ◽

Inverse Document Frequency ◽

Document Frequency ◽

Analysis Experiment ◽

Extraction Algorithm ◽

Precision Rate

Abstract This study mainly analyzed the keyword extraction of English text. First, two commonly used algorithms, the term frequency–inverse document frequency (TF–IDF) algorithm and the keyphrase extraction algorithm (KEA), were introduced. Then, an improved TF–IDF algorithm was designed, which improved the calculation of word frequency, and it was combined with the position weight to improve the performance of keyword extraction. Finally, 100 English literature was selected from the British Academic Written English Corpus for the analysis experiment. The results showed that the improved TF–IDF algorithm had the shortest running time and took only 4.93 s in processing 100 texts; the precision of the algorithms decreased with the increase of the number of extracted keywords. The comparison between the two algorithms demonstrated that the improved TF–IDF algorithm had the best performance, with a precision rate of 71.2%, a recall rate of 52.98%, and an F 1 score of 60.75%, when five keywords were extracted from each article. The experimental results show that the improved TF–IDF algorithm is effective in extracting English text keywords, which can be further promoted and applied in practice.

Download Full-text

Measuring the Extent of the Synonym Problem in Full-Text Searching

Evidence Based Library and Information Practice ◽

10.18438/b8mc85 ◽

2008 ◽

Vol 3 (4) ◽

pp. 18 ◽

Cited By ~ 2

Author(s):

Jeffrey Beall ◽

Karen Kafadar

Keyword(s):

Web Sites ◽

Full Text ◽

The Other ◽

Web Pages ◽

Retrieval Systems ◽

Common Term ◽

Information Retrieval Systems ◽

Search Field ◽

Text Searching ◽

Very High

Objective – This article measures the extent of the synonym problem in full-text searching. The synonym problem occurs when a search misses documents because the search was based on a synonym and not on a more familiar term. Methods – We considered a sample of 90 single word synonym pairs and searched for each word in the pair, both singly and jointly, in the Yahoo! database. We determined the number of web sites that were missed when only one but not the other term was included in the search field. Results – Depending upon how common the usage is of the synonym, the percentage of missed web sites can vary from almost 0% to almost 100%. When the search uses a very uncommon synonym ("diaconate"), a very high percentage of web pages can be missed (95%), versus the search using the more common term (only 9% are missed when searching web pages for the term "deacons"). If both terms in a word pair were nearly equal in usage ("cooks" and "chefs"), then a search on one term but not the other missed almost half the relevant web pages. Conclusion – Our results indicate great value for search engines to incorporate automatic synonym searching not only for user-specified terms but also for high usage synonyms. Moreover, the results demonstrate the value of information retrieval systems that use controlled vocabularies and cross references to generate search results.

Download Full-text

Bibliographic Databases in Tribology

Journal of Tribology ◽

10.1115/1.3261052 ◽

1985 ◽

Vol 107 (3) ◽

pp. 285-294 ◽

Cited By ~ 3

Author(s):

J. R. Fries ◽

F. E. Kennedy

Keyword(s):

Information Retrieval ◽

Literature Search ◽

Technical Information ◽

Bibliographic Databases ◽

Complete Coverage ◽

Retrieval Systems ◽

Scientific And Technical Information ◽

Information Retrieval Systems ◽

Computer Based ◽

Available Information

It is important that the modern-day researcher and engineer stay abreast of technology in his field, but this task is made very difficult by the recent flood of scientific and technical information. Coping with the information explosion requires the use of computerized information systems. This paper reviews computer-based information retrieval systems in engineering and focuses specifically on databases of literature and information relevant to tribologists and lubrication engineers. These databases are listed and their characteristics are discussed. Results of a sample computer-based literature search are included. It is shown that no single database has complete coverage of all aspects of tribology and that several databases should be searched to get all available information on a subject.

Download Full-text

Similarity of medical concepts in question and answering of health communities

Health Informatics Journal ◽

10.1177/1460458219881333 ◽

2019 ◽

Vol 26 (2) ◽

pp. 1443-1454 ◽

Cited By ~ 1

Author(s):

Hamid Naderi ◽

Sina Madani ◽

Behzad Kiani ◽

Kobra Etminani

Keyword(s):

Computing Methods ◽

Inverse Document Frequency ◽

Language System ◽

Unified Medical Language System ◽

Retrieval Systems ◽

Medical Language ◽

Document Frequency ◽

Health Communities ◽

Medical Concepts ◽

Question And Answering

The ability to automatically categorize submitted questions based on topics and suggest similar question and answer to the users reduces the number of redundant questions. Our objective was to compare intra-topic and inter-topic similarity between question and answers by using concept-based similarity computing analysis. We gathered existing question and answers from several popular online health communities. Then, Unified Medical Language System concepts related to selected questions and experts in different topics were extracted and weighted by term frequency -inverse document frequency values. Finally, the similarity between weighted vectors of Unified Medical Language System concepts was computed. Our result showed a considerable gap between intra-topic and inter-topic similarities in such a way that the average of intra-topic similarity (0.095, 0.192, and 0.110, respectively) was higher than the average of inter-topic similarity (0.012, 0.025, and 0.018, respectively) for questions of the top 3 popular online communities including NetWellness, WebMD, and Yahoo Answers. Similarity scores between the content of questions answered by experts in the same and different topics were calculated as 0.51 and 0.11, respectively. Concept-based similarity computing methods can be used in developing intelligent question and answering retrieval systems that contain auto recommendation functionality for similar questions and experts.

Download Full-text

Decoupling Information Retrieval Systems from Internet QoS in SMPs

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.602-605.3706 ◽

2014 ◽

Vol 602-605 ◽

pp. 3706-3711

Author(s):

Hao Chen ◽

Qin Qun Chen ◽

Shao Xia YE

Keyword(s):

Information Retrieval ◽

The Other ◽

Wide Area ◽

Wide Area Networks ◽

Retrieval Systems ◽

Other Hand ◽

Information Retrieval Systems

In recent years, much research has been devoted to the analysis of 128 bit architectures; on the other hand, few have evaluated the construction of wide-area networks. In fact, few cyberneticists would disagree with the understanding of IPv6. This is an important point to understand. we describe an autonomous tool for developing compilers, which we call ADZ.

Download Full-text

The Impact of Ontology on the Performance of Information Retrieval

Web Engineering Advancements and Trends ◽

10.4018/978-1-60566-719-5.ch002 ◽

2010 ◽

pp. 24-37

Author(s):

Indrawan Maria ◽

Loke Seng

Keyword(s):

Information Technology ◽

Information Retrieval ◽

The Other ◽

Ideal Solution ◽

New Approach ◽

Retrieval Systems ◽

Information Retrieval Systems ◽

The Right ◽

The Impact ◽

Semantic Problem

The debate on the effectiveness of ontology in solving semantic problems has increased recently in many domains of information technology. One side of the debate accepts the inclusion of ontology as a suitable solution. The other side of the debate argues that ontology is far from an ideal solution to the semantic problem. This article explores this debate in the area of information retrieval. Several past approaches were explored and a new approach was investigated to test the effectiveness of a generic ontology such as WordNet in improving the performance of information retrieval systems. The test and the analysis of the experiments suggest that WordNet is far from the ideal solution in solving semantic problems in the information retrieval. However, several observations have been made and reported in this article that allow research in ontology for the information retrieval to move towards the right direction.

Download Full-text

Single Document Automatic Text Summarization using Term Frequency-Inverse Document Frequency (TF-IDF)

ComTech Computer Mathematics and Engineering Applications ◽

10.21512/comtech.v7i4.3746 ◽

2016 ◽

Vol 7 (4) ◽

pp. 285 ◽

Cited By ~ 14

Author(s):

Hans Christian ◽

Mikhael Pramodana Agus ◽

Derwin Suhartono

Keyword(s):

Language Processing ◽

Text Summarization ◽

The Other ◽

Online Information ◽

Inverse Document Frequency ◽

Automatic Text Summarization ◽

Document Frequency ◽

Online Source ◽

Automatic Text ◽

F Measure

The increasing availability of online information has triggered an intensive research in the area of automatic text summarization within the Natural Language Processing (NLP). Text summarization reduces the text by removing the less useful information which helps the reader to find the required information quickly. There are many kinds of algorithms that can be used to summarize the text. One of them is TF-IDF (TermFrequency-Inverse Document Frequency). This research aimed to produce an automatic text summarizer implemented with TF-IDF algorithm and to compare it with other various online source of automatic text summarizer. To evaluate the summary produced from each summarizer, The F-Measure as the standard comparison value had been used. The result of this research produces 67% of accuracy with three data samples which are higher compared to the other online summarizers.

Download Full-text

Compilation and evaluation of the Spanish SatiCorpus 2021 for satire identification using linguistic features and transformers

Complex & Intelligent Systems ◽

10.1007/s40747-021-00625-1 ◽

2022 ◽

Author(s):

José Antonio García-Díaz ◽

Rafael Valencia-García

Keyword(s):

Social Media ◽

Information Retrieval ◽

The Other ◽

Linguistic Features ◽

Feature Sets ◽

Retrieval Systems ◽

Other Hand ◽

Extensive Evaluation ◽

Information Retrieval Systems ◽

The One

AbstractSatirical content on social media is hard to distinguish from real news, misinformation, hoaxes or propaganda when there are no clues as to which medium these news were originally written in. It is important, therefore, to provide Information Retrieval systems with mechanisms to identify which results are legitimate and which ones are misleading. Our contribution for satire identification is twofold. On the one hand, we release the Spanish SatiCorpus 2021, a balanced dataset that contains satirical and non-satirical documents. On the other hand, we conduct an extensive evaluation of this dataset with linguistic features and embedding-based features. All feature sets are evaluated separately and combined using different strategies. Our best result is achieved with a combination of the linguistic features and BERT with an accuracy of 97.405%. Besides, we compare our proposal with existing datasets in Spanish regarding satire and irony.

Download Full-text

The Research and Implementation of Keyword Extraction Technology

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.644-650.2003 ◽

2014 ◽

Vol 644-650 ◽

pp. 2003-2008

Author(s):

Ya Min Li ◽

Xian Huan Zhang

Keyword(s):

Word Segmentation ◽

Maximum Matching ◽

The Internet ◽

Keyword Extraction ◽

Extraction Technology ◽

Function Words ◽

Inverse Document Frequency ◽

Part Of Speech ◽

Document Frequency ◽

Research Analysis

Keyword extraction plays an important role in abstract, information retrieval, data mining, text clustering etc. Extracting the keywords from a document can increases the efficiency of retrieval, thus provide great help to efficiently organize the resource. Few writers on the Internet have given the keywords of a document. Artificially extracting the keywords of a document is a great deal of work, so we need a method of extracting the keywords automatically. The paper constructing a verb, function words, stop words etc. small library from the perspective of the Chinese part of speech, realize rapid word segmentation based on the research, analysis, improvement of traditional lexical maximum matching points, and analyze, realize extracting the keywords based on TFIDF(Term Frequency Inverse Document Frequency).

Download Full-text

A comparative evaluation of different keyword extraction techniques

International Journal of Information Retrieval Research ◽

10.4018/ijirr.289573 ◽

2022 ◽

Vol 12 (1) ◽

pp. 0-0

Keyword(s):

High Frequency ◽

Extraction Methods ◽

Text Summarization ◽

Keyword Extraction ◽

Extraction Techniques ◽

Scientific Texts ◽

Inverse Document Frequency ◽

Document Frequency ◽

Long Time ◽

Document Categorization

Retrieving keywords in a text is attracting researchers for a long time as it forms a base for many natural language applications like information retrieval, text summarization, document categorization etc. A text is a collection of words that represent the theme of the text naturally and to bring the naturalism under certain rules is itself a challenging task. In the present paper, the authors evaluate different spatial distribution based keyword extraction methods available in the literature on three standard scientific texts. The authors choose the first few high-frequency words for evaluation to reduce the complexity as all the methods are somehow based on frequency. The authors find that the methods are not providing good results particularly in the case of the first few retrieved words. Thus, the authors propose a new measure based on frequency, inverse document frequency, variance, and Tsallis entropy. Evaluation of different methods is done on the basis of precision, recall, and F-measure. Results show that the proposed method provides improved results.

Download Full-text