Comparison of document similarity algorithms in extracting document keywords from an academic paper

Nowadays, a large number of information can not be reached by the reader because of the misclassification of text-based documents. The misclassified data can also make the readers obtain the wrong information. The method which is proposed by this paper is aiming to classify the documents into the correct group. Each document will have a membership value in several different classes. The method will be used to find the degree of similarity between the two documents is the semantic similarity. In fact, there is no document that doesn’t have a relationship with the other but their relationship might be close to 0. This method calculates the similarity between two documents by taking into account the level of similarity of words and their synonyms. After all inter-document similarity values obtained, a matrix will be created. The matrix is then used as a semi-supervised factor. The output of this method is the value of the membership of each document, which must be one of the greatest membership value for each document which indicates where the documents are grouped. Classification result computed by the method shows a good value which is 90 %. Index Terms - Fuzzy co-clustering, Heuristic, Semantica Similiarity, Semi-supervised learning.

Download Full-text

Shalott’s Song: a Specific Feature Found in Balmont’s Translation of A. Tennyson’s Poem «The Lady of Shalott»

Известия Смоленского государственного университета ◽

10.35785/2072-9464-2020-50-2-22-33 ◽

2020 ◽

pp. 22-33

Author(s):

Margarita Shanurina

Keyword(s):

Neutral Word ◽

Fairy Tale ◽

The Other ◽

Original Text ◽

Academic Paper ◽

Other Hand ◽

The One

This academic paper is devoted to the analysis of a specific feature which could be found in K. Balmont’s translation of A. Tennyson’s poem «The Lady of Shalott». The aim of the work is to study the reasons why Balmont uses the word «волшебница» to describe the heroine in his translation while there is no word with such semantics in the original text. (This word is put in the name of the translated work and it is found in almost every stanza).English analogue of the word «volshebnitsa» (that is, the word «enchantress», which, according to the Oxford English Dictionary, is closest to this word in semantics), while in the original text of the poem this word is not mentioned, the neutral word «lady» is used andonce (in the speech of the mower who hears the heroine singing, but does not see her) there is the word «fairy». This article, on the one hand, summarizes existing studies on the topic; on the other hand, complements them. The study highlights and considers several reasons for the above-mentioned discrepancy between the original text and its translation: emphasizing the connection with a fairy tale, revealing a number of motifs which play an important role in the work of Balmont himself (namely, motifs of music and creativity as magic) and an indication of the main heroine’s charming beauty.

Download Full-text

Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies

Proceedings of the 10th International Conference on Knowledge Capture - K-CAP '19 ◽

10.1145/3360901.3364444 ◽

2019 ◽

Cited By ~ 1

Author(s):

Carlos Badenes-Olmedo ◽

José Luis Redondo-García ◽

Oscar Corcho

Keyword(s):

Document Similarity ◽

Specific Concept ◽

Cross Lingual ◽

Concept Hierarchies

Download Full-text

Large expert-curated database for benchmarking document similarity detection in biomedical literature search

Database ◽

10.1093/database/baz085 ◽

2019 ◽

Vol 2019 ◽

Author(s):

Peter Brown ◽

Aik-Choon Tan ◽

Mohamed A El-Esawi ◽

Thomas Liehr ◽

Oliver Blanck ◽

...

Keyword(s):

Literature Search ◽

Relevant Literature ◽

Biomedical Literature ◽

Medical Subject Headings ◽

Document Similarity ◽

Inverse Document Frequency ◽

Research Fields ◽

Experience Levels ◽

Document Frequency ◽

Systematic Biases

Abstract Document recommendation systems for locating relevant literature have mostly relied on methods developed a decade ago. This is largely due to the lack of a large offline gold-standard benchmark of relevant documents that cover a variety of research fields such that newly developed literature search techniques can be compared, improved and translated into practice. To overcome this bottleneck, we have established the RElevant LIterature SearcH consortium consisting of more than 1500 scientists from 84 countries, who have collectively annotated the relevance of over 180 000 PubMed-listed articles with regard to their respective seed (input) article/s. The majority of annotations were contributed by highly experienced, original authors of the seed articles. The collected data cover 76% of all unique PubMed Medical Subject Headings descriptors. No systematic biases were observed across different experience levels, research fields or time spent on annotations. More importantly, annotations of the same document pairs contributed by different scientists were highly concordant. We further show that the three representative baseline methods used to generate recommended articles for evaluation (Okapi Best Matching 25, Term Frequency–Inverse Document Frequency and PubMed Related Articles) had similar overall performances. Additionally, we found that these methods each tend to produce distinct collections of recommended articles, suggesting that a hybrid method may be required to completely capture all relevant articles. The established database server located at https://relishdb.ict.griffith.edu.au is freely available for the downloading of annotation data and the blind testing of new methods. We expect that this benchmark will be useful for stimulating the development of new powerful techniques for title and title/abstract-based search engines for relevant articles in biomedical research.

Download Full-text