Cross-Language Automatic Plagiarism Detector Using Latent Semantic Analysis and Self-Organizing Map

Approach for Multi-Label Text Data Class Verification and Adjustment Based on Self-Organizing Map and Latent Semantic Analysis

Informatica ◽

10.15388/22-infor473 ◽

2022 ◽

pp. 1-22

Author(s):

Pavel Stefanovič ◽

Olga Kurasova

Keyword(s):

Latent Semantic Analysis ◽

Semantic Analysis ◽

The Self ◽

Self Organizing Map ◽

Text Data ◽

New Approach ◽

New Class ◽

Financial News ◽

Self Organizing

In this paper, a new approach has been proposed for multi-label text data class verification and adjustment. The approach helps to make semi-automated revisions of class assignments to improve the quality of the data. The data quality significantly influences the accuracy of the created models, for example, in classification tasks. It can also be useful for other data analysis tasks. The proposed approach is based on the combination of the usage of the text similarity measure and two methods: latent semantic analysis and self-organizing map. First, the text data must be pre-processed by selecting various filters to clean the data from unnecessary and irrelevant information. Latent semantic analysis has been selected to reduce the vectors dimensionality of the obtained vectors that correspond to each text from the analysed data. The cosine similarity distance has been used to determine which of the multi-label text data class should be changed or adjusted. The self-organizing map has been selected as the key method to detect similarity between text data and make decisions for a new class assignment. The experimental investigation has been performed using the newly collected multi-label text data. Financial news data in the Lithuanian language have been collected from four public websites and classified by experts into ten classes manually. Various parameters of the methods have been analysed, and the influence on the final results has been estimated. The final results are validated by experts. The research proved that the proposed approach could be helpful to verify and adjust multi-label text data classes. 82% of the correct assignments are obtained when the data dimensionality is reduced to 40 using the latent semantic analysis, and the self-organizing map size is reduced from 40 to 5 by step 5.

Download Full-text

Self-organizing weighted incremental probabilistic latent semantic analysis

International Journal of Machine Learning and Cybernetics ◽

10.1007/s13042-017-0681-9 ◽

2017 ◽

Vol 9 (12) ◽

pp. 1987-1998 ◽

Cited By ~ 5

Author(s):

Ning Li ◽

Wenjuan Luo ◽

Kun Yang ◽

Fuzhen Zhuang ◽

Qing He ◽

...

Keyword(s):

Latent Semantic Analysis ◽

Semantic Analysis ◽

Probabilistic Latent Semantic Analysis ◽

Self Organizing

Download Full-text

Analysis on the Effect of Term-Document's Matrix to the Accuracy of Latent-Semantic-Analysis-Based Cross-Language Plagiarism Detection

Proceedings of the Fifth International Conference on Network, Communication and Computing - ICNCC '16 ◽

10.1145/3033288.3033300 ◽

2016 ◽

Author(s):

Anak Agung Putri Ratna ◽

F. Astha Ekadiyanto ◽

Mardiyah ◽

Prima Dewi Purnamasari ◽

Muhammad Salman

Keyword(s):

Latent Semantic Analysis ◽

Semantic Analysis ◽

Plagiarism Detection ◽

Cross Language

Download Full-text

Latent Semantic Analysis for Text Mining and Beyond

Intelligent Multimedia Databases and Information Retrieval ◽

10.4018/978-1-61350-126-9.ch015 ◽

2013 ◽

pp. 253-280 ◽

Cited By ~ 2

Author(s):

Anne Kao ◽

Steve Poteet ◽

Jason Wu ◽

William Ferng ◽

Rod Tjoelker ◽

...

Keyword(s):

Information Retrieval ◽

Text Mining ◽

Latent Semantic Analysis ◽

Web Mining ◽

Semantic Analysis ◽

Search Space ◽

Latent Semantic Indexing ◽

Cross Language Information Retrieval ◽

Text Information ◽

Cross Language

Latent Semantic Analysis (LSA) or Latent Semantic Indexing (LSI), when applied to information retrieval, has been a major analysis approach in text mining. It is an extension of the vector space method in information retrieval, representing documents as numerical vectors but using a more sophisticated mathematical approach to characterize the essential features of the documents and reduce the number of features in the search space. This chapter summarizes several major approaches to this dimensionality reduction, each of which has strengths and weaknesses, and it describes recent breakthroughs and advances. It shows how the constructs and products of LSA applications can be made user-interpretable and reviews applications of LSA beyond information retrieval, in particular, to text information visualization. While the major application of LSA is for text mining, it is also highly applicable to cross-language information retrieval, Web mining, and analysis of text transcribed from speech and textual information in video.

Download Full-text

Self-organizing maps for latent semantic analysis of free-form text in support of public policy analysis

Wiley Interdisciplinary Reviews Data Mining and Knowledge Discovery ◽

10.1002/widm.1112 ◽

2013 ◽

Vol 4 (1) ◽

pp. 71-86 ◽

Cited By ~ 4

Author(s):

Bernie C. Till ◽

Justin Longo ◽

A. Rod Dobell ◽

Peter F. Driessen

Keyword(s):

Public Policy ◽

Policy Analysis ◽

Latent Semantic Analysis ◽

Semantic Analysis ◽

Free Form ◽

Self Organizing Maps ◽

Public Policy Analysis ◽

Self Organizing

Download Full-text

Development of Cross Language Clone Detector for C, C++ & Java Repositories using Natural Language Processing

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.b3612.129219 ◽

2019 ◽

Vol 9 (2) ◽

pp. 2289-2293

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Latent Semantic Analysis ◽

Semantic Analysis ◽

Code Clones ◽

Code Base ◽

Value Decomposition ◽

Cross Language ◽

Bug Fixes

Reusing the code with or without modification is common process in building all the large codebases of system software like Linux, gcc , and jdk. This process is referred to as software cloning or forking. Developers always find difficulty of bug fixes in porting large code base from one language to other native language during software porting. There exist many approaches in identifying software clones of same language that may not contribute for the developers involved in porting hence there is a need for cross language clone detector. This paper uses primary Natural Language Processing (NLP) approach using latent semantic analysis to find the cross language clones of other neighboring languages in terms of all 4 types of clones using latent semantic analysis algorithm that uses Singular value decomposition. It takes input as code(C, C++ or Java) and matches all the neighboring code clones in the static repository in terms of frequency of lines matched

Download Full-text