Chinese-English Cross-Language Text Clustering Algorithm Based on Latent Semantic Analysis

Online discussion forums have rapidly gained usage in e-learning systems. This has placed a heavy burden on course instructors in terms of moderating student discussions. Previous methods of assessing student participation in online discussions followed strictly quantitative approaches that did not necessarily capture the students’ effort. Along with this growth in usage there is a need for accelerated knowledge extraction tools for analysing and presenting online messages in a useful and meaningful manner. This article discussed a qualitative approach which involves content analysis of the discussions and generation of clustered keywords which can be used to identify topics of discussion. The authors applied a new k-means++ clustering algorithm with latent semantic analysis to assess the topics expressed by students in online discussion forums. The proposed algorithm was then compared with the standard k-means++ algorithm. Using the Moodle course management forum to validate the proposed algorithm, the authors show that the k-mean++ clustering algorithm with latent semantic analysis performs better than a stand-alone k-means++.

Download Full-text

Latent Semantic Analysis for Text Mining and Beyond

Intelligent Multimedia Databases and Information Retrieval ◽

10.4018/978-1-61350-126-9.ch015 ◽

2013 ◽

pp. 253-280 ◽

Cited By ~ 2

Author(s):

Anne Kao ◽

Steve Poteet ◽

Jason Wu ◽

William Ferng ◽

Rod Tjoelker ◽

...

Keyword(s):

Information Retrieval ◽

Text Mining ◽

Latent Semantic Analysis ◽

Web Mining ◽

Semantic Analysis ◽

Search Space ◽

Latent Semantic Indexing ◽

Cross Language Information Retrieval ◽

Text Information ◽

Cross Language

Latent Semantic Analysis (LSA) or Latent Semantic Indexing (LSI), when applied to information retrieval, has been a major analysis approach in text mining. It is an extension of the vector space method in information retrieval, representing documents as numerical vectors but using a more sophisticated mathematical approach to characterize the essential features of the documents and reduce the number of features in the search space. This chapter summarizes several major approaches to this dimensionality reduction, each of which has strengths and weaknesses, and it describes recent breakthroughs and advances. It shows how the constructs and products of LSA applications can be made user-interpretable and reviews applications of LSA beyond information retrieval, in particular, to text information visualization. While the major application of LSA is for text mining, it is also highly applicable to cross-language information retrieval, Web mining, and analysis of text transcribed from speech and textual information in video.

Download Full-text

Development of Cross Language Clone Detector for C, C++ & Java Repositories using Natural Language Processing

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.b3612.129219 ◽

2019 ◽

Vol 9 (2) ◽

pp. 2289-2293

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Latent Semantic Analysis ◽

Semantic Analysis ◽

Code Clones ◽

Code Base ◽

Value Decomposition ◽

Cross Language ◽

Bug Fixes

Reusing the code with or without modification is common process in building all the large codebases of system software like Linux, gcc , and jdk. This process is referred to as software cloning or forking. Developers always find difficulty of bug fixes in porting large code base from one language to other native language during software porting. There exist many approaches in identifying software clones of same language that may not contribute for the developers involved in porting hence there is a need for cross language clone detector. This paper uses primary Natural Language Processing (NLP) approach using latent semantic analysis to find the cross language clones of other neighboring languages in terms of all 4 types of clones using latent semantic analysis algorithm that uses Singular value decomposition. It takes input as code(C, C++ or Java) and matches all the neighboring code clones in the static repository in terms of frequency of lines matched

Download Full-text