Hybrid SVD For Document Representation Using Different Vectorization

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit195260 ◽

2019 ◽

pp. 388-393

Author(s):

Kalpana P ◽

Rosini B R ◽

Sathya Priya K P ◽

Sowmiya S

Keyword(s):

Document Clustering ◽

High Dimensionality ◽

Word Embeddings ◽

Document Representation ◽

Challenging Problem ◽

Large Database ◽

Relevant Document ◽

Electronic Form ◽

Large Databases ◽

Uniform Manner

Document Clustering is the process of segmenting a particular collection of text into subgroups. Nowadays all documents are in electronic form, because of the issue to retrieve relevant document from the large database. The goal is to transform text composed of daily language in a structured, database format. In this way, different documents are summarized and presented in a uniform manner. The challenging problem of document clustering are big volume, high dimensionality and complex semantics. The objective of this paper is mainly focused on clustering multi-sense word embeddings using three different algorithms(K-means, DBSCAN, CURE). Among these three algorithm CURE gives better accuracy and it can handle large databases efficiently.</p>

Download Full-text

Hybrid SVD Model for Document Representation

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.f1191.0986s319 ◽

2019 ◽

Vol 8 (6S3) ◽

pp. 1147-1150

Keyword(s):

Singular Value Decomposition ◽

Performance Measures ◽

Document Clustering ◽

Singular Value ◽

High Dimensionality ◽

Racial Groups ◽

Document Representation ◽

Uniform Manner ◽

Value Decomposition ◽

Difficult Issue

Document clusters are the way to segment a certain set of text into racial groups. Nowadays all records are in electronic form due to the problem of retrieving appropriate document from the big database. The objective is to convert text consisting of daily language into a structured database format. Different documents are thus summarized and presented in a uniform manner. Big quantity, high dimensionality and complicated semantics are the difficult issue of document clustering. The aim of this article is primarily to cluster multisense word embedding using three distinct algorithms (K-means, DBSCAN, CURE) using singular value decomposition. In this performance measures are measured using different metrics.

Download Full-text

Towards Big Data to Improve Availability of Massive Database

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.263-266.3326 ◽

2012 ◽

Vol 263-266 ◽

pp. 3326-3329

Author(s):

Jia Jia Miao ◽

Guo You Chen ◽

Kai Du ◽

Zhong Jun Fang

Keyword(s):

Big Data ◽

Relational Databases ◽

Database System ◽

Physical Change ◽

Large Database ◽

Large Databases ◽

Order Of Magnitude ◽

Human Costs ◽

Hardware Costs ◽

Node Size

Due to the huge scale and the number of components, big data is difficult to work with the use of relational databases and desktop statistics and visualization package. Much database replication technology is used to increase the MTTF, but few have a large database system, the traditional method of backup is not feasible, expensive manpower costs reduce MTTR. On the basis of analyzing the characteristics of data in large databases, we propose a new method called Detaching Read-Only (DRO) mechanism and its changes DRO+. It reduces MTTR by reducing the physical change of the data in each database, by separating data node size granularity. Analysis and experimental results show that our method can not only reduce the MTTR an order of magnitude, but there is no additional hardware costs, but also reduce the high human costs.

Download Full-text

Aggregating Neural Word Embeddings for Document Representation

Lecture Notes in Computer Science - Advances in Information Retrieval ◽

10.1007/978-3-319-76941-7_23 ◽

2018 ◽

pp. 303-315 ◽

Cited By ~ 2

Author(s):

Ruqing Zhang ◽

Jiafeng Guo ◽

Yanyan Lan ◽

Jun Xu ◽

Xueqi Cheng

Keyword(s):

Word Embeddings ◽

Document Representation

Download Full-text

Up-to-Date Feature Selection Methods for Scalable and Efficient Machine Learning

Efficiency and Scalability Methods for Computational Intellect ◽

10.4018/978-1-4666-3942-3.ch001 ◽

2013 ◽

pp. 1-26 ◽

Cited By ~ 2

Author(s):

Amparo Alonso-Betanzos ◽

Verónica Bolón-Canedo ◽

Diego Fernández-Francos ◽

Iago Porto-Díaz ◽

Noelia Sánchez-Maroño

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Real World ◽

Large Scale ◽

High Dimensionality ◽

Selection Methods ◽

Learning Methods ◽

Large Databases ◽

Efficient Machine ◽

Processing Techniques

With the advent of high dimensionality, machine learning researchers are now interested not only in accuracy, but also in scalability of algorithms. When dealing with large databases, pre-processing techniques are required to reduce input dimensionality and machine learning can take advantage of feature selection, which consists of selecting the relevant features and discarding irrelevant ones with a minimum degradation in performance. In this chapter, we will review the most up-to-date feature selection methods, focusing on their scalability properties. Moreover, we will show how these learning methods are enhanced when applied to large scale datasets and, finally, some examples of the application of feature selection in real world databases will be shown.

Download Full-text

A Similarity Rough Set Model for Document Representation and Document Clustering

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2011.p0125 ◽

2011 ◽

Vol 15 (2) ◽

pp. 125-133 ◽

Cited By ~ 3

Author(s):

Nguyen Chi Thanh ◽

◽

Koichi Yamada ◽

Muneyuki Unehara

Keyword(s):

Vector Space ◽

Rough Set ◽

Document Clustering ◽

Vector Space Model ◽

Document Representation ◽

Space Model ◽

Large Sets ◽

Tolerance Rough Set ◽

Document Organization ◽

The One

Document clustering is a textmining technique for unsupervised document organization. It helps the users browse and navigate large sets of documents. Ho et al. proposed a Tolerance Rough Set Model (TRSM) [1] for improving the vector space model that represents documents by vectors of terms and applied it to document clustering. In this paper we analyze their model to propose a new model for efficient clustering of documents. We introduce Similarity Rough Set Model (SRSM) as another model for presenting documents in document clustering. The model is evaluated by experiments on test collections. The experiment results show that the SRSM document clusteringmethod outperforms the one with TRSM and the results of SRSM are less affected by the value of parameter than TRSM.

Download Full-text

Leveraging Topic Models with Novel Word Embeddings for Effective Document Clustering

Advances in Computational Intelligence and Informatics - Lecture Notes in Networks and Systems ◽

10.1007/978-981-15-3338-9_17 ◽

2020 ◽

pp. 133-139

Author(s):

Thayyaba Khatoon Mohammed ◽

Rajasekhar Rangasamy ◽

V. S. K. Reddy ◽

A. Govardhan

Keyword(s):

Document Clustering ◽

Topic Models ◽

Word Embeddings

Download Full-text

Query-relevant document representation for text clustering

2010 Fifth International Conference on Digital Information Management (ICDIM) ◽

10.1109/icdim.2010.5664205 ◽

2010 ◽

Cited By ~ 1

Author(s):

Masoud Makrehchi

Keyword(s):

Text Clustering ◽

Document Representation ◽

Relevant Document

Download Full-text

Multivariate Gaussian Document Representation from Word Embeddings for Text Categorization

10.18653/v1/e17-2072 ◽

2017 ◽

Cited By ~ 1

Author(s):

Giannis Nikolentzos ◽

Polykarpos Meladianos ◽

Francois Rousseau ◽

Yannis Stavrakas ◽

Michalis Vazirgiannis

Keyword(s):

Text Categorization ◽

Word Embeddings ◽

Document Representation ◽

Multivariate Gaussian

Download Full-text

Cloud-Based Image Retrieval Using GPU Platforms

Computers ◽

10.3390/computers8020048 ◽

2019 ◽

Vol 8 (2) ◽

pp. 48 ◽

Cited By ~ 2

Author(s):

Sidi Ahmed Mahmoudi ◽

Mohammed Amin Belarbi ◽

El Wardani Dadi ◽

Saïd Mahmoudi ◽

Mohammed Benjelloun

Keyword(s):

Image Retrieval ◽

Performance Improvement ◽

Computation Time ◽

Image Features ◽

Multimedia Retrieval ◽

Large Database ◽

Computation Complexity ◽

Multimedia Documents ◽

Large Databases ◽

Reduction Of Energy Consumption

The process of image retrieval presents an interesting tool for different domains related to computer vision such as multimedia retrieval, pattern recognition, medical imaging, video surveillance and movements analysis. Visual characteristics of images such as color, texture and shape are used to identify the content of images. However, the retrieving process becomes very challenging due to the hard management of large databases in terms of storage, computation complexity, temporal performance and similarity representation. In this paper, we propose a cloud-based platform in which we integrate several features extraction algorithms used for content-based image retrieval (CBIR) systems. Moreover, we propose an efficient combination of SIFT and SURF descriptors that allowed to extract and match image features and hence improve the process of image retrieval. The proposed algorithms have been implemented on the CPU and also adapted to fully exploit the power of GPUs. Our platform is presented with a responsive web solution that offers for users the possibility to exploit, test and evaluate image retrieval methods. The platform offers to users a simple-to-use access for different algorithms such as SIFT, SURF descriptors without the need to setup the environment or install anything while spending minimal efforts on preprocessing and configuring. On the other hand, our cloud-based CPU and GPU implementations are scalable, which means that they can be used even with large database of multimedia documents. The obtained results showed: 1. Precision improvement in terms of recall and precision; 2. Performance improvement in terms of computation time as a result of exploiting GPUs in parallel; 3. Reduction of energy consumption.

Download Full-text

Document Representation with Statistical Word Senses in Cross-Lingual Document Clustering

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s021800141559003x ◽

2015 ◽

Vol 29 (02) ◽

pp. 1559003 ◽

Cited By ~ 3

Author(s):

Guoyu Tang ◽

Yunqing Xia ◽

Erik Cambria ◽

Peng Jin ◽

Thomas Fang Zheng

Keyword(s):

Latent Dirichlet Allocation ◽

State Of The Art ◽

Document Clustering ◽

Word Sense ◽

Document Representation ◽

Word Sense Induction ◽

Translation Ambiguity ◽

Cross Lingual ◽

Word Senses ◽

Induction Model

Cross-lingual document clustering is the task of automatically organizing a large collection of multi-lingual documents into a few clusters, depending on their content or topic. It is well known that language barrier and translation ambiguity are two challenging issues for cross-lingual document representation. To this end, we propose to represent cross-lingual documents through statistical word senses, which are automatically discovered from a parallel corpus through a novel cross-lingual word sense induction model and a sense clustering method. In particular, the former consists in a sense-based vector space model and the latter leverages on a sense-based latent Dirichlet allocation. Evaluation on the benchmarking datasets shows that the proposed models outperform two state-of-the-art methods for cross-lingual document clustering.

Download Full-text