scholarly journals Hybrid SVD For Document Representation Using Different Vectorization

Author(s):  
Kalpana P ◽  
Rosini B R ◽  
Sathya Priya K P ◽  
Sowmiya S

Document Clustering is the process of segmenting a particular collection of text into subgroups. Nowadays all documents are in electronic form, because of the issue to retrieve relevant document from the large database. The goal is to transform text composed of daily language in a structured, database format. In this way, different documents are summarized and presented in a uniform manner. The challenging problem of document clustering are big volume, high dimensionality and complex semantics. The objective of this paper is mainly focused on clustering multi-sense word embeddings using three different algorithms(K-means, DBSCAN, CURE). Among these three algorithm CURE gives better accuracy and it can handle large databases efficiently.</p>

Document clusters are the way to segment a certain set of text into racial groups. Nowadays all records are in electronic form due to the problem of retrieving appropriate document from the big database. The objective is to convert text consisting of daily language into a structured database format. Different documents are thus summarized and presented in a uniform manner. Big quantity, high dimensionality and complicated semantics are the difficult issue of document clustering. The aim of this article is primarily to cluster multisense word embedding using three distinct algorithms (K-means, DBSCAN, CURE) using singular value decomposition. In this performance measures are measured using different metrics.


2012 ◽  
Vol 263-266 ◽  
pp. 3326-3329
Author(s):  
Jia Jia Miao ◽  
Guo You Chen ◽  
Kai Du ◽  
Zhong Jun Fang

Due to the huge scale and the number of components, big data is difficult to work with the use of relational databases and desktop statistics and visualization package. Much database replication technology is used to increase the MTTF, but few have a large database system, the traditional method of backup is not feasible, expensive manpower costs reduce MTTR. On the basis of analyzing the characteristics of data in large databases, we propose a new method called Detaching Read-Only (DRO) mechanism and its changes DRO+. It reduces MTTR by reducing the physical change of the data in each database, by separating data node size granularity. Analysis and experimental results show that our method can not only reduce the MTTR an order of magnitude, but there is no additional hardware costs, but also reduce the high human costs.


Author(s):  
Ruqing Zhang ◽  
Jiafeng Guo ◽  
Yanyan Lan ◽  
Jun Xu ◽  
Xueqi Cheng

Author(s):  
Amparo Alonso-Betanzos ◽  
Verónica Bolón-Canedo ◽  
Diego Fernández-Francos ◽  
Iago Porto-Díaz ◽  
Noelia Sánchez-Maroño

With the advent of high dimensionality, machine learning researchers are now interested not only in accuracy, but also in scalability of algorithms. When dealing with large databases, pre-processing techniques are required to reduce input dimensionality and machine learning can take advantage of feature selection, which consists of selecting the relevant features and discarding irrelevant ones with a minimum degradation in performance. In this chapter, we will review the most up-to-date feature selection methods, focusing on their scalability properties. Moreover, we will show how these learning methods are enhanced when applied to large scale datasets and, finally, some examples of the application of feature selection in real world databases will be shown.


Author(s):  
Nguyen Chi Thanh ◽  
◽  
Koichi Yamada ◽  
Muneyuki Unehara

Document clustering is a textmining technique for unsupervised document organization. It helps the users browse and navigate large sets of documents. Ho et al. proposed a Tolerance Rough Set Model (TRSM) [1] for improving the vector space model that represents documents by vectors of terms and applied it to document clustering. In this paper we analyze their model to propose a new model for efficient clustering of documents. We introduce Similarity Rough Set Model (SRSM) as another model for presenting documents in document clustering. The model is evaluated by experiments on test collections. The experiment results show that the SRSM document clusteringmethod outperforms the one with TRSM and the results of SRSM are less affected by the value of parameter than TRSM.


2017 ◽  
Author(s):  
Giannis Nikolentzos ◽  
Polykarpos Meladianos ◽  
Francois Rousseau ◽  
Yannis Stavrakas ◽  
Michalis Vazirgiannis

Computers ◽  
2019 ◽  
Vol 8 (2) ◽  
pp. 48 ◽  
Author(s):  
Sidi Ahmed Mahmoudi ◽  
Mohammed Amin Belarbi ◽  
El Wardani Dadi ◽  
Saïd Mahmoudi ◽  
Mohammed Benjelloun

The process of image retrieval presents an interesting tool for different domains related to computer vision such as multimedia retrieval, pattern recognition, medical imaging, video surveillance and movements analysis. Visual characteristics of images such as color, texture and shape are used to identify the content of images. However, the retrieving process becomes very challenging due to the hard management of large databases in terms of storage, computation complexity, temporal performance and similarity representation. In this paper, we propose a cloud-based platform in which we integrate several features extraction algorithms used for content-based image retrieval (CBIR) systems. Moreover, we propose an efficient combination of SIFT and SURF descriptors that allowed to extract and match image features and hence improve the process of image retrieval. The proposed algorithms have been implemented on the CPU and also adapted to fully exploit the power of GPUs. Our platform is presented with a responsive web solution that offers for users the possibility to exploit, test and evaluate image retrieval methods. The platform offers to users a simple-to-use access for different algorithms such as SIFT, SURF descriptors without the need to setup the environment or install anything while spending minimal efforts on preprocessing and configuring. On the other hand, our cloud-based CPU and GPU implementations are scalable, which means that they can be used even with large database of multimedia documents. The obtained results showed: 1. Precision improvement in terms of recall and precision; 2. Performance improvement in terms of computation time as a result of exploiting GPUs in parallel; 3. Reduction of energy consumption.


Author(s):  
Guoyu Tang ◽  
Yunqing Xia ◽  
Erik Cambria ◽  
Peng Jin ◽  
Thomas Fang Zheng

Cross-lingual document clustering is the task of automatically organizing a large collection of multi-lingual documents into a few clusters, depending on their content or topic. It is well known that language barrier and translation ambiguity are two challenging issues for cross-lingual document representation. To this end, we propose to represent cross-lingual documents through statistical word senses, which are automatically discovered from a parallel corpus through a novel cross-lingual word sense induction model and a sense clustering method. In particular, the former consists in a sense-based vector space model and the latter leverages on a sense-based latent Dirichlet allocation. Evaluation on the benchmarking datasets shows that the proposed models outperform two state-of-the-art methods for cross-lingual document clustering.


Sign in / Sign up

Export Citation Format

Share Document