Latent semantic analysis for tagging activation states and identifiability in northwestern Mexican news outlets

Author(s):  
Manuel-Alejandro Sánchez-Fernández ◽  
Alfonso Medina-Urrea ◽  
Juan-Manuel Torres-Moreno

The present work aims to study the relationship between measures, obtained from Latent Semantic Analysis (LSA) and a variant known as SPAN, and activation and identifiability states (Informative States) of referents in noun phrases present in journalistic notes from Northwestern Mexican news outlets written in Spanish. The aim and challenge is to find a strategy to achieve labelling of new / given information in the discourse rooted in a theoretically linguistic stance. The new / given distinction can be defined from different perspectives in which it varies what linguistic forms are taken into account. Thus, the focus in this work is to work with full referential devices (n = 2 388). Pearson’s R correlation tests, analysis of variance, graphical exploration of the clustering of labels, and a classification experiment with random forests are performed. For the experiment, two groups were used: noun phrases labeled with all 10 tags of informative states and a binary labelling, as well as the use of two bags-of-words for each noun phrase: the interior and the exterior. It was found that using LSA in conjunction with the inner bag of words can be used to classify certain informational states. This same measure showed good results for the binary division, detecting which sentences introduce new referents in discourse. In previous work using a similar method in noun phrases in English, 80% accuracy (n = 478) was reached in their classification exercise. Our best test for Spanish reached 79%. No work on Spanish using this method has been done before and this kind of experiment is important because Spanish exhibits a more complex inflectional morphology.

2019 ◽  
Vol 10 (1) ◽  
pp. 29
Author(s):  
Yulius Denny Prabowo ◽  
Tedi Lesmana Marselino ◽  
Meylisa Suryawiguna

Extracting information from a large amount of structured data requires expensive computing. The Vector Space Model method works by mapping words in continuous vector space where semantically similar words are mapped in adjacent vector spaces. The Vector Space Model model assumes words that appear in the same context, having the same semantic meaning. In the implementation, there are two different approaches: counting methods (eg: Latent Semantic Analysis) and predictive methods (eg Neural Probabilistic Language Model). This study aims to apply Word2Vec method using the Continuous Bag of Words approach in Indonesian language. Research data was obtained by crawling on several online news portals. The expected result of the research is the Indonesian words vector mapping based on the data used.Keywords: vector space model, word to vector, Indonesian vector space model.Ekstraksi informasi dari sekumpulan data terstruktur dalam jumlah yang besar membutuhkan komputasi yang mahal. Metode Vector Space Model bekerja dengan cara memetakan kata-kata dalam ruang vektor kontinu dimana kata-kata yang serupa secara semantis dipetakan dalam ruang vektor yang berdekatan. Metode Vector Space Model mengasumsikan kata-kata yang muncul pada konteks yang sama, memiliki makna semantik yang sama. Dalam penerapannya ada dua pendekatan yang berbeda yaitu: metode yang berbasis hitungan (misal: Latent Semantic Analysis) dan metode prediktif (misalnya Neural Probabilistic Language Model). Penelitian ini bertujuan untuk menerapkan metode Word2Vec menggunakan pendekatan Continuous Bag Of Words model dalam Bahasa Indonesia. Data penelitian yang digunakan didapatkan dengan cara crawling pada berberapa portal berita online. Hasil penelitian yang diharapkan adalah pemetaan vektor kata Bahasa Indonesia berdasarkan data yang digunakan.Kata Kunci: vector space model, word to vector, vektor kata bahasa Indonesia.


AusArt ◽  
2016 ◽  
Vol 4 (1) ◽  
pp. 19-28
Author(s):  
Pilar Rosado Rodrigo ◽  
Eva Figueras Ferrer ◽  
Ferran Reverter Comes

Esta investigación aborda el problema de la detección aspectos latentes en grandes colecciones de imágenes de obras de artista abstractas, atendiendo sólo a su contenido visual. Se ha programado un algoritmo de descripción de imágenes utilizado en visión artificial cuyo enfoque consiste en colocar una malla regular de puntos de interés en la imagen y seleccionar alrededor de cada uno de sus nodos una región de píxeles para la que se calcula un descriptor que tiene en cuenta los gradientes de grises encontrados. Los descriptores de toda la colección de imágenes se pueden agrupar en función de su similitud y cada grupo resultante pasará a determinar lo que llamamos “palabras visuales”. El método se denomina Bag-of-Words (bolsa de palabras). Teniendo en cuenta la frecuencia con que cada “palabra visual”  ocurre en cada imagen, aplicamos el modelo estadístico pLSA (Probabilistic Latent Semantic Analysis), que clasificará de forma totalmente automática las imágenes según su categoría formal. Esta herramienta resulta de utilidad tanto en el análisis de obras de arte como en la producción artística. Palabras-clave: visión artificial; modelo Bag-of-Words; CBIR (Recuperación de imágenes por contenido); pLSA (ANÁLISIS PROBABILÍSTICO DE ASPECTOS LATENTES); palabra visual From pixel to visual resonances: Images with voicesAbstractThe objective of our research is to develop a series of computer vision programs to search for analogies in large datasets—in this case, collections of images of abstract paintings—based solely on their visual content without textual annotation. We have programmed an algorithm based on a specific model of image description used in computer vision. This approach involves placing a regular grid over the image and selecting a pixel region around each node. Dense features computed over this regular grid with overlapping patches are used to represent the images. Analysing the distances between the whole set of image descriptors we are able to group them according to their similarity and each resulting group will determines what we call "visual words". This model is called Bag-of-Words representation Given the frequency with which each visual word occurs in each image, we apply the method pLSA (Probabilistic Latent Semantic Analysis), a statistical model that classifies fully automatically, without any textual annotation, images according to their formal patterns. In this way, the researchers hope to develop a tool both for producing and analysing works of art. Keywords: artificial visión; Bag-of-Words model; CBIR (Content-Based Image Retrieval); pLSA (Probabilistic Latent Semantic Analysis); visual word


2002 ◽  
Vol 55 (3) ◽  
pp. 879-896 ◽  
Author(s):  
Timothy Desmet ◽  
Marc Brysbaert ◽  
Constantijn De Baecke

We examined the production of relative clauses in sentences with a complex noun phrase containing two possible attachment sites for the relative clause (e.g., “Someone shot the servant of the actress who was on the balcony.”). On the basis of two corpus analyses and two sentence continuation tasks, we conclude that much research about this specific syntactic ambiguity has used complex noun phrases that are quite uncommon. These noun phrases involve the relationship between two humans and, at least in Dutch, induce a different attachment preference from noun phrases referring to non-human entities. We provide evidence that the use of this type of complex noun phrase may have distorted the conclusions about the processes underlying relative clause attachment. In addition, it is shown that, notwithstanding some notable differences between sentence production in the continuation task and in coherent text writing, there seems to be a remarkable correspondence between the attachment patterns obtained with both modes of production.


2018 ◽  
Vol 9 (5) ◽  
pp. 17
Author(s):  
Ainul Azmin Md Zamin ◽  
Raihana Abu Hasan

Abstract as a summary of a dissertation harbours important information where it serves to attract readers to consider reading the entire passage or to abandon it. This study seeks to investigate the backward translation of abstracts made by 10 randomly selected postgraduate students. This research serves as a guideline for students in composing their abstracts as it aims to compare the differences in noun phrase structure written in Malay as translated from English. It also analyses the types of errors when English noun phrases are translated to Malay. Preliminary findings from this pilot study found that translation errors committed were mainly inaccurate word order, inaccurate translation, added translation, dropped translation and also structure change. For this study, an exploratory mode of semantic analysis is applied by looking at noun phrases, the meaningful group of words that form a major part of any sentence, with the noun as the head of the group. Syntax is inevitably interwoven in the analysis as the structure and grammatical aspects of the translations are also analysed. They are examined by comparing English texts to its corresponding translation in the Malay language. Particularly relevant in this study is the need to emphasize on the semantics and syntax skills of the students before a good transaltion work can be produced. Language practitioners can also tap on translation activities to improve the learners’ language competency.


2014 ◽  
Vol 4 (1) ◽  
pp. 46-67
Author(s):  
Keisuke Inohara ◽  
Ryoko Honma ◽  
Takayuki Goto ◽  
Takashi Kusumi ◽  
Akira Utsumi

This study examined the relationship between reading literary novels and generating predictive inferences by analyzing a corpus of Japanese novels. Latent semantic analysis (LSA) was used to capture the statistical structure of the corpus. Then, the authors asked 74 Japanese college students to generate predictive inferences (e.g., “The newspaper burned”) in response to Japanese event sentences (e.g., “A newspaper fell into a bonfire”) and obtained more than 5,000 predicted events. The analysis showed a significant relationship between LSA similarity between the event sentences and the predicted events and frequency of the predicted events. This result suggests that exposure to literary works may help develop readers’ inference generation skills. In addition, two vector operation methods for sentence vector constructions from word vectors were compared: the “Average” method and the “Predication Algorithm” method (Kintsch, 2001). The results support the superiority of the Predication Algorithm method over the Average method.


2012 ◽  
Vol 132 (9) ◽  
pp. 1473-1480
Author(s):  
Masashi Kimura ◽  
Shinta Sawada ◽  
Yurie Iribe ◽  
Kouichi Katsurada ◽  
Tsuneo Nitta

Author(s):  
Priyanka R. Patil ◽  
Shital A. Patil

Similarity View is an application for visually comparing and exploring multiple models of text and collection of document. Friendbook finds ways of life of clients from client driven sensor information, measures the closeness of ways of life amongst clients, and prescribes companions to clients if their ways of life have high likeness. Roused by demonstrate a clients day by day life as life records, from their ways of life are separated by utilizing the Latent Dirichlet Allocation Algorithm. Manual techniques can't be utilized for checking research papers, as the doled out commentator may have lacking learning in the exploration disciplines. For different subjective views, causing possible misinterpretations. An urgent need for an effective and feasible approach to check the submitted research papers with support of automated software. A method like text mining method come to solve the problem of automatically checking the research papers semantically. The proposed method to finding the proper similarity of text from the collection of documents by using Latent Dirichlet Allocation (LDA) algorithm and Latent Semantic Analysis (LSA) with synonym algorithm which is used to find synonyms of text index wise by using the English wordnet dictionary, another algorithm is LSA without synonym used to find the similarity of text based on index. LSA with synonym rate of accuracy is greater when the synonym are consider for matching.


This article examines the method of latent-semantic analysis, its advantages, disadvantages, and the possibility of further transformation for use in arrays of unstructured data, which make up most of the information that Internet users deal with. To extract context-dependent word meanings through the statistical processing of large sets of textual data, an LSA method is used, based on operations with numeric matrices of the word-text type, the rows of which correspond to words, and the columns of text units to texts. The integration of words into themes and the representation of text units in the theme space is accomplished by applying one of the matrix expansions to the matrix data: singular decomposition or factorization of nonnegative matrices. The results of LSA studies have shown that the content of the similarity of words and text is obtained in such a way that the results obtained closely coincide with human thinking. Based on the methods described above, the author has developed and proposed a new way of finding semantic links between unstructured data, namely, information on social networks. The method is based on latent-semantic and frequency analyzes and involves processing the search result received, splitting each remaining text (post) into separate words, each of which takes the round in n words right and left, counting the number of occurrences of each term, working with a pre-created semantic resource (dictionary, ontology, RDF schema, ...). The developed method and algorithm have been tested on six well-known social networks, the interaction of which occurs through the ARI of the respective social networks. The average score for author's results exceeded that of their own social network search. The results obtained in the course of this dissertation can be used in the development of recommendation, search and other systems related to the search, rubrication and filtering of information.


Sign in / Sign up

Export Citation Format

Share Document