An Automated Corpus Annotation Experiment in Brazilian Portuguese for Sentiment Analysis in Public Security

Author(s):  
Victor Diogho Heuer de Carvalho ◽  
Thyago Celso Cavalcante Nepomuceno ◽  
Ana Paula Cabral Seixas Costa
2021 ◽  
Vol 29 (2) ◽  
pp. 859
Author(s):  
Márcio De Souza Dias ◽  
Ariani Di Felippo ◽  
Amanda Pontes Rassi ◽  
Paula Cristina Figueira Cardoso ◽  
Fernando Antônio Asevedo Nóbrega ◽  
...  

Abstract: Automatic summaries commonly present diverse linguistic problems that affect textual quality and thus their understanding by users. Few studies have tried to characterize such problems and their relation with the performance of the summarization systems. In this paper, we investigated the problems in multi-document extracts (i.e., summaries produced by concatenating several sentences taken exactly as they appear in the source texts) generated by systems for Brazilian Portuguese that have different approaches (i.e., superficial and deep) and performances (i.e., baseline and state-of-the art methods). For that, we first reviewed the main characterization studies, resulting in a typology of linguistic problems more suitable for multi-document summarization. Then, we manually annotated a corpus of automatic multi-document extracts in Portuguese based on the typology, which showed that some of linguistic problems are significantly more recurrent than others. Thus, this corpus annotation may support research on linguistic problems detection and correction for summary improvement, allowing the production of automatic summaries that are not only informative (i.e., they convey the content of the source material), but also linguistically well structured.Keywords: automatic summarization; multi-document summary; linguistic problem; corpus annotation.Resumo: Sumários automáticos geralmente apresentam vários problemas linguísticos que afetam a sua qualidade textual e, consequentemente, sua compreensão pelos usuários. Alguns trabalhos caracterizam tais problemas e os relacionam ao desempenho dos sistemas de sumarização. Neste artigo, investigaram-se os problemas em extratos (isto é, sumários produzidos pela concatenação de sentenças extraídas na íntegra dos textos-fonte) multidocumento em Português do Brasil gerados por sistemas que apresentam diferentes abordagens (isto é, superficial e profunda) e desempenho (isto é, métodos baseline e do estado-da-arte). Para tanto, as principais caracterizações dos problemas linguísticos em sumários automáticos foram investigadas, resultando em uma tipologia mais adequada à sumarização multidocumento. Em seguida, anotou-se manualmente um corpus de extratos com base na tipologia, evidenciando que alguns tipos de problemas são significativamente mais recorrentes que outros. Assim, essa anotação gera subsídios para as tarefas automáticas de detecção e correção de problemas linguísticos com vistas à produção de sumários automáticos não só mais informativos (isto é, que cobrem o conteúdo do material de origem), como também linguisticamente bem-estruturados.Palavras-chave: sumarização automática; sumário multidocumento; problema linguístico; anotação de corpus.


Corpora ◽  
2017 ◽  
Vol 12 (1) ◽  
pp. 23-54 ◽  
Author(s):  
Paula C.F. Cardoso ◽  
Thiago A.S. Pardo ◽  
Maite Taboada

Subtopic segmentation aims to break documents into subtopical text passages, which develop a main topic in a text. Being capable of automatically detecting subtopics is very useful for several Natural Language Processing applications. For instance, in automatic summarisation, having the subtopics at hand enables the production of summaries with good subtopic coverage. Given the usefulness of subtopic segmentation, it is common to assemble a reference-annotated corpus that supports the study of the envisioned phenomena and the development and evaluation of systems. In this paper, we describe the subtopic annotation process in a corpus of news texts written in Brazilian Portuguese, following a systematic annotation process and answering the main research questions when performing corpus annotation. Based on this corpus, we propose novel methods for subtopic segmentation following patterns of discourse organisation, specifically using Rhetorical Structure Theory. We show that discourse structures mirror the subtopic changes in news texts. An important outcome of this work is the freely available annotated corpus, which, to the best of our knowledge, is the only one for Portuguese. We demonstrate that some discourse knowledge may significantly help to find boundaries automatically in a text. In particular, the relation type and the level of the tree structure are important features.


2021 ◽  
Vol 13 (1) ◽  
pp. 1-20
Author(s):  
Victor Diogho Heuer de Carvalho ◽  
Ana Paula Cabral Seixas Costa

This article presents (1) the results of a literature review on social web mining and sentiment analysis on public security; (2) the idea of a framework for the analytical process involved in the literature review themes; and (3) a research agenda with a perspective for future studies, considering some elements of the analytical process. The literature review was based on searches of five databases: Scopus, IEEE Xplore, Web of Science, ScienceDirect, and Springer Link. Search strings were applied to retrieve literature material of four kinds, without defining an initial time milestone, to get the historical register of publications associated with the main thematic. After some filtering, primary and secondary findings were separated, enabling the identification of elements for the framework. Finally, the research agenda is presented, containing a set of three research artifacts related to the proposed framework.


2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Jianguo Sun ◽  
Hanqi Yin ◽  
Ye Tian ◽  
Junpeng Wu ◽  
Linshan Shen ◽  
...  

Large amounts of data are widely stored in cyberspace. Not only can they bring much convenience to people’s lives and work, but they can also assist the work in the information security field, such as microexpression recognition and sentiment analysis in the criminal investigation. Thus, it is of great significance to recognize and analyze the sentiment information, which is usually described by different modalities. Due to the correlation among different modalities data, multimodal can provide more comprehensive and robust information than unimodal in data analysis tasks. The complementary information from different modalities can be obtained by multimodal fusion methods. These approaches can process multimodal data through fusion algorithms and ensure the accuracy of the information used for subsequent classification or prediction tasks. In this study, a two-level multimodal fusion (TlMF) method with both data-level and decision-level fusion is proposed to achieve the sentiment analysis task. In the data-level fusion stage, a tensor fusion network is utilized to obtain the text-audio and text-video embeddings by fusing the text with audio and video features, respectively. During the decision-level fusion stage, the soft fusion method is adopted to fuse the classification or prediction results of the upstream classifiers, so that the final classification or prediction results can be as accurate as possible. The proposed method is tested on the CMU-MOSI, CMU-MOSEI, and IEMOCAP datasets, and the empirical results and ablation studies confirm the effectiveness of TlMF in capturing useful information from all the test modalities.


Author(s):  
Abdulrahman Alqarafi ◽  
Ahsan Adeel ◽  
Ahmed Hawalah ◽  
Kevin Swingler ◽  
Amir Hussain

Sign in / Sign up

Export Citation Format

Share Document