Abstractive text summarization and new large-scale datasets for agglutinative languages Turkish and Hungarian

Author(s):  
Batuhan Baykara ◽  
Tunga Güngör
2002 ◽  
Vol 8 (2-3) ◽  
pp. 209-233 ◽  
Author(s):  
OLIVIER FERRET ◽  
BRIGITTE GRAU

Topic analysis is important for many applications dealing with texts, such as text summarization or information extraction. However, it can be done with great precision only if it relies on structured knowledge, which is difficult to produce on a large scale. In this paper, we propose using bootstrapping to solve this problem: a first topic analysis based on a weakly structured source of knowledge, a collocation network, is used for learning explicit topic representations that then support a more precise and reliable topic analysis.


2020 ◽  
Vol 34 (05) ◽  
pp. 7651-7658 ◽  
Author(s):  
Yang Deng ◽  
Wai Lam ◽  
Yuexiang Xie ◽  
Daoyuan Chen ◽  
Yaliang Li ◽  
...  

Community question answering (CQA) gains increasing popularity in both academy and industry recently. However, the redundancy and lengthiness issues of crowdsourced answers limit the performance of answer selection and lead to reading difficulties and misunderstandings for community users. To solve these problems, we tackle the tasks of answer selection and answer summary generation in CQA with a novel joint learning model. Specifically, we design a question-driven pointer-generator network, which exploits the correlation information between question-answer pairs to aid in attending the essential information when generating answer summaries. Meanwhile, we leverage the answer summaries to alleviate noise in original lengthy answers when ranking the relevancy degrees of question-answer pairs. In addition, we construct a new large-scale CQA corpus, WikiHowQA, which contains long answers for answer selection as well as reference summaries for answer summarization. The experimental results show that the joint learning method can effectively address the answer redundancy issue in CQA and achieves state-of-the-art results on both answer selection and text summarization tasks. Furthermore, the proposed model is shown to be of great transferring ability and applicability for resource-poor CQA tasks, which lack of reference answer summaries.


Author(s):  
Lijun Wu ◽  
Li Zhao ◽  
Tao Qin ◽  
Jianhuang Lai ◽  
Tie-Yan Liu

Reinforcement learning (RL), which has been successfully applied to sequence prediction, introduces \textit{reward} as sequence-level supervision signal to evaluate the quality of a generated sequence. Existing RL approaches use the ground-truth sequence to define reward, which limits the application of RL techniques to labeled data. Since labeled data is usually scarce and/or costly to collect, it is desirable to leverage large-scale unlabeled data. In this paper, we extend existing RL methods for sequence prediction to exploit unlabeled data. We propose to learn the reward function from labeled data and use the predicted reward as \textit{pseudo reward} for unlabeled data so that we can learn from unlabeled data using the pseudo reward. To get good pseudo reward on unlabeled data, we propose a RNN-based reward network with attention mechanism, trained with purposely biased data distribution. Experiments show that the pseudo reward can provide good supervision and guide the learning process on unlabeled data. We observe significant improvements on both neural machine translation and text summarization.


Author(s):  
Shen Gao ◽  
Xiuying Chen ◽  
Piji Li ◽  
Zhaochun Ren ◽  
Lidong Bing ◽  
...  

In neural abstractive summarization field, conventional sequence-to-sequence based models often suffer from summarizing the wrong aspect of the document with respect to the main aspect. To tackle this problem, we propose the task of reader-aware abstractive summary generation, which utilizes the reader comments to help the model produce better summary about the main aspect. Unlike traditional abstractive summarization task, reader-aware summarization confronts two main challenges: (1) Comments are informal and noisy; (2) jointly modeling the news document and the reader comments is challenging. To tackle the above challenges, we design an adversarial learning model named reader-aware summary generator (RASG), which consists of four components: (1) a sequence-to-sequence based summary generator; (2) a reader attention module capturing the reader focused aspects; (3) a supervisor modeling the semantic gap between the generated summary and reader focused aspects; (4) a goal tracker producing the goal for each generation step. The supervisor and the goal tacker are used to guide the training of our framework in an adversarial manner. Extensive experiments are conducted on our large-scale real-world text summarization dataset, and the results show that RASG achieves the stateof-the-art performance in terms of both automatic metrics and human evaluations. The experimental results also demonstrate the effectiveness of each module in our framework. We release our large-scale dataset for further research1.


Author(s):  
William Darling

This chapter discusses approaches to applying text summarization research to the real-world problem of opinion summarization of user comments. Following a brief overview of the history of research in text summarization, the authors consider large scale user opinion summarization on the Web, a summarization problem that is distinct from the traditional domain that the research has focused on until very recently. More specifically, they consider opinion summarization of large datasets that generally include large degrees of noise and little editorial structure. To deal with this kind of real-world problem, the chapter addresses three major areas that must be considered and adhered to when designing systems for this type of problem: simple techniques, domain knowledge, and evaluative testing. Each area is covered in detail, and throughout the chapter, the lessons are applied to a case study that aims to apply the recommendations to designing a real-world opinion summarization system for a fictional book publisher.


2020 ◽  
Vol 34 (05) ◽  
pp. 8188-8195
Author(s):  
Haoran Li ◽  
Peng Yuan ◽  
Song Xu ◽  
Youzheng Wu ◽  
Xiaodong He ◽  
...  

We present an abstractive summarization system that produces summary for Chinese e-commerce products. This task is more challenging than general text summarization. First, the appearance of a product typically plays a significant role in customers' decisions to buy the product or not, which requires that the summarization model effectively use the visual information of the product. Furthermore, different products have remarkable features in various aspects, such as “energy efficiency” and “large capacity” for refrigerators. Meanwhile, different customers may care about different aspects. Thus, the summarizer needs to capture the most attractive aspects of a product that resonate with potential purchasers. We propose an aspect-aware multimodal summarization model that can effectively incorporate the visual information and also determine the most salient aspects of a product. We construct a large-scale Chinese e-commerce product summarization dataset that contains approximately 1.4 million manually created product summaries that are paired with detailed product information, including an image, a title, and other textual descriptions for each product. The experimental results on this dataset demonstrate that our models significantly outperform the comparative methods in terms of both the ROUGE score and manual evaluations.


2020 ◽  
Author(s):  
Michael Hahn ◽  
Judith Degen ◽  
Richard Futrell

Memory limitations are known to constrain language comprehension and production, and have been argued to account for crosslinguistic word order regularities. However, a systematic assessment of the role of memory limitations in language structure has proven elusive, in part because it is hard to extract precise large-scale quantitative generalizations about language from existing mechanistic models of memory use in sentence processing. We provide an architecture-independent information-theoretic formalization of memory limitations which enables a simple calculation of the memory efficiency of languages. Our notion of memory efficiency is based on the idea of a memory–surprisal tradeoff : a certain level of average surprisal per word can only be achieved at the cost of storing some amount of information about past context. Based on this notion of memory usage, we advance the Efficient Tradeoff Hypothesis: the order of elements in natural language is under pressure to enable favorable memory-surprisal tradeoffs. We derive that languages enable more efficient tradeoffs when they exhibit information locality: when predictive information about an element is concentrated in its recent past. We provide empirical evidence from three test domains in support of the Efficient Tradeoff Hypothesis: a reanalysis of a miniature artificial language learning experiment, a large-scale study of word order in corpora of 54 languages, and an analysis of morpheme order in two agglutinative languages. These results suggest that principles of order in natural language can be explained via highly generic cognitively motivated principles and lend support to efficiency-based models of the structure of human language.


Sign in / Sign up

Export Citation Format

Share Document