scholarly journals A Systematic Survey on Multi-document Text Summarization

Automatic text summarization is a technique of generating short and accurate summary of a longer text document. Text summarization can be classified based on the number of input documents (single document and multi-document summarization) and based on the characteristics of the summary generated (extractive and abstractive summarization). Multi-document summarization is an automatic process of creating relevant, informative and concise summary from a cluster of related documents. This paper does a detailed survey on the existing literature on the various approaches for text summarization. Few of the most popular approaches such as graph based, cluster based and deep learning-based summarization techniques are discussed here along with the evaluation metrics, which can provide an insight to the future researchers.

2021 ◽  
Vol 10 (2) ◽  
pp. 42-60
Author(s):  
Khadidja Chettah ◽  
Amer Draa

Automatic text summarization has recently become a key instrument for reducing the huge quantity of textual data. In this paper, the authors propose a quantum-inspired genetic algorithm (QGA) for extractive single-document summarization. The QGA is used inside a totally automated system as an optimizer to search for the best combination of sentences to be put in the final summary. The presented approach is compared with 11 reference methods including supervised and unsupervised summarization techniques. They have evaluated the performances of the proposed approach on the DUC 2001 and DUC 2002 datasets using the ROUGE-1 and ROUGE-2 evaluation metrics. The obtained results show that the proposal can compete with other state-of-the-art methods. It is ranked first out of 12, outperforming all other algorithms.


In a world where information is growing rapidly every single day, we need tools to generate summary and headlines from text which is accurate as well as short and precise. In this paper, we have described a method for generating headlines from article. This is done by using hybrid pointer-generator network with attention distribution and coverage mechanism on article which generates abstractive summarization followed by the application of encoder-decoder recurrent neural network with LSTM unit to generate headlines from the summary. Hybrid pointer generator model helps in removing inaccuracy as well as repetitions. We have used CNN / Daily Mail as our dataset.


MATICS ◽  
2020 ◽  
Vol 12 (2) ◽  
pp. 111-116
Author(s):  
Muhammad Adib Zamzam

Text summarization (perangkuman teks) adalah pendekatan yang bisa digunakan untuk meringkas atau memadatkan teks artikel yang panjang menjadi lebih pendek dan ringkas sehingga hasil rangkuman teks yang relatif lebih pendek bisa mewakilkan teks yang panjang. Automatic Text Summarization adalah perangkuman teks yang dilakukan secara otomatis oleh komputer. Terdapat dua macam algoritma Automatic Text Summarization yaitu Extraction-based summarization dan Abstractive summarization. Algoritma TextRank merupakan algoritma extraction-based atau extractive, dimana ekstraksi di sini berarti memilih unit teks (kalimat, segmen-segmen kalimat, paragraf atau passages), lalu dianggap berisi informasi penting dari dokumen dan menyusun unit-unit (kalimat-kalimat) tersebut dengan cara yang benar. Hasil penelitian dengan input 50 artikel dan hasil rangkuman sebanyak 12,5% dari teks asli menunjukkan bahwa sistem memiliki nilai recall ROUGE 41,659 %. Nilai tertinggi recall ROUGE tertinggi tercatat pada artikel 48 dengan nilai 0,764. Nilai terendah recall ROUGE tercatat pada artikel  37 dengan nilai 0,167.


Information ◽  
2020 ◽  
Vol 11 (2) ◽  
pp. 78 ◽  
Author(s):  
Tulu Tilahun Hailu ◽  
Junqing Yu ◽  
Tessfu Geteye Fantaye

Text summarization is a process of producing a concise version of text (summary) from one or more information sources. If the generated summary preserves meaning of the original text, it will help the users to make fast and effective decision. However, how much meaning of the source text can be preserved is becoming harder to evaluate. The most commonly used automatic evaluation metrics like Recall-Oriented Understudy for Gisting Evaluation (ROUGE) strictly rely on the overlapping n-gram units between reference and candidate summaries, which are not suitable to measure the quality of abstractive summaries. Another major challenge to evaluate text summarization systems is lack of consistent ideal reference summaries. Studies show that human summarizers can produce variable reference summaries of the same source that can significantly affect automatic evaluation metrics scores of summarization systems. Humans are biased to certain situation while producing summary, even the same person perhaps produces substantially different summaries of the same source at different time. This paper proposes a word embedding based automatic text summarization and evaluation framework, which can successfully determine salient top-n sentences of a source text as a reference summary, and evaluate the quality of systems summaries against it. Extensive experimental results demonstrate that the proposed framework is effective and able to outperform several baseline methods with regard to both text summarization systems and automatic evaluation metrics when tested on a publicly available dataset.


2020 ◽  
Vol 21 (2) ◽  
Author(s):  
Sheena Kurian K ◽  
Sheena Mathew

The number of scientic or research papers published every year is growing at an exponential rate, which has led to an intensive research in scientic document summarization. The different methods commonly used in automatic text summarization are discussed in this paper with their pros and cons. Commonly used evaluation techniques and datasets in this field are also discussed. Rouge and Pyramid scores of the different methods are tabulated for easy comparison of the results.


Author(s):  
Erwin Yudi Hidayat ◽  
Fahri Firdausillah ◽  
Khafiizh Hastuti ◽  
Ika Novita Dewi ◽  
Azhari Azhari

In this paper, we present Latent Drichlet Allocation in automatic text summarization to improve accuracy in document clustering. The experiments involving 398 data set from public blog article obtained by using python scrapy crawler and scraper. Several steps of clustering in this research are preprocessing, automatic document compression using feature method, automatic document compression using LDA, word weighting and clustering algorithm The results show that automatic document summarization with LDA reaches 72% in LDA 40%, compared to traditional k-means method which only reaches 66%.


Author(s):  
Hui Lin ◽  
Vincent Ng

The focus of automatic text summarization research has exhibited a gradual shift from extractive methods to abstractive methods in recent years, owing in part to advances in neural methods. Originally developed for machine translation, neural methods provide a viable framework for obtaining an abstract representation of the meaning of an input text and generating informative, fluent, and human-like summaries. This paper surveys existing approaches to abstractive summarization, focusing on the recently developed neural approaches.


2020 ◽  
Vol 34 (01) ◽  
pp. 11-18
Author(s):  
Yue Cao ◽  
Xiaojun Wan ◽  
Jinge Yao ◽  
Dian Yu

Automatic text summarization aims at producing a shorter version of the input text that conveys the most important information. However, multi-lingual text summarization, where the goal is to process texts in multiple languages and output summaries in the corresponding languages with a single model, has been rarely studied. In this paper, we present MultiSumm, a novel multi-lingual model for abstractive summarization. The MultiSumm model uses the following training regime: (I) multi-lingual learning that contains language model training, auto-encoder training, translation and back-translation training, and (II) joint summary generation training. We conduct experiments on summarization datasets for five rich-resource languages: English, Chinese, French, Spanish, and German, as well as two low-resource languages: Bosnian and Croatian. Experimental results show that our proposed model significantly outperforms a multi-lingual baseline model. Specifically, our model achieves comparable or even better performance than models trained separately on each language. As an additional contribution, we construct the first summarization dataset for Bosnian and Croatian, containing 177,406 and 204,748 samples, respectively.


Sign in / Sign up

Export Citation Format

Share Document