Building a Korean Sentence-Compression Corpus by Analyzing Sentences and Deleting Words

A text summary extracts serves as a condensed representation of a written input source where important and salient information is kept. However, the condensed representation itself suffer in lack of semantic and coherence if the summary was produced in verbatim using the input itself. Sentence Compression is a technique where unimportant details from a sentence are eliminated by preserving the sentence’s grammar pattern. In this study, we conducted an analysis on our developed Malay Text Corpus to discover the rules and pattern on how human summarizer compresses and eliminates unimportant constituent to construct a summary. A Pattern-Growth based model named Frequent Eliminated Pattern (FASPe) is introduced to represent the text using a set of sequence adjacent words that is frequently being eliminated across the document collection. From the rules obtained, some heuristic knowledge in Sentence Compression is presented with confidence value as high as 85% - that can be used for further reference in the area of Text Summarization for Malay language.

Download Full-text

Unsupervised Sentence Compression using Denoising Auto-Encoders

10.18653/v1/k18-1040 ◽

2018 ◽

Cited By ~ 3

Author(s):

Thibault Fevry ◽

Jason Phang

Keyword(s):

Sentence Compression

Download Full-text

Evaluating Syntactic Sentence Compression for Text Summarisation

Natural Language Processing and Information Systems - Lecture Notes in Computer Science ◽

10.1007/978-3-642-38824-8_11 ◽

2013 ◽

pp. 126-139 ◽

Cited By ~ 1

Author(s):

Prasad Perera ◽

Leila Kosseim

Keyword(s):

Sentence Compression

Download Full-text

Deletion-Based Sentence Compression Using Bi-enc-dec LSTM

Communications in Computer and Information Science - Computational Linguistics ◽

10.1007/978-981-10-8438-6_20 ◽

2018 ◽

pp. 249-260 ◽

Cited By ~ 1

Author(s):

Dac-Viet Lai ◽

Nguyen Truong Son ◽

Nguyen Le Minh

Keyword(s):

Sentence Compression

Download Full-text

MYTextSum: A Malay Text Summarizer Model Using a Constrained Pattern-Growth Sentence Compression Technique

Lecture Notes in Electrical Engineering - Computational Science and Technology ◽

10.1007/978-981-10-8276-4_14 ◽

2018 ◽

pp. 141-150 ◽

Cited By ~ 1

Author(s):

Suraya Alias ◽

Siti Khotijah Mohammad ◽

Keng Hoon Gan ◽

Tan Tien Ping

Keyword(s):

Compression Technique ◽

Sentence Compression ◽

Pattern Growth

Download Full-text

Sentence Compression with Natural Language Generation

Advances in Intelligent and Soft Computing - Knowledge Engineering and Management ◽

10.1007/978-3-642-25661-5_46 ◽

2011 ◽

pp. 357-363

Author(s):

Peng Li ◽

Yinglin Wang

Keyword(s):

Natural Language ◽

Natural Language Generation ◽

Language Generation ◽

Sentence Compression

Download Full-text

Syntactically Look-Ahead Attention Network for Sentence Compression

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6315 ◽

2020 ◽

Vol 34 (05) ◽

pp. 8050-8057

Author(s):

Hidetaka Kamigaito ◽

Manabu Okumura

Keyword(s):

Automatic Evaluation ◽

Future Time ◽

Attention Network ◽

Human Evaluation ◽

Sentence Compression ◽

Look Ahead ◽

The Future

Sentence compression is the task of compressing a long sentence into a short one by deleting redundant words. In sequence-to-sequence (Seq2Seq) based models, the decoder unidirectionally decides to retain or delete words. Thus, it cannot usually explicitly capture the relationships between decoded words and unseen words that will be decoded in the future time steps. Therefore, to avoid generating ungrammatical sentences, the decoder sometimes drops important words in compressing sentences. To solve this problem, we propose a novel Seq2Seq model, syntactically look-ahead attention network (SLAHAN), that can generate informative summaries by explicitly tracking both dependency parent and child words during decoding and capturing important words that will be decoded in the future. The results of the automatic evaluation on the Google sentence compression dataset showed that SLAHAN achieved the best kept-token-based-F1, ROUGE-1, ROUGE-2 and ROUGE-L scores of 85.5, 79.3, 71.3 and 79.1, respectively. SLAHAN also improved the summarization performance on longer sentences. Furthermore, in the human evaluation, SLAHAN improved informativeness without losing readability.

Download Full-text