Building a Korean Sentence-Compression Corpus by Analyzing Sentences and Deleting Words

2021 ◽  
Vol 48 (2) ◽  
pp. 183-194
Author(s):  
GyoungHo Lee ◽  
Yo-Han Park ◽  
Kong Joo Lee
Keyword(s):  
2016 ◽  
Vol 78 (8) ◽  
Author(s):  
Suraya Alias ◽  
Siti Khaotijah Mohammad ◽  
Gan Keng Hoon ◽  
Tan Tien Ping

A text summary extracts serves as a condensed representation of a written input source where important and salient information is kept. However, the condensed representation itself suffer in lack of semantic and coherence if the summary was produced in verbatim using the input itself. Sentence Compression is a technique where unimportant details from a sentence are eliminated by preserving the sentence’s grammar pattern. In this study, we conducted an analysis on our developed Malay Text Corpus to discover the rules and pattern on how human summarizer compresses and eliminates unimportant constituent to construct a summary. A Pattern-Growth based model named Frequent Eliminated Pattern (FASPe) is introduced to represent the text using a set of sequence adjacent words that is frequently being eliminated across the document collection. From the rules obtained, some heuristic knowledge in Sentence Compression is presented with confidence value as high as 85% - that can be used for further reference in the area of Text Summarization for Malay language.


2020 ◽  
Vol 34 (05) ◽  
pp. 8050-8057
Author(s):  
Hidetaka Kamigaito ◽  
Manabu Okumura

Sentence compression is the task of compressing a long sentence into a short one by deleting redundant words. In sequence-to-sequence (Seq2Seq) based models, the decoder unidirectionally decides to retain or delete words. Thus, it cannot usually explicitly capture the relationships between decoded words and unseen words that will be decoded in the future time steps. Therefore, to avoid generating ungrammatical sentences, the decoder sometimes drops important words in compressing sentences. To solve this problem, we propose a novel Seq2Seq model, syntactically look-ahead attention network (SLAHAN), that can generate informative summaries by explicitly tracking both dependency parent and child words during decoding and capturing important words that will be decoded in the future. The results of the automatic evaluation on the Google sentence compression dataset showed that SLAHAN achieved the best kept-token-based-F1, ROUGE-1, ROUGE-2 and ROUGE-L scores of 85.5, 79.3, 71.3 and 79.1, respectively. SLAHAN also improved the summarization performance on longer sentences. Furthermore, in the human evaluation, SLAHAN improved informativeness without losing readability.


Sign in / Sign up

Export Citation Format

Share Document