A Global Model for Concept-to-Text Generation

Concept-to-text generation refers to the task of automatically producing textual output from non-linguistic input. We present a joint model that captures content selection ("what to say") and surface realization ("how to say") in an unsupervised domain-independent fashion. Rather than breaking up the generation process into a sequence of local decisions, we define a probabilistic context-free grammar that globally describes the inherent structure of the input (a corpus of database records and text describing some of them). We recast generation as the task of finding the best derivation tree for a set of database records and describe an algorithm for decoding in this framework that allows to intersect the grammar with additional information capturing fluency and syntactic well-formedness constraints. Experimental evaluation on several domains achieves results competitive with state-of-the-art systems that use domain specific constraints, explicit feature engineering or labeled data.

Download Full-text

Knowledge Sources for Constituent Parsing of German, a Morphologically Rich and Less-Configurational Language

Computational Linguistics ◽

10.1162/coli_a_00135 ◽

2013 ◽

Vol 39 (1) ◽

pp. 57-85 ◽

Cited By ~ 2

Author(s):

Alexander Fraser ◽

Helmut Schmid ◽

Richárd Farkas ◽

Renjing Wang ◽

Hinrich Schütze

Keyword(s):

State Of The Art ◽

Lessons Learned ◽

Knowledge Sources ◽

Lexical Knowledge ◽

Context Free Grammar ◽

The Impact ◽

Context Free ◽

Probabilistic Context

We study constituent parsing of German, a morphologically rich and less-configurational language. We use a probabilistic context-free grammar treebank grammar that has been adapted to the morphologically rich properties of German by markovization and special features added to its productions. We evaluate the impact of adding lexical knowledge. Then we examine both monolingual and bilingual approaches to parse reranking. Our reranking parser is the new state of the art in constituency parsing of the TIGER Treebank. We perform an analysis, concluding with lessons learned, which apply to parsing other morphologically rich and less-configurational languages.

Download Full-text

Sentiment analysis in Turkish: Supervised, semi-supervised, and unsupervised techniques

Natural Language Engineering ◽

10.1017/s1351324920000200 ◽

2020 ◽

pp. 1-29

Author(s):

Cem Rıfkı Aydın ◽

Tunga Güngör

Keyword(s):

Sentiment Analysis ◽

Morphological Analysis ◽

State Of The Art ◽

Feature Weighting ◽

Domain Specific ◽

Word Forms ◽

Supervised Classifiers ◽

Supervised Methods ◽

Domain Independent ◽

Agglutinative Language

Abstract Although many studies on sentiment analysis have been carried out for widely spoken languages, this topic is still immature for Turkish. Most of the works in this language focus on supervised models, which necessitate comprehensive annotated corpora. There are a few unsupervised methods, and they utilize sentiment lexicons either built by translating from English lexicons or created based on corpora. This results in improper word polarities as the language and domain characteristics are ignored. In this paper, we develop unsupervised (domain-independent) and semi-supervised (domain-specific) methods for Turkish, which are based on a set of antonym word pairs as seeds. We make a comprehensive analysis of supervised methods under several feature weighting schemes. We then form ensemble of supervised classifiers and also combine the unsupervised and supervised methods. Since Turkish is an agglutinative language, we perform morphological analysis and use different word forms. The methods developed were tested on two datasets having different styles in Turkish and also on datasets in English to show the portability of the approaches across languages. We observed that the combination of the unsupervised and supervised approaches outperforms the other methods, and we obtained a significant improvement over the state-of-the-art results for both Turkish and English.

Download Full-text

Triple-to-Text Generation with an Anchor-to-Prototype Framework

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/523 ◽

2020 ◽

Author(s):

Ziran Li ◽

Zibo Lin ◽

Ning Ding ◽

Hai-Tao Zheng ◽

Ying Shen

Keyword(s):

Natural Language ◽

State Of The Art ◽

Experimental Results ◽

Training Data ◽

Generation Process ◽

Text Generation ◽

Language Generation ◽

Structured Input ◽

Textual Description ◽

Specific Description

Generating a textual description from a set of RDF triplets is a challenging task in natural language generation. Recent neural methods have become the mainstream for this task, which often generate sentences from scratch. However, due to the huge gap between the structured input and the unstructured output, the input triples alone are insufficient to decide an expressive and specific description. In this paper, we propose a novel anchor-to-prototype framework to bridge the gap between structured RDF triples and natural text. The model retrieves a set of prototype descriptions from the training data and extracts writing patterns from them to guide the generation process. Furthermore, to make a more precise use of the retrieved prototypes, we employ a triple anchor that aligns the input triples into groups so as to better match the prototypes. Experimental results on both English and Chinese datasets show that our method significantly outperforms the state-of-the-art baselines in terms of both automatic and manual evaluation, demonstrating the benefit of learning guidance from retrieved prototypes to facilitate triple-to-text generation.

Download Full-text

Automatic Induction of Bellman-Error Features for Probabilistic Planning

Journal of Artificial Intelligence Research ◽

10.1613/jair.3021 ◽

2010 ◽

Vol 38 ◽

pp. 687-755 ◽

Cited By ~ 3

Author(s):

J. Wu ◽

R. Givan

Keyword(s):

Bellman Equation ◽

State Of The Art ◽

Feature Space ◽

Automatic Processes ◽

Value Iteration ◽

Probabilistic Planning ◽

Domain Specific ◽

Hypothesis Space ◽

Approximate Value Iteration ◽

Domain Independent

Domain-specific features are important in representing problem structure throughout machine learning and decision-theoretic planning. In planning, once state features are provided, domain-independent algorithms such as approximate value iteration can learn weighted combinations of those features that often perform well as heuristic estimates of state value (e.g., distance to the goal). Successful applications in real-world domains often require features crafted by human experts. Here, we propose automatic processes for learning useful domain-specific feature sets with little or no human intervention. Our methods select and add features that describe state-space regions of high inconsistency in the Bellman equation (statewise Bellman error) during approximate value iteration. Our method can be applied using any real-valued-feature hypothesis space and corresponding learning method for selecting features from training sets of state-value pairs. We evaluate the method with hypothesis spaces defined by both relational and propositional feature languages, using nine probabilistic planning domains. We show that approximate value iteration using a relational feature space performs at the state-of-the-art in domain-independent stochastic relational planning. Our method provides the first domain-independent approach that plays Tetris successfully (without human-engineered features).

Download Full-text

Identifying Norms from Observation Using MCMC Sampling

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/17 ◽

2021 ◽

Author(s):

Stephen Cranefield ◽

Ashish Dhiman

Keyword(s):

Robot Manipulator ◽

Multi Agent Systems ◽

Context Free Grammar ◽

Domain Specific ◽

Mcmc Sampling ◽

Agent Interactions ◽

Multi Agent ◽

Identification Techniques ◽

Context Free ◽

Probabilistic Context

To promote efficient interactions in dynamic and multi-agent systems, there is much interest in techniques that allow agents to represent and reason about social norms that govern agent interactions. Much of this work assumes that norms are provided to agents, but some work has investigated how agents can identify the norms present in a society through observation and experience. However, the norm-identification techniques proposed in the literature often depend on a very specific and domain-specific representation of norms, or require that the possible norms can be enumerated in advance. This paper investigates the problem of identifying norm candidates from a normative language expressed as a probabilistic context-free grammar, using Markov Chain Monte Carlo (MCMC) search. We apply our technique to a simulated robot manipulator task and show that it allows effective identification of norms from observation.

Download Full-text

DSGPT: Domain-Specific Generative Pre-Training of Transformers for Text Generation in E-commerce Title and Review Summarization

Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval ◽

10.1145/3404835.3463037 ◽

2021 ◽

Author(s):

Xueying Zhang ◽

Yunjiang Jiang ◽

Yue Shang ◽

Zhaomeng Cheng ◽

Chi Zhang ◽

...

Keyword(s):

Text Generation ◽

Domain Specific ◽

Review Summarization

Download Full-text

BiLabel-Specific Features for Multi-Label Classification

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3458283 ◽

2021 ◽

Vol 16 (1) ◽

pp. 1-23

Author(s):

Min-Ling Zhang ◽

Jun-Peng Fang ◽

Yi-Bo Wang

Keyword(s):

Predictive Models ◽

Comparative Studies ◽

State Of The Art ◽

Classification Model ◽

Generation Process ◽

Prototype Selection ◽

Class Label ◽

Benchmark Datasets ◽

Label Correlations ◽

Class Labels

In multi-label classification, the task is to induce predictive models which can assign a set of relevant labels for the unseen instance. The strategy of label-specific features has been widely employed in learning from multi-label examples, where the classification model for predicting the relevancy of each class label is induced based on its tailored features rather than the original features. Existing approaches work by generating a group of tailored features for each class label independently, where label correlations are not fully considered in the label-specific features generation process. In this article, we extend existing strategy by proposing a simple yet effective approach based on BiLabel-specific features. Specifically, a group of tailored features is generated for a pair of class labels with heuristic prototype selection and embedding. Thereafter, predictions of classifiers induced by BiLabel-specific features are ensembled to determine the relevancy of each class label for unseen instance. To thoroughly evaluate the BiLabel-specific features strategy, extensive experiments are conducted over a total of 35 benchmark datasets. Comparative studies against state-of-the-art label-specific features techniques clearly validate the superiority of utilizing BiLabel-specific features to yield stronger generalization performance for multi-label classification.

Download Full-text

Searching for universal model of amyloid signaling motifs using probabilistic context-free grammars

BMC Bioinformatics ◽

10.1186/s12859-021-04139-y ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Witold Dyrka ◽

Marlena Gąsior-Głogowska ◽

Monika Szefczyk ◽

Natalia Szulc

Keyword(s):

Functional Relationship ◽

High Sensitivity ◽

Alternative Methods ◽

Discriminative Power ◽

Context Free Grammar ◽

Protein Motifs ◽

Functional Features ◽

Universal Grammars ◽

Context Free ◽

Probabilistic Context

Abstract Background Amyloid signaling motifs are a class of protein motifs which share basic structural and functional features despite the lack of clear sequence homology. They are hard to detect in large sequence databases either with the alignment-based profile methods (due to short length and diversity) or with generic amyloid- and prion-finding tools (due to insufficient discriminative power). We propose to address the challenge with a machine learning grammatical model capable of generalizing over diverse collections of unaligned yet related motifs. Results First, we introduce and test improvements to our probabilistic context-free grammar framework for protein sequences that allow for inferring more sophisticated models achieving high sensitivity at low false positive rates. Then, we infer universal grammars for a collection of recently identified bacterial amyloid signaling motifs and demonstrate that the method is capable of generalizing by successfully searching for related motifs in fungi. The results are compared to available alternative methods. Finally, we conduct spectroscopy and staining analyses of selected peptides to verify their structural and functional relationship. Conclusions While the profile HMMs remain the method of choice for modeling homologous sets of sequences, PCFGs seem more suitable for building meta-family descriptors and extrapolating beyond the seed sample.

Download Full-text