Consistent Unsupervised Estimators for Anchored PCFGs

Learning probabilistic context-free grammars (PCFGs) from strings is a classic problem in computational linguistics since Horning ( 1969 ). Here we present an algorithm based on distributional learning that is a consistent estimator for a large class of PCFGs that satisfy certain natural conditions including being anchored (Stratos et al., 2016 ). We proceed via a reparameterization of (top–down) PCFGs that we call a bottom–up weighted context-free grammar. We show that if the grammar is anchored and satisfies additional restrictions on its ambiguity, then the parameters can be directly related to distributional properties of the anchoring strings; we show the asymptotic correctness of a naive estimator and present some simulations using synthetic data that show that algorithms based on this approach have good finite sample behavior.

Download Full-text

Weighted and Probabilistic Context-Free Grammars Are Equally Expressive

Computational Linguistics ◽

10.1162/coli.2007.33.4.477 ◽

2007 ◽

Vol 33 (4) ◽

pp. 477-491 ◽

Cited By ~ 10

Author(s):

Noah A. Smith ◽

Mark Johnson

Keyword(s):

Computational Linguistics ◽

Conditional Distribution ◽

Markov Models ◽

Estimation Procedure ◽

Joint Estimation ◽

Positive Real ◽

Context Free ◽

Probabilistic Context ◽

Context Free Grammars ◽

Probabilistic Context Free Grammars

This article studies the relationship between weighted context-free grammars (WCFGs), where each production is associated with a positive real-valued weight, and probabilistic context-free grammars (PCFGs), where the weights of the productions associated with a nonterminal are constrained to sum to one. Because the class of WCFGs properly includes the PCFGs, one might expect that WCFGs can describe distributions that PCFGs cannot. However, Z. Chi (1999, Computational Linguistics, 25(1):131–160) and S. P. Abney, D. A. McAllester, and P. Pereira (1999, In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pages 542–549, College Park, MD) proved that every WCFG distribution is equivalent to some PCFG distribution. We extend their results to conditional distributions, and show that every WCFG conditional distribution of parses given strings is also the conditional distribution defined by some PCFG, even when the WCFG's partition function diverges. This shows that any parsing or labeling accuracy improvement from conditional estimation of WCFGs or conditional random fields (CRFs) over joint estimation of PCFGs or hidden Markov models (HMMs) is due to the estimation procedure rather than the change in model class, because PCFGs and HMMs are exactly as expressive as WCFGs and chain-structured CRFs, respectively.

Download Full-text

Searching for universal model of amyloid signaling motifs using probabilistic context-free grammars

BMC Bioinformatics ◽

10.1186/s12859-021-04139-y ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Witold Dyrka ◽

Marlena Gąsior-Głogowska ◽

Monika Szefczyk ◽

Natalia Szulc

Keyword(s):

Functional Relationship ◽

High Sensitivity ◽

Alternative Methods ◽

Discriminative Power ◽

Context Free Grammar ◽

Protein Motifs ◽

Functional Features ◽

Universal Grammars ◽

Context Free ◽

Probabilistic Context

Abstract Background Amyloid signaling motifs are a class of protein motifs which share basic structural and functional features despite the lack of clear sequence homology. They are hard to detect in large sequence databases either with the alignment-based profile methods (due to short length and diversity) or with generic amyloid- and prion-finding tools (due to insufficient discriminative power). We propose to address the challenge with a machine learning grammatical model capable of generalizing over diverse collections of unaligned yet related motifs. Results First, we introduce and test improvements to our probabilistic context-free grammar framework for protein sequences that allow for inferring more sophisticated models achieving high sensitivity at low false positive rates. Then, we infer universal grammars for a collection of recently identified bacterial amyloid signaling motifs and demonstrate that the method is capable of generalizing by successfully searching for related motifs in fungi. The results are compared to available alternative methods. Finally, we conduct spectroscopy and staining analyses of selected peptides to verify their structural and functional relationship. Conclusions While the profile HMMs remain the method of choice for modeling homologous sets of sequences, PCFGs seem more suitable for building meta-family descriptors and extrapolating beyond the seed sample.

Download Full-text

Improved Probabilistic Context-Free Grammars for Passwords Using Word Extraction

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp39728.2021.9414886 ◽

2021 ◽

Author(s):

Haibo Cheng ◽

Wenting Li ◽

Ping Wang ◽

Kaitai Liang

Keyword(s):

Context Free ◽

Probabilistic Context ◽

Context Free Grammars ◽

Probabilistic Context Free Grammars

Download Full-text

Knowledge Sources for Constituent Parsing of German, a Morphologically Rich and Less-Configurational Language

Computational Linguistics ◽

10.1162/coli_a_00135 ◽

2013 ◽

Vol 39 (1) ◽

pp. 57-85 ◽

Cited By ~ 2

Author(s):

Alexander Fraser ◽

Helmut Schmid ◽

Richárd Farkas ◽

Renjing Wang ◽

Hinrich Schütze

Keyword(s):

State Of The Art ◽

Lessons Learned ◽

Knowledge Sources ◽

Lexical Knowledge ◽

Context Free Grammar ◽

The Impact ◽

Context Free ◽

Probabilistic Context

We study constituent parsing of German, a morphologically rich and less-configurational language. We use a probabilistic context-free grammar treebank grammar that has been adapted to the morphologically rich properties of German by markovization and special features added to its productions. We evaluate the impact of adding lexical knowledge. Then we examine both monolingual and bilingual approaches to parse reranking. Our reranking parser is the new state of the art in constituency parsing of the TIGER Treebank. We perform an analysis, concluding with lessons learned, which apply to parsing other morphologically rich and less-configurational languages.

Download Full-text