scholarly journals Dependency Parsing of Turkish

2008 ◽  
Vol 34 (3) ◽  
pp. 357-389 ◽  
Author(s):  
Gülşen Eryiğit ◽  
Joakim Nivre ◽  
Kemal Oflazer

The suitability of different parsing methods for different languages is an important topic in syntactic parsing. Especially lesser-studied languages, typologically different from the languages for which methods have originally been developed, pose interesting challenges in this respect. This article presents an investigation of data-driven dependency parsing of Turkish, an agglutinative, free constituent order language that can be seen as the representative of a wider class of languages of similar type. Our investigations show that morphological structure plays an essential role in finding syntactic relations in such a language. In particular, we show that employing sublexical units called inflectional groups, rather than word forms, as the basic parsing units improves parsing accuracy. We test our claim on two different parsing methods, one based on a probabilistic model with beam search and the other based on discriminative classifiers and a deterministic parsing strategy, and show that the usefulness of sublexical units holds regardless of the parsing method. We examine the impact of morphological and lexical information in detail and show that, properly used, this kind of information can improve parsing accuracy substantially. Applying the techniques presented in this article, we achieve the highest reported accuracy for parsing the Turkish Treebank.

2013 ◽  
Vol 1 ◽  
pp. 301-314 ◽  
Author(s):  
Weiwei Sun ◽  
Xiaojun Wan

We present a comparative study of transition-, graph- and PCFG-based models aimed at illuminating more precisely the likely contribution of CFGs in improving Chinese dependency parsing accuracy, especially by combining heterogeneous models. Inspired by the impact of a constituency grammar on dependency parsing, we propose several strategies to acquire pseudo CFGs only from dependency annotations. Compared to linguistic grammars learned from rich phrase-structure treebanks, well designed pseudo grammars achieve similar parsing accuracy and have equivalent contributions to parser ensemble. Moreover, pseudo grammars increase the diversity of base models; therefore, together with all other models, further improve system combination. Based on automatic POS tagging, our final model achieves a UAS of 87.23%, resulting in a significant improvement of the state of the art.


2013 ◽  
Vol 39 (1) ◽  
pp. 5-13 ◽  
Author(s):  
Miguel Ballesteros ◽  
Joakim Nivre

Dependency trees used in syntactic parsing often include a root node representing a dummy word prefixed or suffixed to the sentence, a device that is generally considered a mere technical convenience and is tacitly assumed to have no impact on empirical results. We demonstrate that this assumption is false and that the accuracy of data-driven dependency parsers can in fact be sensitive to the existence and placement of the dummy root node. In particular, we show that a greedy, left-to-right, arc-eager transition-based parser consistently performs worse when the dummy root node is placed at the beginning of the sentence (following the current convention in data-driven dependency parsing) than when it is placed at the end or omitted completely. Control experiments with an arc-standard transition-based parser and an arc-factored graphbased parser reveal no consistent preferences but nevertheless exhibit considerable variation in results depending on root placement. We conclude that the treatment of dummy root nodes in data-driven dependency parsing is an underestimated source of variation in experiments andmay also be a parameter worth tuning for some parsers.


2015 ◽  
pp. 20 ◽  
Author(s):  
Stig-Arne Grönroos ◽  
Kristiina Jokinen ◽  
Katri Hiovain ◽  
Mikko Kurimo ◽  
Sami Virpioja

Many Uralic languages have a rich morphological structure, but lack tools of morphological analysis needed for efficient language processing. While creating a high-quality morphological analyzer requires a significant amount of expert labor, data-driven approaches may provide sufficient quality for many applications.We study how to create a statistical model for morphological segmentation of North Sámi language with a large unannotated corpus and a small amount of human-annotated word forms selected using an active learning approach. For statistical learning, we use the semi-supervised Morfessor Baseline and FlatCat methods. Aer annotating 237 words with our active learning setup, we improve morph boundary recall over 20% with no loss of precision.


2016 ◽  
Vol 4 ◽  
pp. 47-72
Author(s):  
Stig-Arne Grönroos ◽  
Katri Hiovain ◽  
Peter Smit ◽  
Ilona Rauhala ◽  
Kristiina Jokinen ◽  
...  

Many Uralic languages have a rich morphological structure, but lack morphological analysis tools needed for efficient language processing. While creating a high-quality morphological analyzer requires a significant amount of expert labor, data-driven approaches may provide sufficient quality for many applications. We study how to create a statistical model for morphological segmentation with a large unannotated corpus and a small amount of annotated word forms selected using an active learning approach. We apply the procedure to two Finno-Ugric languages: Finnish and North Sámi. The semi-supervised Morfessor FlatCat method is used for statistical learning. For Finnish, we set up a simulated scenario to test various active learning query strategies. The best performance is provided by a coverage-based strategy on word initial and final substrings. For North Sámi we collect a set of humanannotated data. With 300 words annotated with our active learning setup, we see a relative improvement in morph boundary F1-score of 19% compared to unsupervised learning and 7.8% compared to random selection.


Author(s):  
Shumin Shi ◽  
Dan Luo ◽  
Xing Wu ◽  
Congjun Long ◽  
Heyan Huang

Dependency parsing is an important task for Natural Language Processing (NLP). However, a mature parser requires a large treebank for training, which is still extremely costly to create. Tibetan is a kind of extremely low-resource language for NLP, there is no available Tibetan dependency treebank, which is currently obtained by manual annotation. Furthermore, there are few related kinds of research on the construction of treebank. We propose a novel method of multi-level chunk-based syntactic parsing to complete constituent-to-dependency treebank conversion for Tibetan under scarce conditions. Our method mines more dependencies of Tibetan sentences, builds a high-quality Tibetan dependency tree corpus, and makes fuller use of the inherent laws of the language itself. We train the dependency parsing models on the dependency treebank obtained by the preliminary transformation. The model achieves 86.5% accuracy, 96% LAS, and 97.85% UAS, which exceeds the optimal results of existing conversion methods. The experimental results show that our method has the potential to use a low-resource setting, which means we not only solve the problem of scarce Tibetan dependency treebank but also avoid needless manual annotation. The method embodies the regularity of strong knowledge-guided linguistic analysis methods, which is of great significance to promote the research of Tibetan information processing.


Geosciences ◽  
2021 ◽  
Vol 11 (2) ◽  
pp. 99 ◽  
Author(s):  
Yueqi Gu ◽  
Orhun Aydin ◽  
Jacqueline Sosa

Post-earthquake relief zone planning is a multidisciplinary optimization problem, which required delineating zones that seek to minimize the loss of life and property. In this study, we offer an end-to-end workflow to define relief zone suitability and equitable relief service zones for Los Angeles (LA) County. In particular, we address the impact of a tsunami in the study due to LA’s high spatial complexities in terms of clustering of population along the coastline, and a complicated inland fault system. We design data-driven earthquake relief zones with a wide variety of inputs, including geological features, population, and public safety. Data-driven zones were generated by solving the p-median problem with the Teitz–Bart algorithm without any a priori knowledge of optimal relief zones. We define the metrics to determine the optimal number of relief zones as a part of the proposed workflow. Finally, we measure the impacts of a tsunami in LA County by comparing data-driven relief zone maps for a case with a tsunami and a case without a tsunami. Our results show that the impact of the tsunami on the relief zones can extend up to 160 km inland from the study area.


2021 ◽  
Vol 11 (7) ◽  
pp. 3110
Author(s):  
Karina Gibert ◽  
Xavier Angerri

In this paper, the results of the project INSESS-COVID19 are presented, as part of a special call owing to help in the COVID19 crisis in Catalonia. The technological infrastructure and methodology developed in this project allows the quick screening of a territory for a quick a reliable diagnosis in front of an unexpected situation by providing relevant decisional information to support informed decision-making and strategy and policy design. One of the challenges of the project was to extract valuable information from direct participatory processes where specific target profiles of citizens are consulted and to distribute the participation along the whole territory. Having a lot of variables with a moderate number of citizens involved (in this case about 1000) implies the risk of violating statistical secrecy when multivariate relationships are analyzed, thus putting in risk the anonymity of the participants as well as their safety when vulnerable populations are involved, as is the case of INSESS-COVID19. In this paper, the entire data-driven methodology developed in the project is presented and the dealing of the small subgroups of population for statistical secrecy preserving described. The methodology is reusable with any other underlying questionnaire as the data science and reporting parts are totally automatized.


2021 ◽  
Author(s):  
Senthil Krishnababu ◽  
Omar Valero ◽  
Roger Wells

Abstract Data driven technologies are revolutionising the engineering sector by providing new ways of performing day to day tasks through the life cycle of a product as it progresses through manufacture, to build, qualification test, field operation and maintenance. Significant increase in data transfer speeds combined with cost effective data storage, and ever-increasing computational power provide the building blocks that enable companies to adopt data driven technologies such as data analytics, IOT and machine learning. Improved business operational efficiency and more responsive customer support provide the incentives for business investment. Digital twins, that leverages these technologies in their various forms to converge physics and data driven models, are therefore being widely adopted. A high-fidelity multi-physics digital twin, HFDT, that digitally replicates a gas turbine as it is built based on part and build data using advanced component and assembly models is introduced. The HFDT, among other benefits enables data driven assessments to be carried out during manufacture and assembly for each turbine allowing these processes to be optimised and the impact of variability or process change to be readily evaluated. On delivery of the turbine and its associated HFDT to the service support team the HFDT supports the evaluation of in-service performance deteriorations, the impact of field interventions and repair and the changes in operating characteristics resulting from overhaul and turbine upgrade. Thus, creating a cradle to grave physics and data driven twin of the gas turbine asset. In this paper, one branch of HFDT using a power turbine module is firstly presented. This involves simultaneous modelling of gas path and solid using high fidelity CFD and FEA which converts the cold geometry to hot running conditions to assess the impact of various manufacturing and build variabilities. It is shown this process can be executed within reasonable time frames enabling creation of HFDT for each turbine during manufacture and assembly and for this to be transferred to the service team for deployment during field operations. Following this, it is shown how data driven technologies are used in conjunction with the HFDT to improve predictions of engine performance from early build information. The example shown, shows how a higher degree of confidence is achieved through the development of an artificial neural network of the compressor tip gap feature and its effect on overall compressor efficiency.


2018 ◽  
Vol 146 (4) ◽  
pp. 1197-1218
Author(s):  
Michèle De La Chevrotière ◽  
John Harlim

This paper demonstrates the efficacy of data-driven localization mappings for assimilating satellite-like observations in a dynamical system of intermediate complexity. In particular, a sparse network of synthetic brightness temperature measurements is simulated using an idealized radiative transfer model and assimilated to the monsoon–Hadley multicloud model, a nonlinear stochastic model containing several thousands of model coordinates. A serial ensemble Kalman filter is implemented in which the empirical correlation statistics are improved using localization maps obtained from a supervised learning algorithm. The impact of the localization mappings is assessed in perfect-model observing system simulation experiments (OSSEs) as well as in the presence of model errors resulting from the misspecification of key convective closure parameters. In perfect-model OSSEs, the localization mappings that use adjacent correlations to improve the correlation estimated from small ensemble sizes produce robust accurate analysis estimates. In the presence of model error, the filter skills of the localization maps trained on perfect- and imperfect-model data are comparable.


Sign in / Sign up

Export Citation Format

Share Document