Computer mapping of language data

Author(s):  
William A. Kretzschmar
2020 ◽  
Vol 51 (2) ◽  
pp. 479-493
Author(s):  
Jenny A. Roberts ◽  
Evelyn P. Altenberg ◽  
Madison Hunter

Purpose The results of automatic machine scoring of the Index of Productive Syntax from the Computerized Language ANalysis (CLAN) tools of the Child Language Data Exchange System of TalkBank (MacWhinney, 2000) were compared to manual scoring to determine the accuracy of the machine-scored method. Method Twenty transcripts of 10 children from archival data of the Weismer Corpus from the Child Language Data Exchange System at 30 and 42 months were examined. Measures of absolute point difference and point-to-point accuracy were compared, as well as points erroneously given and missed. Two new measures for evaluating automatic scoring of the Index of Productive Syntax were introduced: Machine Item Accuracy (MIA) and Cascade Failure Rate— these measures further analyze points erroneously given and missed. Differences in total scores, subscale scores, and individual structures were also reported. Results Mean absolute point difference between machine and hand scoring was 3.65, point-to-point agreement was 72.6%, and MIA was 74.9%. There were large differences in subscales, with Noun Phrase and Verb Phrase subscales generally providing greater accuracy and agreement than Question/Negation and Sentence Structures subscales. There were significantly more erroneous than missed items in machine scoring, attributed to problems of mistagging of elements, imprecise search patterns, and other errors. Cascade failure resulted in an average of 4.65 points lost per transcript. Conclusions The CLAN program showed relatively inaccurate outcomes in comparison to manual scoring on both traditional and new measures of accuracy. Recommendations for improvement of the program include accounting for second exemplar violations and applying cascaded credit, among other suggestions. It was proposed that research on machine-scored syntax routinely report accuracy measures detailing erroneous and missed scores, including MIA, so that researchers and clinicians are aware of the limitations of a machine-scoring program. Supplemental Material https://doi.org/10.23641/asha.11984364


1991 ◽  
Vol 30 (04) ◽  
pp. 275-283 ◽  
Author(s):  
P. M. Pietrzyk

Abstract:Much information about patients is stored in free text. Hence, the computerized processing of medical language data has been a well-known goal of medical informatics resulting in different paradigms. In Gottingen, a Medical Text Analysis System for German (abbr. MediTAS) has been under development for some time, trying to combine and to extend these paradigms. This article concentrates on the automated syntax analysis of German medical utterances. The investigated text material consists of 8,790 distinct utterances extracted from the summary sections of about 18,400 cytopathological findings reports. The parsing is based upon a new approach called Left-Associative Grammar (LAG) developed by Hausser. By extending considerably the LAG approach, most of the grammatical constructions occurring in the text material could be covered.


2019 ◽  
Vol 113 (1) ◽  
pp. 9-30
Author(s):  
Kateřina Rysová ◽  
Magdaléna Rysová ◽  
Michal Novák ◽  
Jiří Mírovský ◽  
Eva Hajičová

Abstract In the paper, we present EVALD applications (Evaluator of Discourse) for automated essay scoring. EVALD is the first tool of this type for Czech. It evaluates texts written by both native and non-native speakers of Czech. We describe first the history and the present in the automatic essay scoring, which is illustrated by examples of systems for other languages, mainly for English. Then we focus on the methodology of creating the EVALD applications and describe datasets used for testing as well as supervised training that EVALD builds on. Furthermore, we analyze in detail a sample of newly acquired language data – texts written by non-native speakers reaching the threshold level of the Czech language acquisition required e.g. for the permanent residence in the Czech Republic – and we focus on linguistic differences between the available text levels. We present the feature set used by EVALD and – based on the analysis – we extend it with new spelling features. Finally, we evaluate the overall performance of various variants of EVALD and provide the analysis of collected results.


2017 ◽  
Vol 7 (2) ◽  
pp. 167
Author(s):  
Zainal Abidin

This study aims at describing the assimilation of Bonai Ulakpatian isolect in Riau Province. This study is a linguistics research about sound changing that occurs on the different sounds to be the same sounds at the position between two vowels in the middle of a word in an isolect that is used by Bonai ethnic group in Ulakpatian Village, Rokan Hulu Regency. The data of the research is the utterances data of Bonai ethnic group community that referred to in selection of language data. The data were collected by applying interview method by using conversation and recording technique. The data were described phonetically by using IPA symbol, the data were compared with PM and made conclusion The result of the research shows that Bonai Ulakpatian isolect has four assimilation forms at the position between two vowels in the middle of a word, namely 1) PM *nd/v-v> BU [n]/v-v, 2) PM *ŋg/v-v> BU [ŋ]/v-v, 3) PM *mb/v-v> BU [m]/v-v that are total progressive assimilation and phonetics assimilation, and 4) PM *nj/v-v> BU [ñ]/v-v that are reciprocal and phonemic assimilation.AbstrakPenelitian ini bertujuan untuk mendeskripsikan asimilasi pada isolek Bonai Ulakpatian yang terdapat di Provinsi Riau. Kajian ini merupakan kajian linguistik tentang perubahan bunyi yang terjadi pada bunyi-bunyi berbeda menjadi sama, yang berada pada posisi antara dua vokal di tengah kata dalam sebuah isolek yang digunakan oleh suku Bonai di Desa Ulakpatian, Kabupaten Rokan Hulu. Data berupa tuturan masyarakat suku Bonai dikumpulkan dengan penerapan metode cakap dan metode simak dengan menggunakan teknik pancing dan teknik rekam. Analisis data dilakukan dengan pentranskripsian fonetis dengan simbol IPA, pembandingan data dengan leksikon PM, dan penarikan simpulan. Hasil penelitian menunjukkan bahwa isolek Bonai Ulakpatian memiliki empat bentuk asimilasi pada posisi antara dua vokal di tengah kata, yaitu 1) PM*nd/v-v> BU [n]/v-v, 2) PM*ŋg/v-v> BU [ŋ]/v-v, 3) PM*mb/v-v> BU [m]/v-v yang merupakan asimilasi progresif total dan asimilasi fonetis, dan 4) PM*nj/v-v> BU [ñ]/v-v yang merupakan asimilasi resiprokal dan fonemis.


2021 ◽  
Vol 21 (2) ◽  
pp. 1-25
Author(s):  
Pin Ni ◽  
Yuming Li ◽  
Gangmin Li ◽  
Victor Chang

Cyber-Physical Systems (CPS), as a multi-dimensional complex system that connects the physical world and the cyber world, has a strong demand for processing large amounts of heterogeneous data. These tasks also include Natural Language Inference (NLI) tasks based on text from different sources. However, the current research on natural language processing in CPS does not involve exploration in this field. Therefore, this study proposes a Siamese Network structure that combines Stacked Residual Long Short-Term Memory (bidirectional) with the Attention mechanism and Capsule Network for the NLI module in CPS, which is used to infer the relationship between text/language data from different sources. This model is mainly used to implement NLI tasks and conduct a detailed evaluation in three main NLI benchmarks as the basic semantic understanding module in CPS. Comparative experiments prove that the proposed method achieves competitive performance, has a certain generalization ability, and can balance the performance and the number of trained parameters.


2021 ◽  
Vol 14 (2) ◽  
pp. 1-45
Author(s):  
Danielle Bragg ◽  
Naomi Caselli ◽  
Julie A. Hochgesang ◽  
Matt Huenerfauth ◽  
Leah Katz-Hernandez ◽  
...  

Sign language datasets are essential to developing many sign language technologies. In particular, datasets are required for training artificial intelligence (AI) and machine learning (ML) systems. Though the idea of using AI/ML for sign languages is not new, technology has now advanced to a point where developing such sign language technologies is becoming increasingly tractable. This critical juncture provides an opportunity to be thoughtful about an array of Fairness, Accountability, Transparency, and Ethics (FATE) considerations. Sign language datasets typically contain recordings of people signing, which is highly personal. The rights and responsibilities of the parties involved in data collection and storage are also complex and involve individual data contributors, data collectors or owners, and data users who may interact through a variety of exchange and access mechanisms. Deaf community members (and signers, more generally) are also central stakeholders in any end applications of sign language data. The centrality of sign language to deaf culture identity, coupled with a history of oppression, makes usage by technologists particularly sensitive. This piece presents many of these issues that characterize working with sign language AI datasets, based on the authors’ experiences living, working, and studying in this space.


Languages ◽  
2021 ◽  
Vol 6 (3) ◽  
pp. 123
Author(s):  
Thomas A. Leddy-Cecere

The Arabic dialectology literature repeatedly asserts the existence of a macro-level classificatory relationship binding the Arabic speech varieties of the combined Egypto-Sudanic area. This proposal, though oft-encountered, has not previously been formulated in reference to extensive linguistic criteria, but is instead framed primarily on the nonlinguistic premise of historical demographic and genealogical relationships joining the Arabic-speaking communities of the region. The present contribution provides a linguistically based evaluation of this proposed dialectal grouping, to assess whether the postulated dialectal unity is meaningfully borne out by available language data. Isoglosses from the domains of segmental phonology, phonological processes, pronominal morphology, verbal inflection, and syntax are analyzed across six dialects representing Arabic speech in the region. These are shown to offer minimal support for a unified Egypto-Sudanic dialect classification, but instead to indicate a significant north–south differentiation within the sample—a finding further qualified via application of the novel method of Historical Glottometry developed by François and Kalyan. The investigation concludes with reflection on the implications of these results on the understandings of the correspondence between linguistic and human genealogical relationships in the history of Arabic and in dialectological practice more broadly.


Sign in / Sign up

Export Citation Format

Share Document