text corpora
Recently Published Documents


TOTAL DOCUMENTS

456
(FIVE YEARS 167)

H-INDEX

20
(FIVE YEARS 4)

Informatics ◽  
2021 ◽  
Vol 18 (4) ◽  
pp. 7-16
Author(s):  
S. F. Lipnitsky

Objectives. The problem of automating the user information support in decision making system at the stage of describing the problem situation is solved. The relevance of the problem is associated with the need to collect and process significant amounts of information, since in the presence of a large number of factors, a person's capabilities are often insufficient to search and organize the necessary information.When solving the problem of user information support at the stage of describing the problem situation, three main goals are pursued: building a mathematical model of the corresponding processes; formalization of the set of basic concepts of the model; development of the algorithms for implementation of user interaction with the information system.Methods. Methods of set theory, probability theory and graph theory are used.Results. A mathematical model of user information support at the stage of describing a problem situation has been developed. In the process of interacting with the user, the system suggests special templates of sentences and texts for filling. Along with templates, the user receives help texts from the system. They are generated on the basis of the previously developed model of knowledge representation in the form of verbal associations, that is, semantic links between words and phrases corresponding to associative relationships between the entities they designate in the real world.Conclusion. As an implementation of the proposed model, the following algorithms have been developed: an algorithm for creating a dictionary of communicative fragments; algorithms for creating fragment-slot templates for sentences, texts and subject areas; an algorithm of user information support. The vocabulary of communicative fragments is created in four steps in accordance with their formal definition. At each step, four conditions from the given definition are tested sequentially. Fragment-slot templates of sentences are formed by replacing their basic communicative fragments with slots, and text templates - as tuples of templates of their sentences. Fragment-slot templates of subject areas are created in the form of implementation of binary relations reductions on the sets of sentence templates from the corresponding thematic text corpora. Each thematic body of texts defines a certain subject area.


2021 ◽  
pp. 016555152110605
Author(s):  
Gustavo Candela ◽  
María-Dolores Sáez ◽  
Pilar Escobar ◽  
Manuel Marco-Such

In the domain of Galleries, Libraries, Archives and Museums (GLAM) institutions, creative and innovative tools and methodologies for content delivery and user engagement have recently gained international attention. New methods have been proposed to publish digital collections as datasets amenable to computational use. Standardised benchmarks can be useful to broaden the scope of machine-actionable collections and to promote cultural and linguistic diversity. In this article, we propose a methodology to select datasets for computationally driven research applied to Spanish text corpora. This work seeks to encourage Spanish and Latin American institutions to publish machine-actionable collections based on best practices and avoiding common mistakes.


2021 ◽  
Vol 111 (6) ◽  
pp. 105-136
Author(s):  
Gudrun Bukies

The topic of this article is ‘weight’ in the German-Italian language comparison. Which linguistic means are used to refer to weight in German (Gewicht) and what are the Italian equivalents? The material which has been collected is based on monolingual German and Italian dictionaries, reference works and text corpora as well as on bilingual German-Italian dictionaries and text excerpts. The classification of the so-called weight designations including derivatives, composites and word combinations is carried out from an etymological and lexical perspective. In addition to the dictionary entries, German-Italian translation examples show further equivalents of terms and expressions with regards to ‘weight’ in this language pair.


2021 ◽  
Vol 12 (4) ◽  
pp. 48-52
Author(s):  
Strilets V. ◽  

Corpus technologies (corpora of English and Ukrainian texts and tools for their processing) represent modern specialized discourse and facilitate searching for and comparing different units of translation, which makes them a useful tool for both practicing and trainee translators. The purpose of this article is to determine the role and place of corpus technologies in teaching specialized translation on the example of the oil and gas industry. Comparative and parallel text corpora are characterized. The paper reveals methods of applying mono- and bilingual comparative and parallel corpora and corpus managers for acquiring knowledge about genre-stylistic features of texts; developing skills to distinguish a term and determine its collocation profile and semantic preference; analyze translation techniques; translate collocations, complex noun constructions, verbal phrases, and abbreviations. Examples of relevant exercises and tasks that should be performed at the translation training stage are given. Further research should be aimed at integrating corpus-based tasks into the translation practice stage involving the implementation of a translation project.


2021 ◽  
Author(s):  
Tiago Barbosa de Lima ◽  
André C. A. Nascimento ◽  
Pericles Miranda ◽  
Rafael Ferreira Mello

In Brazil, several minority languages suffer a serious risk of extinction. The appropriate documentation of such languages is a fundamental step to avoid that. However, for some of those languages, only a small amount of text corpora is digitally accessible. Meanwhile there are many issues related to the identification of indigenous languages, which may help to identify key similarities among them, as well as to connect related languages and dialects. Therefore, this paper proposes to study and automatically classify 26 neglected Brazilian native languages, considering a small amount of training data, under a supervised and unsupervised setting. Our findings indicate that the use of machine learning models to the analysis of Brazilian Indigenous corpora is very promising, and we hope this work encourage more research on this topic in the next years.


Author(s):  
Nona Naderi ◽  
Julien Knafou ◽  
Jenny Copara ◽  
Patrick Ruch ◽  
Douglas Teodoro

The health and life science domains are well known for their wealth of named entities found in large free text corpora, such as scientific literature and electronic health records. To unlock the value of such corpora, named entity recognition (NER) methods are proposed. Inspired by the success of transformer-based pretrained models for NER, we assess how individual and ensemble of deep masked language models perform across corpora of different health and life science domains—biology, chemistry, and medicine—available in different languages—English and French. Individual deep masked language models, pretrained on external corpora, are fined-tuned on task-specific domain and language corpora and ensembled using classical majority voting strategies. Experiments show statistically significant improvement of the ensemble models over an individual BERT-based baseline model, with an overall best performance of 77% macro F1-score. We further perform a detailed analysis of the ensemble results and show how their effectiveness changes according to entity properties, such as length, corpus frequency, and annotation consistency. The results suggest that the ensembles of deep masked language models are an effective strategy for tackling NER across corpora from the health and life science domains.


2021 ◽  
pp. 016224392110544
Author(s):  
T. E. de Wildt ◽  
I. R. van de Poel ◽  
E. J. L. Chappin

We propose a new approach for tracing value change. Value change may lead to a mismatch between current value priorities in society and the values for which technologies were designed in the past, such as energy technologies based on fossil fuels, which were developed when sustainability was not considered a very important value. Better anticipating value change is essential to avoid a lack of social acceptance and moral acceptability of technologies. While value change can be studied historically and qualitatively, we propose a more quantitative approach that uses large text corpora. It uses probabilistic topic models, which allow us to trace (new) values that are (still) latent. We demonstrate the approach for five types of value change in technology. Our approach is useful for testing hypotheses about value change, such as verifying whether value change has occurred and identifying patterns of value change. The approach can be used to trace value change for various technologies and text corpora, including scientific articles, newspaper articles, and policy documents.


2021 ◽  
Vol 40 (3) ◽  
pp. 421-440
Author(s):  
Hanna Lüschow

Abstract The use of some basic computer science concepts could expand the possibilities of (manual) graphematic text corpus analysis. With these it can be shown that graphematic variation decreases constantly in printed German texts from 1600 to 1900. While the variability is continuously lesser on a text-internal level, it decreases faster for the whole available writing system of individual decades. But which changes took place exactly? Which types of variation went away more quickly, which ones persisted? How do we deal with large amounts of data which cannot be processed manually anymore? Which aspects are of special importance or go missing while working with a large textual base? The use of a measurement called entropy quantifies the variability of the spellings of a given word form, lemma, text or subcorpus, with few restrictions but also less details in the results. The difference between two spellings can be measured via Damerau-Levenshtein distance. To a certain degree, automated data handling can also determine the exact changes that took place. Afterwards, these differences can be counted and ranked. As data source the German Text Archive of the Berlin-Brandenburg Academy of Sciences and Humanities is used. It offers for example orthographic normalization – which is extremely useful –, preprocessing of parts of speech and lemmatization. As opposed to many other approaches the establishment of today’s normed spellings is not seen as the aim of the developments and is therefore not the focus of the research. Instead, the differences between individual spellings are of interest. Afterwards intra- and extralinguistic factors which caused these developments should be determined. These methodological findings could subsequently be used for improving research methods in other graphematic fields of interest, e. g. computer-mediated communication.


Sign in / Sign up

Export Citation Format

Share Document