text corpora Latest Research Papers

Objectives. The problem of automating the user information support in decision making system at the stage of describing the problem situation is solved. The relevance of the problem is associated with the need to collect and process significant amounts of information, since in the presence of a large number of factors, a person's capabilities are often insufficient to search and organize the necessary information.When solving the problem of user information support at the stage of describing the problem situation, three main goals are pursued: building a mathematical model of the corresponding processes; formalization of the set of basic concepts of the model; development of the algorithms for implementation of user interaction with the information system.Methods. Methods of set theory, probability theory and graph theory are used.Results. A mathematical model of user information support at the stage of describing a problem situation has been developed. In the process of interacting with the user, the system suggests special templates of sentences and texts for filling. Along with templates, the user receives help texts from the system. They are generated on the basis of the previously developed model of knowledge representation in the form of verbal associations, that is, semantic links between words and phrases corresponding to associative relationships between the entities they designate in the real world.Conclusion. As an implementation of the proposed model, the following algorithms have been developed: an algorithm for creating a dictionary of communicative fragments; algorithms for creating fragment-slot templates for sentences, texts and subject areas; an algorithm of user information support. The vocabulary of communicative fragments is created in four steps in accordance with their formal definition. At each step, four conditions from the given definition are tested sequentially. Fragment-slot templates of sentences are formed by replacing their basic communicative fragments with slots, and text templates - as tuples of templates of their sentences. Fragment-slot templates of subject areas are created in the form of implementation of binary relations reductions on the sets of sentence templates from the corresponding thematic text corpora. Each thematic body of texts defines a certain subject area.

Download Full-text

Text corpora, professional translators and translator training

The Interpreter and Translator Trainer ◽

10.1080/1750399x.2021.2001955 ◽

2021 ◽

pp. 1-23

Author(s):

Mikhail Mikhailov

Keyword(s):

Text Corpora ◽

Translator Training

Download Full-text

A benchmark of Spanish language datasets for computationally driven research

Journal of Information Science ◽

10.1177/01655515211060530 ◽

2021 ◽

pp. 016555152110605

Author(s):

Gustavo Candela ◽

María-Dolores Sáez ◽

Pilar Escobar ◽

Manuel Marco-Such

Keyword(s):

Spanish Text ◽

Best Practices ◽

Latin American ◽

Linguistic Diversity ◽

User Engagement ◽

Digital Collections ◽

Cultural And Linguistic Diversity ◽

New Methods ◽

Text Corpora ◽

International Attention

In the domain of Galleries, Libraries, Archives and Museums (GLAM) institutions, creative and innovative tools and methodologies for content delivery and user engagement have recently gained international attention. New methods have been proposed to publish digital collections as datasets amenable to computational use. Standardised benchmarks can be useful to broaden the scope of machine-actionable collections and to promote cultural and linguistic diversity. In this article, we propose a methodology to select datasets for computationally driven research applied to Spanish text corpora. This work seeks to encourage Spanish and Latin American institutions to publish machine-actionable collections based on best practices and avoiding common mistakes.

Download Full-text

Die Größe ‚Gewicht‘ im deutsch-italienischen Sprachvergleich

Linguistik Online ◽

10.13092/lo.111.8242 ◽

2021 ◽

Vol 111 (6) ◽

pp. 105-136

Author(s):

Gudrun Bukies

Keyword(s):

Italian Translation ◽

Italian Language ◽

Text Corpora ◽

Linguistic Means ◽

Language Pair

The topic of this article is ‘weight’ in the German-Italian language comparison. Which linguistic means are used to refer to weight in German (Gewicht) and what are the Italian equivalents? The material which has been collected is based on monolingual German and Italian dictionaries, reference works and text corpora as well as on bilingual German-Italian dictionaries and text excerpts. The classification of the so-called weight designations including derivatives, composites and word combinations is carried out from an etymological and lexical perspective. In addition to the dictionary entries, German-Italian translation examples show further equivalents of terms and expressions with regards to ‘weight’ in this language pair.

Download Full-text

Supporting Emotion Automatic Detection and Analysis over Real-Life Text Corpora via Deep Learning: Model, Methodology and Framework

10.18293/jvlc2021-n2-016 ◽

2021 ◽

Vol 2021 (2) ◽

pp. 1-6

Author(s):

Alfredo Cuzzocrea ◽

Alfredo Cuzzocrea

Keyword(s):

Deep Learning ◽

Real Life ◽

Automatic Detection ◽

Learning Model ◽

Text Corpora ◽

Deep Learning Model

Download Full-text

Application of corpus technologies in teaching specialized translation

Humanitarian studios: pedagogics, psychology, philosophy ◽

10.31548/hspedagog2021.04.048 ◽

2021 ◽

Vol 12 (4) ◽

pp. 48-52

Author(s):

Strilets V. ◽

Keyword(s):

Oil And Gas ◽

Oil And Gas Industry ◽

Gas Industry ◽

Parallel Corpora ◽

Text Corpora ◽

Translation Practice ◽

Training Stage ◽

Translation Techniques ◽

Parallel Text

Corpus technologies (corpora of English and Ukrainian texts and tools for their processing) represent modern specialized discourse and facilitate searching for and comparing different units of translation, which makes them a useful tool for both practicing and trainee translators. The purpose of this article is to determine the role and place of corpus technologies in teaching specialized translation on the example of the oil and gas industry. Comparative and parallel text corpora are characterized. The paper reveals methods of applying mono- and bilingual comparative and parallel corpora and corpus managers for acquiring knowledge about genre-stylistic features of texts; developing skills to distinguish a term and determine its collocation profile and semantic preference; analyze translation techniques; translate collocations, complex noun constructions, verbal phrases, and abbreviations. Examples of relevant exercises and tasks that should be performed at the translation training stage are given. Further research should be aimed at integrating corpus-based tasks into the translation practice stage involving the implementation of a translation project.

Download Full-text

Analysis of a Brazilian Indigenous corpus using machine learning methods

10.5753/eniac.2021.18246 ◽

2021 ◽

Author(s):

Tiago Barbosa de Lima ◽

André C. A. Nascimento ◽

Pericles Miranda ◽

Rafael Ferreira Mello

Keyword(s):

Machine Learning ◽

Indigenous Languages ◽

Training Data ◽

Minority Languages ◽

Learning Models ◽

Native Languages ◽

Machine Learning Methods ◽

Text Corpora ◽

Risk Of Extinction ◽

Machine Learning Models

In Brazil, several minority languages suffer a serious risk of extinction. The appropriate documentation of such languages is a fundamental step to avoid that. However, for some of those languages, only a small amount of text corpora is digitally accessible. Meanwhile there are many issues related to the identification of indigenous languages, which may help to identify key similarities among them, as well as to connect related languages and dialects. Therefore, this paper proposes to study and automatically classify 26 neglected Brazilian native languages, considering a small amount of training data, under a supervised and unsupervised setting. Our findings indicate that the use of machine learning models to the analysis of Brazilian Indigenous corpora is very promising, and we hope this work encourage more research on this topic in the next years.

Download Full-text

Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora

Frontiers in Research Metrics and Analytics ◽

10.3389/frma.2021.689803 ◽

2021 ◽

Vol 6 ◽

Author(s):

Nona Naderi ◽

Julien Knafou ◽

Jenny Copara ◽

Patrick Ruch ◽

Douglas Teodoro

Keyword(s):

Life Science ◽

Named Entity Recognition ◽

Majority Voting ◽

Entity Recognition ◽

Language Models ◽

Free Text ◽

Specific Domain ◽

Named Entities ◽

Named Entity ◽

Text Corpora

The health and life science domains are well known for their wealth of named entities found in large free text corpora, such as scientific literature and electronic health records. To unlock the value of such corpora, named entity recognition (NER) methods are proposed. Inspired by the success of transformer-based pretrained models for NER, we assess how individual and ensemble of deep masked language models perform across corpora of different health and life science domains—biology, chemistry, and medicine—available in different languages—English and French. Individual deep masked language models, pretrained on external corpora, are fined-tuned on task-specific domain and language corpora and ensembled using classical majority voting strategies. Experiments show statistically significant improvement of the ensemble models over an individual BERT-based baseline model, with an overall best performance of 77% macro F1-score. We further perform a detailed analysis of the ensemble results and show how their effectiveness changes according to entity properties, such as length, corpus frequency, and annotation consistency. The results suggest that the ensembles of deep masked language models are an effective strategy for tackling NER across corpora from the health and life science domains.

Download Full-text

Tracing Long-term Value Change in (Energy) Technologies: Opportunities of Probabilistic Topic Models Using Large Data Sets

Science Technology & Human Values ◽

10.1177/01622439211054439 ◽

2021 ◽

pp. 016224392110544

Author(s):

T. E. de Wildt ◽

I. R. van de Poel ◽

E. J. L. Chappin

Keyword(s):

Fossil Fuels ◽

Social Acceptance ◽

Topic Models ◽

Large Data ◽

Data Sets ◽

Value Change ◽

Energy Technologies ◽

Text Corpora ◽

Probabilistic Topic Models

We propose a new approach for tracing value change. Value change may lead to a mismatch between current value priorities in society and the values for which technologies were designed in the past, such as energy technologies based on fossil fuels, which were developed when sustainability was not considered a very important value. Better anticipating value change is essential to avoid a lack of social acceptance and moral acceptability of technologies. While value change can be studied historically and qualitatively, we propose a more quantitative approach that uses large text corpora. It uses probabilistic topic models, which allow us to trace (new) values that are (still) latent. We demonstrate the approach for five types of value change in technology. Our approach is useful for testing hypotheses about value change, such as verifying whether value change has occurred and identifying patterns of value change. The approach can be used to trace value change for various technologies and text corpora, including scientific articles, newspaper articles, and policy documents.

Download Full-text

Quantifying graphemic variation via large text corpora

Zeitschrift für Sprachwissenschaft ◽

10.1515/zfs-2021-2038 ◽

2021 ◽

Vol 40 (3) ◽

pp. 421-440

Author(s):

Hanna Lüschow

Keyword(s):

German Text ◽

Computer Mediated Communication ◽

Levenshtein Distance ◽

Special Importance ◽

Mediated Communication ◽

Parts Of Speech ◽

Text Corpora ◽

Computer Mediated ◽

The Difference ◽

Academy Of Sciences

Abstract The use of some basic computer science concepts could expand the possibilities of (manual) graphematic text corpus analysis. With these it can be shown that graphematic variation decreases constantly in printed German texts from 1600 to 1900. While the variability is continuously lesser on a text-internal level, it decreases faster for the whole available writing system of individual decades. But which changes took place exactly? Which types of variation went away more quickly, which ones persisted? How do we deal with large amounts of data which cannot be processed manually anymore? Which aspects are of special importance or go missing while working with a large textual base? The use of a measurement called entropy quantifies the variability of the spellings of a given word form, lemma, text or subcorpus, with few restrictions but also less details in the results. The difference between two spellings can be measured via Damerau-Levenshtein distance. To a certain degree, automated data handling can also determine the exact changes that took place. Afterwards, these differences can be counted and ranked. As data source the German Text Archive of the Berlin-Brandenburg Academy of Sciences and Humanities is used. It offers for example orthographic normalization – which is extremely useful –, preprocessing of parts of speech and lemmatization. As opposed to many other approaches the establishment of today’s normed spellings is not seen as the aim of the developments and is therefore not the focus of the research. Instead, the differences between individual spellings are of interest. Afterwards intra- and extralinguistic factors which caused these developments should be determined. These methodological findings could subsequently be used for improving research methods in other graphematic fields of interest, e. g. computer-mediated communication.

Download Full-text

text corpora
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Information support for decision making in problem situation description

Text corpora, professional translators and translator training

A benchmark of Spanish language datasets for computationally driven research

Die Größe ‚Gewicht‘ im deutsch-italienischen Sprachvergleich

Supporting Emotion Automatic Detection and Analysis over Real-Life Text Corpora via Deep Learning: Model, Methodology and Framework

Application of corpus technologies in teaching specialized translation

Analysis of a Brazilian Indigenous corpus using machine learning methods

Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora

Tracing Long-term Value Change in (Energy) Technologies: Opportunities of Probabilistic Topic Models Using Large Data Sets

Quantifying graphemic variation via large text corpora

Export Citation Format

text corporaRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Information support for decision making in problem situation description

Text corpora, professional translators and translator training

A benchmark of Spanish language datasets for computationally driven research

Die Größe ‚Gewicht‘ im deutsch-italienischen Sprachvergleich

Supporting Emotion Automatic Detection and Analysis over Real-Life Text Corpora via Deep Learning: Model, Methodology and Framework

Application of corpus technologies in teaching specialized translation

Analysis of a Brazilian Indigenous corpus using machine learning methods

Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora

Tracing Long-term Value Change in (Energy) Technologies: Opportunities of Probabilistic Topic Models Using Large Data Sets

Quantifying graphemic variation via large text corpora

text corpora
Recently Published Documents