spelling variation
Recently Published Documents


TOTAL DOCUMENTS

38
(FIVE YEARS 20)

H-INDEX

4
(FIVE YEARS 1)

2021 ◽  
Vol 40 (3) ◽  
pp. 279-295
Author(s):  
Anja Voeste

Abstract In the 15th century, at a time when codification via dictionaries and grammars had not yet taken effect, printers, editors, and compositors were already producing pamphlets and books that had to meet the new requirements of the letterpress, especially as regards the arrangement of white space and uniform line justification (even-margined on the left and right). The following analysis investigates five German editions of the Mirabilia Romae (Marvels of the City of Rome), a well-known pilgrim guide, all printed in 1500 for the contemporaneous Jubilee year and thus for short-term sale. The results show that compositors used different means for text alignment: In addition to deviations in line counts and the repositioning of lines, they chose extended or contracted spelling variants, predominantly on the second half of the page. The most frequent variants are abbreviations in the form of tildes. However, just a few spelling patterns with tildes were used. With respect to explanatory processes in a historical perspective, the results call for a closer consideration of page format, text layout (mise-en-page) and line justification when evaluating spelling variation in early book printing.


2021 ◽  
Vol 40 (3) ◽  
pp. 297-323
Author(s):  
Florian Busch

Abstract Against the backdrop of the societal differentiation of literacy, the paper investigates spelling variation in digital written communication beyond the binary paradigm of standard and nonstandard. To this end, the paper proposes a formal classification of digital spelling variants and then focuses on the socio-communicative functions of these variants in usage. Theoretically grounded in the notions of register and social indexicality, the paper discusses how spelling variants are metapragmatically ordered by social actors and deployed in text-messaging interactions in order to indicate interpretive context. To investigate these phenomena holistically, the paper furthermore presents a tripartite research framework that addresses digital writing regarding its I) structural variants, II) communicative practice, and III) reflexive awareness. Afterwards, this methodological approach is applied empirically. This is done based on a data set that includes samples of everyday literacy by 23 German adolescents: informal WhatsApp texting, on the one hand, formal school essays on the other. The exemplary analyses focus on phonostylistic spellings (e. g. elisions such as <ich hab> instead of <ich habe>) and graphostylistic spellings (e. g. graphemic substitutions such as <daß> instead of <dass>) in these WhatsApp interactions, reconstructing the metapragmatic status of standard orthography in digital writing. By combining structure-oriented, interactional, and ethnographic perspectives, the paper seeks a disciplinary dialogue by relating concepts of sociolinguistics and linguistic anthropology not only to research fields of media linguistics but also to research on writing systems.


Author(s):  
Dinesh Kumar Prabhakar ◽  
Sukomal Pal ◽  
Chiranjeev Kumar

With Web 2.0, there has been exponential growth in the number of Web users and the volume of Web content. Most of these users are not only consumers of the information but also generators of it. People express themselves here in colloquial languages, but using Roman script (transliteration). These texts are mostly informal and casual, and therefore seldom follow grammar rules. Also, there does not exist any prescribed set of spelling rules in transliterated text. This freedom leads to large-scale spelling variations, which is a major challenge in mixed script information processing. This article studies different existing phonetic algorithms to handle the issue of spelling variation, points out the limitations of them, and proposes a novel phonetic encoding approach with two different flavors in the light of Hindi transliteration. Experiments performed over Hindi song lyrics retrieval in mixed script domain with three different retrieval models show that proposed approaches outperform the existing techniques in a majority of the cases (sometimes statistically significantly) for a number of metrics like nDCG@1, nDCG@5, nDCG@10, MAP, MRR, and Recall.


2021 ◽  
Author(s):  
Alexander Robertson

Languages in earlier stages of development differ from their modern analogues, reflecting syntactic, semantic and morphological changes over time. The study of these and other phenomena is the major concern of historical linguistics. The development of literacy and advances in technology mean that human language has often been reserved in physical form. Whilst these artefacts will eventually include video and sound recordings, the current life blood of historical linguistics is text. The written word is the de facto source of evidence for earlier stages of languages and “the first-order witnesses to the more distance linguistic past are written texts.” (Lass, 1997) This dissertation proposes and evaluates a method for identifying spelling variants in historical documents.


2021 ◽  
Author(s):  
Alexander Robertson

Spelling variation in historical text negatively impacts the performance of naturallanguage processing techniques, so normalisation is an important pre-processingstep. Current methods fall some way short of perfect accuracy, often requiringlarge amounts of training data to be effective, and are rarely evaluated againsta wide range of historical sources. This thesis evaluates three models: a HiddenMarkov Model, which has not been previously used for historical text normalisation; a soft attention Neural Network model, which has previously only been evaluated on a single German dataset; and a hard attention Neural Network model,which is adapted from work on morphological inflection and applied here to historical text normalisation for the first time. Each is evaluated against multipledatasets taken from prior work on historical text normalisation. This facilitatesdirect comparison of this work to that existing work. The hard attention NeuralNetwork model achieves state-of-the-art normalisation accuracy in all datasets,even when the volume of training data is restricted. This work will be of particular interest to researchers working with noisy historical data which they wouldlike to explore using modern computational techniques.


2021 ◽  
Vol 9 (1) ◽  
pp. 104-131
Author(s):  
Lassi Saario ◽  
Tanja Säily ◽  
Samuli Kaislaniemi ◽  
Terttu Nevalainen

This paper discusses the process of part-of-speech tagging the Corpus of Early English Correspondence Extension (CEECE), as well as the end result. The process involved normalisation of historical spelling variation, conversion from a legacy format into TEI-XML, and finally, tokenisation and tagging by the CLAWS software. At each stage, we had to face and work around problems such as whether to retain original spelling variants in corpus markup, how to implement overlapping hierarchies in XML, and how to calculate the accuracy of tagging in a way that acknowledges errors in tokenisation. The final tagged corpus is estimated to have an accuracy of 94.5 per cent (in the C7 tagset), which is circa two percentage points (pp) lower than that of present-day corpora but respectable for Late Modern English. The most accurate tag groups include pronouns and numerals, whereas adjectives and adverbs are among the least accurate. Normalisation increased the overall accuracy of tagging by circa 3.7pp. The combination of POS tagging and social metadata will make the corpus attractive to linguists interested in the interplay between language-internal and -external factors affecting variation and change.


2020 ◽  
Vol Special issue on... ◽  
Author(s):  
Martti Mäkinen

International audience Automated approaches to identifying authorship of a text have become commonplace in the stylometric studies. The current article applies an unsupervised stylometric approach on Middle English documents using the script Stylo in R, in an attempt to distinguish between texts from different dialectal areas. The approach is based on the distribution of character 3-grams generated from the texts of the corpus of Middle English Local Documents (MELD). The article adopts the middle ground in the study of Middle English spelling variation, between the concept of relational linguistic space and the real linguistic continuum of medieval England. Stylo can distinguish between Middle English dialects by using the less frequent character 3-grams.


Author(s):  
Laura García Fernández

Abstract This article describes the steps and results of the lemmatization of the derived anomalous verbs of Old English. The data have been retrieved from The Dictionary of Old English Web Corpus, searched through the lexical database from the Nerthus Project called Norna. The methodology comprises several steps combining automatic searches on the lemmatizer and manual revision. Part of the results, including the verbs starting with the letters A to H, are compared with the Dictionary of Old English, while the rest of the lemmas are checked with the standard Old English dictionaries (Clark-Hall, Sweet and Bosworth-Toller). The discussion leads to the conclusion that the lemmatization of the verbs of Old English, a language with a remarkable degree of spelling variation, requires considerable manual revision. However, the progressive improvement of automatic searches, based on the comparison of the initial results with the available lexicographical sources, minimizes the need for manual adjustment.


2020 ◽  
Vol 2020 (Towards a Digital Ecosystem:...) ◽  
Author(s):  
Thibault Clérice

International audience Tokenization of modern and old Western European languages seems to be fairly simple, as it stands on the presence mostly of markers such as spaces and punctuation. However, when dealing with old sources like manuscripts written in scripta continua, antiquity epigraphy or Middle Age manuscripts, (1) such markers are mostly absent, (2) spelling variation and rich morphology make dictionary based approaches difficult. Applying convolutional encoding to characters followed by linear categorization to word-boundary or in-word-sequence is shown to be effective at tokenizing such inputs. Additionally, the software is released with a simple interface for tokenizing a corpus or generating a training set.


Sign in / Sign up

Export Citation Format

Share Document