Mining historical texts for diachronic spelling variants

2020 ◽  
Vol 56 (4) ◽  
pp. 629-650
Author(s):  
Filip Graliński ◽  
Krzysztof Jassem

Abstract The paper describes a method for finding diachronic spelling variants in a corpus that consists of historical and modern Polish texts. The procedure applies the Levenshtein distance and the similarity measure determined with a Word2vec model. The method was applied for both words and sub-word units. A sample of spelling variants was manually evaluated and compared against an existing morphological analyser for Polish historical texts. The resulting lists of spelling variants and spelling modernisation rules were used in a text modernisation tool and their contribution was evaluated. The paper also presents an analogous method for finding spelling variants that result from erroneous OCR. The obtained lists of OCR variants and rules may serve for the correction of OCR output.

2019 ◽  
Vol 26 (2) ◽  
pp. 163-182 ◽  
Author(s):  
Serge Sharoff

AbstractSome languages have very few NLP resources, while many of them are closely related to better-resourced languages. This paper explores how the similarity between the languages can be utilised by porting resources from better- to lesser-resourced languages. The paper introduces a way of building a representation shared across related languages by combining cross-lingual embedding methods with a lexical similarity measure which is based on the weighted Levenshtein distance. One of the outcomes of the experiments is a Panslavonic embedding space for nine Balto-Slavonic languages. The paper demonstrates that the resulting embedding space helps in such applications as morphological prediction, named-entity recognition and genre classification.


1998 ◽  
Vol 3 (1) ◽  
pp. 115-150 ◽  
Author(s):  
Harold Somers

We describe and experimentally evaluate an alternative algorithm for aligning and extracting vocabulary from parallel texts using recency vectors and a similarity measure based on Levenshtein distance. The work is largely inspired by Fung and McKeown 's DK-vec, though we use a simpler algorithm. The technique is tested on two sets of parallel corpora involving English, French, German, Dutch, Spanish, and Japanese. We attempt to evaluate the importance of parameters such as frequency of words chosen as candidates, the effect of different language pairings, and differences between the two corpora.


Author(s):  
Mohana Priya K ◽  
Pooja Ragavi S ◽  
Krishna Priya G

Clustering is the process of grouping objects into subsets that have meaning in the context of a particular problem. It does not rely on predefined classes. It is referred to as an unsupervised learning method because no information is provided about the "right answer" for any of the objects. Many clustering algorithms have been proposed and are used based on different applications. Sentence clustering is one of best clustering technique. Hierarchical Clustering Algorithm is applied for multiple levels for accuracy. For tagging purpose POS tagger, porter stemmer is used. WordNet dictionary is utilized for determining the similarity by invoking the Jiang Conrath and Cosine similarity measure. Grouping is performed with respect to the highest similarity measure value with a mean threshold. This paper incorporates many parameters for finding similarity between words. In order to identify the disambiguated words, the sense identification is performed for the adjectives and comparison is performed. semcor and machine learning datasets are employed. On comparing with previous results for WSD, our work has improvised a lot which gives a percentage of 91.2%


2012 ◽  
Vol 2 ◽  
pp. 107-121
Author(s):  
Lilia Kowkiel ◽  
Arvydas Pacevičius ◽  
Iwona Pietrzkiewicz

Historians and publishers of historical sources have a lot of problems with the texts written in different languages and alphabets, which were created at different times, in the multilingual areas inhabited by many nations following different religions. The historians of book culture have the same problems with texts of inventories and catalogues of books, which are the primary source of knowledge about the content of libraries. At present it’s also important the historical texts to be published in the digital form. This article is a part of the discussion on this very important subject.


Informatica ◽  
2018 ◽  
Vol 29 (3) ◽  
pp. 399-420
Author(s):  
Alessia Amelio ◽  
Darko Brodić ◽  
Radmila Janković

2017 ◽  
Vol 3 (2) ◽  
pp. 1-6
Author(s):  
Ferly Gunawan ◽  
M. Ali Fauzi ◽  
Putra Pandu Adikara

Perkembangan aplikasi mobile yang pesat membuat banyak aplikasi diciptakan dengan berbagai kegunaan untuk memenuhi kebutuhan pengguna. Setiap aplikasi memungkinkan pengguna untuk memberi ulasan tentang aplikasi tersebut. Tujuan dari ulasan adalah untuk mengevaluasi dan meningkatkan kualitas produk ke depannya. Untuk mengetahui hal tersebut, analisis sentimen dapat digunakan untuk mengklasifikasikan ulasan ke dalam sentimen positif atau negatif. Pada ulasan aplikasi biasanya terdapat salah eja sehingga sulit dipahami. Kata yang mengalami salah eja perlu dilakukan normalisasi kata untuk diubah menjadi kata standar. Karena itu, normalisasi kata dibutuhkan untuk menyelesaikan masalah salah eja. Penelitian ini menggunakan normalisasi kata berbasis Levenshtein distance. Berdasarkan pengujian, nilai akurasi tertinggi terdapat pada perbandingan data latih 70% dan data uji 30%. Hasil akurasi tertinggi dari pengujian menggunakan nilai edit <=2 adalah 100%, nilai edit tertinggi kedua didapat pada nilai edit <=1 dengan akurasi 96,4%, sedangkan nilai edit dengan akurasi terendah diperoleh pada nilai edit <=4 dan <=5 dengan akurasi 66,6%. Hasil dari pengujian Naive Bayes-Levenshtein Distance memiliki nilai akurasi tertinggi yaitu 96,9% dibandingkan dengan pengujian Naive Bayes tanpa Levenshtein Distance dengan nilai akurasi 94,4%.  


2020 ◽  
Vol 11 (1) ◽  
pp. 1
Author(s):  
Dwi Puji Rahayu ◽  
Asep Yudha Wirajaya

This study aims to present a historiographic review of the text of the Yellow Tale in the State of Gagelang (hereinafter abbreviated as HSK). This research uses the historical method. The steps used in this study are (1) heuristics; (2) criticism; and (3) historiography. The results of research on this study are known that (1) In the text HSK tells about Sunan Kuning to his descendants and various conflicts in it; (2) The history of the tumult not only describes the conflict between Java and China, but also indicates the interference of the Dutch colonial involvement in it; (3) The relevance between the HSK text and the history of Pacer commotion. The relevance is illustrated by the existence of relevant and interrelated events between the HSK text and the history of Pacer commotion. During this time, the discourse that continues to be "echoed" by the colonial side is the commotion of Chinatown is a dark history for humanity in the archipelago. In fact, the discourse continues to be reproduced when various riots erupted in the country. The discourse that is raised is always based on ethnicity, religion, race, and intergroup. Thus, the presence of the HSK text is an important witness for the history of humanity on earth in the archipelago. In addition, HSK also uses the background of the banner story. It shows that history is not always written by "winners". Because the banner story is a folklore that is so closely related to the life of the Indonesian people. Therefore, a comprehensive and integral study of HSK and other historical texts is absolutely necessary to be carried out in order to reveal the true historical facts. So, Indonesian people can re-recognize the history of their ancestors, both through colonial sources and from the perspective of the nation's own historiography.


Sign in / Sign up

Export Citation Format

Share Document