text corpus
Recently Published Documents


TOTAL DOCUMENTS

332
(FIVE YEARS 129)

H-INDEX

11
(FIVE YEARS 3)

2022 ◽  
Author(s):  
Rob Churchill ◽  
Lisa Singh

Topic models have been applied to everything from books to newspapers to social media posts in an effort to identify the most prevalent themes of a text corpus. We provide an in-depth analysis of unsupervised topic models from their inception to today. We trace the origins of different types of contemporary topic models, beginning in the 1990s, and we compare their proposed algorithms, as well as their different evaluation approaches. Throughout, we also describe settings in which topic models have worked well and areas where new research is needed, setting the stage for the next generation of topic models.


2022 ◽  
Vol 6 (1) ◽  
pp. 4
Author(s):  
Dmitry Soshnikov ◽  
Tatiana Petrova ◽  
Vickie Soshnikova ◽  
Andrey Grunin

Since the beginning of the COVID-19 pandemic almost two years ago, there have been more than 700,000 scientific papers published on the subject. An individual researcher cannot possibly get acquainted with such a huge text corpus and, therefore, some help from artificial intelligence (AI) is highly needed. We propose the AI-based tool to help researchers navigate the medical papers collections in a meaningful way and extract some knowledge from scientific COVID-19 papers. The main idea of our approach is to get as much semi-structured information from text corpus as possible, using named entity recognition (NER) with a model called PubMedBERT and Text Analytics for Health service, then store the data into NoSQL database for further fast processing and insights generation. Additionally, the contexts in which the entities were used (neutral or negative) are determined. Application of NLP and text-based emotion detection (TBED) methods to COVID-19 text corpus allows us to gain insights on important issues of diagnosis and treatment (such as changes in medical treatment over time, joint treatment strategies using several medications, and the connection between signs and symptoms of coronavirus, etc.).


2022 ◽  
Vol 14 (1) ◽  
pp. 0-0

POS (Parts of Speech) tagging, a vital step in diverse Natural Language Processing (NLP) tasks has not drawn much attention in case of Odia a computationally under-developed language. The proposed hybrid method suggests a robust POS tagger for Odia. Observing the rich morphology of the language and unavailability of sufficient annotated text corpus a combination of machine learning and linguistic rules is adopted in the building of the tagger. The tagger is trained on tagged text corpus from the domain of tourism and is capable of obtaining a perceptible improvement in the result. Also an appreciable performance is observed for news articles texts of varied domains. The performance of proposed algorithm experimenting on Odia language shows its manifestation in dominating over existing methods like rule based, hidden Markov model (HMM), maximum entropy (ME) and conditional random field (CRF).


2021 ◽  
Author(s):  
Jianfeng Qu ◽  
Wen Hua ◽  
Dantong Ouyang ◽  
Xiaofang Zhou
Keyword(s):  

2021 ◽  
Vol 22 (4) ◽  
pp. 387-400
Author(s):  
Shashank Srivastav ◽  
Pradeep Kumar Singh ◽  
Divakar Yadav

The process of searching on the World Wide Web (WWW) is increasing regularly, and users around the world also use it regularly. In WWW the size of the text corpus is constantly increasing at an exponential rate, so we need an efficient indexing algorithm that reduces both space and time during the search process. This paper proposes a new technique that utilizes Word-Based Tagging Coding compression which is implemented using Parallel Wavelet Tree, called WBTC_PWT. WBTC_PWT uses the word-based tagging coding encoding technique to reduce the space complexity of the index and uses a parallel wavelet tree which reduces the time it takes to construct indexes. This technique utilizes the features of compressed pattern matching to minimize search time complexity. In this technique, all the unique words present in the text corpus are divided into different levels according to the word frequency table and a different wavelet tree is made for each level in parallel. Compared to other existing search algorithms based on compressed text, the proposed WBTC_PWT search method is significantly faster and it reduces the chances of getting the false matching result.


2021 ◽  
Vol 30 (1) ◽  
pp. 97-121
Author(s):  
Tien-Ping Tan ◽  
Chai Kim Lim ◽  
Wan Rose Eliza Abdul Rahman

A parallel text corpus is an important resource for building a machine translation (MT) system. Existing resources such as translated documents, bilingual dictionaries, and translated subtitles are excellent resources for constructing parallel text corpus. A sentence alignment algorithm automatically aligns source sentences and target sentences because manual sentence alignment is resource-intensive. Over the years, sentence alignment approaches have improved from sentence length heuristics to statistical lexical models to deep neural networks. Solving the alignment problem as a classification problem is interesting as classification is the core of machine learning. This paper proposes a parallel long-short-term memory with attention and convolutional neural network (parallel LSTM+Attention+CNN) for classifying two sentences as parallel or non-parallel sentences. A sliding window approach is also proposed with the classifier to align sentences in the source and target languages. The proposed approach was compared with three classifiers, namely the feedforward neural network, CNN, and bi-directional LSTM. It is also compared with the BleuAlign sentence alignment system. The classification accuracy of these models was evaluated using Malay-English parallel text corpus and UN French-English parallel text corpus. The Malay-English sentence alignment performance was then evaluated using research documents and the very challenging Classical Malay-English document. The proposed classifier obtained more than 80% accuracy in categorizing parallel/non-parallel sentences with a model built using only five thousand training parallel sentences. It has a higher sentence alignment accuracy than other baseline systems.


Author(s):  
Regīna Kvašīte ◽  
◽  
Kazimiers Župerka ◽  

The aim of the research is to find out what words are used in Lithuanian and Latvian to name the rural population. The study was performed by applying descriptive, comparative and quantitative methods. The novelty of the article is the presentation of the Lithuanian language material in Latvian, as well as the analysis of the Latvian language material and the comparison of the meanings and use of Lithuanian and Latvian words. The study is sociolinguistic, not normative; therefore, not only systematic but also contextual, situational synonymy is important. Dictionaries and texts of literary and common languages, synonyms, slang and jargon, the text of the current Lithuanian language (Dabartinės lietuvių kalbos tekstynas) and the Latvian language text corpus (Latviešu valodas tekstu korpuss), are the main sources. A Lithuanian word kaimietis (‘a villager’), which has long been a neutral name for a rural resident or a person born in a village, is a synonym for both neutral and stylistically connoted words. The most common synonyms are sodietis (‘a homestead peasant’) and valstietis (‘a peasant’). In this synonym sequence, a peasant is a remote word that includes the concept “kaimo gyventojas” (‘a rural resident’) and the concept “žemdirbys” (‘an agriculturalist’), thus linking the synonym sequence of the word a villager to a word farmer in the sequence of synonyms ūkininkas (‘a farmer’), laukininkas (‘a field peasant’). Recently, the word kaimietis (‘a villager’) has acquired a second – pejorative – meaning: “sakoma apie neišsilavinusį, prasto skonio ir pan. žmogų, kuris nebūtinai kilęs iš kaimo” (‘it is said of an uneducated, a person of poor taste, and so on, a person who does not necessarily come from the countryside’). It is already recorded in the written dictionary of the common language, which indicates that the common connoted meaning in slang is codified. The word kaimietis (‘a villager’), used in a pejorative sense, appears in the order of words that have a systemic or contextual pejorative meaning, as well as in a despising way: prastuolis, prasčiokas, mužikas, runkelis. The name of the villager in Latvian – the word laucinieks (‘a villager’) – is stylistically neutral, its synonyms consist of the neutral words lauksaimnieks (‘a farmer’) and zemnieks (‘a peasant’). The word zemnieks, similarly to the valstietis (‘a peasant’) in Lithuanian, is the dominant in the order of distant synonyms zemkopis (‘an agriculturalist’) and zemesrūķis [?]. The approach to the synonym sādžinieks (‘a homestead peasant’) is ambiguous: its definition in current dictionaries associates the word either with Latgale or Russia, although according to its origin, it is considered to be a borrowing from the Lithuanian language. The word with root lauk- (from word ‘field’) lauķis [?] is used in a pejorative sense in Latvian (its shade is similar to the Lithuanian words prasčiokas (‘a hick’) and runkelis (‘a person as mindless as a beetroot’)), as well as slang word pāķis [?] and barbarisms – slavism mužiks (‘a kern’), Germanism bauris [?] (in jargon bauers). The material of Lithuanian and Latvian texts shows that in both Lithuanian and Latvian, the words of different connotations are used synonymously in different contexts.


Litera ◽  
2021 ◽  
pp. 227-241
Author(s):  
Anastasia Boginskaya

This research is dedicated to the topic of multivariate translations. Leaning on the text corpus that contains 16 French translations of A. S. Pushkin's novel “Eugene Onegin”, analysis is conducted on the peculiarities of conveying certain characteristic stylistic patterns in French texts alongside other stylistic techniques of the original, as well as changes in translations depending on the poetic form chosen by the translator. The selected extensive material trace traces the evolution of translators’ approach towards the stylistics of Pushkin's text over time. The article focuses on the chapters III and VIII of the novel. Comparative analysis demonstrates the dependence of the stylistic aspect of translation on the poetic form chosen by the translator. Prose translations provide more accurate stylistic equivalents than translations of the verses. Poetry translations are divided into two groups: 1) accurate compliance with the of Onegin’s verse; 2) departure from the rhyme pattern of the original. The frequency of transmitting stylistic techniques of the original in both groups does not demonstrated significant systematic differences. The author determines the consistencies in conveying certain stylistic patterns in various French translations. Periphrases, comparisons, inversions, and metaphors most of the time receive stylistically accurate equivalents in all translations; while metonymy and polysyndeton with conjunction “and” do not. The scientific novelty lies in examination of the text corpus that contains virtually all existing full translations of the novel “Eugene Onegin”.


2021 ◽  
Vol 2052 (1) ◽  
pp. 012027
Author(s):  
D V Mikhaylov ◽  
G M Emelyanov

Abstract The offered paper is devoted to the problem of oneness and integrity of image for the semantic pattern (i.e., sense standard) revealed phrase by phrase for some text within a topical collection. One phrase corresponds here to an extended natural-language sentence. The basis of estimating affinity to the standard is the classifying of words of each phrase in a text according to the TF-IDF value relative to some text corpus. Texts to the corpus are pre-selected by an expert. The essence of the problem: for each phrase, its maximal affinity to the sense standard is achieved concerning the individual corpus document, and, consequently, it is necessary to estimate the mutual relevance of such documents concerning different phrases of the analyzed text. Based on distances between vectors of TF-IDF for words of a separate phrase obtained relative to different corpus documents, the significance estimation for each such document is entered into consideration to choose a pair of mutual relevant.


Sign in / Sign up

Export Citation Format

Share Document