scholarly journals On the unsupervised analysis of domain-specific Chinese texts

2016 ◽  
Vol 113 (22) ◽  
pp. 6154-6159 ◽  
Author(s):  
Ke Deng ◽  
Peter K. Bol ◽  
Kate J. Li ◽  
Jun S. Liu

With the growing availability of digitized text data both publicly and privately, there is a great need for effective computational tools to automatically extract information from texts. Because the Chinese language differs most significantly from alphabet-based languages in not specifying word boundaries, most existing Chinese text-mining methods require a prespecified vocabulary and/or a large relevant training corpus, which may not be available in some applications. We introduce an unsupervised method, top-down word discovery and segmentation (TopWORDS), for simultaneously discovering and segmenting words and phrases from large volumes of unstructured Chinese texts, and propose ways to order discovered words and conduct higher-level context analyses. TopWORDS is particularly useful for mining online and domain-specific texts where the underlying vocabulary is unknown or the texts of interest differ significantly from available training corpora. When outputs from TopWORDS are fed into context analysis tools such as topic modeling, word embedding, and association pattern finding, the results are as good as or better than that from using outputs of a supervised segmentation method.

Entropy ◽  
2020 ◽  
Vol 22 (3) ◽  
pp. 275
Author(s):  
Igor A. Bessmertny ◽  
Xiaoxi Huang ◽  
Aleksei V. Platonov ◽  
Chuqiao Yu ◽  
Julia A. Koroleva

Search engines are able to find documents containing patterns from a query. This approach can be used for alphabetic languages such as English. However, Chinese is highly dependent on context. The significant problem of Chinese text processing is the missing blanks between words, so it is necessary to segment the text to words before any other action. Algorithms for Chinese text segmentation should consider context; that is, the word segmentation process depends on other ideograms. As the existing segmentation algorithms are imperfect, we have considered an approach to build the context from all possible n-grams surrounding the query words. This paper proposes a quantum-inspired approach to rank Chinese text documents by their relevancy to the query. Particularly, this approach uses Bell’s test, which measures the quantum entanglement of two words within the context. The contexts of words are built using the hyperspace analogue to language (HAL) algorithm. Experiments fulfilled in three domains demonstrated that the proposed approach provides acceptable results.


Author(s):  
Yanqing Sun ◽  
Jianwei Zhang ◽  
Marlene Scardamalia

Online discourse from a class of 22 students (11 boys and 11 girls) was analysed to assess advances in conceptual understanding and literacy. The students worked over a two-year period (Grades 3-4), during which they contributed notes to an online Knowledge Building environment—Knowledge Forum®. Contributions revealed that both boys and girls produced a substantial amount of text and graphics, and that their written texts incorporated an increasing proportion of less-frequent, advanced words, including academic vocabulary and domain-specific words from grade levels higher than their own. Brief accounts of classroom discourse indicate how deep understanding and vocabulary growth mutually support each other in online and offline exchanges. The gender differences that were observed show boys doing slightly better than girls, suggesting that Knowledge Building has the potential to help boys overcome weaknesses in literacy.


Author(s):  
Yuan Lin ◽  
Hongzhi Yu ◽  
Fucheng Wan ◽  
Tao Xu
Keyword(s):  

1998 ◽  
Vol 59 ◽  
pp. 57-65
Author(s):  
Marianne Hermans

The results of the pilot study reported on in this article indicate that the combination of children's books supplementary to the biology lessons does not diminish reading-achieve-ment test scores, and that there seems to be an advantage in domain-specific word knowledge. For 14 weeks, the time normally spent on unsustained silent reading in class was filled in by reading on particular subjects that were being discussed in biology lessons. The basic research assumption was that reading various texts on the same subject would not only positively affect the children's knowledge about this subject but would also improve their reading skills and their attitudes towards reading. The experimental group scored significandy better than their peer group on a domain-specific vocabulary test. This indicates that the books were used as stepping stones for building mental knowledge structures. Tests with respect to the other variables such as reading skills yielded no significant differences between the groups. However, post-hoc analysis showed an advantage for pupils from the lower social groups. Their attitude towards reading impro-ved considerably, in which respect they differed significandy from their peers. The results seem to confirm the ideas expressed in the international literature about content area reading and in aspects of schema theory. By reading the books in combination with the biology lessons, certain schemata could be activated which enable the pupils better to understand the new information and store it firmly in their memory. The redundancy of important words appearing in various contexts is a determinant of word knowledge.


2019 ◽  
Vol 9 (11) ◽  
pp. 2347 ◽  
Author(s):  
Hannah Kim ◽  
Young-Seob Jeong

As the number of textual data is exponentially increasing, it becomes more important to develop models to analyze the text data automatically. The texts may contain various labels such as gender, age, country, sentiment, and so forth. Using such labels may bring benefits to some industrial fields, so many studies of text classification have appeared. Recently, the Convolutional Neural Network (CNN) has been adopted for the task of text classification and has shown quite successful results. In this paper, we propose convolutional neural networks for the task of sentiment classification. Through experiments with three well-known datasets, we show that employing consecutive convolutional layers is effective for relatively longer texts, and our networks are better than other state-of-the-art deep learning models.


Author(s):  
CHUEN-MIN HUANG ◽  
MEI-CHEN WU ◽  
CHING-CHE CHANG

Misspelling and misconception resulting from similar pronunciation appears frequently in Chinese texts. Without double check-up, this situation will be getting worse even with the help of Chinese input editor. It is hoped that the quality of Chinese writing would be enhanced if an effective automatic error detection and correction mechanism is embedded in text editor. Therefore, the burden of manpower to proofread shall be released. Until recently, researches in automatic error detection and correction of Chinese text have undergone many challenges and suffered from bad performance compared with that of Western text. In view of the prominent phenomenon in Chinese writing problem, this study proposes a learning model based on Chinese phonemic alphabets. The experimental results demonstrate that this model is effective in finding out misspellings and further improves detection and correction rate.


Author(s):  
Pratiksha Bongale

Today’s world is mostly data-driven. To deal with the humongous amount of data, Machine Learning and Data Mining strategies are put into usage. Traditional ML approaches presume that the model is tested on a dataset extracted from the same domain from where the training data has been taken from. Nevertheless, some real-world situations require machines to provide good results with very little domain-specific training data. This creates room for the development of machines that are capable of predicting accurately by being trained on easily found data. Transfer Learning is the key to it. It is the scientific art of applying the knowledge gained while learning a task to another task that is similar to the previous one in some or another way. This article focuses on building a model that is capable of differentiating text data into binary classes; one roofing the text data that is spam and the other not containing spam using BERT’s pre-trained model (bert-base-uncased). This pre-trained model has been trained on Wikipedia and Book Corpus data and the goal of this paper is to highlight the pre-trained model’s capabilities to transfer the knowledge that it has learned from its training (Wiki and Book Corpus) to classifying spam texts from the rest.


2020 ◽  
Author(s):  
Hegler C. Tissot ◽  
Lucas A. Pedebos

Miscarriages are the most common type of pregnancy loss, mostly occurring in the first 12 weeks of pregnancy due to known factors of different natures. Pregnancy risk assessment aims to quantify evidence in order to reduce such maternal morbidities during pregnancy, and personalized decision support systems are the cornerstone of high-quality, patient-centered care in order to improve diagnosis, treatment selection, and risk assessment. However, the increasing number of patient-level observations and data sparsity requires more effective forms of representing clinical knowledge in order to encode known information that enables performing inference and reasoning. Whereas knowledge embedding representation has been widely explored in the open domain data, there are few efforts for its application in the clinical domain. In this study, we discuss differences among multiple embedding strategies, and we demonstrate how these methods can assist on clinical risk assessment of miscarriage both before and specially in the earlier pregnancy stages. Our experiments show that simple knowledge embedding approaches that utilize domain-specific metadata perform better than complex embedding strategies, although both are able to improve results comparatively to a population probabilistic baseline in both AUPRC, F1-score, and a proposed normalized version of these evaluation metrics that better reflects accuracy for unbalanced datasets.


2018 ◽  
Vol 72 (4) ◽  
pp. 1021-1058 ◽  
Author(s):  
Stephan Peter Bumbacher

Abstract Sinology, as far as textual criticism is concerned, is still in its infancy compared with, e. g., New Testament, classical Greek or European medieval studies. Whereas virtually every ancient Greek, old English, or early German text – to name but a few – has been the subject of text critical scrutiny, in many cases even since Renaissance times, the same does not hold true for Chinese works. In the absence of early manuscripts they could themselves base upon, modern editions of classical Chinese texts usually take as their starting point the earliest extant printed versions which quite often date from Song times and are thus separated by many centuries from the no longer available originals. However, quite often testimonies of ancient texts exist as quotations in works that considerably predate the first printed versions of the texts in question. In view of this fact, virtually every classical Chinese text needs to be systematically re-examined and critically edited by taking into account every available explicit as well as implicit quotation. As the received version of the Zhuang zi 莊子 (Master Zhuang), a text whose origins may lie in the third century BCE, ultimately goes back to Guo Xiang’s 郭象 (ob. 312) editorial activities and as Ge Hong 葛洪 (283–343) was an author active at about the same time, there is a chance that a pre-Guo Xiang version may have been available to him. Therefore, as a case study, this paper examines the explicit as well as implicit Zhuang zi quotations to be found within Ge Hong’s works, in order to examine this possibility.


Sign in / Sign up

Export Citation Format

Share Document