word segmentation Latest Research Papers

Does morphological complexity affect word segmentation? Evidence from computational modeling

Cognition ◽

10.1016/j.cognition.2021.104960 ◽

2022 ◽

Vol 220 ◽

pp. 104960

Author(s):

Georgia Loukatou ◽

Sabine Stoll ◽

Damian Blasi ◽

Alejandrina Cristia

Keyword(s):

Computational Modeling ◽

Word Segmentation ◽

Morphological Complexity

A Statistical Language Model for Pre-Trained Sequence Labeling: A Case Study on Vietnamese

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3483524 ◽

2022 ◽

Vol 21 (3) ◽

pp. 1-21

Author(s):

Xianwen Liao ◽

Yongzhong Huang ◽

Peng Yang ◽

Lei Chen

Keyword(s):

Language Model ◽

Dynamic Programming Algorithm ◽

Named Entity Recognition ◽

Word Segmentation ◽

Training Data ◽

Entity Recognition ◽

Divide And Conquer ◽

Programming Algorithm ◽

Statistical Language Model ◽

Sequence Labeling

By defining the computable word segmentation unit and studying its probability characteristics, we establish an unsupervised statistical language model (SLM) for a new pre-trained sequence labeling framework in this article. The proposed SLM is an optimization model, and its objective is to maximize the total binding force of all candidate word segmentation units in sentences under the condition of no annotated datasets and vocabularies. To solve SLM, we design a recursive divide-and-conquer dynamic programming algorithm. By integrating SLM with the popular sequence labeling models, Vietnamese word segmentation, part-of-speech tagging and named entity recognition experiments are performed. The experimental results show that our SLM can effectively promote the performance of sequence labeling tasks. Just using less than 10% of training data and without using a dictionary, the performance of our sequence labeling framework is better than the state-of-the-art Vietnamese word segmentation toolkit VnCoreNLP on the cross-dataset test. SLM has no hyper-parameter to be tuned, and it is completely unsupervised and applicable to any other analytic language. Thus, it has good domain adaptability.

How much does prosody help word segmentation? A simulation study on infant-directed speech

Cognition ◽

10.1016/j.cognition.2021.104961 ◽

2022 ◽

Vol 219 ◽

pp. 104961

Author(s):

Bogdan Ludusan ◽

Alejandrina Cristia ◽

Reiko Mazuka ◽

Emmanuel Dupoux

Keyword(s):

Simulation Study ◽

Word Segmentation

A Transformer-based Neural Model for Chinese Word Segmentation and Part-of-Speech Tagging

IJARCCE ◽

10.17148/ijarcce.2021.101201 ◽

2021 ◽

Vol 10 (12) ◽

Author(s):

Xinxin Li

Keyword(s):

Neural Model ◽

Word Segmentation ◽

Chinese Word ◽

Chinese Word Segmentation ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Speech Tagging

Natural Infant-Directed Speech Facilitates Neural Tracking of Prosody

10.31234/osf.io/dnkg2 ◽

2021 ◽

Author(s):

Katharina Menn ◽

Christine Michel ◽

Lars Meyer ◽

Stefanie Hoehl ◽

Claudia Männel

Keyword(s):

Language Acquisition ◽

Frequency Band ◽

Low Frequency ◽

Word Segmentation ◽

Stress Rate ◽

Continuous Speech ◽

Novel Objects ◽

Speech Envelope ◽

Natural Interactions ◽

As If

Infants prefer to be addressed with infant-directed speech (IDS). IDS benefits language acquisition through amplified low-frequency amplitude modulations. It has been reported that this amplification increases electrophysiological tracking of IDS compared to adult-directed speech (ADS). It is still unknown which particular frequency band triggers this effect. Here, we compare tracking at the rates of syllables and prosodic stress, which are both critical to word segmentation and recognition. In mother-infant dyads (n=30), mothers described novel objects to their 9-month-olds while infants' EEG was recorded. For IDS, mothers were instructed to speak to their children as they typically do, while for ADS, mothers described the objects as if speaking with an adult. Phonetic analyses confirmed that pitch features were more prototypically infant-directed in the IDS-condition compared to the ADS-condition. Neural tracking of speech was assessed by speech-brain coherence, which measures the synchronization between speech envelope and EEG. Results revealed significant speech-brain coherence at both syllabic and prosodic stress rates, indicating that infants track speech in IDS and ADS at both rates. We found significantly higher speech-brain coherence for IDS compared to ADS in the prosodic stress rate but not the syllabic rate. This indicates that the IDS benefit arises primarily from enhanced prosodic stress. Thus, neural tracking is sensitive to parents’ speech adaptations during natural interactions, possibly facilitating higher-level inferential processes such as word segmentation from continuous speech.

Does morphological complexity affect word segmentation? Evidence from computational modeling

10.31234/osf.io/hzmyf ◽

2021 ◽

Author(s):

Georgia Loukatou ◽

Sabine Stoll ◽

Damián Ezequiel Blasi ◽

Alejandrina Cristia

Keyword(s):

Computational Modeling ◽

Word Segmentation ◽

Inflectional Morphology ◽

Computational Studies ◽

Morphological Complexity ◽

Artificial Languages ◽

Segmentation Algorithms ◽

Word Boundaries ◽

Continuous Stream ◽

Distributional Cues

How can infants detect where words or morphemes start and end in the continuous stream of speech? Previous computational studies have investigated this question mainly for English, where morpheme and word boundaries are often isomorphic. Yet in many languages, words are often multimorphemic, such that word and morpheme boundaries do not align. Our study employed corpora of two languages that differ in the complexity of inflectional morphology, Chintang (Sino-Tibetan) and Japanese (in Experiment 1), as well as corpora of artificial languages ranging in morphological complexity, as measured by the ratio and distribution of morphemes per word (in Experiments 2 and 3). We used two baselines and three conceptually diverse word segmentation algorithms, two of which rely purely on sublexical information using distributional cues, and one that builds a lexicon. The algorithms’ performance was evaluated on both word- and morpheme-level representations of the corpora.Segmentation results were better for the morphologically simpler languages than for the morphologically more complex languages, in line with the hypothesis that languages with greater inflectional complexity could be more difficult to segment into words. We further show that the effect of morphological complexity is relatively small, compared to that of algorithm and evaluation level. We therefore recommend that infant researchers look for signatures of the different segmentation algorithms and strategies, before looking for differences in infant segmentation landmarks across languages varying in complexity.

Linguistic Annotation of Translated Chinese Texts: Coordinating Theory, Algorithms and Data

Journal of Linguistics/Jazykovedný casopis ◽

10.2478/jazcas-2021-0054 ◽

2021 ◽

Vol 72 (2) ◽

pp. 590-602

Author(s):

Kirill I. Semenov ◽

Armine K. Titizian ◽

Aleksandra O. Piskunova ◽

Yulia O. Korotkova ◽

Alena D. Tsvetkova ◽

...

Keyword(s):

Chinese Text ◽

Text Processing ◽

Word Segmentation ◽

The Other ◽

Pos Tagging ◽

Theoretical Comparison ◽

Linguistic Annotation ◽

Corpus Data ◽

Chinese Texts ◽

The One

Abstract The article tackles the problems of linguistic annotation in the Chinese texts presented in the Ruzhcorp – Russian-Chinese Parallel Corpus of RNC, and the ways to solve them. Particular attention is paid to the processing of Russian loanwords. On the one hand, we present the theoretical comparison of the widespread standards of Chinese text processing. On the other hand, we describe our experiments in three fields: word segmentation, grapheme-to-phoneme conversion, and PoS-tagging, on the specific corpus data that contains many transliterations and loanwords. As a result, we propose the preprocessing pipeline of the Chinese texts, that will be implemented in Ruzhcorp.

BYANJON: A Ground Truth Preparation System for Online Handwritten Bangla Documents

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3464379 ◽

2021 ◽

Vol 20 (6) ◽

pp. 1-16

Author(s):

Shibaprasad Sen ◽

Ankan Bhattacharyya ◽

Ram Sarkar ◽

Kaushik Roy

Keyword(s):

Extraction Procedure ◽

Ground Truth ◽

Word Segmentation ◽

Text Line ◽

Line Extraction ◽

Manual Intervention ◽

Ground Truth Generation ◽

Class Labels ◽

Text Line Extraction ◽

Ground Truth Information

The work reported in this article deals with the ground truth generation scheme for online handwritten Bangla documents at text-line, word, and stroke levels. The aim of the proposed scheme is twofold: firstly, to build a document level database so that future researchers can use the database to do research in this field. Secondly, the ground truth information will help other researchers to evaluate the performance of their algorithms developed for text-line extraction, word extraction, word segmentation, stroke recognition, and word recognition. The reported ground truth generation scheme starts with text-line extraction from the online handwritten Bangla documents, then words extraction from the text-lines, and finally segmentation of those words into basic strokes. After word segmentation, the basic strokes are assigned appropriate class labels by using modified distance-based feature extraction procedure and the MLP ( Multi-layer Perceptron ) classifier. The Unicode for the words are then generated from the sequence of stroke labels. XML files are used to store the stroke, word, and text-line levels ground truth information for the corresponding documents. The proposed system is semi-automatic and each step such as text-line extraction, word extraction, word segmentation, and stroke recognition has been implemented by using different algorithms. Thus, the proposed ground truth generation procedure minimizes huge manual intervention by reducing the number of mouse clicks required to extract text-lines, words from the document, and segment the words into basic strokes. The integrated stroke recognition module also helps to minimize the manual labor needed to assign appropriate stroke labels. The freely available and can be accessed at https://byanjon.herokuapp.com/ .

Traditional Chinese Medicine Text Similarity Calculation Model Based on the Bidirectional Temporal Siamese Network

Evidence-based Complementary and Alternative Medicine ◽

10.1155/2021/2337924 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Jigen Luo ◽

Wangping Xiong ◽

Jianqiang Du ◽

Yingfeng Liu ◽

Jianwen Li ◽

...

Keyword(s):

Chinese Medicine ◽

Traditional Chinese Medicine ◽

Knowledge Integration ◽

Short Term Memory ◽

Calculation Model ◽

Word Segmentation ◽

Text Similarity ◽

Model Based ◽

Siamese Network ◽

Similarity Calculation

The text similarity calculation plays a crucial role as the core work of artificial intelligence commercial applications such as traditional Chinese medicine (TCM) auxiliary diagnosis, intelligent question and answer, and prescription recommendation. However, TCM texts have problems such as short sentence expression, inaccurate word segmentation, strong semantic relevance, high feature dimension, and sparseness. This study comprehensively considers the temporal information of sentence context and proposes a TCM text similarity calculation model based on the bidirectional temporal Siamese network (BTSN). We used the enhanced representation through knowledge integration (ERNIE) pretrained language model to train character vectors instead of word vectors and solved the problem of inaccurate word segmentation in TCM. In the Siamese network, the traditional fully connected neural network was replaced by a deep bidirectional long short-term memory (BLSTM) to capture the contextual semantics of the current word information. The improved similarity BLSTM was used to map the sentence that is to be tested into two sets of low-dimensional numerical vectors. Then, we performed similarity calculation training. Experiments on the two datasets of financial and TCM show that the performance of the BTSN model in this study was better than that of other similarity calculation models. When the number of layers of the BLSTM reached 6 layers, the accuracy of the model was the highest. This verifies that the text similarity calculation model proposed in this study has high engineering value.

Word Segmentation of Offline Handwritten Bangla Text Lines

10.1007/978-981-16-5207-3_46 ◽

2021 ◽

pp. 551-560

Author(s):

Komal Agarwal ◽

Akshat Mantry ◽

Chayan Halder

Keyword(s):

Word Segmentation

word segmentation
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Does morphological complexity affect word segmentation? Evidence from computational modeling

A Statistical Language Model for Pre-Trained Sequence Labeling: A Case Study on Vietnamese

How much does prosody help word segmentation? A simulation study on infant-directed speech

A Transformer-based Neural Model for Chinese Word Segmentation and Part-of-Speech Tagging

Natural Infant-Directed Speech Facilitates Neural Tracking of Prosody

Does morphological complexity affect word segmentation? Evidence from computational modeling

Linguistic Annotation of Translated Chinese Texts: Coordinating Theory, Algorithms and Data

BYANJON: A Ground Truth Preparation System for Online Handwritten Bangla Documents

Traditional Chinese Medicine Text Similarity Calculation Model Based on the Bidirectional Temporal Siamese Network

Word Segmentation of Offline Handwritten Bangla Text Lines

Export Citation Format

word segmentationRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Does morphological complexity affect word segmentation? Evidence from computational modeling

A Statistical Language Model for Pre-Trained Sequence Labeling: A Case Study on Vietnamese

How much does prosody help word segmentation? A simulation study on infant-directed speech

A Transformer-based Neural Model for Chinese Word Segmentation and Part-of-Speech Tagging

Natural Infant-Directed Speech Facilitates Neural Tracking of Prosody

Does morphological complexity affect word segmentation? Evidence from computational modeling

Linguistic Annotation of Translated Chinese Texts: Coordinating Theory, Algorithms and Data

BYANJON: A Ground Truth Preparation System for Online Handwritten Bangla Documents

Traditional Chinese Medicine Text Similarity Calculation Model Based on the Bidirectional Temporal Siamese Network

Word Segmentation of Offline Handwritten Bangla Text Lines

word segmentation
Recently Published Documents