MiNgMatch—A Fast N-gram Model for Word Segmentation of the Ainu Language

Word segmentation is an essential task in automatic language processing for languages where there are no explicit word boundary markers, or where space-delimited orthographic words are too coarse-grained. In this paper we introduce the MiNgMatch Segmenter—a fast word segmentation algorithm, which reduces the problem of identifying word boundaries to finding the shortest sequence of lexical n-grams matching the input text. In order to validate our method in a low-resource scenario involving extremely sparse data, we tested it with a small corpus of text in the critically endangered language of the Ainu people living in northern parts of Japan. Furthermore, we performed a series of experiments comparing our algorithm with systems utilizing state-of-the-art lexical n-gram-based language modelling techniques (namely, Stupid Backoff model and a model with modified Kneser-Ney smoothing), as well as a neural model performing word segmentation as character sequence labelling. The experimental results we obtained demonstrate the high performance of our algorithm, comparable with the other best-performing models. Given its low computational cost and competitive results, we believe that the proposed approach could be extended to other languages, and possibly also to other Natural Language Processing tasks, such as speech recognition.

Download Full-text

Improving accuracy of Part-of-Speech (POS) tagging using hidden markov model and morphological analysis for Myanmar Language

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v10i2.pp2023-2030 ◽

2020 ◽

Vol 10 (2) ◽

pp. 2023

Author(s):

Dim Lam Cing ◽

Khin Mar Soe

Keyword(s):

Language Processing ◽

High Performance ◽

Markov Models ◽

Hidden Markov ◽

Morphological Structure ◽

Word Segmentation ◽

Pos Tagging ◽

Part Of Speech ◽

Improving Accuracy ◽

Pos Tagger

In Natural Language Processing (NLP), Word segmentation and Part-of-Speech (POS) tagging are fundamental tasks. The POS information is also necessary in NLP’s preprocessing work applications such as machine translation (MT), information retrieval (IR), etc. Currently, there are many research efforts in word segmentation and POS tagging developed separately with different methods to get high performance and accuracy. For Myanmar Language, there are also separate word segmentors and POS taggers based on statistical approaches such as Neural Network (NN) and Hidden Markov Models (HMMs). But, as the Myanmar language's complex morphological structure, the OOV problem still exists. To keep away from error and improve segmentation by utilizing POS data, segmentation and labeling should be possible at the same time.The main goal of developing POS tagger for any Language is to improve accuracy of tagging and remove ambiguity in sentences due to language structure. This paper focuses on developing word segmentation and Part-of- Speech (POS) Tagger for Myanmar Language. This paper presented the comparison of separate word segmentation and POS tagging with joint word segmentation and POS tagging.

Download Full-text

Improving Basic Natural Language Processing Tools for the Ainu Language

Information ◽

10.3390/info10110329 ◽

2019 ◽

Vol 10 (11) ◽

pp. 329

Author(s):

Karol Nowakowski ◽

Michal Ptaszynski ◽

Fumito Masui ◽

Yoshio Momouchi

Keyword(s):

Language Processing ◽

Positive Impact ◽

Word Segmentation ◽

Data Set ◽

Part Of Speech Tagging ◽

Endangered Language ◽

Part Of Speech ◽

Overall Performance ◽

Speech Tagging ◽

Northern Japan

Ainu is a critically endangered language spoken by the native inhabitants of northern Japan. This paper describes our research aimed at the development of technology for automatic processing of text in Ainu. In particular, we improved the existing tools for normalizing old transcriptions, word segmentation, and part-of-speech tagging. In the experiments we applied two Ainu language dictionaries from different domains (literary and colloquial) and created a new data set by combining them. The experiments revealed that expanding the lexicon had a positive impact on the overall performance of our tools, especially with test data unrelated to any of the training sets used.

Download Full-text

A Feature-Enriched Neural Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/553 ◽

2017 ◽

Cited By ~ 3

Author(s):

Xinchi Chen ◽

Xipeng Qiu ◽

Xuanjing Huang

Keyword(s):

Language Processing ◽

Neural Model ◽

Word Segmentation ◽

Chinese Word ◽

Chinese Word Segmentation ◽

Neural Network Models ◽

Long Distance ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Speech Tagging

Recently, neural network models for natural language processing tasks have been increasingly focused on for their ability of alleviating the burden of manual feature engineering. However, the previous neural models cannot extract the complicated feature compositions as the traditional methods with discrete features. In this work, we propose a feature-enriched neural model for joint Chinese word segmentation and part-of-speech tagging task. Specifically, to simulate the feature templates of traditional discrete feature based models, we use different filters to model the complex compositional features with convolutional and pooling layer, and then utilize long distance dependency information with recurrent layer. Experimental results on five different datasets show the effectiveness of our proposed model.

Download Full-text

Parameterization of Divalent Cations for Coarse-Grained Simulations

10.26434/chemrxiv.11881716 ◽

2020 ◽

Author(s):

Florencia Klein ◽

Daniela Cáceres-Rojas ◽

Monica Carrasco ◽

Juan Carlos Tapia ◽

Julio Caballero ◽

...

Keyword(s):

Molecular Dynamics ◽

Metal Ions ◽

Molecular Dynamics Simulations ◽

Divalent Cations ◽

Computational Cost ◽

Data Bank ◽

Coarse Grained ◽

Interaction Parameters ◽

Dynamics Simulations ◽

Dynamical Description

<p>Although molecular dynamics simulations allow for the study of interactions among virtually all biomolecular entities, metal ions still pose significant challenges to achieve an accurate structural and dynamical description of many biological assemblies. This is particularly the case for coarse-grained (CG) models. Although the reduced computational cost of CG methods often makes them the technique of choice for the study of large biomolecular systems, the parameterization of metal ions is still very crude or simply not available for the vast majority of CG- force fields. Here, we show that incorporating statistical data retrieved from the Protein Data Bank (PDB) to set specific Lennard-Jones interactions can produce structurally accurate CG molecular dynamics simulations. Using this simple approach, we provide a set of interaction parameters for Calcium, Magnesium, and Zinc ions, which cover more than 80% of the metal-bound structures reported on the PDB. Simulations performed using the SIRAH force field on several proteins and DNA systems show that using the present approach it is possible to obtain non-bonded interaction parameters that obviate the use of topological constraints. </p>

Download Full-text

Emotion Recognition on Edge Devices: Training and Deployment

Sensors ◽

10.3390/s21134496 ◽

2021 ◽

Vol 21 (13) ◽

pp. 4496

Author(s):

Vlad Pandelea ◽

Edoardo Ragusa ◽

Tommaso Apicella ◽

Paolo Gastaldo ◽

Erik Cambria

Keyword(s):

Emotion Recognition ◽

Language Processing ◽

Computational Cost ◽

Sequential Learning ◽

High Quality ◽

Fast Training ◽

Online Sequential Learning ◽

And Performance ◽

Resource Constrained Devices ◽

Constrained Devices

Emotion recognition, among other natural language processing tasks, has greatly benefited from the use of large transformer models. Deploying these models on resource-constrained devices, however, is a major challenge due to their computational cost. In this paper, we show that the combination of large transformers, as high-quality feature extractors, and simple hardware-friendly classifiers based on linear separators can achieve competitive performance while allowing real-time inference and fast training. Various solutions including batch and Online Sequential Learning are analyzed. Additionally, our experiments show that latency and performance can be further improved via dimensionality reduction and pre-training, respectively. The resulting system is implemented on two types of edge device, namely an edge accelerator and two smartphones.

Download Full-text

n-Gram Based Language Processing using Twitter Dataset to Identify COVID-19 Patients

Sustainable Cities and Society ◽

10.1016/j.scs.2021.103048 ◽

2021 ◽

pp. 103048

Author(s):

Nidal Nasser ◽

Lutful Karim ◽

Ahmed El Ouadrhiri ◽

Asmaa Ali ◽

Nargis Khan

Keyword(s):

Language Processing ◽

N Gram

Download Full-text

Google Play Content Scraping and Knowledge Engineering using Natural Language Processing Techniques with the Analysis of User Reviews

Journal of Intelligent Systems ◽

10.1515/jisys-2019-0197 ◽

2020 ◽

Vol 30 (1) ◽

pp. 192-208 ◽

Cited By ~ 1

Author(s):

Hamza Aldabbas ◽

Abdullah Bajahzar ◽

Meshrif Alruily ◽

Ali Adil Qureshi ◽

Rana M. Amir Latif ◽

...

Keyword(s):

Logistic Regression ◽

Language Processing ◽

Mobile Application ◽

Knowledge Engineering ◽

Machine Learning Algorithms ◽

Application Development ◽

User Reviews ◽

N Gram ◽

Logistic Regression Algorithm ◽

Google Play

Abstract To maintain the competitive edge and evaluating the needs of the quality app is in the mobile application market. The user’s feedback on these applications plays an essential role in the mobile application development industry. The rapid growth of web technology gave people an opportunity to interact and express their review, rate and share their feedback about applications. In this paper we have scrapped 506259 of user reviews and applications rate from Google Play Store from 14 different categories. The statistical information was measured in the results using different of common machine learning algorithms such as the Logistic Regression, Random Forest Classifier, and Multinomial Naïve Bayes. Different parameters including the accuracy, precision, recall, and F1 score were used to evaluate Bigram, Trigram, and N-gram, and the statistical result of these algorithms was compared. The analysis of each algorithm, one by one, is performed, and the result has been evaluated. It is concluded that logistic regression is the best algorithm for review analysis of the Google Play Store applications. The results have been checked scientifically, and it is found that the accuracy of the logistic regression algorithm for analyzing different reviews based on three classes, i.e., positive, negative, and neutral.

Download Full-text

Efficient Embedded Decoding of Neural Network Language Models in a Machine Translation System

International Journal of Neural Systems ◽

10.1142/s0129065718500077 ◽

2018 ◽

Vol 28 (09) ◽

pp. 1850007

Author(s):

Francisco Zamora-Martinez ◽

Maria Jose Castro-Bleda

Keyword(s):

Neural Network ◽

Machine Translation ◽

Language Processing ◽

Traditional Approach ◽

Computational Cost ◽

Integrated Approach ◽

Language Models ◽

Translation System ◽

Neural Net ◽

Network Language

Neural Network Language Models (NNLMs) are a successful approach to Natural Language Processing tasks, such as Machine Translation. We introduce in this work a Statistical Machine Translation (SMT) system which fully integrates NNLMs in the decoding stage, breaking the traditional approach based on [Formula: see text]-best list rescoring. The neural net models (both language models (LMs) and translation models) are fully coupled in the decoding stage, allowing to more strongly influence the translation quality. Computational issues were solved by using a novel idea based on memorization and smoothing of the softmax constants to avoid their computation, which introduces a trade-off between LM quality and computational cost. These ideas were studied in a machine translation task with different combinations of neural networks used both as translation models and as target LMs, comparing phrase-based and [Formula: see text]-gram-based systems, showing that the integrated approach seems more promising for [Formula: see text]-gram-based systems, even with nonfull-quality NNLMs.

Download Full-text

You Only Traverse Twice: A YOTT Placement, Routing, and Timing Approach for CGRAs

ACM Transactions on Embedded Computing Systems ◽

10.1145/3477038 ◽

2021 ◽

Vol 20 (5s) ◽

pp. 1-25

Author(s):

Michael Canesche ◽

Westerley Carvalho ◽

Lucas Reis ◽

Matheus Oliveira ◽

Salles Magalhães ◽

...

Keyword(s):

Execution Time ◽

High Performance ◽

Coarse Grained ◽

Optimal Placement ◽

Greedy Heuristics ◽

High Quality ◽

Solution Quality ◽

Graph Traversal ◽

Trade Offs ◽

Graph Properties

Coarse-grained reconfigurable architecture (CGRA) mapping involves three main steps: placement, routing, and timing. The mapping is an NP-complete problem, and a common strategy is to decouple this process into its independent steps. This work focuses on the placement step, and its aim is to propose a technique that is both reasonably fast and leads to high-performance solutions. Furthermore, a near-optimal placement simplifies the following routing and timing steps. Exact solutions cannot find placements in a reasonable execution time as input designs increase in size. Heuristic solutions include meta-heuristics, such as Simulated Annealing (SA) and fast and straightforward greedy heuristics based on graph traversal. However, as these approaches are probabilistic and have a large design space, it is not easy to provide both run-time efficiency and good solution quality. We propose a graph traversal heuristic that provides the best of both: high-quality placements similar to SA and the execution time of graph traversal approaches. Our placement introduces novel ideas based on “you only traverse twice” (YOTT) approach that performs a two-step graph traversal. The first traversal generates annotated data to guide the second step, which greedily performs the placement, node per node, aided by the annotated data and target architecture constraints. We introduce three new concepts to implement this technique: I/O and reconvergence annotation, degree matching, and look-ahead placement. Our analysis of this approach explores the placement execution time/quality trade-offs. We point out insights on how to analyze graph properties during dataflow mapping. Our results show that YOTT is 60.6 , 9.7 , and 2.3 faster than a high-quality SA, bounding box SA VPR, and multi-single traversal placements, respectively. Furthermore, YOTT reduces the average wire length and the maximal FIFO size (additional timing requirement on CGRAs) to avoid delay mismatches in fully pipelined architectures.

Download Full-text

An explicit method to calculate implicit spatial finite differences

Geophysics ◽

10.1190/geo2021-0001.1 ◽

2021 ◽

pp. 1-71

Author(s):

Hongwei Liu ◽

Yi Luo

Keyword(s):

Finite Difference ◽

High Performance ◽

Computational Cost ◽

New Method ◽

Seismic Exploration ◽

Explicit Method ◽

Lu Decomposition ◽

Implicit Methods ◽

Band Matrices ◽

Explicit Finite Difference

The finite-difference solution of the second-order acoustic wave equation is a fundamental algorithm in seismic exploration for seismic forward modeling, imaging, and inversion. Unlike the standard explicit finite difference (EFD) methods that usually suffer from the so-called "saturation effect", the implicit FD methods can obtain much higher accuracy with relatively short operator length. Unfortunately, these implicit methods are not widely used because band matrices need to be solved implicitly, which is not suitable for most high-performance computer architectures. We introduce an explicit method to overcome this limitation by applying explicit causal and anti-causal integrations. We can prove that the explicit solution is equivalent to the traditional implicit LU decomposition method in analytical and numerical ways. In addition, we also compare the accuracy of the new methods with the traditional EFD methods up to 32nd order, and numerical results indicate that the new method is more accurate. In terms of the computational cost, the newly proposed method is standard 8th order EFD plus two causal and anti-causal integrations, which can be applied recursively, and no extra memory is needed. In summary, compared to the standard EFD methods, the new method has a spectral-like accuracy; compared to the traditional LU-decomposition implicit methods, the new method is explicit. It is more suitable for high-performance computing without losing any accuracy.

Download Full-text