scholarly journals MiNgMatch—A Fast N-gram Model for Word Segmentation of the Ainu Language

Information ◽  
2019 ◽  
Vol 10 (10) ◽  
pp. 317 ◽  
Author(s):  
Karol Nowakowski ◽  
Michal Ptaszynski ◽  
Fumito Masui

Word segmentation is an essential task in automatic language processing for languages where there are no explicit word boundary markers, or where space-delimited orthographic words are too coarse-grained. In this paper we introduce the MiNgMatch Segmenter—a fast word segmentation algorithm, which reduces the problem of identifying word boundaries to finding the shortest sequence of lexical n-grams matching the input text. In order to validate our method in a low-resource scenario involving extremely sparse data, we tested it with a small corpus of text in the critically endangered language of the Ainu people living in northern parts of Japan. Furthermore, we performed a series of experiments comparing our algorithm with systems utilizing state-of-the-art lexical n-gram-based language modelling techniques (namely, Stupid Backoff model and a model with modified Kneser-Ney smoothing), as well as a neural model performing word segmentation as character sequence labelling. The experimental results we obtained demonstrate the high performance of our algorithm, comparable with the other best-performing models. Given its low computational cost and competitive results, we believe that the proposed approach could be extended to other languages, and possibly also to other Natural Language Processing tasks, such as speech recognition.

Author(s):  
Dim Lam Cing ◽  
Khin Mar Soe

In Natural Language Processing (NLP), Word segmentation and Part-of-Speech (POS) tagging are fundamental tasks. The POS information is also necessary in NLP’s preprocessing work applications such as machine translation (MT), information retrieval (IR), etc. Currently, there are many research efforts in word segmentation and POS tagging developed separately with different methods to get high performance and accuracy. For Myanmar Language, there are also separate word segmentors and POS taggers based on statistical approaches such as Neural Network (NN) and Hidden Markov Models (HMMs). But, as the Myanmar language's complex morphological structure, the OOV problem still exists. To keep away from error and improve segmentation by utilizing POS data, segmentation and labeling should be possible at the same time.The main goal of developing POS tagger for any Language is to improve accuracy of tagging and remove ambiguity in sentences due to language structure. This paper focuses on developing word segmentation and Part-of- Speech (POS) Tagger for Myanmar Language. This paper presented the comparison of separate word segmentation and POS tagging with joint word segmentation and POS tagging.


Information ◽  
2019 ◽  
Vol 10 (11) ◽  
pp. 329
Author(s):  
Karol Nowakowski ◽  
Michal Ptaszynski ◽  
Fumito Masui ◽  
Yoshio Momouchi

Ainu is a critically endangered language spoken by the native inhabitants of northern Japan. This paper describes our research aimed at the development of technology for automatic processing of text in Ainu. In particular, we improved the existing tools for normalizing old transcriptions, word segmentation, and part-of-speech tagging. In the experiments we applied two Ainu language dictionaries from different domains (literary and colloquial) and created a new data set by combining them. The experiments revealed that expanding the lexicon had a positive impact on the overall performance of our tools, especially with test data unrelated to any of the training sets used.


Author(s):  
Xinchi Chen ◽  
Xipeng Qiu ◽  
Xuanjing Huang

Recently, neural network models for natural language processing tasks have been increasingly focused on for their ability of alleviating the burden of manual feature engineering. However, the previous neural models cannot extract the complicated feature compositions as the traditional methods with discrete features. In this work, we propose a feature-enriched neural model for joint Chinese word segmentation and part-of-speech tagging task. Specifically, to simulate the feature templates of traditional discrete feature based models, we use different filters to model the complex compositional features with convolutional and pooling layer, and then utilize long distance dependency information with recurrent layer. Experimental results on five different datasets show the effectiveness of our proposed model.


2020 ◽  
Author(s):  
Florencia Klein ◽  
Daniela Cáceres-Rojas ◽  
Monica Carrasco ◽  
Juan Carlos Tapia ◽  
Julio Caballero ◽  
...  

<p>Although molecular dynamics simulations allow for the study of interactions among virtually all biomolecular entities, metal ions still pose significant challenges to achieve an accurate structural and dynamical description of many biological assemblies. This is particularly the case for coarse-grained (CG) models. Although the reduced computational cost of CG methods often makes them the technique of choice for the study of large biomolecular systems, the parameterization of metal ions is still very crude or simply not available for the vast majority of CG- force fields. Here, we show that incorporating statistical data retrieved from the Protein Data Bank (PDB) to set specific Lennard-Jones interactions can produce structurally accurate CG molecular dynamics simulations. Using this simple approach, we provide a set of interaction parameters for Calcium, Magnesium, and Zinc ions, which cover more than 80% of the metal-bound structures reported on the PDB. Simulations performed using the SIRAH force field on several proteins and DNA systems show that using the present approach it is possible to obtain non-bonded interaction parameters that obviate the use of topological constraints. </p>


Sensors ◽  
2021 ◽  
Vol 21 (13) ◽  
pp. 4496
Author(s):  
Vlad Pandelea ◽  
Edoardo Ragusa ◽  
Tommaso Apicella ◽  
Paolo Gastaldo ◽  
Erik Cambria

Emotion recognition, among other natural language processing tasks, has greatly benefited from the use of large transformer models. Deploying these models on resource-constrained devices, however, is a major challenge due to their computational cost. In this paper, we show that the combination of large transformers, as high-quality feature extractors, and simple hardware-friendly classifiers based on linear separators can achieve competitive performance while allowing real-time inference and fast training. Various solutions including batch and Online Sequential Learning are analyzed. Additionally, our experiments show that latency and performance can be further improved via dimensionality reduction and pre-training, respectively. The resulting system is implemented on two types of edge device, namely an edge accelerator and two smartphones.


2021 ◽  
pp. 103048
Author(s):  
Nidal Nasser ◽  
Lutful Karim ◽  
Ahmed El Ouadrhiri ◽  
Asmaa Ali ◽  
Nargis Khan
Keyword(s):  

2020 ◽  
Vol 30 (1) ◽  
pp. 192-208 ◽  
Author(s):  
Hamza Aldabbas ◽  
Abdullah Bajahzar ◽  
Meshrif Alruily ◽  
Ali Adil Qureshi ◽  
Rana M. Amir Latif ◽  
...  

Abstract To maintain the competitive edge and evaluating the needs of the quality app is in the mobile application market. The user’s feedback on these applications plays an essential role in the mobile application development industry. The rapid growth of web technology gave people an opportunity to interact and express their review, rate and share their feedback about applications. In this paper we have scrapped 506259 of user reviews and applications rate from Google Play Store from 14 different categories. The statistical information was measured in the results using different of common machine learning algorithms such as the Logistic Regression, Random Forest Classifier, and Multinomial Naïve Bayes. Different parameters including the accuracy, precision, recall, and F1 score were used to evaluate Bigram, Trigram, and N-gram, and the statistical result of these algorithms was compared. The analysis of each algorithm, one by one, is performed, and the result has been evaluated. It is concluded that logistic regression is the best algorithm for review analysis of the Google Play Store applications. The results have been checked scientifically, and it is found that the accuracy of the logistic regression algorithm for analyzing different reviews based on three classes, i.e., positive, negative, and neutral.


2018 ◽  
Vol 28 (09) ◽  
pp. 1850007
Author(s):  
Francisco Zamora-Martinez ◽  
Maria Jose Castro-Bleda

Neural Network Language Models (NNLMs) are a successful approach to Natural Language Processing tasks, such as Machine Translation. We introduce in this work a Statistical Machine Translation (SMT) system which fully integrates NNLMs in the decoding stage, breaking the traditional approach based on [Formula: see text]-best list rescoring. The neural net models (both language models (LMs) and translation models) are fully coupled in the decoding stage, allowing to more strongly influence the translation quality. Computational issues were solved by using a novel idea based on memorization and smoothing of the softmax constants to avoid their computation, which introduces a trade-off between LM quality and computational cost. These ideas were studied in a machine translation task with different combinations of neural networks used both as translation models and as target LMs, comparing phrase-based and [Formula: see text]-gram-based systems, showing that the integrated approach seems more promising for [Formula: see text]-gram-based systems, even with nonfull-quality NNLMs.


2021 ◽  
Vol 20 (5s) ◽  
pp. 1-25
Author(s):  
Michael Canesche ◽  
Westerley Carvalho ◽  
Lucas Reis ◽  
Matheus Oliveira ◽  
Salles Magalhães ◽  
...  

Coarse-grained reconfigurable architecture (CGRA) mapping involves three main steps: placement, routing, and timing. The mapping is an NP-complete problem, and a common strategy is to decouple this process into its independent steps. This work focuses on the placement step, and its aim is to propose a technique that is both reasonably fast and leads to high-performance solutions. Furthermore, a near-optimal placement simplifies the following routing and timing steps. Exact solutions cannot find placements in a reasonable execution time as input designs increase in size. Heuristic solutions include meta-heuristics, such as Simulated Annealing (SA) and fast and straightforward greedy heuristics based on graph traversal. However, as these approaches are probabilistic and have a large design space, it is not easy to provide both run-time efficiency and good solution quality. We propose a graph traversal heuristic that provides the best of both: high-quality placements similar to SA and the execution time of graph traversal approaches. Our placement introduces novel ideas based on “you only traverse twice” (YOTT) approach that performs a two-step graph traversal. The first traversal generates annotated data to guide the second step, which greedily performs the placement, node per node, aided by the annotated data and target architecture constraints. We introduce three new concepts to implement this technique: I/O and reconvergence annotation, degree matching, and look-ahead placement. Our analysis of this approach explores the placement execution time/quality trade-offs. We point out insights on how to analyze graph properties during dataflow mapping. Our results show that YOTT is 60.6 , 9.7 , and 2.3 faster than a high-quality SA, bounding box SA VPR, and multi-single traversal placements, respectively. Furthermore, YOTT reduces the average wire length and the maximal FIFO size (additional timing requirement on CGRAs) to avoid delay mismatches in fully pipelined architectures.


Geophysics ◽  
2021 ◽  
pp. 1-71
Author(s):  
Hongwei Liu ◽  
Yi Luo

The finite-difference solution of the second-order acoustic wave equation is a fundamental algorithm in seismic exploration for seismic forward modeling, imaging, and inversion. Unlike the standard explicit finite difference (EFD) methods that usually suffer from the so-called "saturation effect", the implicit FD methods can obtain much higher accuracy with relatively short operator length. Unfortunately, these implicit methods are not widely used because band matrices need to be solved implicitly, which is not suitable for most high-performance computer architectures. We introduce an explicit method to overcome this limitation by applying explicit causal and anti-causal integrations. We can prove that the explicit solution is equivalent to the traditional implicit LU decomposition method in analytical and numerical ways. In addition, we also compare the accuracy of the new methods with the traditional EFD methods up to 32nd order, and numerical results indicate that the new method is more accurate. In terms of the computational cost, the newly proposed method is standard 8th order EFD plus two causal and anti-causal integrations, which can be applied recursively, and no extra memory is needed. In summary, compared to the standard EFD methods, the new method has a spectral-like accuracy; compared to the traditional LU-decomposition implicit methods, the new method is explicit. It is more suitable for high-performance computing without losing any accuracy.


Sign in / Sign up

Export Citation Format

Share Document