weighted finite state transducers
Recently Published Documents


TOTAL DOCUMENTS

69
(FIVE YEARS 4)

H-INDEX

11
(FIVE YEARS 1)

2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Bilal Elghadyry ◽  
Faissal Ouardi ◽  
Sébastien Verel

AbstractWeighted finite-state transducers have been shown to be a general and efficient representation in many applications such as text and speech processing, computational biology, and machine learning. The composition of weighted finite-state transducers constitutes a fundamental and common operation between these applications. The NP-hardness of the composition computation problem presents a challenge that leads us to devise efficient algorithms on a large scale when considering more than two transducers. This paper describes a parallel computation of weighted finite transducers composition in MapReduce framework. To the best of our knowledge, this paper is the first to tackle this task using MapReduce methods. First, we analyze the communication cost of this problem using Afrati et al. model. Then, we propose three MapReduce methods based respectively on input alphabet mapping, state mapping, and hybrid mapping. Finally, intensive experiments on a wide range of weighted finite-state transducers are conducted to compare the proposed methods and show their efficiency for large-scale data.


2020 ◽  
Author(s):  
Bilal Elghadyry ◽  
Faissal Ouardi ◽  
Sébastien Verel

Abstract Weighted finite-state transducers have been shown to be a general and efficient representation in many applications such as text and speech processing, computational biology, and machine learning. The composition of weighted finite-state transducers constitutes a fundamental and common operation between these applications. The NP-hardness of the composition computation problem presents a challenge that leads us to devise efficient algorithms on a large scale when considering more than two transducers. This paper describes a parallel computation of weighted finite transducers composition in MapReduce framework. To the best of our knowledge, this paper is the first to tackle this task using MapReduce methods. First, we analyze the communication cost of this problem using Afrati et al. model. Then, we propose three MapReduce methods based respectively on input alphabet mapping, state mapping, and hybrid mapping. Finally, intensive experiments on a wide range of weighted finite-state transducers are conducted to compare the proposed methods and show their efficiency for large-scale data.


2019 ◽  
Vol 25 (2) ◽  
pp. 307-321 ◽  
Author(s):  
Izaskun Etxeberria ◽  
Iñaki Alegria ◽  
Larraitz Uria

AbstractThis paper presents a study about methods for normalization of historical texts. The aim of these methods is learning relations between historical and contemporary word forms. We have compiled training and test corpora for different languages and scenarios, and we have tried to read the results related to the features of the corpora and languages. Our proposed method, based on weighted finite-state transducers, is compared to previously published ones. Our method learns to map phonological changes using a noisy channel model; it is a simple solution that can use a limited amount of supervision in order to achieve adequate performance. The compiled corpora are ready to be used for other researchers in order to compare results. Concerning the amount of supervision for the task, we investigate how the size of training corpus affects the results and identify some interesting factors to anticipate the difficulty of the task.


2017 ◽  
Author(s):  
Lars Hellsten ◽  
Brian Roark ◽  
Prasoon Goyal ◽  
Cyril Allauzen ◽  
Françoise Beaufays ◽  
...  

2016 ◽  
Author(s):  
Ian Holmes

1AbstractWe describe a strategy for constructing codes for DNA-based information storage by serial composition of weighted finite-state transducers. The resulting state machines can integrate correction of substitution errors; synchronization by interleaving watermark and periodic marker signals; conversion from binary to ternary, quaternary or mixed-radix sequences via an efficient block code; encoding into a DNA sequence that avoids homopolymer, dinucleotide, or trinucleotide runs and other short local repeats; and detection/correction of errors (including local duplications, burst deletions, and substitutions) that are characteristic of DNA sequencing technologies. We present software implementing these codes, available at https://github.com/ihh/dnastore, with simulation results demonstrating that the generated DNA is free of short repeats and can be accurately decoded even in the presence of substitutions, short duplications and deletions.


2014 ◽  
Vol 21 (3) ◽  
pp. 333-353 ◽  
Author(s):  
PETER EBDEN ◽  
RICHARD SPROAT

AbstractThis paper describes the Kestrel text normalization system, a component of the Google text-to-speech synthesis (TTS) system. At the core of Kestrel are text-normalization grammars that are compiled into libraries of weighted finite-state transducers (WFSTs). While the use of WFSTs for text normalization is itself not new, Kestrel differs from previous systems in its separation of the initialtokenization and classificationphase of analysis fromverbalization. Input text is first tokenized and different tokens classified using WFSTs. As part of the classification, detectedsemiotic classes– expressions such as currency amounts, dates, times, measure phases, are parsed into protocol buffers (https://code.google.com/p/protobuf/). The protocol buffers are then verbalized, with possible reordering of the elements, again using WFSTs. This paper describes the architecture of Kestrel, the protocol buffer representations of semiotic classes, and presents some examples of grammars for various languages. We also discuss applications and deployments of Kestrel as part of the Google TTS system, which runs on both server and client side on multiple devices, and is used daily by millions of people in nineteen languages and counting.


2014 ◽  
Vol 40 (4) ◽  
pp. 733-761
Author(s):  
Richard Sproat ◽  
Mahsa Yarmohammadi ◽  
Izhak Shafran ◽  
Brian Roark

This paper explores lexicographic semirings and their application to problems in speech and language processing. Specifically, we present two instantiations of binary lexicographic semirings, one involving a pair of tropical weights, and the other a tropical weight paired with a novel string semiring we term the categorial semiring. The first of these is used to yield an exact encoding of backoff models with epsilon transitions. This lexicographic language model semiring allows for off-line optimization of exact models represented as large weighted finite-state transducers in contrast to implicit (on-line) failure transition representations. We present empirical results demonstrating that, even in simple intersection scenarios amenable to the use of failure transitions, the use of the more powerful lexicographic semiring is competitive in terms of time of intersection. The second of these lexicographic semirings is applied to the problem of extracting, from a lattice of word sequences tagged for part of speech, only the single best-scoring part of speech tagging for each word sequence. We do this by incorporating the tags as a categorial weight in the second component of a 〈Tropical, Categorial〉 lexicographic semiring, determinizing the resulting word lattice acceptor in that semiring, and then mapping the tags back as output labels of the word lattice transducer. We compare our approach to a competing method due to Povey et al. (2012).


Sign in / Sign up

Export Citation Format

Share Document