Automatic Diacritic Recovery with focus on the Quality of the training Corpus for Resource-scarce Languages

Author(s):  
Ikechukwu Ignatius Ayogu ◽  
Onoja Abu
Keyword(s):  
2012 ◽  
Vol 21 (04) ◽  
pp. 383-403 ◽  
Author(s):  
ELENA FILATOVA

Wikipedia is used as a training corpus for many information selection tasks: summarization, question-answering, etc. The information presented in Wikipedia articles as well as the order in which this information is presented, is treated as the gold standard and is used for improving the quality of information selection systems. However, the Wikipedia articles corresponding to the same entry (person, location, event, etc.) written in different languages have substantial differences regarding what information is included in these articles. In this paper we analyze the regularities of information overlap among the articles about the same Wikipedia entry written in different languages: some information facts are covered in the Wikipedia articles in many languages, while others are covered only in a few languages. We introduce a hypothesis that the structure of this information overlap is similar to the information overlap structure (pyramid model) used in summarization evaluation, as well as the information overlap/repetition structure used to identify important information for multidocument summarization. We prove the correctness of our hypothesis by building a summarization system according to the presented information overlap hypothesis. This system summarizes English Wikipedia articles given the articles about the same Wikipedia entries written in other languages. To evaluate the quality of the created summaries, we use Amazon Mechanical Turk as the source of human subjects who can reliably judge the quality of the created text. We also compare the summaries generated according to the information overlap hypothesis against the lead line baseline which is considered to be the most reliable way to generate summaries of Wikipedia articles. The summarization experiment proves the correctness of the introduced multilingual Wikipedia information overlap hypothesis.


2018 ◽  
Vol 25 (1) ◽  
pp. 171-210
Author(s):  
NILADRI CHATTERJEE ◽  
SUSMITA GUPTA

AbstractFor a given training corpus of parallel sentences, the quality of the output produced by a translation system relies heavily on the underlying similarity measurement criteria. A phrase-based machine translation system derives its output through a generative process using a Phrase Table comprising source and target language phrases. As a consequence, the more effective the Phrase Table is, in terms of its size and the output that may be derived out of it, the better is the expected outcome of the underlying translation system. However, finding the most similar phrase(s) from a given training corpus that can help generate a good quality translation poses a serious challenge. In practice, often there are many parallel phrase entries in a Phrase Table that are either redundant, or do not contribute to the translation results effectively. Identifying these candidate entries and removing them from the Phrase Table will not only reduce the size of the Phrase Table, but should also help in improving the processing speed for generating the translations. The present paper develops a scheme based on syntactic structure and the marker hypothesis (Green 1979, The necessity of syntax markers: two experiments with artificial languages, Journal of Verbal Learning and Behavior) for reducing the size of a Phrase Table, without compromising much on the translation quality of the output, by retaining the non-redundant and meaningful parallel phrases only. The proposed scheme is complemented with an appropriate similarity measurement scheme to achieve maximum efficiency in terms of BLEU scores. Although designed for Hindi to English machine translation, the overall approach is quite general, and is expected to be easily adaptable for other language pairs as well.


Author(s):  
O. N. Lyashevskaya ◽  
◽  
L. N. Ostyakova ◽  
E. A. Salnikov ◽  
O. A. Semenova ◽  
...  

Orthographic and morphological heterogeneity of historical texts in premodern Slavic causes many difficulties in pos- and morphological tagging. Existing approaches to these tasks show state-of-the-art results without normalization, but they are still very sensitive to the properties of training data such as genre and origin. In this paper, we investigate to what extent the heterogeneity and size of the training corpus influence the quality of pos tagging and morphological analysis. We observe that UDpipe trained on different parts of the Middle Russian corpus demonstrates a boost in accuracy when using less training data. We resolve this paradox by analyzing the distribution of pos-tags and short words across subcorpora.


Author(s):  
Guanhua Chen ◽  
Yun Chen ◽  
Yong Wang ◽  
Victor O.K. Li

Leveraging lexical constraint is extremely significant in domain-specific machine translation and interactive machine translation. Previous studies mainly focus on extending beam search algorithm or augmenting the training corpus by replacing source phrases with the corresponding target translation. These methods either suffer from the heavy computation cost during inference or depend on the quality of the bilingual dictionary pre-specified by user or constructed with statistical machine translation. In response to these problems, we present a conceptually simple and empirically effective data augmentation approach in lexical constrained neural machine translation. Specifically, we make constraint-aware training data by first randomly sampling the phrases of the reference as constraints, and then packing them together into the source sentence with a separation symbol. Extensive experiments on several language pairs demonstrate that our approach achieves superior translation results over the existing systems, improving translation of constrained sentences without hurting the unconstrained ones.


Author(s):  
K. T. Tokuyasu

During the past investigations of immunoferritin localization of intracellular antigens in ultrathin frozen sections, we found that the degree of negative staining required to delineate u1trastructural details was often too dense for the recognition of ferritin particles. The quality of positive staining of ultrathin frozen sections, on the other hand, has generally been far inferior to that attainable in conventional plastic embedded sections, particularly in the definition of membranes. As we discussed before, a main cause of this difficulty seemed to be the vulnerability of frozen sections to the damaging effects of air-water surface tension at the time of drying of the sections.Indeed, we found that the quality of positive staining is greatly improved when positively stained frozen sections are protected against the effects of surface tension by embedding them in thin layers of mechanically stable materials at the time of drying (unpublished).


Author(s):  
L. D. Jackel

Most production electron beam lithography systems can pattern minimum features a few tenths of a micron across. Linewidth in these systems is usually limited by the quality of the exposing beam and by electron scattering in the resist and substrate. By using a smaller spot along with exposure techniques that minimize scattering and its effects, laboratory e-beam lithography systems can now make features hundredths of a micron wide on standard substrate material. This talk will outline sane of these high- resolution e-beam lithography techniques.We first consider parameters of the exposure process that limit resolution in organic resists. For concreteness suppose that we have a “positive” resist in which exposing electrons break bonds in the resist molecules thus increasing the exposed resist's solubility in a developer. Ihe attainable resolution is obviously limited by the overall width of the exposing beam, but the spatial distribution of the beam intensity, the beam “profile” , also contributes to the resolution. Depending on the local electron dose, more or less resist bonds are broken resulting in slower or faster dissolution in the developer.


Author(s):  
G. Lehmpfuhl

Introduction In electron microscopic investigations of crystalline specimens the direct observation of the electron diffraction pattern gives additional information about the specimen. The quality of this information depends on the quality of the crystals or the crystal area contributing to the diffraction pattern. By selected area diffraction in a conventional electron microscope, specimen areas as small as 1 µ in diameter can be investigated. It is well known that crystal areas of that size which must be thin enough (in the order of 1000 Å) for electron microscopic investigations are normally somewhat distorted by bending, or they are not homogeneous. Furthermore, the crystal surface is not well defined over such a large area. These are facts which cause reduction of information in the diffraction pattern. The intensity of a diffraction spot, for example, depends on the crystal thickness. If the thickness is not uniform over the investigated area, one observes an averaged intensity, so that the intensity distribution in the diffraction pattern cannot be used for an analysis unless additional information is available.


Author(s):  
K. Shibatomi ◽  
T. Yamanoto ◽  
H. Koike

In the observation of a thick specimen by means of a transmission electron microscope, the intensity of electrons passing through the objective lens aperture is greatly reduced. So that the image is almost invisible. In addition to this fact, it have been reported that a chromatic aberration causes the deterioration of the image contrast rather than that of the resolution. The scanning electron microscope is, however, capable of electrically amplifying the signal of the decreasing intensity, and also free from a chromatic aberration so that the deterioration of the image contrast due to the aberration can be prevented. The electrical improvement of the image quality can be carried out by using the fascionating features of the SEM, that is, the amplification of a weak in-put signal forming the image and the descriminating action of the heigh level signal of the background. This paper reports some of the experimental results about the thickness dependence of the observability and quality of the image in the case of the transmission SEM.


Sign in / Sign up

Export Citation Format

Share Document