scholarly journals An Unsupervised Normalization Algorithm for Noisy Text: A Case Study for Information Retrieval and Stance Detection

2021 ◽  
Vol 13 (3) ◽  
pp. 1-25
Author(s):  
Anurag Roy ◽  
Shalmoli Ghosh ◽  
Kripabandhu Ghosh ◽  
Saptarshi Ghosh

A large fraction of textual data available today contains various types of “noise,” such as OCR noise in digitized documents, noise due to informal writing style of users on microblogging sites, and so on. To enable tasks such as search/retrieval and classification over all the available data, we need robust algorithms for text normalization, i.e., for cleaning different kinds of noise in the text. There have been several efforts towards cleaning or normalizing noisy text; however, many of the existing text normalization methods are supervised and require language-dependent resources or large amounts of training data that is difficult to obtain. We propose an unsupervised algorithm for text normalization that does not need any training data/human intervention. The proposed algorithm is applicable to text over different languages and can handle both machine-generated and human-generated noise. Experiments over several standard datasets show that text normalization through the proposed algorithm enables better retrieval and stance detection, as compared to that using several baseline text normalization methods.

Author(s):  
Zolzaya Byambadorj ◽  
Ryota Nishimura ◽  
Altangerel Ayush ◽  
Norihide Kitaoka

The huge increase in social media use in recent years has resulted in new forms of social interaction, changing our daily lives. Due to increasing contact between people from different cultures as a result of globalization, there has also been an increase in the use of the Latin alphabet, and as a result a large amount of transliterated text is being used on social media. In this study, we propose a variety of character level sequence-to-sequence (seq2seq) models for normalizing noisy, transliterated text written in Latin script into Mongolian Cyrillic script, for scenarios in which there is a limited amount of training data available. We applied performance enhancement methods, which included various beam search strategies, N-gram-based context adoption, edit distance-based correction and dictionary-based checking, in novel ways to two basic seq2seq models. We experimentally evaluated these two basic models as well as fourteen enhanced seq2seq models, and compared their noisy text normalization performance with that of a transliteration model and a conventional statistical machine translation (SMT) model. The proposed seq2seq models improved the robustness of the basic seq2seq models for normalizing out-of-vocabulary (OOV) words, and most of our models achieved higher normalization performance than the conventional method. When using test data during our text normalization experiment, our proposed method which included checking each hypothesis during the inference period achieved the lowest word error rate (WER = 13.41%), which was 4.51% fewer errors than when using the conventional SMT method.


2021 ◽  
Vol 7 (4) ◽  
pp. 64
Author(s):  
Tanguy Ophoff ◽  
Cédric Gullentops ◽  
Kristof Van Beeck ◽  
Toon Goedemé

Object detection models are usually trained and evaluated on highly complicated, challenging academic datasets, which results in deep networks requiring lots of computations. However, a lot of operational use-cases consist of more constrained situations: they have a limited number of classes to be detected, less intra-class variance, less lighting and background variance, constrained or even fixed camera viewpoints, etc. In these cases, we hypothesize that smaller networks could be used without deteriorating the accuracy. However, there are multiple reasons why this does not happen in practice. Firstly, overparameterized networks tend to learn better, and secondly, transfer learning is usually used to reduce the necessary amount of training data. In this paper, we investigate how much we can reduce the computational complexity of a standard object detection network in such constrained object detection problems. As a case study, we focus on a well-known single-shot object detector, YoloV2, and combine three different techniques to reduce the computational complexity of the model without reducing its accuracy on our target dataset. To investigate the influence of the problem complexity, we compare two datasets: a prototypical academic (Pascal VOC) and a real-life operational (LWIR person detection) dataset. The three optimization steps we exploited are: swapping all the convolutions for depth-wise separable convolutions, perform pruning and use weight quantization. The results of our case study indeed substantiate our hypothesis that the more constrained a problem is, the more the network can be optimized. On the constrained operational dataset, combining these optimization techniques allowed us to reduce the computational complexity with a factor of 349, as compared to only a factor 9.8 on the academic dataset. When running a benchmark on an Nvidia Jetson AGX Xavier, our fastest model runs more than 15 times faster than the original YoloV2 model, whilst increasing the accuracy by 5% Average Precision (AP).


2020 ◽  
Vol 13 (1) ◽  
pp. 23
Author(s):  
Wei Zhao ◽  
William Yamada ◽  
Tianxin Li ◽  
Matthew Digman ◽  
Troy Runge

In recent years, precision agriculture has been researched to increase crop production with less inputs, as a promising means to meet the growing demand of agriculture products. Computer vision-based crop detection with unmanned aerial vehicle (UAV)-acquired images is a critical tool for precision agriculture. However, object detection using deep learning algorithms rely on a significant amount of manually prelabeled training datasets as ground truths. Field object detection, such as bales, is especially difficult because of (1) long-period image acquisitions under different illumination conditions and seasons; (2) limited existing prelabeled data; and (3) few pretrained models and research as references. This work increases the bale detection accuracy based on limited data collection and labeling, by building an innovative algorithms pipeline. First, an object detection model is trained using 243 images captured with good illimitation conditions in fall from the crop lands. In addition, domain adaptation (DA), a kind of transfer learning, is applied for synthesizing the training data under diverse environmental conditions with automatic labels. Finally, the object detection model is optimized with the synthesized datasets. The case study shows the proposed method improves the bale detecting performance, including the recall, mean average precision (mAP), and F measure (F1 score), from averages of 0.59, 0.7, and 0.7 (the object detection) to averages of 0.93, 0.94, and 0.89 (the object detection + DA), respectively. This approach could be easily scaled to many other crop field objects and will significantly contribute to precision agriculture.


2021 ◽  
Vol 11 (15) ◽  
pp. 6723
Author(s):  
Ariana Raluca Hategan ◽  
Romulus Puscas ◽  
Gabriela Cristea ◽  
Adriana Dehelean ◽  
Francois Guyon ◽  
...  

The present work aims to test the potential of the application of Artificial Neural Networks (ANNs) for food authentication. For this purpose, honey was chosen as the working matrix. The samples were originated from two countries: Romania (50) and France (53), having as floral origins: acacia, linden, honeydew, colza, galium verum, coriander, sunflower, thyme, raspberry, lavender and chestnut. The ANNs were built on the isotope and elemental content of the investigated honey samples. This approach conducted to the development of a prediction model for geographical recognition with an accuracy of 96%. Alongside this work, distinct models were developed and tested, with the aim of identifying the most suitable configurations for this application. In this regard, improvements have been continuously performed; the most important of them consisted in overcoming the unwanted phenomenon of over-fitting, observed for the training data set. This was achieved by identifying appropriate values for the number of iterations over the training data and for the size and number of the hidden layers and by introducing of a dropout layer in the configuration of the neural structure. As a conclusion, ANNs can be successfully applied in food authenticity control, but with a degree of caution with respect to the “over optimization” of the correct classification percentage for the training sample set, which can lead to an over-fitted model.


Author(s):  
Anastasia V. Kolmogorova

The article aims to analyze the validity of Internet confession texts used as a source of training data set for designing computer classifier of Internet texts in Russian according to their emotional tonality. Thus, the classifier, backed by Lövheim’s emotional cube model, is expected to detect eight classes of emotions represented in the text or to assign the text to the emotionally neutral class. The first and one of the most important stages of the classifier creation is the training data set selection. The training data set in Machine Learning is the actual dataset used to train the model for performing various actions. The internet text genres that are traditionally used in sentiment analysis to train two or three tonalities classifiers are twits, films and market reviews, blogs and financial reports. The novelty of our project consists in designing multiclass classifier that requires a new non-trivial training data. As such, we have chosen the texts from public group Overheard in Russian social network VKontakte. As all texts show similarities, we united them under the genre name “Internet confession”. To feature the genre, we applied the method of narrative semiotics describing six positions forming the deep narrative structure of “Internet confession”: Addresser – a person aware of her/his separateness from the society; Addressee – society / public opinion; Subject – a narrator describing his / her emotional state; Object – the person’s self-image; Helper – the person’s frankness; Adversary – the person’s shame. The above mentioned genre features determine its primary advantage – a qualitative one – to be especially focused on the emotionality while more traditional sources of textual data are based on such categories as expressivity (twits) or axiological estimations (all sorts of reviews). The structural analysis of texts under discussion has also demonstrated several advantages due to the technological basis of the Overheard project: the text hashtagging prevents the researcher from submitting the whole collection to the crowdsourcing assessment; its size is optimal for assessment by experts; despite their hyperbolized emotionality, the texts of Internet confession genre share the stylistic features typical of different types of personal internet discourse. However, the narrative character of all Internet confession texts implies some restrictions in their use within sentiment analysis project.


2019 ◽  
Vol 11 (2) ◽  
pp. 144
Author(s):  
Danar Wido Seno ◽  
Arief Wibowo

Social media writing content growing make a lot of new words that appear on Twitter in the form of words and abbreviations that appear so that sentiment analysis is increasingly difficult to get high accuracy of textual data on Twitter social media. In this study, the authors conducted research on sentiment analysis of the pairs of candidates for President and Vice President of Indonesia in the 2019 Elections. To obtain higher accuracy results and accommodate the problem of textual data development on Twitter, the authors conducted a combination of methods to conduct the sentiment analysis with unsupervised and supervised methods. namely Lexicon Based. This study used Twitter data in October 2018 using the search keywords with the names of each pair of candidates for President and Vice President of the 2019 Elections totaling 800 datasets. From the study with 800 datasets the best accuracy was obtained with a value of 92.5% with 80% training data composition and 20% testing data with a Precision value in each class between 85.7% - 97.2% and Recall value for each class among 78, 2% - 93.5%. With the Lexicon Based method as a labeling dataset, the process of labeling the Support Vector Machine dataset is no longer done manually but is processed by the Lexicon Based method and the dictionary on the lexicon can be added along with the development of data content on Twitter social media.


2010 ◽  
Vol 2 (2) ◽  
pp. 38-51 ◽  
Author(s):  
Marc Halbrügge

Keep it simple - A case study of model development in the context of the Dynamic Stocks and Flows (DSF) taskThis paper describes the creation of a cognitive model submitted to the ‘Dynamic Stocks and Flows’ (DSF) modeling challenge. This challenge aims at comparing computational cognitive models for human behavior during an open ended control task. Participants in the modeling competition were provided with a simulation environment and training data for benchmarking their models while the actual specification of the competition task was withheld. To meet this challenge, the cognitive model described here was designed and optimized for generalizability. Only two simple assumptions about human problem solving were used to explain the empirical findings of the training data. In-depth analysis of the data set prior to the development of the model led to the dismissal of correlations or other parametric statistics as goodness-of-fit indicators. A new statistical measurement based on rank orders and sequence matching techniques is being proposed instead. This measurement, when being applied to the human sample, also identifies clusters of subjects that use different strategies for the task. The acceptability of the fits achieved by the model is verified using permutation tests.


2021 ◽  
Vol 4 (2) ◽  
Author(s):  
Khalifa Ahmed Muiz

S̲h̲īs̲h̲ Mah͎al, the magnificent and monumental creation of Mughals, still stands like a jewel after hundreds of years. This building characteristically excels in its decoration and is best known for its intricate detailing. The evolution and transformation of decorative arts reached its zenith during the reign of S̲h̲āh Jahan (1628 - 1658), which is known as the era of delicacy and pure light in white. The aim of this paper is to study the decorative arts that excelled during this golden era. In this regard, S̲h̲īs̲h̲ Maḥal, situated at the Lahore Fort, is taken as a case study. Comprehensive documentation of these decorative arts including their design, material and technology developed the baseline inventory for their interpretation and appreciation. The study further explored the transformations and transitions during their refinement in addition to their description in the historical textual data.         Keywords: decorative techniques, Pietra-dura, S̲h̲āh Jahan, Mumtaz


2012 ◽  
Vol 14 (2) ◽  
pp. 139 ◽  
Author(s):  
Wahyuni Reksoatmodjo ◽  
Jogiyanto Hartono ◽  
Achmad Djunaedi ◽  
Hargo Utomo

Interaction and linkages between business and information technology (IT) strategies remain a primary concern among executives. This study aims to gain an in depth understanding of how companies achieve alignment and the policy framework that underlies the efforts, particularly those that are associated with the most dominant factor that contributes to the establishment of strategic alignment, namely IT infrastructure flexibility. For that purpose, the study explored four companies engaged in the field of oil, electricity, and communication by adopting interpretive case study. Data were gathered using triangulation methods via field interviews, artifacts, document analysis, as well as direct observation. The textual data were elaborated by an intentional analysis in order to guide the study in exploring the phenomenon. The study identified elements that reflect IT infrastructure flexibilities namely connectivity, compatibility, modularity, IT staff knowledge and skills, and integration. Those elements cover both technical and behavioral dimensions of a company’s components that need to be included in the consideration during the planning phase


Sign in / Sign up

Export Citation Format

Share Document