Networds: The impact of electronic text-processing utilities on writing

1994 ◽  
Vol 17 (2) ◽  
pp. 127-166
Author(s):  
William Dubie
Author(s):  
Andrei Mikheev

Electronic text is essentially just a sequence of characters, but the majority of text processing tools operate in terms of linguistic units such as words and sentences. Tokenization is a process of segmenting text into words, and sentence splitting is the process of determining sentence boundaries in the text. In this chapter we describe major challenges for text tokenization and sentence splitting in different languages, and outline various computational approaches to tackling them.


2022 ◽  
Vol 126 ◽  
pp. 107016
Author(s):  
Valeria A. Pfeifer ◽  
Emma L. Armstrong ◽  
Vicky Tzuyin Lai
Keyword(s):  

Author(s):  
Snezhana Sulova ◽  
Boris Bankov

The impact of social networks on our liveskeeps increasing because they provide content,generated and controlled by users, that is constantly evolving. They aid us in spreading news, statements, ideas and comments very quickly. Social platforms are currently one of the richest sources of customer feedback on a variety of topics. A topic that is frequently discussed is the resort and holiday villages and the tourist services offered there. Customer comments are valuable to both travel planners and tour operators. The accumulation of opinions in the web space is a prerequisite for using and applying appropriate tools for their computer processing and for extracting useful knowledge from them. While working with unstructured data, such as social media messages, there isn’t a universal text processing algorithm because each social network and its resources have their own characteristics. In this article, we propose a new approach for an automated analysis of a static set of historical data of user messages about holiday and vacation resorts, published on Twitter. The approach is based on natural language processing techniques and the application of machine learning methods. The experiments are conducted using softwareproduct RapidMiner. 


2021 ◽  
Author(s):  
Rianto Rianto ◽  
Achmad Benny Mutiara ◽  
Eri Prasetyo Wibowo ◽  
Paulus Insap Santosa

Abstract Background: Stemming has long been used in data pre-processing to retrieve information by tracking affixed words back into their root. In an Indonesian setting, existing stemming methods have been observed, and the existing stemming methods are proven to result in high accuracy level. However, there are not many stemming methods for non-formal Indonesian text processing. This study introduces a new stemming method to solve problems in the non-formal Indonesian text data pre-processing. Furthermore, this study aims to improve the accuracy of text classifier models by strengthening stemming method. Using the Support Vector Machine algorithm, a text classifier model is developed, and its accuracy is checked. The experimental evaluation was done by testing 550 datasets in Indonesian using two different stemming methods. Findings: The results show that using the proposed stemming method, the text classifier model has higher accuracy than the existing methods with a score of 0.85 and 0.73, respectively. These results indicate that the proposed stemming methods produces a classifier model with a small error rate, so it will be more accurate to predict a class of objects. Conclusion: The existing Indonesian stemming methods are still oriented towards Indonesian formal sentences, therefore the method has limitations to be used in Indonesian non-formal sentences. This phenomenon underlies the suggestion of developing a corpus by normalizing Indonesian non-formal into formal to be used as a better stemming method. The impact of using the corpus as a stemming method is that it can improve the accuracy of the classifier model. In the future, the proposed corpus and stemming methods can be used for various purposes including text clustering, summarizing, detecting hate speech, and other text processing applications in Indonesian.


Target ◽  
2021 ◽  
Author(s):  
Samuel Läubli ◽  
Patrick Simianer ◽  
Joern Wuebker ◽  
Geza Kovacs ◽  
Rico Sennrich ◽  
...  

Abstract Widely used computer-aided translation (CAT) tools divide documents into segments, such as sentences, and arrange them side-by-side in a spreadsheet-like view. We present the first controlled evaluation of these design choices on translator performance, measuring speed and accuracy in three experimental text-processing tasks. We find significant evidence that sentence-by-sentence presentation enables faster text reproduction and within-sentence error identification compared to unsegmented text, and that a top-and-bottom arrangement of source and target sentences enables faster text reproduction compared to a side-by-side arrangement. For revision, on the other hand, we find that presenting unsegmented text results in the highest accuracy and time efficiency. Our findings have direct implications for best practices in designing CAT tools.


Author(s):  
Meftah Mohammed Charaf Eddine

In the field of machine translation of texts, the ambiguity in both lexical (dictionary) and structural aspects is still one of the difficult problems. Researchers in this field use different approaches, the most important of which is machine learning in its various types. The goal of the approach that we propose in this article is to define a new concept of electronic text, which makes the electronic text free from any lexical or structural ambiguity. We used a semantic coding system that relies on attaching the original electronic text (via the text editor interface) with the meanings intended by the author. The author defines the meaning desired for each word that can be a source of ambiguity. The proposed approach in this article can be used with any type of electronic text (text processing applications, web pages, email text, etc.). Thanks to the approach that we propose and through the experiments that we have conducted using it, we can obtain a very high accuracy rate. We can say that the problem of lexical and structural ambiguity can be completely solved. With this new concept of electronic text, the text file contains not only the text but also with it the true sense of the exact meaning intended by the writer in the form of symbols. These semantic symbols are used during machine translation to obtain a translated text completely free of any lexical and structural ambiguity.


2019 ◽  
Vol 31 (1) ◽  
pp. 53-81
Author(s):  
R. Brisco ◽  
R. I. Whitfield ◽  
H. Grierson

Abstract Selection of suitable computer-supported collaborative design (CSCD) technologies is crucial to facilitate successful projects. This paper presents the first systematic method for engineering design teams to evaluate and select the most suitable CSCD technologies comparing technology functionality and project requirements established in peer-reviewed literature. The paper first presents 220 factors that influence successful CSCD. These factors were then systematically mapped and categorised to create CSCD requirement statements. The novel evaluation and selection method incorporates these requirement statements within a matrix and develops a discourse analysis text processing algorithm with data from collaborative projects to automate the population of how technologies impact the success of CSCD in engineering design teams. This method was validated using data collected across 3 years of a student global design project. The impact of this method is the potential to change the way engineering design teams consider the technology they use and how the selection of appropriate tools impacts the success of their CSCD projects. The development of the CSCD evaluation matrix is the first of its kind enabling a systematic and justifiable comparison and technology selection, with the aim of best supporting the engineering designers collaborative design activity.


IEEE Access ◽  
2021 ◽  
pp. 1-1
Author(s):  
Abdul Ghafoor ◽  
Ali Shariq Imran ◽  
Sher Muhammad Daudpota ◽  
Zenun Kastrati ◽  
Abdullah ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document