scholarly journals n-Gram Based Language Processing using Twitter Dataset to Identify COVID-19 Patients

2021 ◽  
pp. 103048
Author(s):  
Nidal Nasser ◽  
Lutful Karim ◽  
Ahmed El Ouadrhiri ◽  
Asmaa Ali ◽  
Nargis Khan
Keyword(s):  
2020 ◽  
Vol 30 (1) ◽  
pp. 192-208 ◽  
Author(s):  
Hamza Aldabbas ◽  
Abdullah Bajahzar ◽  
Meshrif Alruily ◽  
Ali Adil Qureshi ◽  
Rana M. Amir Latif ◽  
...  

Abstract To maintain the competitive edge and evaluating the needs of the quality app is in the mobile application market. The user’s feedback on these applications plays an essential role in the mobile application development industry. The rapid growth of web technology gave people an opportunity to interact and express their review, rate and share their feedback about applications. In this paper we have scrapped 506259 of user reviews and applications rate from Google Play Store from 14 different categories. The statistical information was measured in the results using different of common machine learning algorithms such as the Logistic Regression, Random Forest Classifier, and Multinomial Naïve Bayes. Different parameters including the accuracy, precision, recall, and F1 score were used to evaluate Bigram, Trigram, and N-gram, and the statistical result of these algorithms was compared. The analysis of each algorithm, one by one, is performed, and the result has been evaluated. It is concluded that logistic regression is the best algorithm for review analysis of the Google Play Store applications. The results have been checked scientifically, and it is found that the accuracy of the logistic regression algorithm for analyzing different reviews based on three classes, i.e., positive, negative, and neutral.


Author(s):  
Saugata Bose ◽  
Ritambhra Korpal

In this chapter, an initiative is proposed where natural language processing (NLP) techniques and supervised machine learning algorithms have been combined to detect external plagiarism. The major emphasis is on to construct a framework to detect plagiarism from monolingual texts by implementing n-gram frequency comparison approach. The framework is based on 120 characteristics which have been extracted during pre-processing steps using simple NLP approach. Afterward, filter metrics has been applied to select most relevant features and supervised classification learning algorithm has been used later to classify the documents in four levels of plagiarism. Then, confusion matrix was built to estimate the false positives and false negatives. Finally, the authors have shown C4.5 decision tree-based classifier's suitability on calculating accuracy over naive Bayes. The framework achieved 89% accuracy with low false positive and false negative rate and it shows higher precision and recall value comparing to passage similarities method, sentence similarity method, and search space reduction method.


Information ◽  
2019 ◽  
Vol 10 (10) ◽  
pp. 317 ◽  
Author(s):  
Karol Nowakowski ◽  
Michal Ptaszynski ◽  
Fumito Masui

Word segmentation is an essential task in automatic language processing for languages where there are no explicit word boundary markers, or where space-delimited orthographic words are too coarse-grained. In this paper we introduce the MiNgMatch Segmenter—a fast word segmentation algorithm, which reduces the problem of identifying word boundaries to finding the shortest sequence of lexical n-grams matching the input text. In order to validate our method in a low-resource scenario involving extremely sparse data, we tested it with a small corpus of text in the critically endangered language of the Ainu people living in northern parts of Japan. Furthermore, we performed a series of experiments comparing our algorithm with systems utilizing state-of-the-art lexical n-gram-based language modelling techniques (namely, Stupid Backoff model and a model with modified Kneser-Ney smoothing), as well as a neural model performing word segmentation as character sequence labelling. The experimental results we obtained demonstrate the high performance of our algorithm, comparable with the other best-performing models. Given its low computational cost and competitive results, we believe that the proposed approach could be extended to other languages, and possibly also to other Natural Language Processing tasks, such as speech recognition.


2020 ◽  
pp. 1-25
Author(s):  
Kamila POLIŠENSKÁ ◽  
Shula CHIAT ◽  
Jakub SZEWCZYK ◽  
Katherine E. TWOMEY

Abstract Theories of language processing differ with respect to the role of abstract syntax and semantics vs surface-level lexical co-occurrence (n-gram) frequency. The contribution of each of these factors has been demonstrated in previous studies of children and adults, but none have investigated them jointly. This study evaluated the role of all three factors in a sentence repetition task performed by children aged 4–7 and 11–12 years. It was found that semantic plausibility benefitted performance in both age groups; syntactic complexity disadvantaged the younger group but benefitted the older group; while contrary to previous findings, n-gram frequency did not facilitate, and in a post-hoc analysis even hampered, performance. This new evidence suggests that n-gram frequency effects might be restricted to the highly constrained and frequent n-grams used in previous investigations, and that semantics and morphosyntax play a more powerful role than n-gram frequency, supporting the role of abstract linguistic knowledge in children's sentence processing.


First Monday ◽  
2009 ◽  
Author(s):  
Milen Martchev

This comparative study examines aspects of online behaviour exhibited by participants in online discussion groups in the United Kingdom and Japan. The primary data consists of message board threads gathered from U.K. and Japanese Internet forum sites with the analysis focusing on hyperlinks contained in the forum messages as well as dates and times of posting extracted from the message heads. A 'reading' of hyperlinks is undertaken through consulting N-gram frequencies obtained from each data set, juxtaposed and compared with the help of a coefficient of difference and the chi-square test. Contrasts in Internet surfing patterns, information-gathering preferences, and references to video, audio, pictorial and sexual content are examined; post times are used to compare daily and weekly patterns of posting activity between the two countries. This study also provides an overview of the uses that N-grams have in natural language processing and argues for their analytical potential in sociolinguistic and CMC-related research.


2009 ◽  
Vol 08 (02) ◽  
pp. 249-265 ◽  
Author(s):  
WEN ZHANG ◽  
TAKETOSHI YOSHIDA ◽  
XIJIN TANG

As a hybrid of N-gram in natural language processing and collocation in statistical linguistics, multi-word is becoming a hot topic in area of text mining and information retrieval. In this paper, a study concerning distribution of multi-words is carried out to explore a theoretical basis for probabilistic term-weighting scheme. Specifically, the Poisson distribution, zero-inflated binomial distribution, and G-distribution are comparatively studied on a task of predicting probabilities of multi-words' occurrences using these distributions, for both technical multi-words and nontechnical multi-words. In addition, a rule-based multi-word extraction algorithm is proposed to extract multi-words from texts based on words' occurring patterns and syntactical structures. Our experimental results demonstrate that G-distribution has the best capability to predict probabilities of frequency of multi-words' occurrence and the Poisson distribution is comparable to zero-inflated binomial distribution in estimation of multi-word distribution. The outcome of this study validates that burstiness is a universal phenomenon in linguistic count data, which is applicable not only for individual content words but also for multi-words.


2016 ◽  
Vol 105 (1) ◽  
pp. 63-76
Author(s):  
Theresa Guinard

Abstract Morphological analysis (finding the component morphemes of a word and tagging morphemes with part-of-speech information) is a useful preprocessing step in many natural language processing applications, especially for synthetic languages. Compound words from the constructed language Esperanto are formed by straightforward agglutination, but for many words, there is more than one possible sequence of component morphemes. However, one segmentation is usually more semantically probable than the others. This paper presents a modified n-gram Markov model that finds the most probable segmentation of any Esperanto word, where the model’s states represent morpheme part-of-speech and semantic classes. The overall segmentation accuracy was over 98% for a set of presegmented dictionary words.


2014 ◽  
Vol 9 (3) ◽  
pp. 437-472 ◽  
Author(s):  
Cyrus Shaoul ◽  
R. Harald Baayen ◽  
Chris F. Westbury

What knowledge influences our choice of words when we write or speak? Predicting which word a person will produce next is not easy, even when the linguistic context is known. One task that has been used to assess context dependent word choice is the fill-in-the-blank task, also called the cloze task. The cloze probability of specific context is an empirical measure found by asking many people to fill in the blank. In this paper we harness the power of large corpora to look at the influence of corpus-derived probabilistic information from a word’s micro-context on word choice. We asked young adults to complete short phrases called n-grams with up to 20 responses per phrase. The probability of the responded word and the conditional probability of the response given the context were predictive of the frequency with which each response was produced. Furthermore the order in which the participants generated multiple completions of the same context was predicted by the conditional probability as well. These results suggest that word choice in cloze tasks taps into implicit knowledge of a person’s past experience with that word in various contexts. Furthermore, the importance of n-gram conditional probabilities in our analysis is further evidence of implicit knowledge about multi-word sequences and support theories of language processing that involve anticipating or predicting based on context.


2021 ◽  
Vol 24 (67) ◽  
pp. 1-17
Author(s):  
Flávio Arthur O. Santos ◽  
Thiago Dias Bispo ◽  
Hendrik Teixeira Macedo ◽  
Cleber Zanchettin

Natural language processing systems have attracted much interest of the industry. This branch of study is composed of some applications such as machine translation, sentiment analysis, named entity recognition, question and answer, and others. Word embeddings (i.e., continuous word representations) are an essential module for those applications generally used as word representation to machine learning models. Some popular methods to train word embeddings are GloVe and Word2Vec. They achieve good word representations, despite limitations: both ignore morphological information of the words and consider only one representation vector for each word. This approach implies the word embeddings does not consider different word contexts properly and are unaware of its inner structure. To mitigate this problem, the other word embeddings method FastText represents each word as a bag of characters n-grams. Hence, a continuous vector describes each n-gram, and the final word representation is the sum of its characters n-grams vectors. Nevertheless, the use of all n-grams character of a word is a poor approach since some n-grams have no semantic relation with their words and increase the amount of potentially useless information. This approach also increase the training phase time. In this work, we propose a new method for training word embeddings, and its goal is to replace the FastText bag of character n-grams for a bag of word morphemes through the morphological analysis of the word. Thus, words with similar context and morphemes are represented by vectors close to each other. To evaluate our new approach, we performed intrinsic evaluations considering 15 different tasks, and the results show a competitive performance compared to FastText. Moreover, the proposed model is $40\%$ faster than FastText in the training phase. We also outperform the baseline approaches in extrinsic evaluations through Hate speech detection and NER tasks using different scenarios.


Information ◽  
2020 ◽  
Vol 11 (1) ◽  
pp. 51 ◽  
Author(s):  
Kien Tran ◽  
Hiroshi Sato ◽  
Masao Kubo

The ability to stop malware as soon as they start spreading will always play an important role in defending computer systems. It must be a huge benefit for organizations as well as society if intelligent defense systems could themselves detect and prevent new types of malware as soon as they reveal only a tiny amount of samples. An approach introduced in this paper takes advantage of One-shot/Few-shot learning algorithms to solve the malware classification problems using a Memory Augmented Neural Network in combination with the Natural Language Processing techniques such as word2vec, n-gram. We embed the malware’s API calls, which are very valuable sources of information for identifying malware’s behaviors, in the different feature spaces, and then feed them to the one-shot/few-shot learning models. Evaluating the model on the two datasets (FFRI 2017 and APIMDS) shows that the models with different parameters could yield high accuracy on malware classification with only a few samples. For example, on the APIMDS dataset, it was able to guess 78.85% correctly after seeing only nine malware samples and 89.59% after fine-tuning with a few other samples. The results confirmed very good accuracies compared to the other traditional methods, and point to a new area of malware research.


Sign in / Sign up

Export Citation Format

Share Document