Stemming is a common word conflation method that perceives stems embedded in the words and decreases them to their stem (root) by conflating all the morphologically related terms into a single term, without doing a complete morphological analysis. This article presents STEMUR, an enhanced stemming algorithm for automatic word conflation for Urdu language. In addition to handling words with prefixes and suffixes, STEMUR also handles words with infixes. Rather than using a totally unsupervised approach, we utilized the linguistic knowledge to develop a collection of patterns for Urdu infixes to enhance the accuracy of the stems and affixes acquired during the training process. Additionally, STEMUR also handles English loan words and can handle words with more than one affix. STEMUR is compared with four existing Urdu stemmers including Assas-Band and the template-based stemmer that are also implemented in this study. Results are processed on two corpora containing 89,437 and 30,907 words separately. Results show clear improvements regarding strength and accuracy of STEMUR. The use of maximum possible infix rules boosted our stemmer's accuracy up to 93.1% and helped us achieve a precision of 98.9%.
AbstractOnline Social Networks (OSNs) are a popular platform for communication and collaboration. Spammers are highly active in OSNs. Uncovering spammers has become one of the most challenging problems in OSNs. Classification-based supervised approaches are the most commonly used method for detecting spammers. Classification-based systems suffer from limitations of “data labelling”, “spam drift”, “imbalanced datasets” and “data fabrication”. These limitations effect the accuracy of a classifier’s detection. An unsupervised approach does not require labelled datasets. We aim to address the limitation of data labelling and spam drifting through an unsupervised approach.We present a pure unsupervised approach for spammer detection based on the peer acceptance of a user in a social network to distinguish spammers from genuine users. The peer acceptance of a user to another user is calculated based on common shared interests over multiple shared topics between the two users. The main contribution of this paper is the introduction of a pure unsupervised spammer detection approach based on users’ peer acceptance. Our approach does not require labelled training datasets. While it does not better the accuracy of supervised classification-based approaches, our approach has become a successful alternative for traditional classifiers for spam detection by achieving an accuracy of 96.9%.
Electronic medical records (EMRs) include many valuable data about patients, which is, however, unstructured. Therefore, there is a lack of both labeled medical text data in Russian and tools for automatic annotation. As a result, today, it is hardly feasible for researchers to utilize text data of EMRs in training machine learning models in the biomedical domain. We present an unsupervised approach to medical data annotation. Syntactic trees are produced from initial sentences using morphological and syntactical analyses. In retrieved trees, similar subtrees are grouped using Node2Vec and Word2Vec and labeled using domain vocabularies and Wikidata categories. The usage of Wikidata categories increased the fraction of labeled sentences 5.5 times compared to labeling with domain vocabularies only. We show on a validation dataset that the proposed labeling method generates meaningful labels correctly for 92.7% of groups. Annotation with domain vocabularies and Wikidata categories covered more than 82% of sentences of the corpus, extended with timestamp and event labels 97% of sentences got covered. The obtained method can be used to label EMRs in Russian automatically. Additionally, the proposed methodology can be applied to other languages, which lack resources for automatic labeling and domain vocabulary.
The current practice of adjusting hearing aids (HA) is tiring and time-consuming for both patients and audiologists. Of hearing-impaired people, 40–50% are not satisfied with their HAs. In addition, good designs of HAs are often avoided since the process of fitting them is exhausting. To improve the fitting process, a machine learning (ML) unsupervised approach is proposed to cluster the pure-tone audiograms (PTA). This work applies the spectral clustering (SP) approach to group audiograms according to their similarity in shape. Different SP approaches are tested for best results and these approaches were evaluated by Silhouette, Calinski-Harabasz, and Davies-Bouldin criteria values. Kutools for Excel add-in is used to generate audiograms’ population, annotated using the results from SP, and different criteria values are used to evaluate population clusters. Finally, these clusters are mapped to a standard set of audiograms used in HA characterization. The results indicated that grouping the data in 8 groups or 10 results in ones with high evaluation criteria. The evaluation for population audiograms clusters shows good performance, as it resulted in a Silhouette coefficient >0.5. This work introduces a new concept to classify audiograms using an ML algorithm according to the audiograms’ similarity in shape.
We present a method to provide a biologically meaningful representation of the space of protein sequences. While billions of protein sequences are available, organizing this vast amount of information into functional categories is daunting, time-consuming and incomplete. We present our unsupervised approach that combines Transformer protein language models, UMAP graphs, and spectral clustering to create meaningful clusters in the protein spaces. To demonstrate the meaningfulness of the clusters, we show that they preserve most of the signal present in a dataset of manually curated enzyme protein families.
Online social networking platforms allow people to freely express their ideas, opinions, and emotions negatively or positively. Previous studies have examined user’s sentiments on these platforms to study their behaviour in different contexts and purposes. The mechanism of collecting public opinion information has attracted researchers to automatically classify the polarity of public opinions based on the use of concise language in messages, such as tweets, by analyzing social media data. In this paper, we extend the preceding work , by proposing an unsupervised approach to automatically detect extreme opinions/posts in social networks. We have evaluated our performance on five different social network and media datasets. In this work, we use the semi-supervised approach BERT to check the accuracy of our classified dataset. The latter task shows that, in these datasets, posts that were previously classified as negative or positive are, in fact, extremely negative or positive in many cases.