scholarly journals ML-EC2

Author(s):  
Aakanksha Sharaff ◽  
Naresh Kumar Nagwani

A multi-label variant of email classification named ML-EC2 (multi-label email classification using clustering) has been proposed in this work. ML-EC2 is a hybrid algorithm based on text clustering, text classification, frequent-term calculation (based on latent dirichlet allocation), and taxonomic term-mapping technique. It is an example of classification using text clustering technique. It studies the problem where each email cluster represents a single class label while it is associated with set of cluster labels. It is multi-label text-clustering-based classification algorithm in which an email cluster can be mapped to more than one email category when cluster label matches with more than one category term. The algorithm will be helpful when there is a vague idea of topic. The performance parameters Entropy and Davies-Bouldin Index are used to evaluate the designed algorithm.

2020 ◽  
Author(s):  
Jiting Tang ◽  
Saini Yang ◽  
Weiping Wang

<p>In 2019, the typhoon Lekima hit China, bringing strong winds and heavy rainfall to the nine provinces and municipalities on the northeastern coast of China. According to the Ministry of Emergency Management of the People’s Republic of China, Lekima caused 66 direct fatalities, 14 million affected people and is responsible for a direct economic loss in excess of 50 billion yuan. The current observation technologies include remote sensing and meteorological observation. But they have a long time cycle of data collection and a low interaction with disaster victims. Social media big data is a new data source for natural disaster research, which can provide technical reference for natural hazard analysis, risk assessment and emergency rescue information management.</p><p>We propose an assessment framework of social media data-based typhoon-induced flood assessment, which includes five parts: (1) <strong>Data acquisition.</strong> Obtain Sina Weibo text and some tag attributes based on keywords, time and location. (2) <strong>Spatiotemporal quantitative analysis.</strong> Collect the public concerns and trends from the perspective of words, time and space of different scales to judge the impact range of typhoon-induced flood. (3) <strong>Text classification and multi-source heterogeneous data fusion analysis.</strong> Build a hazard intensity and disaster text classification model by CNN (Convolutional Neural Networks), then integrate multi-source data including meteorological monitoring, population economy and disaster report for secondary evaluation and correction. (4) <strong>Text clustering and sub event mining.</strong> Extract subevents by BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) text clustering algorithms for automatic recognition of emergencies. (5) <strong>Emotional analysis and crisis management.</strong> Use time-space sequence model and four-quadrant analysis method to track the public negative emotions and find the potential crisis for emergency management.</p><p>This framework is validated with the case study of typhoon Lekima. The results show that social media big data makes up for the gap of data efficiency and spatial coverage. Our framework can assess the influence coverage, hazard intensity, disaster information and emergency needs, and it can reverse the disaster propagation process based on the spatiotemporal sequence. The assessment results after the secondary correction of multi-source data can be used in the actual system.</p><p>The proposed framework can be applied on a wide spatial scope and even full coverage; it is spatially efficient and can obtain feedback from affected areas and people almost immediately at the same time as a disaster occurs. Hence, it has a promising potential in large-scale and real-time disaster assessment.</p>


2019 ◽  
Vol 17 (2) ◽  
pp. 241-249
Author(s):  
Yangyang Li ◽  
Bo Liu

Short and sparse characteristics and synonyms and homonyms are main obstacles for short-text classification. In recent years, research on short-text classification has focused on expanding short texts but has barely guaranteed the validity of expanded words. This study proposes a new method to weaken these effects without external knowledge. The proposed method analyses short texts by using the topic model based on Latent Dirichlet Allocation (LDA), represents each short text by using a vector space model and presents a new method to adjust the vector of short texts. In the experiments, two open short-text data sets composed of google news and web search snippets are utilised to evaluate the classification performance and prove the effectiveness of our method.


2019 ◽  
Author(s):  
Ayoub Bagheri ◽  
Daniel Oberski ◽  
Arjan Sammani ◽  
Peter G.M. van der Heijden ◽  
Folkert W. Asselbergs

AbstractBackgroundWith the increasing use of unstructured text in electronic health records, extracting useful related information has become a necessity. Text classification can be applied to extract patients’ medical history from clinical notes. However, the sparsity in clinical short notes, that is, excessively small word counts in the text, can lead to large classification errors. Previous studies demonstrated that natural language processing (NLP) can be useful in the text classification of clinical outcomes. We propose incorporating the knowledge from unlabeled data, as this may alleviate the problem of short noisy sparse text.ResultsThe software package SALTClass (short and long text classifier) is a machine learning NLP toolkit. It uses seven clustering algorithms, namely, latent Dirichlet allocation, K-Means, MiniBatchK-Means, BIRCH, MeanShift, DBScan, and GMM. Smoothing methods are applied to the resulting cluster information to enrich the representation of sparse text. For the subsequent prediction step, SALTClass can be used on either the original document-term matrix or in an enrichment pipeline. To this end, ten different supervised classifiers have also been integrated into SALTClass. We demonstrate the effectiveness of the SALTClass NLP toolkit in the identification of patients’ family history in a Dutch clinical cardiovascular text corpus from University Medical Center Utrecht, the Netherlands.ConclusionsThe considerable amount of unstructured short text in healthcare applications, particularly in clinical cardiovascular notes, has created an urgent need for tools that can parse specific information from text reports. Using machine learning algorithms for enriching short text can improve the representation for further applications.AvailabilitySALTClass can be downloaded as a Python package from Python Package Index (PyPI) website athttps://pypi.org/project/saltclassand from GitHub athttps://github.com/bagheria/saltclass.


Author(s):  
Zhihua Wei ◽  
Duoqian Miao ◽  
Ruizhi Wang ◽  
Zhifei Zhang

Text representation is the prerequisite of various document processing tasks, such as information retrieval, text classification, text clustering, etc. It has been studied intensively for the past few years, and many excellent models have been designed as well. However, the performance of these models is affected by the problem of data sparseness. Existing smoothing techniques usually make use of statistic theory or linguistic information to assign a uniform distribution to absent words. They do not concern the real word distribution or distinguish between words. In this chapter, a method based on a kind of soft computing theory, Tolerance Rough Set theory, which makes use of upper approximation and lower approximation theory in Rough Set to assign different values for absent words in different approximation regions, is proposed. Theoretically, our algorithms can estimate smoothing value for absent words according to their relation with respect to existing words. Text classification experiments by using Vector Space Model (VSM) and Latent Dirichlet Allocation (LDA) model on public corpora have shown that our algorithms greatly improve the performance of text representation model, especially for the performance of unbalanced corpus.


Author(s):  
Na Zheng ◽  
Jie Yu Wu

A clustering method based on the Latent Dirichlet Allocation and the VSM model to compute the text similarity is presented. The Latent Dirichlet Allocation subject models and the VSM vector space model weights strategy are used respectively to calculate the text similarity. The linear combination of the two results is used to get the text similarity. Then the k-means clustering algorithm is chosen for cluster analysis. It can not only solve the deep semantic information leakage problems of traditional text clustering, but also solve the problem of the LDA that could not distinguish the texts because of too much dimension reduction. So the deep semantic information is mined from the text, and the clustering efficiency is improved. Through the comparisons with the traditional methods, the result shows that this algorithm can improve the performance of text clustering.


2006 ◽  
Vol 05 (03) ◽  
pp. 211-222
Author(s):  
Imad Rahal ◽  
Hassan Najadat ◽  
William Perrizo

The importance of text mining stems from the availability of huge volumes of text databases holding a wealth of valuable information that needs to be mined. Text mining is a coarse area encompassing many finer branches one of which is text categorisation or text classification. Text categorisation is the process of assigning class labels to documents based entirely on their textual contents where we are given a document d, and asked to find its subject matter or class label, Ci. In this paper, an optimised k-Nearest Neighbours classifier that uses discretisation, the P-tree technology, and dimensionality reduction to achieve a high degree of accuracy, space utilisation and time efficiency is proposed. One of the fundamental contributions of this work is that as new samples arrive, the proposed classifier can find the k nearest neighbours to the new sample from the training space without a single database scan.


Sign in / Sign up

Export Citation Format

Share Document