scholarly journals Automatic Genre Categorization of Emails into predefined categories using machine learning

Author(s):  
Vinod Kumar Bhalla, Et. al.

In today’s dynamic world, there is a need for fast, efficient, and reliable means of communication. To meet these requirements email system was developed and it got popular with the invention of WWW. Now, the Email system has been used extensively for official, business, and personal communication. On average individual users receive 50-60 mails each day. It is becoming a burden to easily manage emails. So there is a need for effective and reliable means to organize the mails for easy and fast retrieval. An efficient approach is proposed in this paper to classify the mails based on the predefined genres. It has been observed in the proposed research that the classification of emails greatly improves efficiency and saves time and effort to manage them. The results obtained in this paper are very encouraging. Over 90 % of emails are categorized correctly. Email genres are predefined and corresponding keyword lists are generated. Frequency tf-idf of the keywords in the email decides the genre of mail. SVM is used as a multiclass classifier. In this paper need for negative training data has been removed as the proposed classifier works on the principle of one class against the rest.

Sensors ◽  
2021 ◽  
Vol 21 (7) ◽  
pp. 2503
Author(s):  
Taro Suzuki ◽  
Yoshiharu Amano

This paper proposes a method for detecting non-line-of-sight (NLOS) multipath, which causes large positioning errors in a global navigation satellite system (GNSS). We use GNSS signal correlation output, which is the most primitive GNSS signal processing output, to detect NLOS multipath based on machine learning. The shape of the multi-correlator outputs is distorted due to the NLOS multipath. The features of the shape of the multi-correlator are used to discriminate the NLOS multipath. We implement two supervised learning methods, a support vector machine (SVM) and a neural network (NN), and compare their performance. In addition, we also propose an automated method of collecting training data for LOS and NLOS signals of machine learning. The evaluation of the proposed NLOS detection method in an urban environment confirmed that NN was better than SVM, and 97.7% of NLOS signals were correctly discriminated.


2021 ◽  
Author(s):  
Julia Kaltenborn ◽  
Viviane Clay ◽  
Amy R. Macfarlane ◽  
Joshua Michael Lloyd King ◽  
Martin Schneebeli

<p>Snow-layer classification is an essential diagnostic task for a wide variety of cryospheric science and climate research applications. Traditionally, these measurements are made in snow pits, requiring trained operators and a substantial time commitment. The SnowMicroPen (SMP), a portable high-resolution snow penetrometer, has been demonstrated as a capable tool for rapid snow grain classification and layer type segmentation through statistical inversion of its mechanical signal. The manual classification of the SMP profiles requires time and training and becomes infeasible for large datasets.</p><p>Here, we introduce a novel set of SMP measurements collected during the MOSAiC expedition and apply Machine Learning (ML) algorithms to automatically classify and segment SMP profiles of snow on Arctic sea ice. To this end, different supervised and unsupervised ML methods, including Random Forests, Support Vector Machines, Artificial Neural Networks, and k-means Clustering, are compared. A subsequent segmentation of the classified data results in distinct layers and snow grain markers for the SMP profiles. The models are trained with the dataset by King et al. (2020) and the MOSAiC SMP dataset. The MOSAiC dataset is a unique and extensive dataset characterizing seasonal and spatial variation of snow on the central Arctic sea-ice.</p><p>We will test and compare the different algorithms and evaluate the algorithms’ effectiveness based on the need for initial dataset labeling, execution speed, and ease of implementation. In particular, we will compare supervised to unsupervised methods, which are distinguished by their need for labeled training data.</p><p>The implementation of different ML algorithms for SMP profile classification could provide a fast and automatic grain type classification and snow layer segmentation. Based on the gained knowledge from the algorithms’ comparison, a tool can be built to provide scientists from different fields with an immediate SMP profile classification and segmentation. </p><p> </p><p>King, J., Howell, S., Brady, M., Toose, P., Derksen, C., Haas, C., & Beckers, J. (2020). Local-scale variability of snow density on Arctic sea ice. <em>The Cryosphere</em>, <em>14</em>(12), 4323-4339, https://doi.org/10.5194/tc-14-4323-2020.</p>


2020 ◽  
Vol 12 (12) ◽  
pp. 2049
Author(s):  
Joongbin Lim ◽  
Kyoung-Min Kim ◽  
Eun-Hee Kim ◽  
Ri Jin

The most recent forest-type map of the Korean Peninsula was produced in 1910. That of South Korea alone was produced since 1972; however, the forest type information of North Korea, which is an inaccessible region, is not known due to the separation after the Korean War. In this study, we developed a model to classify the five dominant tree species in North Korea (Korean red pine, Korean pine, Japanese larch, needle fir, and Oak) using satellite data and machine-learning techniques. The model was applied to the Gwangneung Forest area in South Korea; the Mt. Baekdu area of China, which borders North Korea; and to Goseong-gun, at the border of South Korea and North Korea, to evaluate the model’s applicability to North Korea. Eighty-three percent accuracy was achieved in the classification of the Gwangneung Forest area. In classifying forest types in the Mt. Baekdu area and Goseong-gun, even higher accuracies of 91% and 90% were achieved, respectively. These results confirm the model’s regional applicability. To expand the model for application to North Korea, a new model was developed by integrating training data from the three study areas. The integrated model’s classification of forest types in Goseong-gun (South Korea) was relatively accurate (80%); thus, the model was utilized to produce a map of the predicted dominant tree species in Goseong-gun (North Korea).


Author(s):  
Xiaoming Li ◽  
Yan Sun ◽  
Qiang Zhang

In this paper, we focus on developing a novel method to extract sea ice cover (i.e., discrimination/classification of sea ice and open water) using Sentinel-1 (S1) cross-polarization (vertical-horizontal, VH or horizontal-vertical, HV) data in extra wide (EW) swath mode based on the machine learning algorithm support vector machine (SVM). The classification basis includes the S1 radar backscatter coefficients and texture features that are calculated from S1 data using the gray level co-occurrence matrix (GLCM). Different from previous methods where appropriate samples are manually selected to train the SVM to classify sea ice and open water, we proposed a method of unsupervised generation of the training samples based on two GLCM texture features, i.e. entropy and homogeneity, that have contrasting characteristics on sea ice and open water. We eliminate the most uncertainty of selecting training samples in machine learning and achieve automatic classification of sea ice and open water by using S1 EW data. The comparison shows good agreement between the SAR-derived sea ice cover using the proposed method and a visual inspection, of which the accuracy reaches approximately 90% - 95% based on a few cases. Besides this, compared with the analyzed sea ice cover data Ice Mapping System (IMS) based on 728 S1 EW images, the accuracy of extracted sea ice cover by using S1 data is more than 80%.


2020 ◽  
Author(s):  
Tim Henning ◽  
Benjamin Bergner ◽  
Christoph Lippert

Instance segmentation is a common task in quantitative cell analysis. While there are many approaches doing this using machine learning, typically, the training process requires a large amount of manually annotated data. We present HistoFlow, a software for annotation-efficient training of deep learning models for cell segmentation and analysis with an interactive user interface.It provides an assisted annotation tool to quickly draw and correct cell boundaries and use biomarkers as weak annotations. It also enables the user to create artificial training data to lower the labeling effort. We employ a universal U-Net neural network architecture that allows accurate instance segmentation and the classification of phenotypes in only a single pass of the network. Transfer learning is available through the user interface to adapt trained models to new tissue types.We demonstrate HistoFlow for fluorescence breast cancer images. The models trained using only artificial data perform comparably to those trained with time-consuming manual annotations. They outperform traditional cell segmentation algorithms and match state-of-the-art machine learning approaches. A user test shows that cells can be annotated six times faster than without the assistance of our annotation tool. Extending a segmentation model for classification of epithelial cells can be done using only 50 to 1500 annotations.Our results show that, unlike previous assumptions, it is possible to interactively train a deep learning model in a matter of minutes without many manual annotations.


2021 ◽  
Vol 503 (1) ◽  
pp. 484-497
Author(s):  
F Pérez-Galarce ◽  
K Pichara ◽  
P Huijse ◽  
M Catelan ◽  
D Mery

ABSTRACT Machine learning has achieved an important role in the automatic classification of variable stars, and several classifiers have been proposed over the last decade. These classifiers have achieved impressive performance in several astronomical catalogues. However, some scientific articles have also shown that the training data therein contain multiple sources of bias. Hence, the performance of those classifiers on objects not belonging to the training data is uncertain, potentially resulting in the selection of incorrect models. Besides, it gives rise to the deployment of misleading classifiers. An example of the latter is the creation of open-source labelled catalogues with biased predictions. In this paper, we develop a method based on an informative marginal likelihood to evaluate variable star classifiers. We collect deterministic rules that are based on physical descriptors of RR Lyrae stars, and then, to mitigate the biases, we introduce those rules into the marginal likelihood estimation. We perform experiments with a set of Bayesian logistic regressions, which are trained to classify RR Lyraes, and we found that our method outperforms traditional non-informative cross-validation strategies, even when penalized models are assessed. Our methodology provides a more rigorous alternative to assess machine learning models using astronomical knowledge. From this approach, applications to other classes of variable stars and algorithmic improvements can be developed.


Author(s):  
Muhamad Soleh ◽  
Naufal Ammar ◽  
Indrati Sukmadi

Machine learning is a one of computer science field, machine-learning studies how computers are able to learn from data to improve their intelligence. Machine learning consists of many classification methods, including Neural Networks, Support Vector Machines, Logistics Regression, and others. In this study, a classification process carried out using the Logistics Regression method for cases of Diabetes. Diabetes is an increase in glucose in the bloodstream due to a lack of insulin, which is responsible for the transfer of glucose from the blood to tissues or cells. This study created with the aim of improving previous paper. The data used in this study are the same data as previous studies published by the Pima Indian Diabetes Dataset. In this study, several stages used, those are pre-processing, processing, evaluation, and website-based application development. The data in this study divided into two, 75% for training data, and 25% for testing data. This study produces an evaluation with an accuracy 80%, which means it is better than the previous paper, which is 75, 97%.


2019 ◽  
Vol 116 (36) ◽  
pp. 18119-18125 ◽  
Author(s):  
Ryan C. Sartor ◽  
Jaclyn Noshay ◽  
Nathan M. Springer ◽  
Steven P. Briggs

Accurate annotation of plant genomes remains complex due to the presence of many pseudogenes arising from whole-genome duplication-generated redundancy or the capture and movement of gene fragments by transposable elements. Machine learning on genome-wide epigenetic marks, informed by transcriptomic and proteomic training data, could be used to improve annotations through classification of all putative protein-coding genes as either constitutively silent or able to be expressed. Expressed genes were subclassified as able to express both mRNAs and proteins or only RNAs, and CG gene body methylation was associated only with the former subclass. More than 60,000 protein-coding genes have been annotated in the reference genome of maize inbred B73. About two-thirds of these genes are transcribed and are designated the filtered gene set (FGS). Classification of genes by our trained random forest algorithm was accurate and relied only on histone modifications or DNA methylation patterns within the gene body; promoter methylation was unimportant. Other inbred lines are known to transcribe significantly different sets of genes, indicating that the FGS is specific to B73. We accurately classified the sets of transcribed genes in additional inbred lines, arising from inbred-specific DNA methylation patterns. This approach highlights the potential of using chromatin information to improve annotations of functional genes.


2020 ◽  
Vol 10 (8) ◽  
pp. 2758
Author(s):  
Yahir Hernández-Mier ◽  
Marco Aurelio Nuño-Maganda ◽  
Said Polanco-Martagón ◽  
María del Refugio García-Chávez

This work proposes the evaluation of a set of algorithms of machine learning and the selection of the most appropriate one for the classification of segmented chromosomes images acquired using the Giemsa staining technique (G-banding). The evaluation and selection of the best classification algorithms was carried out over a dataset of 119 Q-banding chromosomes images, and the obtained results were then applied to a dataset of 24 G-band chromosomes images, manually classified by an expert of the Laboratory of Cytogenetic of the Children’s Hospital of Tamaulipas. The results of evaluation of 51 classifiers yielded that the best classification accuracy for the selected features was obtained by a backpropagation neural network. One of the main contributions of this study is the proposal of a two-stage classification scheme based on the best classifier found by the initial evaluation. In stage 1, chromosome images are classified into three major groups. In stage 2, the output of phase 1 is used as the input of a multiclass classifier. Using this scheme, 82% of the IGB bank samples and 88% of the samples of a bank of images obtained with a Q-band available in the literature consisting of 119 chromosome studies were successfully classified. The proposed work is a part of an desktop application that allows cytogeneticist to automatically generate cytogenetic reports.


2011 ◽  
Vol 48-49 ◽  
pp. 1275-1281 ◽  
Author(s):  
Yu Jen Hu ◽  
Yuh Hua Hu ◽  
Jyh Bin Ke

We proposed a categorized method of DNA sequences matrix by FCM (fuzzy cluster means). FCM avoided the errors caused by the reduction of dimensions. It further reached comprehensive machine learning. In our experiment, there are 40 training data which are artificial samples, and we verify the proposed method with 182 natural DNA sequences. The result showed the proposed method enhanced the accuracy of the classification of genes from 76% to 93%.


Sign in / Sign up

Export Citation Format

Share Document