Automatic Genre Categorization of Emails into predefined categories using machine learning

In today’s dynamic world, there is a need for fast, efficient, and reliable means of communication. To meet these requirements email system was developed and it got popular with the invention of WWW. Now, the Email system has been used extensively for official, business, and personal communication. On average individual users receive 50-60 mails each day. It is becoming a burden to easily manage emails. So there is a need for effective and reliable means to organize the mails for easy and fast retrieval. An efficient approach is proposed in this paper to classify the mails based on the predefined genres. It has been observed in the proposed research that the classification of emails greatly improves efficiency and saves time and effort to manage them. The results obtained in this paper are very encouraging. Over 90 % of emails are categorized correctly. Email genres are predefined and corresponding keyword lists are generated. Frequency tf-idf of the keywords in the email decides the genre of mail. SVM is used as a multiclass classifier. In this paper need for negative training data has been removed as the proposed classifier works on the principle of one class against the rest.

Download Full-text

NLOS Multipath Classification of GNSS Signal Correlation Output Using Machine Learning

Sensors ◽

10.3390/s21072503 ◽

2021 ◽

Vol 21 (7) ◽

pp. 2503

Author(s):

Taro Suzuki ◽

Yoshiharu Amano

Keyword(s):

Machine Learning ◽

Satellite System ◽

Training Data ◽

Support Vector ◽

Positioning Errors ◽

Automated Method ◽

Global Navigation Satellite ◽

Better Than ◽

Signal Correlation

This paper proposes a method for detecting non-line-of-sight (NLOS) multipath, which causes large positioning errors in a global navigation satellite system (GNSS). We use GNSS signal correlation output, which is the most primitive GNSS signal processing output, to detect NLOS multipath based on machine learning. The shape of the multi-correlator outputs is distorted due to the NLOS multipath. The features of the shape of the multi-correlator are used to discriminate the NLOS multipath. We implement two supervised learning methods, a support vector machine (SVM) and a neural network (NN), and compare their performance. In addition, we also propose an automated method of collecting training data for LOS and NLOS signals of machine learning. The evaluation of the proposed NLOS detection method in an urban environment confirmed that NN was better than SVM, and 97.7% of NLOS signals were correctly discriminated.

Download Full-text

A Comparison of Machine Learning Algorithms for the Segmentation and Classification of Snow Micro Penetrometer Profiles on Arctic Sea Ice

10.5194/egusphere-egu21-15637 ◽

2021 ◽

Author(s):

Julia Kaltenborn ◽

Viviane Clay ◽

Amy R. Macfarlane ◽

Joshua Michael Lloyd King ◽

Martin Schneebeli

Keyword(s):

Machine Learning ◽

Sea Ice ◽

Arctic Sea Ice ◽

Machine Learning Algorithms ◽

Training Data ◽

Support Vector ◽

Snow Layer ◽

Arctic Sea ◽

Execution Speed

Snow-layer classification is an essential diagnostic task for a wide variety of cryospheric science and climate research applications. Traditionally, these measurements are made in snow pits, requiring trained operators and a substantial time commitment. The SnowMicroPen (SMP), a portable high-resolution snow penetrometer, has been demonstrated as a capable tool for rapid snow grain classification and layer type segmentation through statistical inversion of its mechanical signal. The manual classification of the SMP profiles requires time and training and becomes infeasible for large datasets.Here, we introduce a novel set of SMP measurements collected during the MOSAiC expedition and apply Machine Learning (ML) algorithms to automatically classify and segment SMP profiles of snow on Arctic sea ice. To this end, different supervised and unsupervised ML methods, including Random Forests, Support Vector Machines, Artificial Neural Networks, and k-means Clustering, are compared. A subsequent segmentation of the classified data results in distinct layers and snow grain markers for the SMP profiles. The models are trained with the dataset by King et al. (2020) and the MOSAiC SMP dataset. The MOSAiC dataset is a unique and extensive dataset characterizing seasonal and spatial variation of snow on the central Arctic sea-ice.We will test and compare the different algorithms and evaluate the algorithms&#8217; effectiveness based on the need for initial dataset labeling, execution speed, and ease of implementation. In particular, we will compare supervised to unsupervised methods, which are distinguished by their need for labeled training data.The implementation of different ML algorithms for SMP profile classification could provide a fast and automatic grain type classification and snow layer segmentation. Based on the gained knowledge from the algorithms&#8217; comparison, a tool can be built to provide scientists from different fields with an immediate SMP profile classification and segmentation.&#160;&#160;King, J., Howell, S., Brady, M., Toose, P., Derksen, C., Haas, C., & Beckers, J. (2020). Local-scale variability of snow density on Arctic sea ice. The Cryosphere, 14(12), 4323-4339, https://doi.org/10.5194/tc-14-4323-2020.

Download Full-text

Machine Learning for Tree Species Classification Using Sentinel-2 Spectral Information, Crown Texture, and Environmental Variables

Remote Sensing ◽

10.3390/rs12122049 ◽

2020 ◽

Vol 12 (12) ◽

pp. 2049

Author(s):

Joongbin Lim ◽

Kyoung-Min Kim ◽

Eun-Hee Kim ◽

Ri Jin

Keyword(s):

Machine Learning ◽

South Korea ◽

Tree Species ◽

North Korea ◽

Forest Type ◽

Forest Area ◽

Training Data ◽

Machine Learning Techniques ◽

Forest Types

The most recent forest-type map of the Korean Peninsula was produced in 1910. That of South Korea alone was produced since 1972; however, the forest type information of North Korea, which is an inaccessible region, is not known due to the separation after the Korean War. In this study, we developed a model to classify the five dominant tree species in North Korea (Korean red pine, Korean pine, Japanese larch, needle fir, and Oak) using satellite data and machine-learning techniques. The model was applied to the Gwangneung Forest area in South Korea; the Mt. Baekdu area of China, which borders North Korea; and to Goseong-gun, at the border of South Korea and North Korea, to evaluate the model’s applicability to North Korea. Eighty-three percent accuracy was achieved in the classification of the Gwangneung Forest area. In classifying forest types in the Mt. Baekdu area and Goseong-gun, even higher accuracies of 91% and 90% were achieved, respectively. These results confirm the model’s regional applicability. To expand the model for application to North Korea, a new model was developed by integrating training data from the three study areas. The integrated model’s classification of forest types in Goseong-gun (South Korea) was relatively accurate (80%); thus, the model was utilized to produce a map of the predicted dominant tree species in Goseong-gun (North Korea).

Download Full-text

Extraction of Sea Ice Cover by Sentinel-1 SAR Based on SVM with Unsupervised Generation of Training Data

10.20944/preprints202005.0336.v1 ◽

2020 ◽

Author(s):

Xiaoming Li ◽

Yan Sun ◽

Qiang Zhang

Keyword(s):

Machine Learning ◽

Sea Ice ◽

Learning Algorithm ◽

Texture Features ◽

Open Water ◽

Ice Cover ◽

Training Data ◽

Support Vector ◽

Training Samples

In this paper, we focus on developing a novel method to extract sea ice cover (i.e., discrimination/classification of sea ice and open water) using Sentinel-1 (S1) cross-polarization (vertical-horizontal, VH or horizontal-vertical, HV) data in extra wide (EW) swath mode based on the machine learning algorithm support vector machine (SVM). The classification basis includes the S1 radar backscatter coefficients and texture features that are calculated from S1 data using the gray level co-occurrence matrix (GLCM). Different from previous methods where appropriate samples are manually selected to train the SVM to classify sea ice and open water, we proposed a method of unsupervised generation of the training samples based on two GLCM texture features, i.e. entropy and homogeneity, that have contrasting characteristics on sea ice and open water. We eliminate the most uncertainty of selecting training samples in machine learning and achieve automatic classification of sea ice and open water by using S1 EW data. The comparison shows good agreement between the SAR-derived sea ice cover using the proposed method and a visual inspection, of which the accuracy reaches approximately 90% - 95% based on a few cases. Besides this, compared with the analyzed sea ice cover data Ice Mapping System (IMS) based on 728 S1 EW images, the accuracy of extracted sea ice cover by using S1 data is more than 80%.

Download Full-text

HistoFlow: Label-Efficient and Interactive Deep Learning Cell Analysis

10.1101/2020.07.15.204891 ◽

2020 ◽

Author(s):

Tim Henning ◽

Benjamin Bergner ◽

Christoph Lippert

Keyword(s):

Machine Learning ◽

Deep Learning ◽

User Interface ◽

Network Architecture ◽

Training Data ◽

Cell Segmentation ◽

Annotation Tool ◽

Cell Analysis ◽

Instance Segmentation

Instance segmentation is a common task in quantitative cell analysis. While there are many approaches doing this using machine learning, typically, the training process requires a large amount of manually annotated data. We present HistoFlow, a software for annotation-efficient training of deep learning models for cell segmentation and analysis with an interactive user interface.It provides an assisted annotation tool to quickly draw and correct cell boundaries and use biomarkers as weak annotations. It also enables the user to create artificial training data to lower the labeling effort. We employ a universal U-Net neural network architecture that allows accurate instance segmentation and the classification of phenotypes in only a single pass of the network. Transfer learning is available through the user interface to adapt trained models to new tissue types.We demonstrate HistoFlow for fluorescence breast cancer images. The models trained using only artificial data perform comparably to those trained with time-consuming manual annotations. They outperform traditional cell segmentation algorithms and match state-of-the-art machine learning approaches. A user test shows that cells can be annotated six times faster than without the assistance of our annotation tool. Extending a segmentation model for classification of epithelial cells can be done using only 50 to 1500 annotations.Our results show that, unlike previous assumptions, it is possible to interactively train a deep learning model in a matter of minutes without many manual annotations.

Download Full-text

Informative Bayesian model selection for RR Lyrae star classifiers

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/stab320 ◽

2021 ◽

Vol 503 (1) ◽

pp. 484-497

Author(s):

F Pérez-Galarce ◽

K Pichara ◽

P Huijse ◽

M Catelan ◽

D Mery

Keyword(s):

Machine Learning ◽

Marginal Likelihood ◽

Likelihood Estimation ◽

Variable Stars ◽

Training Data ◽

Multiple Sources ◽

Rr Lyrae Stars ◽

Rr Lyrae ◽

Lyrae Stars

ABSTRACT Machine learning has achieved an important role in the automatic classification of variable stars, and several classifiers have been proposed over the last decade. These classifiers have achieved impressive performance in several astronomical catalogues. However, some scientific articles have also shown that the training data therein contain multiple sources of bias. Hence, the performance of those classifiers on objects not belonging to the training data is uncertain, potentially resulting in the selection of incorrect models. Besides, it gives rise to the deployment of misleading classifiers. An example of the latter is the creation of open-source labelled catalogues with biased predictions. In this paper, we develop a method based on an informative marginal likelihood to evaluate variable star classifiers. We collect deterministic rules that are based on physical descriptors of RR Lyrae stars, and then, to mitigate the biases, we introduce those rules into the marginal likelihood estimation. We perform experiments with a set of Bayesian logistic regressions, which are trained to classify RR Lyraes, and we found that our method outperforms traditional non-informative cross-validation strategies, even when penalized models are assessed. Our methodology provides a more rigorous alternative to assess machine learning models using astronomical knowledge. From this approach, applications to other classes of variable stars and algorithmic improvements can be developed.

Download Full-text

Website-Based Application for Classification of Diabetes Using Logistic Regression Method

Jurnal Ilmiah Merpati (Menara Penelitian Akademika Teknologi Informasi) ◽

10.24843/jim.2021.v09.i01.p03 ◽

2021 ◽

pp. 23

Author(s):

Muhamad Soleh ◽

Naufal Ammar ◽

Indrati Sukmadi

Keyword(s):

Machine Learning ◽

Regression Method ◽

Training Data ◽

Support Vector ◽

Application Development ◽

Science Field ◽

Logistics Regression ◽

Vector Machines ◽

Logistic Regression Method

Machine learning is a one of computer science field, machine-learning studies how computers are able to learn from data to improve their intelligence. Machine learning consists of many classification methods, including Neural Networks, Support Vector Machines, Logistics Regression, and others. In this study, a classification process carried out using the Logistics Regression method for cases of Diabetes. Diabetes is an increase in glucose in the bloodstream due to a lack of insulin, which is responsible for the transfer of glucose from the blood to tissues or cells. This study created with the aim of improving previous paper. The data used in this study are the same data as previous studies published by the Pima Indian Diabetes Dataset. In this study, several stages used, those are pre-processing, processing, evaluation, and website-based application development. The data in this study divided into two, 75% for training data, and 25% for testing data. This study produces an evaluation with an accuracy 80%, which means it is better than the previous paper, which is 75, 97%.

Download Full-text

Identification of the expressome by machine learning on omics data

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1813645116 ◽

2019 ◽

Vol 116 (36) ◽

pp. 18119-18125 ◽

Cited By ~ 10

Author(s):

Ryan C. Sartor ◽

Jaclyn Noshay ◽

Nathan M. Springer ◽

Steven P. Briggs

Keyword(s):

Machine Learning ◽

Dna Methylation ◽

Inbred Lines ◽

Training Data ◽

Gene Body ◽

Gene Body Methylation ◽

Protein Coding ◽

Protein Coding Genes ◽

Methylation Patterns

Accurate annotation of plant genomes remains complex due to the presence of many pseudogenes arising from whole-genome duplication-generated redundancy or the capture and movement of gene fragments by transposable elements. Machine learning on genome-wide epigenetic marks, informed by transcriptomic and proteomic training data, could be used to improve annotations through classification of all putative protein-coding genes as either constitutively silent or able to be expressed. Expressed genes were subclassified as able to express both mRNAs and proteins or only RNAs, and CG gene body methylation was associated only with the former subclass. More than 60,000 protein-coding genes have been annotated in the reference genome of maize inbred B73. About two-thirds of these genes are transcribed and are designated the filtered gene set (FGS). Classification of genes by our trained random forest algorithm was accurate and relied only on histone modifications or DNA methylation patterns within the gene body; promoter methylation was unimportant. Other inbred lines are known to transcribe significantly different sets of genes, indicating that the FGS is specific to B73. We accurately classified the sets of transcribed genes in additional inbred lines, arising from inbred-specific DNA methylation patterns. This approach highlights the potential of using chromatin information to improve annotations of functional genes.

Download Full-text

Machine Learning Classifiers Evaluation for Automatic Karyogram Generation from G-Banded Metaphase Images

Applied Sciences ◽

10.3390/app10082758 ◽

2020 ◽

Vol 10 (8) ◽

pp. 2758

Author(s):

Yahir Hernández-Mier ◽

Marco Aurelio Nuño-Maganda ◽

Said Polanco-Martagón ◽

María del Refugio García-Chávez

Keyword(s):

Machine Learning ◽

Classification Scheme ◽

Phase 1 ◽

Staining Technique ◽

Desktop Application ◽

Stage Classification ◽

Multiclass Classifier ◽

Stage 1 ◽

Selection Of

This work proposes the evaluation of a set of algorithms of machine learning and the selection of the most appropriate one for the classification of segmented chromosomes images acquired using the Giemsa staining technique (G-banding). The evaluation and selection of the best classification algorithms was carried out over a dataset of 119 Q-banding chromosomes images, and the obtained results were then applied to a dataset of 24 G-band chromosomes images, manually classified by an expert of the Laboratory of Cytogenetic of the Children’s Hospital of Tamaulipas. The results of evaluation of 51 classifiers yielded that the best classification accuracy for the selected features was obtained by a backpropagation neural network. One of the main contributions of this study is the proposal of a two-stage classification scheme based on the best classifier found by the initial evaluation. In stage 1, chromosome images are classified into three major groups. In stage 2, the output of phase 1 is used as the input of a multiclass classifier. Using this scheme, 82% of the IGB bank samples and 88% of the samples of a bank of images obtained with a Q-band available in the literature consisting of 119 chromosome studies were successfully classified. The proposed work is a part of an desktop application that allows cytogeneticist to automatically generate cytogenetic reports.

Download Full-text

The Modified DNA Identification Classification on Fuzzy Relation

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.48-49.1275 ◽

2011 ◽

Vol 48-49 ◽

pp. 1275-1281 ◽

Cited By ~ 3

Author(s):

Yu Jen Hu ◽

Yuh Hua Hu ◽

Jyh Bin Ke

Keyword(s):

Machine Learning ◽

Dna Sequences ◽

Fuzzy Relation ◽

Fuzzy Cluster ◽

Training Data ◽

Dna Identification ◽

Modified Dna

We proposed a categorized method of DNA sequences matrix by FCM (fuzzy cluster means). FCM avoided the errors caused by the reduction of dimensions. It further reached comprehensive machine learning. In our experiment, there are 40 training data which are artificial samples, and we verify the proposed method with 182 natural DNA sequences. The result showed the proposed method enhanced the accuracy of the classification of genes from 76% to 93%.

Download Full-text