Using k-mer Embeddings Learned from a Skip-Gram Based Neural Network as Effective Feature Representation for Building a Cross-Species Prediction Model to Identify DNA N6-Methyladenine Sites in Plant Genomes

Abstract Identification of DNA N6-methyladenine sites has been a very active topic of computational biology due to the unavailability of suitable methods to identify them accurately, especially in plants. Substantial results were obtained with a great effort put in extracting, heuristic searching, or fusing a diverse types of features, not to mention a feature selection step. We considered DNA, the human life book, as a book corpus for training DNA language models. K-mer embeddings then were generated from Skipgram neural networks and input into several ensemble tree-based algorithms. We trained the prediction model on Rosaceae genome dataset and performed a comprehensive test on 3 plant genome datasets. Our proposed method shows promising performance with AUC performance approaching an ideal value on Rosaceae dataset (0.99), a high score on Rice dataset (0.95) and improved performance on Rice dataset while enjoying an elegant, yet efficient feature extraction process.

Download Full-text

Hierarchical Phoneme Classification for Improved Speech Recognition

Applied Sciences ◽

10.3390/app11010428 ◽

2021 ◽

Vol 11 (1) ◽

pp. 428

Author(s):

Donghoon Oh ◽

Jeong-Sik Park ◽

Ji-Hwan Kim ◽

Gil-Jin Jang

Keyword(s):

Speech Recognition ◽

Language Processing ◽

Confusion Matrix ◽

Critical Factor ◽

Recognition System ◽

Classification Performance ◽

Language Models ◽

Successful Implementation ◽

Phoneme Classification ◽

Improved Performance

Speech recognition consists of converting input sound into a sequence of phonemes, then finding text for the input using language models. Therefore, phoneme classification performance is a critical factor for the successful implementation of a speech recognition system. However, correctly distinguishing phonemes with similar characteristics is still a challenging problem even for state-of-the-art classification methods, and the classification errors are hard to be recovered in the subsequent language processing steps. This paper proposes a hierarchical phoneme clustering method to exploit more suitable recognition models to different phonemes. The phonemes of the TIMIT database are carefully analyzed using a confusion matrix from a baseline speech recognition model. Using automatic phoneme clustering results, a set of phoneme classification models optimized for the generated phoneme groups is constructed and integrated into a hierarchical phoneme classification method. According to the results of a number of phoneme classification experiments, the proposed hierarchical phoneme group models improved performance over the baseline by 3%, 2.1%, 6.0%, and 2.2% for fricative, affricate, stop, and nasal sounds, respectively. The average accuracy was 69.5% and 71.7% for the baseline and proposed hierarchical models, showing a 2.2% overall improvement.

Download Full-text

Evaluating Word Similarity Measure of Embeddings Through Binary Classification

Journal of Computer Science Research ◽

10.30564/jcsr.v1i3.1268 ◽

2019 ◽

Vol 1 (3) ◽

Author(s):

A. Aziz Altowayan ◽

Lixin Tao

Keyword(s):

Similarity Measure ◽

Binary Classification ◽

General Purpose ◽

Feature Representation ◽

Entity Recognition ◽

Language Models ◽

Data Set ◽

Word Similarity ◽

Domain Specific ◽

Retrieval Rate

We consider the following problem: given neural language models (embeddings) each of which is trained on an unknown data set, how can we determine which model would provide a better result when used for feature representation in a downstream task such as text classification or entity recognition? In this paper, we assess the word similarity measure through analyzing its impact on word embeddings learned from various datasets and how they perform in a simple classification task. Word representations were learned and assessed under the same conditions. For training word vectors, we used the implementation of Continuous Bag of Words described in [1]. To assess the quality of the vectors, we applied the analogy questions test for word similarity described in the same paper. Further, to measure the retrieval rate of an embedding model, we introduced a new metric (Average Retrieval Error) which measures the percentage of missing words in the model. We observe that scoring a high accuracy of syntactic and semantic similarities between word pairs is not an indicator of better classification results. This observation can be justified by the fact that a domain-specific corpus contributes to the performance better than a general-purpose corpus. For reproducibility, we release our experiments scripts and results.

Download Full-text

Ekstraksi Informasi Halaman Web Menggunakan Pendekatan Bootstrapping pada Ontology-Based Information Extraction

IJCCS (Indonesian Journal of Computing and Cybernetics Systems) ◽

10.22146/ijccs.7540 ◽

2015 ◽

Vol 9 (2) ◽

pp. 111 ◽

Cited By ~ 1

Author(s):

Erma Susanti ◽

Khabib Mustofa

Keyword(s):

Information Extraction ◽

Language Processing ◽

Semantic Content ◽

Extraction Process ◽

Web Pages ◽

Structured Information ◽

Improved Performance ◽

Types Of Information ◽

Unstructured Information

AbstrakEkstraksi informasi merupakan suatu bidang ilmu untuk pengolahan bahasa alami, dengan cara mengubah teks tidak terstruktur menjadi informasi dalam bentuk terstruktur. Berbagai jenis informasi di Internet ditransmisikan secara tidak terstruktur melalui website, menyebabkan munculnya kebutuhan akan suatu teknologi untuk menganalisa teks dan menemukan pengetahuan yang relevan dalam bentuk informasi terstruktur. Contoh informasi tidak terstruktur adalah informasi utama yang ada pada konten halaman web. Bermacam pendekatan untuk ekstraksi informasi telah dikembangkan oleh berbagai peneliti, baik menggunakan metode manual atau otomatis, namun masih perlu ditingkatkan kinerjanya terkait akurasi dan kecepatan ekstraksi. Pada penelitian ini diusulkan suatu penerapan pendekatan ekstraksi informasi dengan mengkombinasikan pendekatan bootstrapping dengan Ontology-based Information Extraction (OBIE). Pendekatan bootstrapping dengan menggunakan sedikit contoh data berlabel, digunakan untuk memimalkan keterlibatan manusia dalam proses ekstraksi informasi, sedangkan penggunakan panduan ontologi untuk mengekstraksi classes (kelas), properties dan instance digunakan untuk menyediakan konten semantik untuk web semantik. Pengkombinasian kedua pendekatan tersebut diharapkan dapat meningkatan kecepatan proses ekstraksi dan akurasi hasil ekstraksi. Studi kasus untuk penerapan sistem ekstraksi informasi menggunakan dataset “LonelyPlanet”. Kata kunci—Ekstraksi informasi, ontologi, bootstrapping, Ontology-Based Information Extraction, OBIE, kinerja Abstract Information extraction is a field study of natural language processing by converting unstructured text into structured information. Several types of information on the Internet is transmitted through unstructured information via websites, led to emergence of the need a technology to analyze text and found relevant knowledge into structured information. For example of unstructured information is existing main information on the content of web pages. Various approaches for information extraction have been developed by many researchers, either using manual or automatic method, but still need to be improved performance related accuracy and speed of extraction. This research proposed an approach of information extraction that combines bootstrapping approach with Ontology-Based Information Extraction (OBIE). Bootstrapping approach using small seed of labelled data, is used to minimize human intervention on information extraction process, while the use of guide ontology for extracting classes, properties and instances, using for provide semantic content for semantic web. Combining both approaches expected to increase speed of extraction process and accuracy of extraction results. Case study to apply information extraction system using “LonelyPlanet” datasets. Keywords— Information extraction, ontology, bootstrapping, Ontology-Based Information Extraction, OBIE, performance

Download Full-text

Transfer Learning for Risk Classification of Social Media Posts: Model Evaluation Study (Preprint)

10.2196/preprints.15371 ◽

2019 ◽

Author(s):

Derek Howard ◽

Marta M Maslej ◽

Justin Lee ◽

Jacob Ritchie ◽

Geoffrey Woollard ◽

...

Keyword(s):

Mental Health ◽

Machine Learning ◽

Social Media ◽

Transfer Learning ◽

Computational Linguistics ◽

Feature Representation ◽

Fine Tuning ◽

Language Models ◽

Universal Sentence ◽

Text Feature

BACKGROUND Mental illness affects a significant portion of the worldwide population. Online mental health forums can provide a supportive environment for those afflicted and also generate a large amount of data that can be mined to predict mental health states using machine learning methods. OBJECTIVE This study aimed to benchmark multiple methods of text feature representation for social media posts and compare their downstream use with automated machine learning (AutoML) tools. We tested on datasets that contain posts labeled for perceived suicide risk or moderator attention in the context of self-harm. Specifically, we assessed the ability of the methods to prioritize posts that a moderator would identify for immediate response. METHODS We used 1588 labeled posts from the Computational Linguistics and Clinical Psychology (CLPsych) 2017 shared task collected from the Reachout.com forum. Posts were represented using lexicon-based tools, including Valence Aware Dictionary and sEntiment Reasoner, Empath, and Linguistic Inquiry and Word Count, and also using pretrained artificial neural network models, including DeepMoji, Universal Sentence Encoder, and Generative Pretrained Transformer-1 (GPT-1). We used Tree-based Optimization Tool and Auto-Sklearn as AutoML tools to generate classifiers to triage the posts. RESULTS The top-performing system used features derived from the GPT-1 model, which was fine-tuned on over 150,000 unlabeled posts from Reachout.com. Our top system had a macroaveraged F1 score of 0.572, providing a new state-of-the-art result on the CLPsych 2017 task. This was achieved without additional information from metadata or preceding posts. Error analyses revealed that this top system often misses expressions of hopelessness. In addition, we have presented visualizations that aid in the understanding of the learned classifiers. CONCLUSIONS In this study, we found that transfer learning is an effective strategy for predicting risk with relatively little labeled data and noted that fine-tuning of pretrained language models provides further gains when large amounts of unlabeled text are available.

Download Full-text

Resampled dimensional reduction for feature representation in machine learning

10.21203/rs.3.pex-1636/v1 ◽

2021 ◽

Author(s):

Herdiantri Sufriyana ◽

Yu Wei Wu ◽

Emily Chia-Yu Su

Keyword(s):

Machine Learning ◽

Parameter Estimation ◽

Prediction Model ◽

Sample Size ◽

Dimensional Reduction ◽

Latent Variables ◽

Feature Representation ◽

Estimated Parameters ◽

Representation Technique ◽

Selection Of

Abstract We aimed to provide a resampling protocol for dimensional reduction resulting a few latent variables. The applicability focuses on but not limited for developing a machine learning prediction model in order to improve the number of sample size in relative to the number of candidate predictors. By this feature representation technique, one can improve generalization by preventing latent variables to overfit data used to conduct the dimensional reduction. However, this technique may warrant more computational capacity and time to conduct the procedure. The key stages consisted of derivation of latent variables from multiple resampling subsets, parameter estimation of latent variables in population, and selection of latent variables transformed by the estimated parameters.

Download Full-text

Machine Learning and Prediction-Based Resource Management in IoT Considering Qos

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1705.078219 ◽

2019 ◽

Vol 8 (2) ◽

pp. 687-694

Keyword(s):

Machine Learning ◽

Prediction Model ◽

Short Term Memory ◽

Real Data ◽

Computing Methods ◽

Energy Utilization ◽

Real Field ◽

Proposed Model ◽

Improved Performance ◽

Video Sensors

Internet of Things (IoT) is one of the fast-growing technology paradigms used in every sectors, where in the Quality of Service (QoS) is a critical component in such systems and usage perspective with respect to ProSumers (producer and consumers). Most of the recent research works on QoS in IoT have used Machine Learning (ML) techniques as one of the computing methods for improved performance and solutions. The adoption of Machine Learning and its methodologies have become a common trend and need in every technologies and domain areas, such as open source frameworks, task specific algorithms and using AI and ML techniques. In this work we propose an ML based prediction model for resource optimization in the IoT environment for QoS provisioning. The proposed methodology is implemented by using a multi-layer neural network (MNN) for Long Short Term Memory (LSTM) learning in layered IoT environment. Here the model considers the resources like bandwidth and energy as QoS parameters and provides the required QoS by efficient utilization of the resources in the IoT environment. The performance of the proposed model is evaluated in a real field implementation by considering a civil construction project, where in the real data is collected by using video sensors and mobile devices as edge nodes. Performance of the prediction model is observed that there is an improved bandwidth and energy utilization in turn providing the required QoS in the IoT environment.

Download Full-text

Pemisahan Bahan Aktif dalam Buah Sosis (Kigelia africana) dengan Metode Ekstraksi Padat-Cair (Leaching)

FLUIDA ◽

10.35313/fluida.v13i1.1603 ◽

2020 ◽

Vol 13 (1) ◽

pp. 17-23

Author(s):

Ahmad Fauzan ◽

Mukhtar Ghozali ◽

Tri Reksa Saputra ◽

Heni Khautsar Muchtari ◽

Maria Rosa Mistika Mopa

Keyword(s):

Solvent Extraction ◽

Raw Materials ◽

Human Life ◽

Operating Time ◽

Soxhlet Extraction ◽

Extraction Process ◽

Chemical Substances ◽

Solid Ratio ◽

Natural Substances ◽

Qualitative Test

ABSTRAK Industri kosmetik dan farmasi merupakan contoh industri yang saat ini banyak menggunakan bahan alami sebagai bahan baku dalam pembuatan produknya. Salah satu tanaman herbal adalah buah sosis. Buah sosis mengandung senyawa-senyawa aktif penting bagi manusia, yaitu flavonoid, iridoid, naphthoquinone, dan coumarin. Tujuan dari penelitian ini adalah menentukan jenis pelarut, rasio pelarut/bahan baku, waktu operasi, dan persen yield terbaik dalam proses ekstraksi buah sosis dengan metode ekstraksi soxhlet pada variasi jenis pelarut (etanol 96% dan metanol), rasio volume pelarut/berat bahan baku (8:1, 10:1, 12:1), dan waktu operasi (1,2, dan 3 jam). Berdasarkan data yang diperoleh, ekstrak terbaik yaitu ekstrak yang menggunakan pelarut metanol, waktu ekstraksi selama 3 jam, dan rasio pelarut/bahan baku yaitu 10:1, dengan perolehan yield sebesar 33,12%. Hasil uji kualitatif menunjukkan bahwa pada ekstrak tersebut mengandung empat senyawa yang diinginkan. Sementara hasil uji kuantitatif menunjukkan bahwa pada ekstrak metanol buah sosis tersebut terkandung flavonoid sebanyak 5168 ppm. Kata kunci: Buah sosis, ektraksi, flavonoid, iridoid, naphthoquinone ABSTRACT Cosmetic and pharmaceutical industry are two examples of some industries that currently uses a lot of chemical substances as the raw materials to produce their products. But there has been some research about natural substances to replace chemical substances as the raw materials. One of the natural substances is sausage fruit. Sausage fruit contains important bio active compounds for human life such as flavonoids, iridoids, naphthoquinones, and coumarins. In this research, the extraction of bioactive compounds from sausage fruit with soxhlet extraction process has been studied. The purpose of this experiment is to determine the best type of solvent, solvent to solid ratio, and operating time with varying: type of solvent (methanol and ethanol), volume solvent to weight solid ratio (8:1, 10:1, 12:1), and operating time (1-3 hours). The experiment found that the optimal extraction conditions were as follows: methanol as the solvent, extraction time 3 hours, and the solvent-solid ratio of 10:1 with a yield of 33,12%. The result of a qualitative test of the extract shows that the extract contains flavonoids, iridoids, naphthoquinones, and coumarins. Meanwhile, the result of a quantitative test shows that the extract contains flavonoids as much as 5168 ppm. Keywords: Sausage fruit, extraction, flavonoid, iridoid, naphthoquinone

Download Full-text

Prediction of Hormone-Binding Proteins Based on K-mer Feature Representation and Naive Bayes

Frontiers in Genetics ◽

10.3389/fgene.2021.797641 ◽

2021 ◽

Vol 12 ◽

Author(s):

Yuxin Guo ◽

Liping Hou ◽

Wen Zhu ◽

Peng Wang

Keyword(s):

Prediction Model ◽

Naive Bayes ◽

Experimental Period ◽

Naïve Bayes ◽

Feature Representation ◽

Hormone Binding ◽

Hormone Binding Protein ◽

Representation Method ◽

Optimal Feature

Hormone binding protein (HBP) is a soluble carrier protein that interacts selectively with different types of hormones and has various effects on the body’s life activities. HBPs play an important role in the growth process of organisms, but their specific role is still unclear. Therefore, correctly identifying HBPs is the first step towards understanding and studying their biological function. However, due to their high cost and long experimental period, it is difficult for traditional biochemical experiments to correctly identify HBPs from an increasing number of proteins, so the real characterization of HBPs has become a challenging task for researchers. To measure the effectiveness of HBPs, an accurate and reliable prediction model for their identification is desirable. In this paper, we construct the prediction model HBP_NB. First, HBPs data were collected from the UniProt database, and a dataset was established. Then, based on the established high-quality dataset, the k-mer (K = 3) feature representation method was used to extract features. Second, the feature selection algorithm was used to reduce the dimensionality of the extracted features and select the appropriate optimal feature set. Finally, the selected features are input into Naive Bayes to construct the prediction model, and the model is evaluated by using 10-fold cross-validation. The final results were 95.45% accuracy, 94.17% sensitivity and 96.73% specificity. These results indicate that our model is feasible and effective.

Download Full-text

Robust predictions of specialized metabolism genes through machine learning

10.1101/304873 ◽

2018 ◽

Author(s):

Bethany M. Moore ◽

Peipei Wang ◽

Pengxiang Fan ◽

Bryan Leong ◽

Craig A. Schenck ◽

...

Keyword(s):

Machine Learning ◽

Prediction Model ◽

Gene Networks ◽

Experimental Studies ◽

Plant Genome ◽

Primary Metabolism ◽

Specialized Metabolism ◽

Specialized Metabolites ◽

General Metabolism ◽

Pattern Sequence

AbstractPlant specialized metabolism (SM) enzymes produce lineage-specific metabolites with important ecological, evolutionary, and biotechnological implications. Using Arabidopsis thaliana as a model, we identified distinguishing characteristics of SM and GM (general metabolism, traditionally referred to as primary metabolism) genes through a detailed study of features including duplication pattern, sequence conservation, transcription, protein domain content, and gene network properties. Analysis of multiple sets of benchmark genes revealed that SM genes tend to be tandemly duplicated, co-expressed with their paralogs, narrowly expressed at lower levels, less conserved, and less well connected in gene networks relative to GM genes. Although the values of each of these features significantly differed between SM and GM genes, any single feature was ineffective at predicting SM from GM genes. Using machine learning methods to integrate all features, a well performing prediction model was established with a true positive rate of 0.87 and a true negative rate of 0.71. In addition, 86% of known SM genes not used to create the machine learning model were predicted as SM genes, further demonstrating its accuracy. We also demonstrated that the model could be further improved when we distinguished between SM, GM, and junction genes responsible for reactions shared by SM and GM pathways. Application of the prediction model led to the identification of 1,217 A. thaliana genes with previously unknown functions, providing a global, high-confidence estimate of SM gene content in a plant genome.SignificanceSpecialized metabolites are critical for plant-environment interactions, e.g., attracting pollinators or defending against herbivores, and are important sources of plant-based pharmaceuticals. However, it is unclear what proportion of enzyme-encoding genes play roles in specialized metabolism (SM) as opposed to general metabolism (GM) in any plant species. This is because of the diversity of specialized metabolites and the considerable number of incompletely characterized pathways responsible for their production. In addition, SM gene ancestors frequently played roles in GM. We evaluate features distinguishing SM and GM genes and build a computational model that accurately predicts SM genes. Our predictions provide candidates for experimental studies, and our modeling approach can be applied to other species that produce medicinally or industrially useful compounds.

Download Full-text

Prediction of Flash Flood using Rainfall by MLP Classifier

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.f9880.059120 ◽

2020 ◽

Vol 9 (1) ◽

pp. 425-429

Keyword(s):

Prediction Model ◽

Nearest Neighbor ◽

Weather Forecasting ◽

Flash Flood ◽

Human Life ◽

Cost Effective ◽

Rainfall Data ◽

Support Vector ◽

K Nearest Neighbor ◽

Flood Prediction

Flood are one of the unfavorable natural disasters. A flood can result in a huge loss of human lives and properties. It can also affect agricultural lands and destroy cultivated crops and trees. The flood can occur as a result of surface-runoff formed from melting snow, long-drawn-out rains, and derisory drainage of rainwater or collapse of dams. Today people have destroyed the rivers and lakes and have turned the natural water storage pools to buildings and construction lands. Flash floods can develop quickly within a few hours when compared with a regular flood. Research in prediction of flood has improved to reduce the loss of human life, property damages, and various problems related to the flood. Machine learning methods are widely used in building an efficient prediction model for weather forecasting. This advancement of the prediction system provides cost-effective solutions and better performance. In this paper, a prediction model is constructed using rainfall data to predict the occurrence of floods due to rainfall. The model predicts whether “flood may happen or not” based on the rainfall range for particular locations. Indian district rainfall data is used to build the prediction model. The dataset is trained with various algorithms like Linear Regression, K- Nearest Neighbor, Support Vector Machine, and Multilayer Perceptron. Among this, MLP algorithm performed efficiently with the highest accuracy of 97.40%. The MLP flash flood prediction model can be useful for the climate scientist to predict the flood during a heavy downpour with the highest accuracy.

Download Full-text