scholarly journals ON THE USE OF MULTIPLE INSTANCE LEARNING FOR DATA CLASSIFICATION

2021 ◽  
Vol 25 (Special) ◽  
pp. 1-127-1-137
Author(s):  
Nibras Z. Salih ◽  
◽  
Walaa Khalaf ◽  

In the multiple instances learning framework, instances are arranged into bags, each bag contains several instances, the labels of each instance are not available but the label is available for each bag. Whilst in a single instance learning each instance is connected with the label that contains a single feature vector. This paper examines the distinction between these paradigms to see if it is appropriate, to cast the problem within a multiple instance framework. In single-instance learning, two datasets are applied (students’ dataset and iris dataset) using Naïve Bayes Classifier (NBC), Multilayer perceptron (MLP), Support Vector Machine (SVM), and Sequential Minimal Optimization (SMO), while SimpleMI, MIWrapper, and MIBoost in multiple instances learning. Leave One Out Cross-Validation (LOOCV), five and ten folds Cross-Validation techniques (5-CV, 10-CV) are implemented to evaluate the classification results. A comparison of the result of these techniques is made, several algorithms are found to be more effective for classification in the multiple instances learning. The suitable algorithms for the students' dataset are MIBoost with MLP for LOOCV with an accuracy of 75%, whereas SimpleMI with SMO for the iris dataset is the suitable algorithm for 10-CV with an accuracy of 99.33%.

2019 ◽  
Vol 11 (21) ◽  
pp. 2512 ◽  
Author(s):  
Nicolas Karasiak ◽  
Jean-François Dejoux ◽  
Mathieu Fauvel ◽  
Jérôme Willm ◽  
Claude Monteil ◽  
...  

Mapping forest composition using multiseasonal optical time series remains a challenge. Highly contrasted results are reported from one study to another suggesting that drivers of classification errors are still under-explored. We evaluated the performances of single-year Formosat-2 time series to discriminate tree species in temperate forests in France and investigated how predictions vary statistically and spatially across multiple years. Our objective was to better estimate the impact of spatial autocorrelation in the validation data on measurement accuracy and to understand which drivers in the time series are responsible for classification errors. The experiments were based on 10 Formosat-2 image time series irregularly acquired during the seasonal vegetation cycle from 2006 to 2014. Due to lot of clouds in the year 2006, an alternative 2006 time series using only cloud-free images has been added. Thirteen tree species were classified in each single-year dataset based on the Support Vector Machine (SVM) algorithm. The performances were assessed using a spatial leave-one-out cross validation (SLOO-CV) strategy, thereby guaranteeing full independence of the validation samples, and compared with standard non-spatial leave-one-out cross-validation (LOO-CV). The results show relatively close statistical performances from one year to the next despite the differences between the annual time series. Good agreements between years were observed in monospecific tree plantations of broadleaf species versus high disparity in other forests composed of different species. A strong positive bias in the accuracy assessment (up to 0.4 of Overall Accuracy (OA)) was also found when spatial dependence in the validation data was not removed. Using the SLOO-CV approach, the average OA values per year ranged from 0.48 for 2006 to 0.60 for 2013, which satisfactorily represents the spatial instability of species prediction between years.


2012 ◽  
Vol 229-231 ◽  
pp. 2276-2279
Author(s):  
Yu An Pan ◽  
Xuan Xiao ◽  
Pu Wang

Antimicrobial peptides (AMP) are potent, broad spectrum antibiotics which demonstrate potential as novel therapeutic agents. Because it is both time-consuming and laborious to identify new AMPs by experiment, this paper tries to resolve this problem by pattern recognition. Two major contents included: Firstly, up to six kinds of physicochemical properties value are selected to code the AMP sequence as physical-chemical property matrix (PCM), then auto and cross covariance transformation is performed to extract features from the PCM for AMP sequence expression; Secondly, these feature vectors are input to a powerful Support Vector Machine (SVM) classifier for training and new query AMP recognition. For a newly constructed AMP benchmark dataset, the overall classification accuracy about 96% has been achieved through the rigorous Leave-One-Out cross-validation. For convenience, a user-friendly web server, AMPpred, has been established at http://icpr.jci.jx.cn/bioinfo/AMPpred. It is anticipated that this on-line predictor may become a useful bioinformatics tool for molecular biology and drug development. Also, its novel approach will further stimulate the development of predicting peptide attributes.


2012 ◽  
Vol 554-556 ◽  
pp. 1628-1631 ◽  
Author(s):  
Tian Hong Gu ◽  
Wei Lv ◽  
Xia Shao ◽  
Wen Cong Lu

Based on the element contents of N, O, H and C of objects detected by γ-ray resonance, support vector classification (SVC) method was used to construct the model for distinguishing high energy materials (HEMs) from ordinary ones. It was found that the accuracy of prediction was 95.9% based on the leave-one-out cross validation (LOOCV) test. The results indicated that the performance of SVC model is good enough to detect HEMs in the presence of ordinary materials for the purpose of security checking.


2021 ◽  
Author(s):  
Ιωάννης Μήνου

Η μεγαλύτερη πρόκληση των σύγχρονων υπολογιστικών συστημάτων είναι αναμφισβήτητα η αποδοτική αποθήκευση και ανάκτηση πολύ μεγάλου όγκου δεδομένων. Η ανάγκη αυτή έκανε την εμφάνισή της τα τελευταία χρόνια λόγω της έκρηξης δεδομένων που παρατηρείται στο διαδίκτυο και αποκτά ολοένα και μεγαλύτερη σημασία λόγω του πολύ μεγάλου εύρους πληροφοριών που μπορούμε να αντλήσουμε. Ο τομέας της υγειονομικής περίθαλψης και των ιατρικών δεδομένων είναι συνεχώς και ταχέως εξελισσόμενος. Η αξιοποίηση των Big Data στο χώρο της υγείας προσφέρει πολύτιμη πληροφόρηση καθώς παρουσιάζουν απεριόριστες δυνατότητες για αποτελεσματική αποθήκευση, επεξεργασία, sql queries και ανάλυση ιατρικών δεδομένων.Σκοπός της παρούσας διατριβής είναι η μελέτη τεχνικών εξόρυξης γνώσης για δεδομένα μεγάλου όγκου, που αφορούν το πεδίο της Υγείας. Παράλληλα σκοπός της έρευνας είναι η μελέτη στατιστικών και υπολογιστικών αλγορίθμων ανάλυσης μεγάλου όγκου δεδομένων υγείας που έχουν ως αποτέλεσμα την παραγωγή νέας γνώσης καθώς και την εξαγωγή στατιστικά σημαντικής πληροφορίας για τους επαγγελματίες υγείας. Τέλος, η παρούσα διατριβή διερευνά τις γνώσεις των επιστημόνων της Πληροφορικής Υγείας και των επαγγελματιών υγείας σχετικά με τα Big Data.Στην παρούσα διδακτορική διατριβή έγινε βιβλιογραφική ανασκόπηση της έννοιας των Big Data. Η ανασκόπηση αυτή περιλαμβάνει τον ορισμό των Big Data ,τα χαρακτηριστικά τους, τα πλεονεκτήματα και τα μειονεκτήματά τους στο χώρο της υγείας. Στη συνέχεια γίνεται αναφορά στην υλοποίηση και στους μηχανισμούς αποθήκευσης των Big Data. Επιπλέον γίνεται αναφορά στα συστήματα ανάλυσης και επεξεργασίας μεγάλου όγκου δεδομένων, στις γλώσσες προγραμματισμού για Big Data, στην εξόρυξη γνώσης δεδομένων στο χώρο της υγείας. Ακόμη γίνεται αναφορά στη χρήση των Big Data στην Ευρώπη και στον κόσμο. Τέλος παρουσιάζονται οι βασικές αρχές του GDPR καθώς και το πώς σχετίζεται με τα Big Data στο χώρο της υγείας. Επίσης διεξήχθησαν δύο εμπειρικές μελέτες.Η πρώτη μελέτη είχε σαν στόχο την καταγραφή της άποψης των επιστημόνων της Πληροφορικής Υγείας σχετικά με την τεχνολογία των Big Data. Η συλλογή των δεδομένων έγινε με χρήση ερωτηματολογίου. Η στατιστική ανάλυση έδειξε τη θετική ανταπόκριση του δείγματος σχετικά με την τεχνολογία των Big Data.Η δεύτερη μελέτη είχε σαν στόχο την καταγραφή της άποψης των Επαγγελματιών Υγείας σχετικά με την τεχνολογία των Big Data. Η συλλογή των δεδομένων έγινε με χρήση ερωτηματολογίου. Η στατιστική ανάλυση δεν έδωσε επαρκείς απαντήσεις καθώς οι ερωτηθέντες έδειξαν θετική στάση απέναντι στα Big Data ενώ απάντησαν ότι δεν γνωρίζουν πολλά για τη συγκεκριμένη τεχνολογία.Το τελευταίο κομμάτι της διατριβής περιλαμβάνει την ανάπτυξη μεθόδων πρόβλεψης για την δυνατότητα διάγνωσης των ασθενών με καρδιαγγειακά νοσήματα. Οι μέθοδοι πρόβλεψης που χρησιμοποιήθηκαν είναι: Λογιστική Παλινδρόμηση, Naive Bayes Classifier, Δένδρα αποφάσεων, Αλγόριθμος Κ κοντινότερων γειτόνων, Αλγόριθμος SVM (Support Vector Machine) και Random Forest. Η ανάπτυξη περιλάμβανε όλα τα στάδια προεπεξεργασίας των δεδομένων ενώ χρησιμοποιήθηκαν συγκεκριμένες μετρικές για τη μέτρηση της απόδοσης των κατηγοριοποιητών. Τέλος έγιναν βελτιώσεις της απόδοσης των κατηγοριοποιητών χρησιμοποιώντας διασταυρωτική επαλήθευση με την μέθοδο cross-validation ενώ επιλύθηκε και το πρόβλημα της ανισορροπίας των κλάσεων χρησιμοποιώντας τη μέθοδο SMOTE.


Cells ◽  
2019 ◽  
Vol 8 (9) ◽  
pp. 1040 ◽  
Author(s):  
Li Zhang ◽  
Xing Chen ◽  
Jun Yin

The important role of microRNAs (miRNAs) in the formation, development, diagnosis, and treatment of diseases has attracted much attention among researchers recently. In this study, we present an unsupervised deep learning model of the variational autoencoder for MiRNA–disease association prediction (VAEMDA). Through combining the integrated miRNA similarity and the integrated disease similarity with known miRNA–disease associations, respectively, we constructed two spliced matrices. These matrices were applied to train the variational autoencoder (VAE), respectively. The final predicted association scores between miRNAs and diseases were obtained by integrating the scores from the two trained VAE models. Unlike previous models, VAEMDA can avoid noise introduced by the random selection of negative samples and reveal associations between miRNAs and diseases from the perspective of data distribution. Compared with previous methods, VAEMDA obtained higher area under the receiver operating characteristics curves (AUCs) of 0.9118, 0.8652, and 0.9091 ± 0.0065 in global leave-one-out cross validation (LOOCV), local LOOCV, and five-fold cross validation, respectively. Further, the AUCs of VAEMDA were 0.8250 and 0.8237 in global leave-one-disease-out cross validation (LODOCV), and local LODOCV, respectively. In three different types of case studies on three important diseases, the results showed that most of the top 50 potentially associated miRNAs were verified by databases and the literature.


2014 ◽  
Vol 136 (3) ◽  
Author(s):  
Jie Zhang ◽  
Souma Chowdhury ◽  
Ali Mehmani ◽  
Achille Messac

This paper investigates the characterization of the uncertainty in the prediction of surrogate models. In the practice of engineering, where predictive models are pervasively used, the knowledge of the level of modeling error in any region of the design space is uniquely helpful for design exploration and model improvement. The lack of methods that can explore the spatial variation of surrogate error levels in a wide variety of surrogates (i.e., model-independent methods) leaves an important gap in our ability to perform design domain exploration. We develop a novel framework, called domain segmentation based on uncertainty in the surrogate (DSUS) to segregate the design domain based on the level of local errors. The errors in the surrogate estimation are classified into physically meaningful classes based on the user's understanding of the system and/or the accuracy requirements for the concerned system analysis. The leave-one-out cross-validation technique is used to quantity the local errors. Support vector machine (SVM) is implemented to determine the boundaries between error classes, and to classify any new design point into the pertinent error class. We also investigate the effectiveness of the leave-one-out cross-validation technique in providing a local error measure, through comparison with actual local errors. The utility of the DSUS framework is illustrated using two different surrogate modeling methods: (i) the Kriging method and (ii) the adaptive hybrid functions (AHF). The DSUS framework is applied to a series of standard test problems and engineering problems. In these case studies, the DSUS framework is observed to provide reasonable accuracy in classifying the design-space based on error levels. More than 90% of the test points are accurately classified into the appropriate error classes.


2018 ◽  
Vol 5 (2) ◽  
pp. 234-254 ◽  
Author(s):  
Inzar Salfikar ◽  
Indra Adji Sulistijono ◽  
Achmad Basuki

Finding victims at a disaster site is the primary goal of Search-and-Rescue (SAR) operations. Many technologies created from research for searching disaster victims through aerial imaging. but, most of them are difficult to detect victims at tsunami disaster sites with victims and backgrounds which are look similar. This research collects post-tsunami aerial imaging data from the internet to builds dataset and model for detecting tsunami disaster victims. Datasets are built based on distance differences from features every sample using Histogram-of-Oriented-Gradient (HOG) method. We use the longest distance to collect samples from photo to generate victim and non-victim samples. We claim steps to collect samples by measuring HOG feature distance from all samples. the longest distance between samples will take as a candidate to build the dataset, then classify victim (positives) and non-victim (negatives) samples manually. The dataset of tsunami disaster victims was re-analyzed using cross-validation Leave-One-Out (LOO) with Support-Vector-Machine (SVM) method. The experimental results show the performance of two test photos with 61.70% precision, 77.60% accuracy, 74.36% recall and f-measure 67.44% to distinguish victim (positives) and non-victim (negatives).


2019 ◽  
Vol 3 (1) ◽  
pp. 54-62
Author(s):  
Razi Aziz Syahputro ◽  
Widodo ◽  
Hamidillah Ajie

Penelitian ini dilatarbelakangi dengan dibutuhkannya sistem pengklasifikasian untuk memudahkan pihak Jurusan Teknik Elektro khususnya Program Studi PTIK untuk mengklasifikasikan judul skripsi berdasarkan peminatan. Sebelum sistem dibuat diperlukan pertimbangan dari beberapa algoritma klasifikasi yang ada, maka dari itu penelitian ini memilih 3 algoritma dari 10 algoritma terbaik menurut ICDM tahun 2006. Klasifikasi terhadap dokumen teks pendek seperti judul skripsi mahasiswa memiliki kesulitan tersendiri daripada dokumen teks panjang karena semakin sedikit kata semakin sulit diklasifikasi. Sehingga tujuan dari penelitian ini adalah untuk mengetahui algoritma yang paling efektif untuk mengklasifikasi judul skripsi. Penelitian ini terdiri dari beberapa tahap yaitu pengumpulan data, pengelompokan data melalui angket oleh dosen ahli, pre-processing text, pembobotan kata menggunakan vector space model dan tf-idf, evaluasi dengan k-fold cross validation, klasifikasi menggunakan k-nearest neighbor, naïve bayes classifier, dan support vector machine, dan analisis dengan confusion matrix. Percobaan dilakukan dengan menggunakan 266 data judul skripsi mahasiswa PTIK UNJ dari angkatan 2010-2013, dengan data terakhir berasal dari sidang skripsi pada semester 105(semester ganjil 2016/2017). Hasil dari klasifikasi menggunakan algoritma tersebut didapatkan algoritma yang paling efisien yaitu support vector machine dengan akurasi 82% dari 10 kali percobaan.


PeerJ ◽  
2017 ◽  
Vol 5 ◽  
pp. e3561 ◽  
Author(s):  
Ravindra Kumar ◽  
Bandana Kumari ◽  
Manish Kumar

BackgroundThe endoplasmic reticulum plays an important role in many cellular processes, which includes protein synthesis, folding and post-translational processing of newly synthesized proteins. It is also the site for quality control of misfolded proteins and entry point of extracellular proteins to the secretory pathway. Hence at any given point of time, endoplasmic reticulum contains two different cohorts of proteins, (i) proteins involved in endoplasmic reticulum-specific function, which reside in the lumen of the endoplasmic reticulum, called as endoplasmic reticulum resident proteins and (ii) proteins which are in process of moving to the extracellular space. Thus, endoplasmic reticulum resident proteins must somehow be distinguished from newly synthesized secretory proteins, which pass through the endoplasmic reticulum on their way out of the cell. Approximately only 50% of the proteins used in this study as training data had endoplasmic reticulum retention signal, which shows that these signals are not essentially present in all endoplasmic reticulum resident proteins. This also strongly indicates the role of additional factors in retention of endoplasmic reticulum-specific proteins inside the endoplasmic reticulum.MethodsThis is a support vector machine based method, where we had used different forms of protein features as inputs for support vector machine to develop the prediction models. During trainingleave-one-outapproach of cross-validation was used. Maximum performance was obtained with a combination of amino acid compositions of different part of proteins.ResultsIn this study, we have reported a novel support vector machine based method for predicting endoplasmic reticulum resident proteins, named as ERPred. During training we achieved a maximum accuracy of 81.42% withleave-one-outapproach of cross-validation. When evaluated on independent dataset, ERPred did prediction with sensitivity of 72.31% and specificity of 83.69%. We have also annotated six different proteomes to predict the candidate endoplasmic reticulum resident proteins in them. A webserver, ERPred, was developed to make the method available to the scientific community, which can be accessed athttp://proteininformatics.org/mkumar/erpred/index.html.DiscussionWe found that out of 124 proteins of the training dataset, only 66 proteins had endoplasmic reticulum retention signals, which shows that these signals are not an absolute necessity for endoplasmic reticulum resident proteins to remain inside the endoplasmic reticulum. This observation also strongly indicates the role of additional factors in retention of proteins inside the endoplasmic reticulum. Our proposed predictor, ERPred, is a signal independent tool. It is tuned for the prediction of endoplasmic reticulum resident proteins, even if the query protein does not contain specific ER-retention signal.


Sign in / Sign up

Export Citation Format

Share Document