Highlighted Document Image Classification

2021 ◽  
Vol 2021 (29) ◽  
pp. 154-159
Author(s):  
Yafei Mao ◽  
Yufang Sun ◽  
Peter Bauer ◽  
Todd Harris ◽  
Mark Shaw ◽  
...  

There are many existing document image classification researches, but most of them are not designed for use in constrained computer resources, like printers, or focused on documents with highlighter pen marks. To enable printers to better discriminate highlighted documents, we designed a set of features in CIE Lch(a* b*) space to use along with the support vector machine. The features include two gamut-based features and six low-level color features. By first identifying the highlight pixels, and then computing the distance from the highlight pixels to the boundary of the printer gamut, the gamut-based features can be obtained. The low-level color features are built upon the color distribution information of the image blocks. The best feature subset of the existing and new features is constructed by sequential forward floating selection (SFFS) feature selection. Leave-one-out cross-validation is performed on a dataset with 400 document images to evaluate the effectiveness of the classification model. The cross-validation results indicate significant improvements over the baseline highlighted document classification model.

Author(s):  
Ting Liu ◽  
Jia-Mao Chen ◽  
Dan Zhang ◽  
Qian Zhang ◽  
Bowen Peng ◽  
...  

Apolipoprotein is a group of plasma proteins that are associated with a variety of diseases, such as hyperlipidemia, atherosclerosis, Alzheimer’s disease, and diabetes. In order to investigate the function of apolipoproteins and to develop effective targets for related diseases, it is necessary to accurately identify and classify apolipoproteins. Although it is possible to identify apolipoproteins accurately through biochemical experiments, they are expensive and time-consuming. This work aims to establish a high-efficiency and high-accuracy prediction model for recognition of apolipoproteins and their subfamilies. We firstly constructed a high-quality benchmark dataset including 270 apolipoproteins and 535 non-apolipoproteins. Based on the dataset, pseudo-amino acid composition (PseAAC) and composition of k-spaced amino acid pairs (CKSAAP) were used as input vectors. To improve the prediction accuracy and eliminate redundant information, analysis of variance (ANOVA) was used to rank the features. And the incremental feature selection was utilized to obtain the best feature subset. Support vector machine (SVM) was proposed to construct the classification model, which could produce the accuracy of 97.27%, sensitivity of 96.30%, and specificity of 97.76% for discriminating apolipoprotein from non-apolipoprotein in 10-fold cross-validation. In addition, the same process was repeated to generate a new model for predicting apolipoprotein subfamilies. The new model could achieve an overall accuracy of 95.93% in 10-fold cross-validation. According to our proposed model, a convenient webserver called ApoPred was established, which can be freely accessed at http://tang-biolab.com/server/ApoPred/service.html. We expect that this work will contribute to apolipoprotein function research and drug development in relevant diseases.


2020 ◽  
Vol 15 ◽  
Author(s):  
Chun Qiu ◽  
Sai Li ◽  
Shenghui Yang ◽  
Lin Wang ◽  
Aihui Zeng ◽  
...  

Aim: To search the genes related to the mechanisms of the occurrence of glioma and to try to build a prediction model for glioblastomas. Background: The morbidity and mortality of glioblastomas are very high, which seriously endangers human health. At present, the goals of many investigations on gliomas are mainly to understand the cause and mechanism of these tumors at the molecular level and to explore clinical diagnosis and treatment methods. However, there is no effective early diagnosis method for this disease, and there are no effective prevention, diagnosis or treatment measures. Methods: First, the gene expression profiles derived from GEO were downloaded. Then, differentially expressed genes (DEGs) in the disease samples and the control samples were identified. After that, GO and KEGG enrichment analyses of DEGs were performed by DAVID. Furthermore, the correlation-based feature subset (CFS) method was applied to the selection of key DEGs. In addition, the classification model between the glioblastoma samples and the controls was built by an Support Vector Machine (SVM) based on selected key genes. Results and Discussion: Thirty-six DEGs, including 17 upregulated and 19 downregulated genes, were selected as the feature genes to build the classification model between the glioma samples and the control samples by the CFS method. The accuracy of the classification model by using a 10-fold cross-validation test and independent set test was 76.25% and 70.3%, respectively. In addition, PPP2R2B and CYBB can also be found in the top 5 hub genes screened by the protein– protein interaction (PPI) network. Conclusions: This study indicated that the CFS method is a useful tool to identify key genes in glioblastomas. In addition, we also predicted that genes such as PPP2R2B and CYBB might be potential biomarkers for the diagnosis of glioblastomas.


2016 ◽  
Vol 36 (suppl_1) ◽  
Author(s):  
Hua Tang ◽  
Hao Lin

Objective: Apolipoproteins are of great physiological importance and are associated with different diseases such as dyslipidemia, thrombogenesis and angiocardiopathy. Apolipoproteins have therefore emerged as key risk markers and important research targets yet the types of apolipoproteins has not been fully elucidated. Accurate identification of the apoliproproteins is very crucial to the comprehension of cardiovascular diseases and drug design. The aim of this study is to develop a powerful model to precisely identify apolipoproteins. Approach and Results: We manually collected a non-redundant dataset of 53 apoliproproteins and 136 non-apoliproproteins with the sequence identify of less than 40% from UniProt. After formulating the protein sequence samples with g -gap dipeptide composition (here g =1~10), the analysis of various (ANOVA) was adopted to find out the best feature subset which can achieve the best accuracy. Support Vector Machine (SVM) was then used to perform classification. The predictive model was evaluated using a five-fold cross-validation which yielded a sensitivity of 96.2%, a specificity of 99.3%, and an accuracy of 98.4%. The study indicated that the proposed method could be a feasible means of conducting preliminary analyses of apoliproproteins. Conclusion: We demonstrated that apoliproproteins can be predicted from their primary sequences. Also we discovered the special dipeptide distribution in apoliproproteins. These findings open new perspectives to improve apoliproproteins prediction by considering the specific dipeptides. We expect that these findings will help to improve drug development in anti-angiocardiopathy disease. Key words: Apoliproproteins Angiocardiopathy Support Vector Machine


Author(s):  
Alok Kumar Shukla ◽  
Pradeep Singh ◽  
Manu Vardhan

The explosion of the high-dimensional dataset in the scientific repository has been encouraging interdisciplinary research on data mining, pattern recognition and bioinformatics. The fundamental problem of the individual Feature Selection (FS) method is extracting informative features for classification model and to seek for the malignant disease at low computational cost. In addition, existing FS approaches overlook the fact that for a given cardinality, there can be several subsets with similar information. This paper introduces a novel hybrid FS algorithm, called Filter-Wrapper Feature Selection (FWFS) for a classification problem and also addresses the limitations of existing methods. In the proposed model, the front-end filter ranking method as Conditional Mutual Information Maximization (CMIM) selects the high ranked feature subset while the succeeding method as Binary Genetic Algorithm (BGA) accelerates the search in identifying the significant feature subsets. One of the merits of the proposed method is that, unlike an exhaustive method, it speeds up the FS procedure without lancing of classification accuracy on reduced dataset when a learning model is applied to the selected subsets of features. The efficacy of the proposed (FWFS) method is examined by Naive Bayes (NB) classifier which works as a fitness function. The effectiveness of the selected feature subset is evaluated using numerous classifiers on five biological datasets and five UCI datasets of a varied dimensionality and number of instances. The experimental results emphasize that the proposed method provides additional support to the significant reduction of the features and outperforms the existing methods. For microarray data-sets, we found the lowest classification accuracy is 61.24% on SRBCT dataset and highest accuracy is 99.32% on Diffuse large B-cell lymphoma (DLBCL). In UCI datasets, the lowest classification accuracy is 40.04% on the Lymphography using k-nearest neighbor (k-NN) and highest classification accuracy is 99.05% on the ionosphere using support vector machine (SVM).


2020 ◽  
Vol 43 (1) ◽  
pp. 103-125
Author(s):  
Yi Zhong ◽  
Jianghua He ◽  
Prabhakar Chalise

With the advent of high throughput technologies, the high-dimensional datasets are increasingly available. This has not only opened up new insight into biological systems but also posed analytical challenges. One important problem is the selection of informative feature-subset and prediction of the future outcome. It is crucial that models are not overfitted and give accurate results with new data. In addition, reliable identification of informative features with high predictive power (feature selection) is of interests in clinical settings. We propose a two-step framework for feature selection and classification model construction, which utilizes a nested and repeated cross-validation method. We evaluated our approach using both simulated data and two publicly available gene expression datasets. The proposed method showed comparatively better predictive accuracy for new cases than the standard cross-validation method.


2019 ◽  
Vol 11 (21) ◽  
pp. 2512 ◽  
Author(s):  
Nicolas Karasiak ◽  
Jean-François Dejoux ◽  
Mathieu Fauvel ◽  
Jérôme Willm ◽  
Claude Monteil ◽  
...  

Mapping forest composition using multiseasonal optical time series remains a challenge. Highly contrasted results are reported from one study to another suggesting that drivers of classification errors are still under-explored. We evaluated the performances of single-year Formosat-2 time series to discriminate tree species in temperate forests in France and investigated how predictions vary statistically and spatially across multiple years. Our objective was to better estimate the impact of spatial autocorrelation in the validation data on measurement accuracy and to understand which drivers in the time series are responsible for classification errors. The experiments were based on 10 Formosat-2 image time series irregularly acquired during the seasonal vegetation cycle from 2006 to 2014. Due to lot of clouds in the year 2006, an alternative 2006 time series using only cloud-free images has been added. Thirteen tree species were classified in each single-year dataset based on the Support Vector Machine (SVM) algorithm. The performances were assessed using a spatial leave-one-out cross validation (SLOO-CV) strategy, thereby guaranteeing full independence of the validation samples, and compared with standard non-spatial leave-one-out cross-validation (LOO-CV). The results show relatively close statistical performances from one year to the next despite the differences between the annual time series. Good agreements between years were observed in monospecific tree plantations of broadleaf species versus high disparity in other forests composed of different species. A strong positive bias in the accuracy assessment (up to 0.4 of Overall Accuracy (OA)) was also found when spatial dependence in the validation data was not removed. Using the SLOO-CV approach, the average OA values per year ranged from 0.48 for 2006 to 0.60 for 2013, which satisfactorily represents the spatial instability of species prediction between years.


2020 ◽  
Vol 2020 ◽  
pp. 1-14
Author(s):  
Gaoteng Yuan ◽  
Yihui Liu ◽  
Wei Huang ◽  
Bing Hu

Purpose. The objective of this study is to investigate the use of texture analysis (TA) of magnetic resonance image (MRI) enhanced scan and machine learning methods for distinguishing different grades in breast invasive ductal carcinoma (IDC). Preoperative prediction of the grade of IDC can provide reference for different clinical treatments, so it has important practice values in clinic. Methods. Firstly, a breast cancer segmentation model based on discrete wavelet transform (DWT) and K-means algorithm is proposed. Secondly, TA was performed and the Gabor wavelet analysis is used to extract the texture feature of an MRI tumor. Then, according to the distance relationship between the features, key features are sorted and feature subsets are selected. Finally, the feature subset is classified by using a support vector machine and adjusted parameters to achieve the best classification effect. Results. By selecting key features for classification prediction, the classification accuracy of the classification model can reach 81.33%. 3-, 4-, and 5-fold cross-validation of the prediction accuracy of the support vector machine model is 77.79%~81.94%. Conclusion. The pathological grading of IDC can be predicted and evaluated by texture analysis and feature extraction of breast tumors. This method can provide much valuable information for doctors’ clinical diagnosis. With further development, the model demonstrates high potential for practical clinical use.


2014 ◽  
Vol 615 ◽  
pp. 194-197
Author(s):  
Zhen Yuan Tu ◽  
Fang Hua Ning ◽  
Wu Jia Yu

In practice, it is difficult for Support Vector Machine (SVM) to have a relatively high recognition rate as well as a quite fast recognition speed. In order to resolve this defect, in this paper we build a SVM classification model combining numerical characteristics. We use readings of rotary natural meters as the test temple, do positioning, preprocessing, feature points extracting, classifying and other series of operations to the numeric region of the dial. Then with the idea of cross-validation, we keep doing parameter optimation to SVM. At last, after making a comprehensive contrast of the effects which numerous performance factors make on the experimental outputs, we try to give our explanation of the outputs from different perspectives.


2012 ◽  
Vol 229-231 ◽  
pp. 2276-2279
Author(s):  
Yu An Pan ◽  
Xuan Xiao ◽  
Pu Wang

Antimicrobial peptides (AMP) are potent, broad spectrum antibiotics which demonstrate potential as novel therapeutic agents. Because it is both time-consuming and laborious to identify new AMPs by experiment, this paper tries to resolve this problem by pattern recognition. Two major contents included: Firstly, up to six kinds of physicochemical properties value are selected to code the AMP sequence as physical-chemical property matrix (PCM), then auto and cross covariance transformation is performed to extract features from the PCM for AMP sequence expression; Secondly, these feature vectors are input to a powerful Support Vector Machine (SVM) classifier for training and new query AMP recognition. For a newly constructed AMP benchmark dataset, the overall classification accuracy about 96% has been achieved through the rigorous Leave-One-Out cross-validation. For convenience, a user-friendly web server, AMPpred, has been established at http://icpr.jci.jx.cn/bioinfo/AMPpred. It is anticipated that this on-line predictor may become a useful bioinformatics tool for molecular biology and drug development. Also, its novel approach will further stimulate the development of predicting peptide attributes.


2012 ◽  
Vol 554-556 ◽  
pp. 1628-1631 ◽  
Author(s):  
Tian Hong Gu ◽  
Wei Lv ◽  
Xia Shao ◽  
Wen Cong Lu

Based on the element contents of N, O, H and C of objects detected by γ-ray resonance, support vector classification (SVC) method was used to construct the model for distinguishing high energy materials (HEMs) from ordinary ones. It was found that the accuracy of prediction was 95.9% based on the leave-one-out cross validation (LOOCV) test. The results indicated that the performance of SVC model is good enough to detect HEMs in the presence of ordinary materials for the purpose of security checking.


Sign in / Sign up

Export Citation Format

Share Document