Study on Consistency Analysis in Text Categorization

2014 ◽  
Vol 539 ◽  
pp. 181-184
Author(s):  
Wan Li Zuo ◽  
Zhi Yan Wang ◽  
Ning Ma ◽  
Hong Liang

Accurate classification of text is a basic premise of extracting various types of information on the Web efficiently and utilizing the network resources properly. In this paper, a brand new text classification method was proposed. Consistency analysis method is a type of iterative algorithm, which mainly trains different classifiers (weak classifier) by aiming at the same training set, and then these classifiers will be gathered for testing the consistency degrees of various classification methods for the same text, thus to manifest the knowledge of each type of classifier. It main determines the weight of each sample according to the fact is the classification of each sample is accurate in each training set, as well as the accuracy of the last overall classification, and then sends the new data set whose weight has been modified to the subordinate classifier for training. In the end, the classifier gained in the training will be integrated as the final decision classifier. The classifier with consistency analysis can eliminate some unnecessary training data characteristics and place the key words on key training data. According to the experimental result, the average accuracy of this method is 91.0%, while the average recall rate is 88.1%.

2014 ◽  
Vol 644-650 ◽  
pp. 2395-2398
Author(s):  
Jian Si Ren

The development of Internet and digital library has triggered a lot of text categorization methods. How to find desired information accurately and timely is becoming more and more important and automatic text categorization can help us achieve this goal. In general, text classifier is implemented by using some traditional classification methods such as Naive-Bayes (NB). ARC-BC (Associative Rule-based Classifier by Category) can be used for text categorization by dividing text documents into subsets in which all documents belong to the same category and generate associative classification rules for each subset. This classifier differs from previous methods in that it consists of discovered association rules between words and categories extracted from the training set. In order to train and test this classifier, we constructed training data and testing data respectively by selecting documents from Yahoo. The experimental result shows that the performance of ARC-BC based text categorization is very pretty efficient and effective and it is comparable to Naïve Bayesian algorithm based text categorization.


2017 ◽  
Vol 45 (2) ◽  
pp. 66-74
Author(s):  
Yufeng Ma ◽  
Long Xia ◽  
Wenqi Shen ◽  
Mi Zhou ◽  
Weiguo Fan

Purpose The purpose of this paper is automatic classification of TV series reviews based on generic categories. Design/methodology/approach What the authors mainly applied is using surrogate instead of specific roles or actors’ name in reviews to make reviews more generic. Besides, feature selection techniques and different kinds of classifiers are incorporated. Findings With roles’ and actors’ names replaced by generic tags, the experimental result showed that it can generalize well to agnostic TV series as compared with reviews keeping the original names. Research limitations/implications The model presented in this paper must be built on top of an already existed knowledge base like Baidu Encyclopedia. Such database takes lots of work. Practical implications Like in digital information supply chain, if reviews are part of the information to be transported or exchanged, then the model presented in this paper can help automatically identify individual review according to different requirements and help the information sharing. Originality/value One originality is that the authors proposed the surrogate-based approach to make reviews more generic. Besides, they also built a review data set of hot Chinese TV series, which includes eight generic category labels for each review.


2016 ◽  
Vol 2016 (4) ◽  
pp. 21-36 ◽  
Author(s):  
Tao Wang ◽  
Ian Goldberg

Abstract Website fingerprinting allows a local, passive observer monitoring a web-browsing client’s encrypted channel to determine her web activity. Previous attacks have shown that website fingerprinting could be a threat to anonymity networks such as Tor under laboratory conditions. However, there are significant differences between laboratory conditions and realistic conditions. First, in laboratory tests we collect the training data set together with the testing data set, so the training data set is fresh, but an attacker may not be able to maintain a fresh data set. Second, laboratory packet sequences correspond to a single page each, but for realistic packet sequences the split between pages is not obvious. Third, packet sequences may include background noise from other types of web traffic. These differences adversely affect website fingerprinting under realistic conditions. In this paper, we tackle these three problems to bridge the gap between laboratory and realistic conditions for website fingerprinting. We show that we can maintain a fresh training set with minimal resources. We demonstrate several classification-based techniques that allow us to split full packet sequences effectively into sequences corresponding to a single page each. We describe several new algorithms for tackling background noise. With our techniques, we are able to build the first website fingerprinting system that can operate directly on packet sequences collected in the wild.


2013 ◽  
Vol 2013 ◽  
pp. 1-9 ◽  
Author(s):  
R. Manjula Devi ◽  
S. Kuppuswami ◽  
R. C. Suganthe

Artificial neural network has been extensively consumed training model for solving pattern recognition tasks. However, training a very huge training data set using complex neural network necessitates excessively high training time. In this correspondence, a new fast Linear Adaptive Skipping Training (LAST) algorithm for training artificial neural network (ANN) is instituted. The core essence of this paper is to ameliorate the training speed of ANN by exhibiting only the input samples that do not categorize perfectly in the previous epoch which dynamically reducing the number of input samples exhibited to the network at every single epoch without affecting the network’s accuracy. Thus decreasing the size of the training set can reduce the training time, thereby ameliorating the training speed. This LAST algorithm also determines how many epochs the particular input sample has to skip depending upon the successful classification of that input sample. This LAST algorithm can be incorporated into any supervised training algorithms. Experimental result shows that the training speed attained by LAST algorithm is preferably higher than that of other conventional training algorithms.


Dose-Response ◽  
2019 ◽  
Vol 17 (4) ◽  
pp. 155932581989417 ◽  
Author(s):  
Zhi Huang ◽  
Jie Liu ◽  
Liang Luo ◽  
Pan Sheng ◽  
Biao Wang ◽  
...  

Background: Plenty of evidence has suggested that autophagy plays a crucial role in the biological processes of cancers. This study aimed to screen autophagy-related genes (ARGs) and establish a novel a scoring system for colorectal cancer (CRC). Methods: Autophagy-related genes sequencing data and the corresponding clinical data of CRC in The Cancer Genome Atlas were used as training data set. The GSE39582 data set from the Gene Expression Omnibus was used as validation set. An autophagy-related signature was developed in training set using univariate Cox analysis followed by stepwise multivariate Cox analysis and assessed in the validation set. Then we analyzed the function and pathways of ARGs using Gene Ontology and Kyoto Encyclopedia of Genes and Genomes (KEGG) database. Finally, a prognostic nomogram combining the autophagy-related risk score and clinicopathological characteristics was developed according to multivariate Cox analysis. Results: After univariate and multivariate analysis, 3 ARGs were used to construct autophagy-related signature. The KEGG pathway analyses showed several significantly enriched oncological signatures, such as p53 signaling pathway, apoptosis, human cytomegalovirus infection, platinum drug resistance, necroptosis, and ErbB signaling pathway. Patients were divided into high- and low-risk groups, and patients with high risk had significantly shorter overall survival (OS) than low-risk patients in both training set and validation set. Furthermore, the nomogram for predicting 3- and 5-year OS was established based on autophagy-based risk score and clinicopathologic factors. The area under the curve and calibration curves indicated that the nomogram showed well accuracy of prediction. Conclusions: Our proposed autophagy-based signature has important prognostic value and may provide a promising tool for the development of personalized therapy.


Author(s):  
GOZDE UNAL ◽  
GAURAV SHARMA ◽  
REINER ESCHBACH

Photography, lithography, xerography, and inkjet printing are the dominant technologies for color printing. Images produced on these "different media" are often scanned either for the purpose of copying or creating an electronic representation. For an improved color calibration during scanning, a media identification from the scanned image data is desirable. In this paper, we propose an efficient algorithm for automated classification of input media into four major classes corresponding to photographic, lithographic, xerographic and inkjet. Our technique exploits the strong correlation between the type of input media and the spatial statistics of corresponding images, which are observed in the scanned images. We adopt ideas from spatial statistics literature, and design two spatial statistical measures of dispersion and periodicity, which are computed over spatial point patterns generated from blocks of the scanned image, and whose distributions provide the features for making a decision. We utilize extensive training data and determined well separated decision regions to classify the input media. We validate and tested our classification technique results over an independent extensive data set. The results demonstrate that the proposed method is able to distinguish between the different media with high reliability.


2019 ◽  
Vol 1 ◽  
pp. 1-1
Author(s):  
Tee-Ann Teo

<p><strong>Abstract.</strong> Deep Learning is a kind of Machine Learning technology which utilizing the deep neural network to learn a promising model from a large training data set. Convolutional Neural Network (CNN) has been successfully applied in image segmentation and classification with high accuracy results. The CNN applies multiple kernels (also called filters) to extract image features via image convolution. It is able to determine multiscale features through the multiple layers of convolution and pooling processes. The variety of training data plays an important role to determine a reliable CNN model. The benchmarking training data for road mark extraction is mainly focused on close-range imagery because it is easier to obtain a close-range image rather than an airborne image. For example, KITTI Vision Benchmark Suite. This study aims to transfer the road mark training data from mobile lidar system to aerial orthoimage in Fully Convolutional Networks (FCN). The transformation of the training data from ground-based system to airborne system may reduce the effort of producing a large training data set.</p><p>This study uses FCN technology and aerial orthoimage to localize road marks on the road regions. The road regions are first extracted from 2-D large-scale vector map. The input aerial orthoimage is 10&amp;thinsp;cm spatial resolution and the non-road regions are masked out before the road mark localization. The training data are road mark’s polygons, which are originally digitized from ground-based mobile lidar and prepared for the road mark extraction using mobile mapping system. This study reuses these training data and applies them for the road mark extraction using aerial orthoimage. The digitized training road marks are then transformed to road polygon based on mapping coordinates. As the detail of ground-based lidar is much better than the airborne system, the partially occulted parking lot in aerial orthoimage can also be obtained from the ground-based system. The labels (also called annotations) for FCN include road region, non-regions and road mark. The size of a training batch is 500&amp;thinsp;pixel by 500&amp;thinsp;pixel (50&amp;thinsp;m by 50&amp;thinsp;m on the ground), and the total number of training batches for training is 75 batches. After the FCN training stage, an independent aerial orthoimage (Figure 1a) is applied to predict the road marks. The results of FCN provide initial regions for road marks (Figure 1b). Usually, road marks show higher reflectance than road asphalts. Therefore, this study uses this characteristic to refine the road marks (Figure 1c) by a binary classification inside the initial road mark’s region.</p><p>To compare the automatically extracted road marks (Figure 1c) and manually digitized road marks (Figure 1d), most road marks can be extracted using the training set from ground-based system. This study also selects an area of 600&amp;thinsp;m&amp;thinsp;&amp;times;&amp;thinsp;200&amp;thinsp;m in quantitative analysis. Among the 371 reference road marks, 332 can be extracted from proposed scheme, and the completeness reached 89%. The preliminary experiment demonstrated that most road marks can be successfully extracted by the proposed scheme. Therefore, the training data from the ground-based mapping system can be utilized in airborne orthoimage in similar spatial resolution.</p>


2019 ◽  
Vol 8 (4) ◽  
pp. 407-417
Author(s):  
Inas Hasimah ◽  
Moch. Abdul Mukid ◽  
Hasbi Yasin

House credit (KPR) is a credit facilities for buying or other comsumptive needs with house warranty. The warranty for KPR is the house that will be purchased. The warranty for KPR multiguna take over is the house that will be owned by debtor, and then debtor is taking over KPR to another financial institution. For fulfilled the credit to prospective debtor is done by passing through the process of credit application and credit analysis. With the credit analysis, will acknowledge the ability of debtor for repay a credit. Final decision of credit application is classified into approved and refused. k Nearest Neighbor by attributes weighting using Global Gini Diversity Index is a statistical method that can be used to classify the credit decision of prospective debtor. This research use 2443 data of KPR multiguna take over’s prospective debtor in 2018 with credit decision of prospective debtor as dependent variable and four selected independent variable such as home ownership status, job, loans amount, and income.  The best classification result of k-NN by Global Gini Diversity Index weighting is when using 80% training data set and 20% testing data set with k=7 obtained  APER value 0,0798 and accuracy 92,02%. Keywords: KPR Multiguna Take Over, Classification, KNN by Global Gini Diversity Index weighting, Evaluation of Classification


2004 ◽  
Vol 43 (02) ◽  
pp. 192-201 ◽  
Author(s):  
R. E. Abdel-Aal

Summary Objectives: To introduce abductive network classifier committees as an ensemble method for improving classification accuracy in medical diagnosis. While neural networks allow many ways to introduce enough diversity among member models to improve performance when forming a committee, the self-organizing, automatic-stopping nature, and learning approach used by abductive networks are not very conducive for this purpose. We explore ways of overcoming this limitation and demonstrate improved classification on three standard medical datasets. Methods: Two standard 2-class medical datasets (Pima Indians Diabetes and Heart Disease) and a 6-class dataset (Dermatology) were used to investigate ways of training abductive networks with adequate independence, as well as methods of combining their outputs to form a network that improves performance beyond that of single models. Results: Two- or three-member committees of models trained on completely or partially different subsets of training data and using simple output combination methods achieve improvements between 2 and 5 percentage points in the classification accuracy over the best single model developed using the full training set. Conclusions: Varying model complexity alone gives abductive network models that are too correlated to ensure enough diversity for forming a useful committee. Diversity achieved through training member networks on independent subsets of the training data outweighs limitations of the smaller training set for each, resulting in net gain in committee performance. As such models train faster and can be trained in parallel, this can also speed up classifier development.


Soil Research ◽  
2000 ◽  
Vol 38 (4) ◽  
pp. 867 ◽  
Author(s):  
G. K. Summerell ◽  
T. I. Dowling ◽  
D. P. Richardson ◽  
J. Walker ◽  
B. Lees

Parna is a wind-blown clay, mobilised from inland Australia as the result of a series of intermittent high wind events during the Quaternary. Parna can be recognised on the basis of colour, texture, distributional patterns, and pedology. Parna deposits have been recorded across a wide area of south eastern Australia and have influenced the local pedology and hydrology. In some cases parna has increased soil sodicity and the potential for dryland salinisation. Predicting its spatial distribution is useful when considering agricultural potential and in assessing the risk and spatial spread of dryland salinity. Here we present the results of modelling to predict its local distribution in an area covering 291 km2 in the Young district of NSW. Two conceptual models of parna deposition and subsequent redistribution were used to develop a current parna distribution map: (a) deposition = f(topography, aspect) after assuming that interactions of rainfall, vegetation, and wind speed were relatively the same at the local scale; (b) removal or retention = f (slope angle, catchment size, slope length) as a representation of the erosive energy of gravity. Five landscape variables, elevation, aspect, slope, flow accumulation, and flow length, were derived from a 20 m digital elevation model (DEM). A training set of parna deposits was established using air photos and field survey from limited exposures in the Young district of NSW. These areas were digitised and converted to a grid of areas of parna and no-parna. This training set for parna and the 5 landscape variable grids were processed in the IDRISI for WINDOWS Geographic Information System (GIS). Spatial relationships between the parna and no-parna deposits and the 5 landscape variables were extracted from this training set. This information was imported into an inductive learning program called KnowledgeSEEKER. A decision tree was built by recursive partitioning of the data set using Chi-squares to categorise variables, and an F test for continuous variables to best replicate the training data classification of ‘parna’ and ‘no-parna’. The rules derived from this process were applied to the study area to predict the occurrence of parna in the broader landscape. Predictions were field checked and the rules adjusted until they best represented the occurrence of parna in the field. The final model showed predictions of parna deposits as follows: (i) higher elevations in the Young landscape were the dominant sites of parna deposits; (ii) thicker deposits of parna occurred on the windward south-west and north-west; (iii) thinner deposits occurred on the leeward side of a central ridge feature; (iv) because the training set concentrated around the major central ridge feature, poorer predictions were obtained on gently undulating country.


Sign in / Sign up

Export Citation Format

Share Document