WEB PAGES CLASSIFICATION USING DOMAIN ONTOLOGY AND CLUSTERING

Author(s):  
SIMA SOLTANI ◽  
AHMAD ABDOLLAHZADEH BARFOROUSH

Transferring the current Websites to Semantic Websites, using ontology population, is a research area within which classification takes the main role. The existing classification algorithms and single level execution of them are insufficient on web data. Moreover, because of the variety in the context and structure of even common domain Websites, there is a lack of training data for these classification algorithms. In this paper, we present three experiments: (1) use information in domain ontology on the layers of classes to train classifiers (layered classification) with improvement up to 10% on accuracy of classification, (2) experiment on the problem of training dataset and use clustering as a preprocess, (3) use ensembles to benefit from both methods. Beside the improvement of accuracy in these experiments, we found that for the ensemble we can dispense with the algorithm of classification and use a simple classification like Bayes and achieve the accuracy of complex algorithms like SVM.

2017 ◽  
Vol 1 (3) ◽  
pp. 42-58 ◽  
Author(s):  
Frederique Lang ◽  
Diego Chavarro ◽  
Yuxian Liu

AbstractPurposeThe authors aim at testing the performance of a set of machine learning algorithms that could improve the process of data cleaning when building datasets.Design/methodology/approachThe paper is centered on cleaning datasets gathered from publishers and online resources by the use of specific keywords. In this case, we analyzed data from the Web of Science. The accuracy of various forms of automatic classification was tested here in comparison with manual coding in order to determine their usefulness for data collection and cleaning. We assessed the performance of seven supervised classification algorithms (Support Vector Machine (SVM), Scaled Linear Discriminant Analysis, Lasso and elastic-net regularized generalized linear models, Maximum Entropy, Regression Tree, Boosting, and Random Forest) and analyzed two properties: accuracy and recall. We assessed not only each algorithm individually, but also their combinations through a voting scheme. We also tested the performance of these algorithms with different sizes of training data. When assessing the performance of different combinations, we used an indicator of coverage to account for the agreement and disagreement on classification between algorithms.FindingsWe found that the performance of the algorithms used vary with the size of the sample for training. However, for the classification exercise in this paper the best performing algorithms were SVM and Boosting. The combination of these two algorithms achieved a high agreement on coverage and was highly accurate. This combination performs well with a small training dataset (10%), which may reduce the manual work needed for classification tasks.Research limitationsThe dataset gathered has significantly more records related to the topic of interest compared to unrelated topics. This may affect the performance of some algorithms, especially in their identification of unrelated papers.Practical implicationsAlthough the classification achieved by this means is not completely accurate, the amount of manual coding needed can be greatly reduced by using classification algorithms. This can be of great help when the dataset is big. With the help of accuracy, recall, and coverage measures, it is possible to have an estimation of the error involved in this classification, which could open the possibility of incorporating the use of these algorithms in software specifically designed for data cleaning and classification.Originality/valueWe analyzed the performance of seven algorithms and whether combinations of these algorithms improve accuracy in data collection. Use of these algorithms could reduce time needed for manual data cleaning.


Author(s):  
Zirui (Raymond) Huang ◽  
Ali Arian ◽  
Yuqiu (Rachael) Yuan ◽  
Yi-Chang Chiu

An increasingly emphasized research area is the forecast of short-term traffic conditions for nonrecurring traffic dynamics caused by random highway incidents such as crashes or roadway closures. This research proposes a prediction framework which focuses on training a machine learning (ML) model to predict the speed heatmap associated with incidents. Heatmaps contain ideal information that depicts the spatiotemporal characteristics of incident-induced impacts and are suitable objects for ML models to understand and predict. Because of the sparsity of incident data in the real world, we proposed a simulation approach to rapidly expand the training dataset, thus speeding up the model training process. The conditional deep convolutional generative adversarial nets is employed to predict the speed heatmap and the mesoscopic dynamic traffic assignment model DynusT was used to generate many training data. The evaluation shows that the proposed model captures both the tonal and spatial distribution of pixel values at 80.19% similarity between the prediction and actual heatmaps. To the best of our knowledge, this is one of the first attempts in the literature to train ML to predict heatmap representation of incident-induced spatiotemporal impact, and speeding up the training via simulation.


2020 ◽  
Vol 27 ◽  
Author(s):  
Zaheer Ullah Khan ◽  
Dechang Pi

Background: S-sulfenylation (S-sulphenylation, or sulfenic acid) proteins, are special kinds of post-translation modification, which plays an important role in various physiological and pathological processes such as cytokine signaling, transcriptional regulation, and apoptosis. Despite these aforementioned significances, and by complementing existing wet methods, several computational models have been developed for sulfenylation cysteine sites prediction. However, the performance of these models was not satisfactory due to inefficient feature schemes, severe imbalance issues, and lack of an intelligent learning engine. Objective: In this study, our motivation is to establish a strong and novel computational predictor for discrimination of sulfenylation and non-sulfenylation sites. Methods: In this study, we report an innovative bioinformatics feature encoding tool, named DeepSSPred, in which, resulting encoded features is obtained via n-segmented hybrid feature, and then the resampling technique called synthetic minority oversampling was employed to cope with the severe imbalance issue between SC-sites (minority class) and non-SC sites (majority class). State of the art 2DConvolutional Neural Network was employed over rigorous 10-fold jackknife cross-validation technique for model validation and authentication. Results: Following the proposed framework, with a strong discrete presentation of feature space, machine learning engine, and unbiased presentation of the underline training data yielded into an excellent model that outperforms with all existing established studies. The proposed approach is 6% higher in terms of MCC from the first best. On an independent dataset, the existing first best study failed to provide sufficient details. The model obtained an increase of 7.5% in accuracy, 1.22% in Sn, 12.91% in Sp and 13.12% in MCC on the training data and12.13% of ACC, 27.25% in Sn, 2.25% in Sp, and 30.37% in MCC on an independent dataset in comparison with 2nd best method. These empirical analyses show the superlative performance of the proposed model over both training and Independent dataset in comparison with existing literature studies. Conclusion : In this research, we have developed a novel sequence-based automated predictor for SC-sites, called DeepSSPred. The empirical simulations outcomes with a training dataset and independent validation dataset have revealed the efficacy of the proposed theoretical model. The good performance of DeepSSPred is due to several reasons, such as novel discriminative feature encoding schemes, SMOTE technique, and careful construction of the prediction model through the tuned 2D-CNN classifier. We believe that our research work will provide a potential insight into a further prediction of S-sulfenylation characteristics and functionalities. Thus, we hope that our developed predictor will significantly helpful for large scale discrimination of unknown SC-sites in particular and designing new pharmaceutical drugs in general.


2020 ◽  
Vol 12 (9) ◽  
pp. 1418
Author(s):  
Runmin Dong ◽  
Cong Li ◽  
Haohuan Fu ◽  
Jie Wang ◽  
Weijia Li ◽  
...  

Substantial progress has been made in the field of large-area land cover mapping as the spatial resolution of remotely sensed data increases. However, a significant amount of human power is still required to label images for training and testing purposes, especially in high-resolution (e.g., 3-m) land cover mapping. In this research, we propose a solution that can produce 3-m resolution land cover maps on a national scale without human efforts being involved. First, using the public 10-m resolution land cover maps as an imperfect training dataset, we propose a deep learning based approach that can effectively transfer the existing knowledge. Then, we improve the efficiency of our method through a network pruning process for national-scale land cover mapping. Our proposed method can take the state-of-the-art 10-m resolution land cover maps (with an accuracy of 81.24% for China) as the training data, enable a transferred learning process that can produce 3-m resolution land cover maps, and further improve the overall accuracy (OA) to 86.34% for China. We present detailed results obtained over three mega cities in China, to demonstrate the effectiveness of our proposed approach for 3-m resolution large-area land cover mapping.


2015 ◽  
Vol 32 (7) ◽  
pp. 1341-1355 ◽  
Author(s):  
S. J. Rennie ◽  
M. Curtis ◽  
J. Peter ◽  
A. W. Seed ◽  
P. J. Steinle ◽  
...  

AbstractThe Australian Bureau of Meteorology’s operational weather radar network comprises a heterogeneous radar collection covering diverse geography and climate. A naïve Bayes classifier has been developed to identify a range of common echo types observed with these radars. The success of the classifier has been evaluated against its training dataset and by routine monitoring. The training data indicate that more than 90% of precipitation may be identified correctly. The echo types most difficult to distinguish from rainfall are smoke, chaff, and anomalous propagation ground and sea clutter. Their impact depends on their climatological frequency. Small quantities of frequently misclassified persistent echo (like permanent ground clutter or insects) can also cause quality control issues. The Bayes classifier is demonstrated to perform better than a simple threshold method, particularly for reducing misclassification of clutter as precipitation. However, the result depends on finding a balance between excluding precipitation and including erroneous echo. Unlike many single-polarization classifiers that are only intended to extract precipitation echo, the Bayes classifier also discriminates types of nonprecipitation echo. Therefore, the classifier provides the means to utilize clear air echo for applications like data assimilation, and the class information will permit separate data handling of different echo types.


Author(s):  
Sarmad Mahar ◽  
Sahar Zafar ◽  
Kamran Nishat

Headnotes are the precise explanation and summary of legal points in an issued judgment. Law journals hire experienced lawyers to write these headnotes. These headnotes help the reader quickly determine the issue discussed in the case. Headnotes comprise two parts. The first part comprises the topic discussed in the judgment, and the second part contains a summary of that judgment. In this thesis, we design, develop and evaluate headnote prediction using machine learning, without involving human involvement. We divided this task into a two steps process. In the first step, we predict law points used in the judgment by using text classification algorithms. The second step generates a summary of the judgment using text summarization techniques. To achieve this task, we created a Databank by extracting data from different law sources in Pakistan. We labelled training data generated based on Pakistan law websites. We tested different feature extraction methods on judiciary data to improve our system. Using these feature extraction methods, we developed a dictionary of terminology for ease of reference and utility. Our approach achieves 65% accuracy by using Linear Support Vector Classification with tri-gram and without stemmer. Using active learning our system can continuously improve the accuracy with the increased labelled examples provided by the users of the system.


2020 ◽  
Vol 10 (6) ◽  
pp. 2104
Author(s):  
Michał Tomaszewski ◽  
Paweł Michalski ◽  
Jakub Osuchowski

This article presents an analysis of the effectiveness of object detection in digital images with the application of a limited quantity of input. The possibility of using a limited set of learning data was achieved by developing a detailed scenario of the task, which strictly defined the conditions of detector operation in the considered case of a convolutional neural network. The described solution utilizes known architectures of deep neural networks in the process of learning and object detection. The article presents comparisons of results from detecting the most popular deep neural networks while maintaining a limited training set composed of a specific number of selected images from diagnostic video. The analyzed input material was recorded during an inspection flight conducted along high-voltage lines. The object detector was built for a power insulator. The main contribution of the presented papier is the evidence that a limited training set (in our case, just 60 training frames) could be used for object detection, assuming an outdoor scenario with low variability of environmental conditions. The decision of which network will generate the best result for such a limited training set is not a trivial task. Conducted research suggests that the deep neural networks will achieve different levels of effectiveness depending on the amount of training data. The most beneficial results were obtained for two convolutional neural networks: the faster region-convolutional neural network (faster R-CNN) and the region-based fully convolutional network (R-FCN). Faster R-CNN reached the highest AP (average precision) at a level of 0.8 for 60 frames. The R-FCN model gained a worse AP result; however, it can be noted that the relationship between the number of input samples and the obtained results has a significantly lower influence than in the case of other CNN models, which, in the authors’ assessment, is a desired feature in the case of a limited training set.


2020 ◽  
Author(s):  
Tanweer Alam ◽  
Mohamed Benaida

Building the innovative blockchain-based architecture across the Internet of Things (IoT) platform for the education system could be an enticing mechanism to boost communication efficiency within the 5 G network. Wireless networking would have been the main research area allowing people to communicate without using the wires. It was established at the start of the Internet by retrieving the web pages to connect from one computer to another computer Moreover, high-speed, intelligent, powerful networks with numerous contemporary technologies, such as low power consumption, and so on, appear to be available in today's world to connect among each other. The extension of fog features on physical things under IoT is allowed in this situation. One of the complex tasks throughout the area of mobile communications would be to design a new virtualization framework based on blockchain across the Internet of Things architecture. The goal of this research is to connect a new study for an educational system that contains Blockchain to the internet of things or keeping things cryptographically secure on the internet. This research combines with its improved blockchain and IoT to create an efficient interaction system between students, teachers, employers, developers, facilitators and accreditors on the Internet. This specified framework is detailed research's great estimation.


Author(s):  
M. Kölle ◽  
V. Walter ◽  
S. Schmohl ◽  
U. Soergel

Abstract. Automated semantic interpretation of 3D point clouds is crucial for many tasks in the domain of geospatial data analysis. For this purpose, labeled training data is required, which has often to be provided manually by experts. One approach to minimize effort in terms of costs of human interaction is Active Learning (AL). The aim is to process only the subset of an unlabeled dataset that is particularly helpful with respect to class separation. Here a machine identifies informative instances which are then labeled by humans, thereby increasing the performance of the machine. In order to completely avoid involvement of an expert, this time-consuming annotation can be resolved via crowdsourcing. Therefore, we propose an approach combining AL with paid crowdsourcing. Although incorporating human interaction, our method can run fully automatically, so that only an unlabeled dataset and a fixed financial budget for the payment of the crowdworkers need to be provided. We conduct multiple iteration steps of the AL process on the ISPRS Vaihingen 3D Semantic Labeling benchmark dataset (V3D) and especially evaluate the performance of the crowd when labeling 3D points. We prove our concept by using labels derived from our crowd-based AL method for classifying the test dataset. The analysis outlines that by labeling only 0:4% of the training dataset by the crowd and spending less than 145 $, both our trained Random Forest and sparse 3D CNN classifier differ in Overall Accuracy by less than 3 percentage points compared to the same classifiers trained on the complete V3D training set.


Author(s):  
C. Koetsier ◽  
T. Peters ◽  
M. Sester

Abstract. Estimating vehicle poses is crucial for generating precise movement trajectories from (surveillance) camera data. Additionally for real time applications this task has to be solved in an efficient way. In this paper we introduce a deep convolutional neural network for pose estimation of vehicles from image patches. For a given 2D image patch our approach estimates the 2D coordinates of the image representing the exact center ground point (cx, cy) and the orientation of the vehicle - represented by the elevation angle (e) of the camera with respect to the vehicle’s center ground point and the azimuth rotation (a) of the vehicle with respect to the camera. To train a accurate model a large and diverse training dataset is needed. Collecting and labeling such large amount of data is very time consuming and expensive. Due to the lack of a sufficient amount of training data we show furthermore, that also rendered 3D vehicle models with artificial generated textures are nearly adequate for training.


Sign in / Sign up

Export Citation Format

Share Document