A tree-based learning approach for document structure analysis and its application to web search

2014 ◽  
Vol 21 (4) ◽  
pp. 569-605 ◽  
Author(s):  
F. CANAN PEMBE ◽  
TUNGA GÜNGÖR

AbstractIn this paper, we study the problem of structural analysis of Web documents aiming at extracting the sectional hierarchy of a document. In general, a document can be represented as a hierarchy of sections and subsections with corresponding headings and subheadings. We developed two machine learning models: heading extraction model and hierarchy extraction model. Heading extraction was formulated as a classification problem whereas a tree-based learning approach was employed in hierarchy extraction. For this purpose, we developed an incremental learning algorithm based on support vector machines and perceptrons. The models were evaluated in detail with respect to the performance of the heading and hierarchy extraction tasks. For comparison, a baseline rule-based approach was used that relies on heuristics and HTML document object model tree processing. The machine learning approach, which is a fully automatic approach, outperformed the rule-based approach. We also analyzed the effect of document structuring on automatic summarization in the context of Web search. The results of the task-based evaluation on TREC queries showed that structured summaries are superior to unstructured summaries both in terms of accuracy and user ratings, and enable the users to determine the relevancy of search results more accurately than search engine snippets.

2018 ◽  
Vol 31 (07) ◽  
pp. 937-945 ◽  
Author(s):  
Massimiliano Grassi ◽  
David A. Loewenstein ◽  
Daniela Caldirola ◽  
Koen Schruers ◽  
Ranjan Duara ◽  
...  

ABSTRACTBackground:In a previous study, we developed a highly performant and clinically-translatable machine learning algorithm for a prediction of three-year conversion to Alzheimer’s disease (AD) in subjects with Mild Cognitive Impairment (MCI) and Pre-mild Cognitive Impairment. Further tests are necessary to demonstrate its accuracy when applied to subjects not used in the original training process. In this study, we aimed to provide preliminary evidence of this via a transfer learning approach.Methods:We initially employed the same baseline information (i.e. clinical and neuropsychological test scores, cardiovascular risk indexes, and a visual rating scale for brain atrophy) and the same machine learning technique (support vector machine with radial-basis function kernel) used in our previous study to retrain the algorithm to discriminate between participants with AD (n = 75) and normal cognition (n = 197). Then, the algorithm was applied to perform the original task of predicting the three-year conversion to AD in the sample of 61 MCI subjects that we used in the previous study.Results:Even after the retraining, the algorithm demonstrated a significant predictive performance in the MCI sample (AUC = 0.821, 95% CI bootstrap = 0.705–0.912, best balanced accuracy = 0.779, sensitivity = 0.852, specificity = 0.706).Conclusions:These results provide a first indirect evidence that our original algorithm can also perform relevant generalized predictions when applied to new MCI individuals. This motivates future efforts to bring the algorithm to sufficient levels of optimization and trustworthiness that will allow its application in both clinical and research settings.


2021 ◽  
Vol 3 (3) ◽  
pp. 542-558
Author(s):  
Lijuan Tan ◽  
Jinzhu Lu ◽  
Huanyu Jiang

Tomato production can be greatly reduced due to various diseases, such as bacterial spot, early blight, and leaf mold. Rapid recognition and timely treatment of diseases can minimize tomato production loss. Nowadays, a large number of researchers (including different institutes, laboratories, and universities) have developed and examined various traditional machine learning (ML) and deep learning (DL) algorithms for plant disease classification. However, through pass survey analysis, we found that there are no studies comparing the classification performance of ML and DL for the tomato disease classification problem. The performance and outcomes of different traditional ML and DL (a subset of ML) methods may vary depending on the datasets used and the tasks to be solved. This study generally aimed to identify the most suitable ML/DL models for the PlantVillage tomato dataset and the tomato disease classification problem. For machine learning algorithm implementation, we used different methods to extract disease features manually. In our study, we extracted a total of 52 texture features using local binary pattern (LBP) and gray level co-occurrence matrix (GLCM) methods and 105 color features using color moment and color histogram methods. Among all the feature extraction methods, the COLOR+GLCM method obtained the best result. By comparing the different methods, we found that the metrics (accuracy, precision, recall, F1 score) of the tested deep learning networks (AlexNet, VGG16, ResNet34, EfficientNet-b0, and MobileNetV2) were all better than those of the measured machine learning algorithms (support vector machine (SVM), k-nearest neighbor (kNN), and random forest (RF)). Furthermore, we found that, for our dataset and classification task, among the tested ML/DL algorithms, the ResNet34 network obtained the best results, with accuracy of 99.7%, precision of 99.6%, recall of 99.7%, and F1 score of 99.7%.


Author(s):  
Padmavathi .S ◽  
M. Chidambaram

Text classification has grown into more significant in managing and organizing the text data due to tremendous growth of online information. It does classification of documents in to fixed number of predefined categories. Rule based approach and Machine learning approach are the two ways of text classification. In rule based approach, classification of documents is done based on manually defined rules. In Machine learning based approach, classification rules or classifier are defined automatically using example documents. It has higher recall and quick process. This paper shows an investigation on text classification utilizing different machine learning techniques.


2019 ◽  
Vol 23 (1) ◽  
pp. 12-21 ◽  
Author(s):  
Shikha N. Khera ◽  
Divya

Information technology (IT) industry in India has been facing a systemic issue of high attrition in the past few years, resulting in monetary and knowledge-based loses to the companies. The aim of this research is to develop a model to predict employee attrition and provide the organizations opportunities to address any issue and improve retention. Predictive model was developed based on supervised machine learning algorithm, support vector machine (SVM). Archival employee data (consisting of 22 input features) were collected from Human Resource databases of three IT companies in India, including their employment status (response variable) at the time of collection. Accuracy results from the confusion matrix for the SVM model showed that the model has an accuracy of 85 per cent. Also, results show that the model performs better in predicting who will leave the firm as compared to predicting who will not leave the company.


2021 ◽  
Vol 10 (5) ◽  
pp. 992
Author(s):  
Martina Barchitta ◽  
Andrea Maugeri ◽  
Giuliana Favara ◽  
Paolo Marco Riela ◽  
Giovanni Gallo ◽  
...  

Patients in intensive care units (ICUs) were at higher risk of worsen prognosis and mortality. Here, we aimed to evaluate the ability of the Simplified Acute Physiology Score (SAPS II) to predict the risk of 7-day mortality, and to test a machine learning algorithm which combines the SAPS II with additional patients’ characteristics at ICU admission. We used data from the “Italian Nosocomial Infections Surveillance in Intensive Care Units” network. Support Vector Machines (SVM) algorithm was used to classify 3782 patients according to sex, patient’s origin, type of ICU admission, non-surgical treatment for acute coronary disease, surgical intervention, SAPS II, presence of invasive devices, trauma, impaired immunity, antibiotic therapy and onset of HAI. The accuracy of SAPS II for predicting patients who died from those who did not was 69.3%, with an Area Under the Curve (AUC) of 0.678. Using the SVM algorithm, instead, we achieved an accuracy of 83.5% and AUC of 0.896. Notably, SAPS II was the variable that weighted more on the model and its removal resulted in an AUC of 0.653 and an accuracy of 68.4%. Overall, these findings suggest the present SVM model as a useful tool to early predict patients at higher risk of death at ICU admission.


2021 ◽  
Vol 9 (1) ◽  
Author(s):  
Linda A. Antonucci ◽  
Alessandra Raio ◽  
Giulio Pergola ◽  
Barbara Gelao ◽  
Marco Papalino ◽  
...  

Abstract Background Recent views posited that negative parenting and attachment insecurity can be considered as general environmental factors of vulnerability for psychosis, specifically for individuals diagnosed with psychosis (PSY). Furthermore, evidence highlighted a tight relationship between attachment style and social cognition abilities, a key PSY behavioral phenotype. The aim of this study is to generate a machine learning algorithm based on the perceived quality of parenting and attachment style-related features to discriminate between PSY and healthy controls (HC) and to investigate its ability to track PSY early stages and risk conditions, as well as its association with social cognition performance. Methods Perceived maternal and paternal parenting, as well as attachment anxiety and avoidance scores, were trained to separate 71 HC from 34 PSY (20 individuals diagnosed with schizophrenia + 14 diagnosed with bipolar disorder with psychotic manifestations) using support vector classification and repeated nested cross-validation. We then validated this model on independent datasets including individuals at the early stages of disease (ESD, i.e. first episode of psychosis or depression, or at-risk mental state for psychosis) and with familial high risk for PSY (FHR, i.e. having a first-degree relative suffering from psychosis). Then, we performed factorial analyses to test the group x classification rate interaction on emotion perception, social inference and managing of emotions abilities. Results The perceived parenting and attachment-based machine learning model discriminated PSY from HC with a Balanced Accuracy (BAC) of 72.2%. Slightly lower classification performance was measured in the ESD sample (HC-ESD BAC = 63.5%), while the model could not discriminate between FHR and HC (BAC = 44.2%). We observed a significant group x classification interaction in PSY and HC from the discovery sample on emotion perception and on the ability to manage emotions (both p = 0.02). The interaction on managing of emotion abilities was replicated in the ESD and HC validation sample (p = 0.03). Conclusion Our results suggest that parenting and attachment-related variables bear significant classification power when applied to both PSY and its early stages and are associated with variability in emotion processing. These variables could therefore be useful in psychosis early recognition programs aimed at softening the psychosis-associated disability.


2021 ◽  
pp. 1-17
Author(s):  
Ahmed Al-Tarawneh ◽  
Ja’afer Al-Saraireh

Twitter is one of the most popular platforms used to share and post ideas. Hackers and anonymous attackers use these platforms maliciously, and their behavior can be used to predict the risk of future attacks, by gathering and classifying hackers’ tweets using machine-learning techniques. Previous approaches for detecting infected tweets are based on human efforts or text analysis, thus they are limited to capturing the hidden text between tweet lines. The main aim of this research paper is to enhance the efficiency of hacker detection for the Twitter platform using the complex networks technique with adapted machine learning algorithms. This work presents a methodology that collects a list of users with their followers who are sharing their posts that have similar interests from a hackers’ community on Twitter. The list is built based on a set of suggested keywords that are the commonly used terms by hackers in their tweets. After that, a complex network is generated for all users to find relations among them in terms of network centrality, closeness, and betweenness. After extracting these values, a dataset of the most influential users in the hacker community is assembled. Subsequently, tweets belonging to users in the extracted dataset are gathered and classified into positive and negative classes. The output of this process is utilized with a machine learning process by applying different algorithms. This research build and investigate an accurate dataset containing real users who belong to a hackers’ community. Correctly, classified instances were measured for accuracy using the average values of K-nearest neighbor, Naive Bayes, Random Tree, and the support vector machine techniques, demonstrating about 90% and 88% accuracy for cross-validation and percentage split respectively. Consequently, the proposed network cyber Twitter model is able to detect hackers, and determine if tweets pose a risk to future institutions and individuals to provide early warning of possible attacks.


Sensors ◽  
2021 ◽  
Vol 21 (2) ◽  
pp. 617
Author(s):  
Umer Saeed ◽  
Young-Doo Lee ◽  
Sana Ullah Jan ◽  
Insoo Koo

Sensors’ existence as a key component of Cyber-Physical Systems makes it susceptible to failures due to complex environments, low-quality production, and aging. When defective, sensors either stop communicating or convey incorrect information. These unsteady situations threaten the safety, economy, and reliability of a system. The objective of this study is to construct a lightweight machine learning-based fault detection and diagnostic system within the limited energy resources, memory, and computation of a Wireless Sensor Network (WSN). In this paper, a Context-Aware Fault Diagnostic (CAFD) scheme is proposed based on an ensemble learning algorithm called Extra-Trees. To evaluate the performance of the proposed scheme, a realistic WSN scenario composed of humidity and temperature sensor observations is replicated with extreme low-intensity faults. Six commonly occurring types of sensor fault are considered: drift, hard-over/bias, spike, erratic/precision degradation, stuck, and data-loss. The proposed CAFD scheme reveals the ability to accurately detect and diagnose low-intensity sensor faults in a timely manner. Moreover, the efficiency of the Extra-Trees algorithm in terms of diagnostic accuracy, F1-score, ROC-AUC, and training time is demonstrated by comparison with cutting-edge machine learning algorithms: a Support Vector Machine and a Neural Network.


Sign in / Sign up

Export Citation Format

Share Document