scholarly journals Scalable and Generalizable Social Bot Detection through Data Selection

2020 ◽  
Vol 34 (01) ◽  
pp. 1096-1103 ◽  
Author(s):  
Kai-Cheng Yang ◽  
Onur Varol ◽  
Pik-Mai Hui ◽  
Filippo Menczer

Efficient and reliable social bot classification is crucial for detecting information manipulation on social media. Despite rapid development, state-of-the-art bot detection models still face generalization and scalability challenges, which greatly limit their applications. In this paper we propose a framework that uses minimal account metadata, enabling efficient analysis that scales up to handle the full stream of public tweets of Twitter in real time. To ensure model accuracy, we build a rich collection of labeled datasets for training and validation. We deploy a strict validation system so that model performance on unseen datasets is also optimized, in addition to traditional cross-validation. We find that strategically selecting a subset of training data yields better model accuracy and generalization than exhaustively training on all available data. Thanks to the simplicity of the proposed model, its logic can be interpreted to provide insights into social bot characteristics.

2017 ◽  
Vol 3 ◽  
pp. e137 ◽  
Author(s):  
Mona Alshahrani ◽  
Othman Soufan ◽  
Arturo Magana-Mora ◽  
Vladimir B. Bajic

Background Artificial neural networks (ANNs) are a robust class of machine learning models and are a frequent choice for solving classification problems. However, determining the structure of the ANNs is not trivial as a large number of weights (connection links) may lead to overfitting the training data. Although several ANN pruning algorithms have been proposed for the simplification of ANNs, these algorithms are not able to efficiently cope with intricate ANN structures required for complex classification problems. Methods We developed DANNP, a web-based tool, that implements parallelized versions of several ANN pruning algorithms. The DANNP tool uses a modified version of the Fast Compressed Neural Network software implemented in C++ to considerably enhance the running time of the ANN pruning algorithms we implemented. In addition to the performance evaluation of the pruned ANNs, we systematically compared the set of features that remained in the pruned ANN with those obtained by different state-of-the-art feature selection (FS) methods. Results Although the ANN pruning algorithms are not entirely parallelizable, DANNP was able to speed up the ANN pruning up to eight times on a 32-core machine, compared to the serial implementations. To assess the impact of the ANN pruning by DANNP tool, we used 16 datasets from different domains. In eight out of the 16 datasets, DANNP significantly reduced the number of weights by 70%–99%, while maintaining a competitive or better model performance compared to the unpruned ANN. Finally, we used a naïve Bayes classifier derived with the features selected as a byproduct of the ANN pruning and demonstrated that its accuracy is comparable to those obtained by the classifiers trained with the features selected by several state-of-the-art FS methods. The FS ranking methodology proposed in this study allows the users to identify the most discriminant features of the problem at hand. To the best of our knowledge, DANNP (publicly available at www.cbrc.kaust.edu.sa/dannp) is the only available and on-line accessible tool that provides multiple parallelized ANN pruning options. Datasets and DANNP code can be obtained at www.cbrc.kaust.edu.sa/dannp/data.php and https://doi.org/10.5281/zenodo.1001086.


Author(s):  
Penghui Wei ◽  
Wenji Mao ◽  
Guandan Chen

Analyzing public attitudes plays an important role in opinion mining systems. Stance detection aims to determine from a text whether its author is in favor of, against, or neutral towards a given target. One challenge of this task is that a text may not explicitly express an attitude towards the target, but existing approaches utilize target content alone to build models. Moreover, although weakly supervised approaches have been proposed to ease the burden of manually annotating largescale training data, such approaches are confronted with noisy labeling problem. To address the above two issues, in this paper, we propose a Topic-Aware Reinforced Model (TARM) for weakly supervised stance detection. Our model consists of two complementary components: (1) a detection network that incorporates target-related topic information into representation learning for identifying stance effectively; (2) a policy network that learns to eliminate noisy instances from auto-labeled data based on off-policy reinforcement learning. Two networks are alternately optimized to improve each other’s performances. Experimental results demonstrate that our proposed model TARM outperforms the state-of-the-art approaches.


Diagnostics ◽  
2021 ◽  
Vol 11 (12) ◽  
pp. 2288
Author(s):  
Kaixiang Su ◽  
Jiao Wu ◽  
Dongxiao Gu ◽  
Shanlin Yang ◽  
Shuyuan Deng ◽  
...  

Increasingly, machine learning methods have been applied to aid in diagnosis with good results. However, some complex models can confuse physicians because they are difficult to understand, while data differences across diagnostic tasks and institutions can cause model performance fluctuations. To address this challenge, we combined the Deep Ensemble Model (DEM) and tree-structured Parzen Estimator (TPE) and proposed an adaptive deep ensemble learning method (TPE-DEM) for dynamic evolving diagnostic task scenarios. Different from previous research that focuses on achieving better performance with a fixed structure model, our proposed model uses TPE to efficiently aggregate simple models more easily understood by physicians and require less training data. In addition, our proposed model can choose the optimal number of layers for the model and the type and number of basic learners to achieve the best performance in different diagnostic task scenarios based on the data distribution and characteristics of the current diagnostic task. We tested our model on one dataset constructed with a partner hospital and five UCI public datasets with different characteristics and volumes based on various diagnostic tasks. Our performance evaluation results show that our proposed model outperforms other baseline models on different datasets. Our study provides a novel approach for simple and understandable machine learning models in tasks with variable datasets and feature sets, and the findings have important implications for the application of machine learning models in computer-aided diagnosis.


2020 ◽  
Vol 10 (23) ◽  
pp. 8346
Author(s):  
Ni Jiang ◽  
Feihong Yu

Cell counting is a fundamental part of biomedical and pathological research. Predicting a density map is the mainstream method to count cells. As an easy-trained and well-generalized model, the random forest is often used to learn the cell images and predict the density maps. However, it cannot predict the data that are beyond the training data, which may result in underestimation. To overcome this problem, we propose a cell counting framework to predict the density map by detecting cells. The cell counting framework contains two parts: the training data preparation and the detection framework. The former makes sure that the cells can be detected even when overlapping, and the latter makes sure the count result accurate and robust. The proposed method uses multiple random forests to predict various probability maps where the cells can be detected by Hessian matrix. Take all the detection results into consideration to get the density map and achieve better performance. We conducted experiments on three public cell datasets. Experimental results showed that the proposed model performs better than the traditional random forest (RF) in terms of accuracy and robustness, and even superior to some state-of-the-art deep learning models. Especially when the training data are small, which is the usual case in cell counting, the count errors on VGG cells, and MBM cells were decreased from 3.4 to 2.9, from 11.3 to 9.3, respectively. The proposed model can obtain the lowest count error and achieves state-of-the-art.


2021 ◽  
Vol 17 (12) ◽  
pp. e1009655
Author(s):  
Lei Li ◽  
Yu-Tian Wang ◽  
Cun-Mei Ji ◽  
Chun-Hou Zheng ◽  
Jian-Cheng Ni ◽  
...  

microRNAs (miRNAs) are small non-coding RNAs related to a number of complicated biological processes. A growing body of studies have suggested that miRNAs are closely associated with many human diseases. It is meaningful to consider disease-related miRNAs as potential biomarkers, which could greatly contribute to understanding the mechanisms of complex diseases and benefit the prevention, detection, diagnosis and treatment of extraordinary diseases. In this study, we presented a novel model named Graph Convolutional Autoencoder for miRNA-Disease Association Prediction (GCAEMDA). In the proposed model, we utilized miRNA-miRNA similarities, disease-disease similarities and verified miRNA-disease associations to construct a heterogeneous network, which is applied to learn the embeddings of miRNAs and diseases. In addition, we separately constructed miRNA-based and disease-based sub-networks. Combining the embeddings of miRNAs and diseases, graph convolution autoencoder (GCAE) is utilized to calculate association scores of miRNA-disease on two sub-networks, respectively. Furthermore, we obtained final prediction scores between miRNAs and diseases by adopting an average ensemble way to integrate the prediction scores from two types of subnetworks. To indicate the accuracy of GCAEMDA, we applied different cross validation methods to evaluate our model whose performance were better than the state-of-the-art models. Case studies on a common human diseases were also implemented to prove the effectiveness of GCAEMDA. The results demonstrated that GCAEMDA were beneficial to infer potential associations of miRNA-disease.


Information ◽  
2021 ◽  
Vol 12 (11) ◽  
pp. 487
Author(s):  
Sohaib Al-Yadumi ◽  
Wei-Wei Goh ◽  
Ee-Xion Tan ◽  
Noor Zaman Jhanjhi ◽  
Patrice Boursier

Ontology matching is a rapidly emerging topic crucial for semantic web effort, data integration, and interoperability. Semantic heterogeneity is one of the most challenging aspects of ontology matching. Consequently, background knowledge (BK) resources are utilized to bridge the semantic gap between the ontologies. Generic BK approaches use a single matcher to discover correspondences between entities from different ontologies. However, the Ontology Alignment Evaluation Initiative (OAEI) results show that not all matchers identify the same correct mappings. Moreover, none of the matchers can obtain good results across all matching tasks. This study proposes a novel BK multimatcher approach for improving ontology matching by effectively generating and combining mappings from biomedical ontologies. Aggregation strategies to create more effective mappings are discussed. Then, a matcher path confidence measure that helps select the most promising paths using the final mapping selection algorithm is proposed. The proposed model performance is tested using the Anatomy and Large Biomed tracks offered by the OAEI 2020. Results show that higher recall levels have been obtained. Moreover, the F-measure values achieved with our model are comparable with those obtained by the state of the art matchers.


2020 ◽  
Author(s):  
Anika Gebauer ◽  
Ali Sakhaee ◽  
Axel Don ◽  
Mareike Ließ

<p>In order to assess the carbon and water storage capacity of agricultural soils at national scale (Germany), spatially continuous, high-resolution soil information on the particle size distribution is an essential requirement. Machine learning models are good at computing complex, composite non-linear functions. They can be trained on point data to relate soil properties (response variable) to approximations of soil forming factors (predictors). Finally, the obtained models can be used for spatial soil property predictions.</p><p>We developed models for topsoil texture regionalization using two powerful algorithms: the boosted regression trees machine learning algorithm, and the differential evolution algorithm applied for parameter tuning. Texture data (clay, silt, sand) originated from two sources: (1) the new soil database of the German Agricultural Soil Inventory (BZE), and (2) the well-known, publicly available database of the European Land Use / Cover Land Survey (LUCAS). BZE texture data results from an eight-kilometer sampling raster (2991 sampling points). LUCAS data from soils under agricultural use (Germany) comprises 1377 sampling points. The predictor datasets included DEM-based topography variables, information on the geographic position, and legacy maps of soil systematic units. In a first step, a nested five-fold cross-validation approach was used to tune and train models on the BZE data. In a second step, the amount of training data was increased by adding two-thirds of the LUCAS data. Model performance was evaluated by (1) cross-validation (R<sub>CV</sub>²), and (2) by using the remaining LUCAS data as an independent external test set (R<sub>external</sub>²).</p><p>Models trained on the BZE data were able to predict the nation-wide spatial distribution of clay, silt and sand (R<sub>CV</sub>² = 0.57 – 0.76; R<sub>external</sub>² = 0.68 – 0.83). Model performance was further enhanced by adding the LUCAS data to the training dataset.</p>


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Meiyu Li ◽  
Fenghui Lian ◽  
Chunyu Wang ◽  
Shuxu Guo

Abstract Background A novel multi-level pyramidal pooling residual U-Net with adversarial mechanism was proposed for organ segmentation from medical imaging, and was conducted on the challenging NIH Pancreas-CT dataset. Methods The 82 pancreatic contrast-enhanced abdominal CT volumes were split via four-fold cross validation to test the model performance. In order to achieve accurate segmentation, we firstly involved residual learning into an adversarial U-Net to achieve a better gradient information flow for improving segmentation performance. Then, we introduced a multi-level pyramidal pooling module (MLPP), where a novel pyramidal pooling was involved to gather contextual information for segmentation, then four groups of structures consisted of a different number of pyramidal pooling blocks were proposed to search for the structure with the optimal performance, and two types of pooling blocks were applied in the experimental section to further assess the robustness of MLPP for pancreas segmentation. For evaluation, Dice similarity coefficient (DSC) and recall were used as the metrics in this work. Results The proposed method preceded the baseline network 5.30% and 6.16% on metrics DSC and recall, and achieved competitive results compared with the-state-of-art methods. Conclusions Our algorithm showed great segmentation performance even on the particularly challenging pancreas dataset, this indicates that the proposed model is a satisfactory and promising segmentor.


2021 ◽  
Vol 11 (11) ◽  
pp. 661
Author(s):  
Marah Alhalabi ◽  
Mohammed Ghazal ◽  
Fasila Haneefa ◽  
Jawad Yousaf ◽  
Ayman El-Baz

Resolving circuit diagrams is a regular part of learning for school and university students from engineering backgrounds. Simulating circuits is usually done manually by creating circuit diagrams on circuit tools, which is a time-consuming and tedious process. We propose an innovative method of simulating circuits from hand-drawn diagrams using smartphones through an image recognition system. This method allows students to use their smartphones to capture images instead of creating circuit diagrams before simulation. Our contribution lies in building a circuit recognition system using a deep learning capsule networks algorithm. The developed system receives an image captured by a smartphone that undergoes preprocessing, region proposal, classification, and node detection to get a Netlist and exports it to a circuit simulator program for simulation. We aim to improve engineering education using smartphones by (1) achieving higher accuracy using less training data with capsule networks and (2) developing a comprehensive system that captures hand-drawn circuit diagrams and produces circuit simulation results. We use 400 samples per class and report an accuracy of 96% for stratified 5-fold cross-validation. Through testing, we identify the optimum distance for taking circuit images to be 10 to 20 cm. Our proposed model can identify components of different scales and rotations.


2021 ◽  
Vol 13 (22) ◽  
pp. 12906
Author(s):  
Majed G. Alharbi ◽  
Ahmed Stohy ◽  
Mohammed Elhenawy ◽  
Mahmoud Masoud ◽  
Hamiden Abd El-Wahed Khalifa

This paper introduces a time efficient deep learning-based solution to the traveling salesman problem with time window (TSPTW). Our goal is to reduce the total tour length traveled by -*the agent without violating any time limitations. This will aid in decreasing the time required to supply any type of service, as well as lowering the emissions produced by automobiles, allowing our planet to recover from air pollution emissions. The proposed model is a variation of the pointer networks that has a better ability to encode the TSPTW problems. The model proposed in this paper is inspired from our previous work that introduces a hybrid context encoder and a multi attention decoder. The hybrid encoder primarily comprises the transformer encoder and the graph encoder; these encoders encode the feature vector before passing it to the attention decoder layer. The decoder consists of transformer context and graph context as well. The output attentions from the two decoders are aggregated and used to select the following step in the trip. To the best of our knowledge, our network is the first neural model that will be able to solve medium-size TSPTW problems. Moreover, we conducted sensitivity analysis to explore how the model performance changes as the time window width in the training and testing data changes. The experimental work shows that our proposed model outperforms the state-of-the-art model for TSPTW of sizes 20, 50 and 100 nodes/cities. We expect that our model will become state-of-the-art methodology for solving the TSPTW problems.


Sign in / Sign up

Export Citation Format

Share Document