What You Say and How You Say it: Joint Modeling of Topics and Discourse in Microblog Conversations

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00267 ◽

2019 ◽

Vol 7 ◽

pp. 267-281 ◽

Cited By ~ 4

Author(s):

Jichuan Zeng ◽

Jing Li ◽

Yulan He ◽

Cuiyun Gao ◽

Michael R. Lyu ◽

...

Keyword(s):

Neural Model ◽

Joint Modeling ◽

Data Sets

This paper presents an unsupervised framework for jointly modeling topic content and discourse behavior in microblog conversations. Concretely, we propose a neural model to discover word clusters indicating what a conversation concerns (i.e., topics) and those reflecting how participants voice their opinions (i.e., discourse).1Extensive experiments show that our model can yield both coherent topics and meaningful discourse behavior. Further study shows that our topic and discourse representations can benefit the classification of microblog messages, especially when they are jointly trained with the classifier.Our data sets and code are available at: http://github.com/zengjichuan/Topic_Disc .

Download Full-text

Classification of jujube defects in small data sets based on transfer learning

Neural Computing and Applications ◽

10.1007/s00521-021-05715-2 ◽

2021 ◽

Author(s):

Jianping Ju ◽

Hong Zheng ◽

Xiaohang Xu ◽

Zhongyuan Guo ◽

Zhaohui Zheng ◽

...

Keyword(s):

Transfer Learning ◽

Loss Function ◽

Training Model ◽

Parameter Distribution ◽

Test Accuracy ◽

Small Data ◽

Data Sets ◽

Data Set ◽

Small Data Sets

AbstractAlthough convolutional neural networks have achieved success in the field of image classification, there are still challenges in the field of agricultural product quality sorting such as machine vision-based jujube defects detection. The performance of jujube defect detection mainly depends on the feature extraction and the classifier used. Due to the diversity of the jujube materials and the variability of the testing environment, the traditional method of manually extracting the features often fails to meet the requirements of practical application. In this paper, a jujube sorting model in small data sets based on convolutional neural network and transfer learning is proposed to meet the actual demand of jujube defects detection. Firstly, the original images collected from the actual jujube sorting production line were pre-processed, and the data were augmented to establish a data set of five categories of jujube defects. The original CNN model is then improved by embedding the SE module and using the triplet loss function and the center loss function to replace the softmax loss function. Finally, the depth pre-training model on the ImageNet image data set was used to conduct training on the jujube defects data set, so that the parameters of the pre-training model could fit the parameter distribution of the jujube defects image, and the parameter distribution was transferred to the jujube defects data set to complete the transfer of the model and realize the detection and classification of the jujube defects. The classification results are visualized by heatmap through the analysis of classification accuracy and confusion matrix compared with the comparison models. The experimental results show that the SE-ResNet50-CL model optimizes the fine-grained classification problem of jujube defect recognition, and the test accuracy reaches 94.15%. The model has good stability and high recognition accuracy in complex environments.

Download Full-text

Detection and Classification of Anomalies in Large Data Sets on the Basis of Information Granules

IEEE Transactions on Fuzzy Systems ◽

10.1109/tfuzz.2021.3076265 ◽

2021 ◽

pp. 1-1

Author(s):

Adam Kiersztyn ◽

Pawe Karczmarek ◽

Krystyna Kiersztyn ◽

Witold Pedrycz

Keyword(s):

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Information Granules

Download Full-text

Software Benchmark—Classification Tree Algorithms for Cell Atlases Annotation Using Single-Cell RNA-Sequencing Data

Microbiology Research ◽

10.3390/microbiolres12020022 ◽

2021 ◽

Vol 12 (2) ◽

pp. 317-334

Author(s):

Omar Alaqeeli ◽

Li Xing ◽

Xuekui Zhang

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Classification Tree ◽

Area Under The Curve ◽

Data Sets ◽

Sequencing Data ◽

Single Cell Rna Sequencing ◽

Tree Algorithms ◽

R Packages

Classification tree is a widely used machine learning method. It has multiple implementations as R packages; rpart, ctree, evtree, tree and C5.0. The details of these implementations are not the same, and hence their performances differ from one application to another. We are interested in their performance in the classification of cells using the single-cell RNA-Sequencing data. In this paper, we conducted a benchmark study using 22 Single-Cell RNA-sequencing data sets. Using cross-validation, we compare packages’ prediction performances based on their Precision, Recall, F1-score, Area Under the Curve (AUC). We also compared the Complexity and Run-time of these R packages. Our study shows that rpart and evtree have the best Precision; evtree is the best in Recall, F1-score and AUC; C5.0 prefers more complex trees; tree is consistently much faster than others, although its complexity is often higher than others.

Download Full-text

A proposed procedure for the analysis of large phytosociological data sets in the classification of South African grasslands

Koedoe ◽

10.4102/koedoe.v38i1.303 ◽

1995 ◽

Vol 38 (1) ◽

Cited By ~ 6

Author(s):

G.J. Bredenkamp ◽

H. Bezuidenhout

Keyword(s):

South African ◽

Plant Communities ◽

Vegetation Type ◽

Second Step ◽

Data Sets ◽

Vegetation Types ◽

Related Plant ◽

Single Cluster ◽

Synoptic Table

A procedure for the effective classification of large phytosociological data sets, and the combination of many data sets from various parts of the South African grasslands is demonstrated. The procedure suggests a region by region or project by project treatment of the data. The analyses are performed step by step to effectively bring together all releves of similar or related plant communities. The first step involves a separate numerical classification of each subset (region), and subsequent refinement by Braun- Blanquet procedures. The resulting plant communities are summarised in a single synoptic table, by calculating a synoptic value for each species in each community. In the second step all communities in the synoptic table are classified by numerical analysis, to bring related communities from different regions or studies together in a single cluster. After refinement of these clusters by Braun-Blanquet procedures, broad vegetation types are identified. As a third step phytosociological tables are compiled for each iden- tified broad vegetation type, and a comprehensive abstract hierarchy constructed.

Download Full-text

The classification of imbalanced large data sets based on MapReduce and ensemble of ELM classifiers

International Journal of Machine Learning and Cybernetics ◽

10.1007/s13042-015-0478-7 ◽

2015 ◽

Vol 8 (3) ◽

pp. 1009-1017 ◽

Cited By ~ 54

Author(s):

Junhai Zhai ◽

Sufang Zhang ◽

Chenxi Wang

Keyword(s):

Large Data ◽

Large Data Sets ◽

Data Sets

Download Full-text

Sequential Sampling for Estimation and Classification of the Incidence of Hop Powdery Mildew II: Cone Sampling

Plant Disease ◽

10.1094/pdis-91-8-1013 ◽

2007 ◽

Vol 91 (8) ◽

pp. 1013-1020 ◽

Cited By ~ 8

Author(s):

David H. Gent ◽

William W. Turechek ◽

Walter F. Mahaffee

Keyword(s):

Powdery Mildew ◽

Binomial Distribution ◽

Disease Incidence ◽

Sequential Sampling ◽

Model Construction ◽

Data Sets ◽

Data Set ◽

Sampling Plans ◽

Simulated Sampling

Sequential sampling models for estimation and classification of the incidence of powdery mildew (caused by Podosphaera macularis) on hop (Humulus lupulus) cones were developed using parameter estimates of the binary power law derived from the analysis of 221 transect data sets (model construction data set) collected from 41 hop yards sampled in Oregon and Washington from 2000 to 2005. Stop lines, models that determine when sufficient information has been collected to estimate mean disease incidence and stop sampling, for sequential estimation were validated by bootstrap simulation using a subset of 21 model construction data sets and simulated sampling of an additional 13 model construction data sets. Achieved coefficient of variation (C) approached the prespecified C as the estimated disease incidence, [Formula: see text], increased, although achieving a C of 0.1 was not possible for data sets in which [Formula: see text] < 0.03 with the number of sampling units evaluated in this study. The 95% confidence interval of the median difference between [Formula: see text] of each yard (achieved by sequential sampling) and the true p of the original data set included 0 for all 21 data sets evaluated at levels of C of 0.1 and 0.2. For sequential classification, operating characteristic (OC) and average sample number (ASN) curves of the sequential sampling plans obtained by bootstrap analysis and simulated sampling were similar to the OC and ASN values determined by Monte Carlo simulation. Correct decisions of whether disease incidence was above or below prespecified thresholds (pt) were made for 84.6 or 100% of the data sets during simulated sampling when stop lines were determined assuming a binomial or beta-binomial distribution of disease incidence, respectively. However, the higher proportion of correct decisions obtained by assuming a beta-binomial distribution of disease incidence required, on average, sampling 3.9 more plants per sampling round to classify disease incidence compared with the binomial distribution. Use of these sequential sampling plans may aid growers in deciding the order in which to harvest hop yards to minimize the risk of a condition called “cone early maturity” caused by late-season infection of cones by P. macularis. Also, sequential sampling could aid in research efforts, such as efficacy trials, where many hop cones are assessed to determine disease incidence.

Download Full-text

An Improved Algorithm for SVMs Classification of Imbalanced Data Sets

Engineering Applications of Neural Networks - Communications in Computer and Information Science ◽

10.1007/978-3-642-03969-0_11 ◽

2009 ◽

pp. 108-118 ◽

Cited By ~ 6

Author(s):

Cristiano Leite Castro ◽

Mateus Araujo Carvalho ◽

Antônio Padua Braga

Keyword(s):

Imbalanced Data ◽

Data Sets ◽

Imbalanced Data Sets ◽

Improved Algorithm

Download Full-text

Decision Tree: A Machine Learning for Intrusion Detection

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.f1234.0486s419 ◽

2019 ◽

Vol 8 (6S4) ◽

pp. 1126-1130

Keyword(s):

Machine Learning ◽

Intrusion Detection ◽

Detection System ◽

Research Work ◽

Machine Learning Techniques ◽

Data Sets ◽

Legitimate User ◽

Learning Techniques ◽

Three Stages

The Intrusion is a major threat to unauthorized data or legal network using the legitimate user identity or any of the back doors and vulnerabilities in the network. IDS mechanisms are developed to detect the intrusions at various levels. The objective of the research work is to improve the Intrusion Detection System performance by applying machine learning techniques based on decision trees for detection and classification of attacks. The methodology adapted will process the datasets in three stages. The experimentation is conducted on KDDCUP99 data sets based on number of features. The Bayesian three modes are analyzed for different sized data sets based upon total number of attacks. The time consumed by the classifier to build the model is analyzed and the accuracy is done.

Download Full-text

ADVANCED CLASSIFICATION OF OPTICAL AND SAR IMAGES FOR URBAN LAND COVER MAPPING

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xliii-b3-2020-1417-2020 ◽

2020 ◽

Vol XLIII-B3-2020 ◽

pp. 1417-1421

Author(s):

D. Amarsaikhan

Keyword(s):

Land Cover ◽

Urban Land ◽

Landsat 8 ◽

Data Sets ◽

Contextual Knowledge ◽

Rule Based ◽

Land Cover Types ◽

Urban Land Cover ◽

Land Cover Map

Abstract. The aim of this research is to classify urban land cover types using an advanced classification method. As the input bands to the classification, the features derived from Landsat 8 and Sentinel 1A SAR data sets are used. To extract the reliable urban land cover information from the optical and SAR features, a rule-based classification algorithm that uses spatial thresholds defined from the contextual knowledge is constructed. The result of the constructed method is compared with the results of a standard classification technique and it indicates a higher accuracy. Overall, the study demonstrates that the multisource data sets can considerably improve the classification of urban land cover types and the rule-based method is a powerful tool to produce a reliable land cover map.

Download Full-text

Distributed training and scalability for the particle clustering method UCluster

EPJ Web of Conferences ◽

10.1051/epjconf/202125102054 ◽

2021 ◽

Vol 251 ◽

pp. 02054

Author(s):

Olga Sunneborn Gudnadottir ◽

Daniel Gedon ◽

Colin Desmarais ◽

Karl Bengtsson Bernander ◽

Raazesh Sainudiin ◽

...

Keyword(s):

Particle Physics ◽

Hadron Collider ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Training Time ◽

Distributed Training ◽

Machine Learning Methods ◽

Multi Class Classification

In recent years, machine-learning methods have become increasingly important for the experiments at the Large Hadron Collider (LHC). They are utilised in everything from trigger systems to reconstruction and data analysis. The recent UCluster method is a general model providing unsupervised clustering of particle physics data, that can be easily modified to provide solutions for a variety of different decision problems. In the current paper, we improve on the UCluster method by adding the option of training the model in a scalable and distributed fashion, and thereby extending its utility to learn from arbitrarily large data sets. UCluster combines a graph-based neural network called ABCnet with a clustering step, using a combined loss function in the training phase. The original code is publicly available in TensorFlow v1.14 and has previously been trained on a single GPU. It shows a clustering accuracy of 81% when applied to the problem of multi-class classification of simulated jet events. Our implementation adds the distributed training functionality by utilising the Horovod distributed training framework, which necessitated a migration of the code to TensorFlow v2. Together with using parquet files for splitting data up between different compute nodes, the distributed training makes the model scalable to any amount of input data, something that will be essential for use with real LHC data sets. We find that the model is well suited for distributed training, with the training time decreasing in direct relation to the number of GPU’s used. However, further improvements by a more exhaustive and possibly distributed hyper-parameter search is required in order to achieve the reported accuracy of the original UCluster method.

Download Full-text