Categorizing Multi-Label Product Questionnaires through SVM Based Click stream

In this paper, Question Categorization (QC) has been studied most primarily in order to understand customers' search intention. In both of these searches, the items in the question list relate to the category label belonging to the taxonomy tree that is being examined. Despite this, search queries about the product usually vary depending on what is vague, and introduce new products over time, seasonal trends and narrow. Traditional supervised approaches to E-Commerce QC are not possible due to the high volume of traffic and high cost for manual annotation in E-Commerce search engines. Here, clickstream data is utilized to determine the effectiveness of a channel's marketplace. So, using the customer's click concept, to collect large-scale question categorization data, this paper uses unsupervised methods that means SVM algorithm is mainly used in this system. Here the data is in the multiclass and multi-label classifier is used to classify them. This paper gets on a large multi-label data set with specific and individual queries from a specific category. In this paper, a comparison of different sophisticated text classifiers is viewed. This paper calculates the micro-F1 scores of top and leaf, which are considered to be a linear SVM-ensemble.

Download Full-text

A Consensus Algorithm for Linear Support Vector Machines

Management Science ◽

10.1287/mnsc.2021.4042 ◽

2021 ◽

Author(s):

Haimonti Dutta

Keyword(s):

Big Data ◽

Large Scale ◽

Descent Method ◽

Consensus Algorithm ◽

Stochastic Gradient Descent ◽

Support Vector ◽

Gradient Descent Method ◽

Scalable Algorithms ◽

Data Set ◽

Svm Algorithm

In the era of big data, an important weapon in a machine learning researcher’s arsenal is a scalable support vector machine (SVM) algorithm. Traditional algorithms for learning SVMs scale superlinearly with the training set size, which becomes infeasible quickly for large data sets. In recent years, scalable algorithms have been designed which study the primal or dual formulations of the problem. These often suggest a way to decompose the problem and facilitate development of distributed algorithms. In this paper, we present a distributed algorithm for learning linear SVMs in the primal form for binary classification called the gossip-based subgradient (GADGET) SVM. The algorithm is designed such that it can be executed locally on sites of a distributed system. Each site processes its local homogeneously partitioned data and learns a primal SVM model; it then gossips with random neighbors about the classifier learnt and uses this information to update the model. To learn the model, the SVM optimization problem is solved using several techniques, including a gradient estimation procedure, stochastic gradient descent method, and several variants including minibatches of varying sizes. Our theoretical results indicate that the rate at which the GADGET SVM algorithm converges to the global optimum at each site is dominated by an [Formula: see text] term, where λ measures the degree of convexity of the function at the site. Empirical results suggest that this anytime algorithm—where the quality of results improve gradually as computation time increases—has performance comparable to its centralized, pseudodistributed, and other state-of-the-art gossip-based SVM solvers. It is at least 1.5 times (often several orders of magnitude) faster than other gossip-based SVM solvers known in literature and has a message complexity of O(d) per iteration, where d represents the number of features of the data set. Finally, a large-scale case study is presented wherein the consensus-based SVM algorithm is used to predict failures of advanced mechanical components in a chocolate manufacturing process using more than one million data points. This paper was accepted by J. George Shanthikumar, big data analytics.

Download Full-text

ProGen:Provenance database generator for large-scale data set

Journal of Computer Applications ◽

10.3724/sp.j.1087.2008.02737 ◽

2009 ◽

Vol 28 (11) ◽

pp. 2737-2740

Author(s):

Xiao ZHANG ◽

Shan WANG ◽

Na LIAN

Keyword(s):

Large Scale ◽

Data Set ◽

Large Scale Data ◽

Scale Data

Download Full-text

Integrative Data Analysis from a Unifying Research Synthesis Perspective

10.1093/oso/9780190676001.003.0020 ◽

2018 ◽

Author(s):

Eun-Young Mun ◽

Anne E. Ray

Keyword(s):

Data Analysis ◽

Large Scale ◽

Research Synthesis ◽

Alcohol Intervention ◽

Data Set ◽

Integrative Data Analysis ◽

Level Data ◽

Model Complex ◽

Wide Range ◽

Individual Participant

Integrative data analysis (IDA) is a promising new approach in psychological research and has been well received in the field of alcohol research. This chapter provides a larger unifying research synthesis framework for IDA. Major advantages of IDA of individual participant-level data include better and more flexible ways to examine subgroups, model complex relationships, deal with methodological and clinical heterogeneity, and examine infrequently occurring behaviors. However, between-study heterogeneity in measures, designs, and samples and systematic study-level missing data are significant barriers to IDA and, more broadly, to large-scale research synthesis. Based on the authors’ experience working on the Project INTEGRATE data set, which combined individual participant-level data from 24 independent college brief alcohol intervention studies, it is also recognized that IDA investigations require a wide range of expertise and considerable resources and that some minimum standards for reporting IDA studies may be needed to improve transparency and quality of evidence.

Download Full-text

Financial distress determinants among SMEs: empirical evidence from Sweden

Journal of Economic Studies ◽

10.1108/jes-01-2019-0030 ◽

2020 ◽

Vol 47 (3) ◽

pp. 547-560 ◽

Cited By ~ 1

Author(s):

Darush Yazdanfar ◽

Peter Öhman

Keyword(s):

Financial Crisis ◽

Financial Distress ◽

Large Scale ◽

Global Financial Crisis ◽

Binary Logistic Regression ◽

Data Availability ◽

Cross Sectional ◽

Data Set ◽

Content Type ◽

The Global Financial Crisis

PurposeThe purpose of this study is to empirically investigate determinants of financial distress among small and medium-sized enterprises (SMEs) during the global financial crisis and post-crisis periods.Design/methodology/approachSeveral statistical methods, including multiple binary logistic regression, were used to analyse a longitudinal cross-sectional panel data set of 3,865 Swedish SMEs operating in five industries over the 2008–2015 period.FindingsThe results suggest that financial distress is influenced by macroeconomic conditions (i.e. the global financial crisis) and, in particular, by various firm-specific characteristics (i.e. performance, financial leverage and financial distress in previous year). However, firm size and industry affiliation have no significant relationship with financial distress.Research limitationsDue to data availability, this study is limited to a sample of Swedish SMEs in five industries covering eight years. Further research could examine the generalizability of these findings by investigating other firms operating in other industries and other countries.Originality/valueThis study is the first to examine determinants of financial distress among SMEs operating in Sweden using data from a large-scale longitudinal cross-sectional database.

Download Full-text

Correlation between the structure and skin permeability of compounds

Scientific Reports ◽

10.1038/s41598-021-89587-5 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Ruolan Zeng ◽

Jiyong Deng ◽

Limin Dang ◽

Xinliang Yu

Keyword(s):

Large Data ◽

Qsar Model ◽

Coefficient Of Determination ◽

Support Vector ◽

Skin Permeability ◽

Data Set ◽

Test Set ◽

Svm Algorithm ◽

Svm Model ◽

Toxicity Relationship

AbstractA three-descriptor quantitative structure–activity/toxicity relationship (QSAR/QSTR) model was developed for the skin permeability of a sufficiently large data set consisting of 274 compounds, by applying support vector machine (SVM) together with genetic algorithm. The optimal SVM model possesses the coefficient of determination R2 of 0.946 and root mean square (rms) error of 0.253 for the training set of 139 compounds; and a R2 of 0.872 and rms of 0.302 for the test set of 135 compounds. Compared with other models reported in the literature, our SVM model shows better statistical performance in a model that deals with more samples in the test set. Therefore, applying a SVM algorithm to develop a nonlinear QSAR model for skin permeability was achieved.

Download Full-text

Relationship between large-scale ionospheric field-aligned currents and electron/ion precipitations: DMSP observations

Earth Planets and Space ◽

10.1186/s40623-020-01286-z ◽

2020 ◽

Vol 72 (1) ◽

Author(s):

Chao Xiong ◽

Claudia Stolle ◽

Patrick Alken ◽

Jan Rauberg

Keyword(s):

Time Distribution ◽

Large Scale ◽

Particle Flux ◽

Particle Energy ◽

Lower Latitude ◽

Particle Precipitation ◽

Data Set ◽

The Mean ◽

Two Parameters ◽

Meteorological Satellite

Abstract In this study, we have derived field-aligned currents (FACs) from magnetometers onboard the Defense Meteorological Satellite Project (DMSP) satellites. The magnetic latitude versus local time distribution of FACs from DMSP shows comparable dependences with previous findings on the intensity and orientation of interplanetary magnetic field (IMF) By and Bz components, which confirms the reliability of DMSP FAC data set. With simultaneous measurements of precipitating particles from DMSP, we further investigate the relation between large-scale FACs and precipitating particles. Our result shows that precipitation electron and ion fluxes both increase in magnitude and extend to lower latitude for enhanced southward IMF Bz, which is similar to the behavior of FACs. Under weak northward and southward Bz conditions, the locations of the R2 current maxima, at both dusk and dawn sides and in both hemispheres, are found to be close to the maxima of the particle energy fluxes; while for the same IMF conditions, R1 currents are displaced further to the respective particle flux peaks. Largest displacement (about 3.5°) is found between the downward R1 current and ion flux peak at the dawn side. Our results suggest that there exists systematic differences in locations of electron/ion precipitation and large-scale upward/downward FACs. As outlined by the statistical mean of these two parameters, the FAC peaks enclose the particle energy flux peaks in an auroral band at both dusk and dawn sides. Our comparisons also found that particle precipitation at dawn and dusk and in both hemispheres maximizes near the mean R2 current peaks. The particle precipitation flux maxima closer to the R1 current peaks are lower in magnitude. This is opposite to the known feature that R1 currents are on average stronger than R2 currents.

Download Full-text

Galaxy spin direction distribution in HST and SDSS show similar large-scale asymmetry

Publications of the Astronomical Society of Australia ◽

10.1017/pasa.2020.46 ◽

2020 ◽

Vol 37 ◽

Author(s):

Lior Shamir

Keyword(s):

Large Scale ◽

Spiral Galaxies ◽

Hubble Space Telescope ◽

Gravitational Interaction ◽

Large Data ◽

Sloan Digital Sky Survey ◽

Data Sets ◽

Dipole Axis ◽

Data Set ◽

The Asymmetry

Abstract Several recent observations using large data sets of galaxies showed non-random distribution of the spin directions of spiral galaxies, even when the galaxies are too far from each other to have gravitational interaction. Here, a data set of $\sim8.7\cdot10^3$ spiral galaxies imaged by Hubble Space Telescope (HST) is used to test and profile a possible asymmetry between galaxy spin directions. The asymmetry between galaxies with opposite spin directions is compared to the asymmetry of galaxies from the Sloan Digital Sky Survey. The two data sets contain different galaxies at different redshift ranges, and each data set was annotated using a different annotation method. The results show that both data sets show a similar asymmetry in the COSMOS field, which is covered by both telescopes. Fitting the asymmetry of the galaxies to cosine dependence shows a dipole axis with probabilities of $\sim2.8\sigma$ and $\sim7.38\sigma$ in HST and SDSS, respectively. The most likely dipole axis identified in the HST galaxies is at $(\alpha=78^{\rm o},\delta=47^{\rm o})$ and is well within the $1\sigma$ error range compared to the location of the most likely dipole axis in the SDSS galaxies with $z>0.15$ , identified at $(\alpha=71^{\rm o},\delta=61^{\rm o})$ .

Download Full-text

COVIDSenti: A Large-Scale Benchmark Twitter Data Set for COVID-19 Sentiment Analysis

IEEE Transactions on Computational Social Systems ◽

10.1109/tcss.2021.3051189 ◽

2021 ◽

pp. 1-13

Author(s):

Usman Naseem ◽

Imran Razzak ◽

Matloob Khushi ◽

Peter W. Eklund ◽

Jinman Kim

Keyword(s):

Sentiment Analysis ◽

Large Scale ◽

Data Set ◽

Twitter Data

Download Full-text

Distributed learning with indefinite kernels

Analysis and Applications ◽

10.1142/s021953051850032x ◽

2019 ◽

Vol 17 (06) ◽

pp. 947-975 ◽

Cited By ~ 2

Author(s):

Lei Shi

Keyword(s):

Large Scale ◽

Substantial Reduction ◽

Computation Time ◽

Distributed Learning ◽

Rates Of Convergence ◽

Regression Problem ◽

Data Set ◽

Regularization Scheme ◽

Original Algorithm ◽

Indefinite Kernel

We investigate the distributed learning with coefficient-based regularization scheme under the framework of kernel regression methods. Compared with the classical kernel ridge regression (KRR), the algorithm under consideration does not require the kernel function to be positive semi-definite and hence provides a simple paradigm for designing indefinite kernel methods. The distributed learning approach partitions a massive data set into several disjoint data subsets, and then produces a global estimator by taking an average of the local estimator on each data subset. Easy exercisable partitions and performing algorithm on each subset in parallel lead to a substantial reduction in computation time versus the standard approach of performing the original algorithm on the entire samples. We establish the first mini-max optimal rates of convergence for distributed coefficient-based regularization scheme with indefinite kernels. We thus demonstrate that compared with distributed KRR, the concerned algorithm is more flexible and effective in regression problem for large-scale data sets.

Download Full-text

Visual Analysis of Biomarkers Reveals Differences in Lipid Profiles and Liver Enzymes before and after Gastric Sleeve and Bypass

Obesity Facts ◽

10.1159/000510401 ◽

2021 ◽

pp. 1-11

Author(s):

Marijn Marthe Georgine van Berckel ◽

Saskia L.M. van Loon ◽

Arjen-Kars Boer ◽

Volkher Scharnhorst ◽

Simon W. Nienhuijs

Keyword(s):

Bariatric Surgery ◽

Visual Analytics ◽

Biochemical Markers ◽

Visual Analysis ◽

Large Population ◽

High Volume ◽

Cholesterol Concentration ◽

Health State ◽

Data Set ◽

Before And After

Introduction: Bariatric surgery results in both intentional and unintentional metabolic changes. In a high-volume bariatric center, extensive laboratory panels are used to monitor these changes pre- and postoperatively. Consecutive measurements of relevant biochemical markers allow exploration of the health state of bariatric patients and comparison of different patient groups. Objective: The objective of this study is to compare biomarker distributions over time between 2 common bariatric procedures, i.e., sleeve gastrectomy (SG) and gastric bypass (RYGB), using visual analytics. Methods: Both pre- and postsurgical (6, 12, and 24 months) data of all patients who underwent primary bariatric surgery were collected retrospectively. The distribution and evolution of different biochemical markers were compared before and after surgery using asymmetric beanplots in order to evaluate the effect of primary SG and RYGB. A beanplot is an alternative to the boxplot that allows an easy and thorough visual comparison of univariate data. Results: In total, 1,237 patients (659 SG and 578 RYGB) were included. The sleeve and bypass groups were comparable in terms of age and the prevalence of comorbidities. The mean presurgical BMI and the percentage of males were higher in the sleeve group. The effect of surgery on lowering of glycated hemoglobin was similar for both surgery types. After RYGB surgery, the decrease in the cholesterol concentration was larger than after SG. The enzymatic activity of aspartate aminotransferase, alanine aminotransferase, and alkaline phosphate in sleeve patients was higher presurgically but lower postsurgically compared to bypass values. Conclusions: Beanplots allow intuitive visualization of population distributions. Analysis of this large population-based data set using beanplots suggests comparable efficacies of both types of surgery in reducing diabetes. RYGB surgery reduced dyslipidemia more effectively than SG. The trend toward a larger decrease in liver enzyme activities following SG is a subject for further investigation.

Download Full-text