scholarly journals Improved Parsimonious Topic Modeling Based on the Bayesian Information Criterion

Entropy ◽  
2020 ◽  
Vol 22 (3) ◽  
pp. 326
Author(s):  
Hang Wang ◽  
David Miller

In a previous work, a parsimonious topic model (PTM) was proposed for text corpora. In that work, unlike LDA, the modeling determined a subset of salient words for each topic, with topic-specific probabilities, with the rest of the words in the dictionary explained by a universal shared model. Further, in LDA all topics are in principle present in every document. In contrast, PTM gives sparse topic representation, determining the (small) subset of relevant topics for each document. A customized Bayesian information criterion (BIC) was derived, balancing model complexity and goodness of fit, with the BIC minimized to jointly determine the entire model—the topic-specific words, document-specific topics, all model parameter values, and the total number of topics—in a wholly unsupervised fashion. In the present work, several important modeling and algorithm (parameter learning) extensions of PTM are proposed. First, we modify the BIC objective function using a lossless coding scheme with low modeling cost for describing words that are non-salient for all topics—such words are essentially identified as wholly noisy/uninformative. This approach increases the PTM’s model sparsity, which also allows model selection of more topics and with lower BIC cost than the original PTM. Second, in the original PTM model learning strategy, word switches were updated sequentially, which is myopic and susceptible to finding poor locally optimal solutions. Here, instead, we jointly optimize all the switches that correspond to the same word (across topics). This approach jointly optimizes many more parameters at each step than the original PTM, which in principle should be less susceptible to finding poor local minima. Results on several document data sets show that our proposed method outperformed the original PTM model with respect to multiple performance measures, and gave a sparser topic model representation than the original PTM.

Author(s):  
Barinaadaa John Nwikpe

A new sole parameter probability distribution named the Tornumonkpe distribution has been derived in this paper. The new model is a blend of gamma (2,  and gamma(3  distributions. The shape of its density for different values of the parameter has been shown.  The mathematical expression for the moment generating function, the first three raw moments, the second and third moments about the mean, the distribution of order statistics, coefficient of variation and coefficient of skewness has been given. The parameter of the new distribution was estimated using the method of maximum likelihood. The goodness of fit of the Tornumonkpe distribution was established by fitting the distribution to three real life data sets. Using -2lnL, Bayesian Information Criterion (BIC), and Akaike Information Criterion(AIC) as criterial for selecting the best fitting model, it was revealed that the new distribution outperforms the one parameter exponential, Shanker and Amarendra distributions for the data sets used.


Author(s):  
Jiawei Chen ◽  
Yinghui (Catherine) Yang ◽  
Hongyan Liu

In recent years, more and more platforms where both buyers and sellers can write reviews for each other have emerged. These bilateral reviews are important information sources in the decision-making process of both buyers and sellers. In this study, we develop a comprehensive relational topic modeling approach to analyze bilateral reviews for better online transaction prediction. The prediction results will enable the platform to increase the chance that the buyer and seller reach a transaction by presenting buyers with offerings that are more likely to lead to a transaction. Within the framework of the relational topic model, we embed a topic structure with both shared and corpus-specific topics to better handle text corpora generated from different sources. Our model facilitates the extraction of the appropriate topic structure from different document collections that helps enhance the transaction prediction performance. Comprehensive experiments conducted on real-world data sets collected from sharing economy platforms demonstrate that our new model significantly outperforms other alternatives. The robust results obtained from multiple sets of comparisons demonstrate the value of bilateral reviews if they are processed properly. Our approach can be applied to many platforms where bilateral reviews are available.


2003 ◽  
Vol 15 (7) ◽  
pp. 1691-1714 ◽  
Author(s):  
Vladimir Cherkassky ◽  
Yunqian Ma

We discuss empirical comparison of analytical methods for model selection. Currently, there is no consensus on the best method for finite-sample estimation problems, even for the simple case of linear estimators. This article presents empirical comparisons between classical statistical methods—Akaike information criterion (AIC) and Bayesian information criterion (BIC)—and the structural risk minimization (SRM) method, basedon Vapnik-Chervonenkis (VC) theory, for regression problems. Our study is motivated by empirical comparisons in Hastie, Tibshirani, and Friedman (2001), which claims that the SRM method performs poorly for model selection and suggests that AIC yields superior predictive performance. Hence, we present empirical comparisons for various data sets and different types of estimators (linear, subset selection, and k-nearest neighbor regression). Our results demonstrate the practical advantages of VC-based model selection; it consistently outperforms AIC for all data sets. In our study, SRM and BIC methods show similar predictive performance. This discrepancy (between empirical results obtained using the same data) is caused by methodological drawbacks in Hastie et al. (2001), especially in their loose interpretation and application of SRM method. Hence, we discuss methodological issues important for meaningful comparisons and practical application of SRM method. We also point out the importance of accurate estimation of model complexity (VC-dimension) for empirical comparisons and propose a new practical estimate of model complexity for k-nearest neighbors regression.


2018 ◽  
Vol 45 (5) ◽  
pp. 351-365 ◽  
Author(s):  
Ismaila Ba ◽  
Fahim Ashkar

We recommend methods of discrimination between some three-parameter distributions used in hydro-meteorological frequency modeling. Discriminations are between model pairs belonging to the group (generalized extreme value (GEV), Pearson Type III (P3), generalized logistic (GLO)). To assess the fit of these distributions to data, the Akaike information criterion, Bayesian information criterion, and (or) goodness-of-fit measures are commonly employed. However, it is difficult to estimate the discrimination power and bias of these methods when used with three-parameter distributions. Consequently, we propose two alternative tools and assess their performance. Both tools are based on a sample transformation to normality followed by applying a powerful statistic for testing normality, such as the Shapiro-Wilk or the probability plot correlation coefficient statistic. While arriving at recommendations for discriminating between the (GEV, GLO) and (P3, GLO) pairs of models, we show that the discrimination power between the P3 and GEV distributions can be rather low.


2021 ◽  
Vol 9 (2) ◽  
pp. 404-409
Author(s):  
K Prashant Gokul, Et. al.

Topic models give a helpful strategy to dimensionality decrease and exploratory data analysis in huge text corpora. Most ways to deal with topic model learning have been founded on a greatest likelihood objective. Proficient algorithms exist that endeavor to inexact this target, yet they have no provable certifications. As of late, algorithms have been presented that give provable limits, however these algorithms are not down to earth since they are wasteful and not hearty to infringement of model presumptions. In this work, we propose to consolidate the statistical topic modeling with pattern mining strategies to produce pattern-based topic models to upgrade the semantic portrayals of the conventional word based topic models. Using the proposed pattern-based topic model, clients' inclinations can be modeled with different topics and every one of which is addressed with semantically rich patterns. A tale information filtering model is proposed here. In information filtering model client information needs are made in terms of different topics where every topic is addressed by patterns. The calculation produces results similar to the best executions while running significant degrees quicker.


Econometrics ◽  
2021 ◽  
Vol 9 (1) ◽  
pp. 10
Author(s):  
Šárka Hudecová ◽  
Marie Hušková ◽  
Simos G. Meintanis

This article considers goodness-of-fit tests for bivariate INAR and bivariate Poisson autoregression models. The test statistics are based on an L2-type distance between two estimators of the probability generating function of the observations: one being entirely nonparametric and the second one being semiparametric computed under the corresponding null hypothesis. The asymptotic distribution of the proposed tests statistics both under the null hypotheses as well as under alternatives is derived and consistency is proved. The case of testing bivariate generalized Poisson autoregression and extension of the methods to dimension higher than two are also discussed. The finite-sample performance of a parametric bootstrap version of the tests is illustrated via a series of Monte Carlo experiments. The article concludes with applications on real data sets and discussion.


2021 ◽  
Vol 5 (1) ◽  
pp. 10
Author(s):  
Mark Levene

A bootstrap-based hypothesis test of the goodness-of-fit for the marginal distribution of a time series is presented. Two metrics, the empirical survival Jensen–Shannon divergence (ESJS) and the Kolmogorov–Smirnov two-sample test statistic (KS2), are compared on four data sets—three stablecoin time series and a Bitcoin time series. We demonstrate that, after applying first-order differencing, all the data sets fit heavy-tailed α-stable distributions with 1<α<2 at the 95% confidence level. Moreover, ESJS is more powerful than KS2 on these data sets, since the widths of the derived confidence intervals for KS2 are, proportionately, much larger than those of ESJS.


2014 ◽  
Vol 2014 ◽  
pp. 1-7 ◽  
Author(s):  
Dinesh Verma ◽  
Shishir Kumar

Nowadays, software developers are facing challenges in minimizing the number of defects during the software development. Using defect density parameter, developers can identify the possibilities of improvements in the product. Since the total number of defects depends on module size, so there is need to calculate the optimal size of the module to minimize the defect density. In this paper, an improved model has been formulated that indicates the relationship between defect density and variable size of modules. This relationship could be used for optimization of overall defect density using an effective distribution of modules sizes. Three available data sets related to concern aspect have been examined with the proposed model by taking the distinct values of variables and parameter by putting some constraint on parameters. Curve fitting method has been used to obtain the size of module with minimum defect density. Goodness of fit measures has been performed to validate the proposed model for data sets. The defect density can be optimized by effective distribution of size of modules. The larger modules can be broken into smaller modules and smaller modules can be merged to minimize the overall defect density.


Sign in / Sign up

Export Citation Format

Share Document