scholarly journals Model Selection for Non-Negative Tensor Factorization with Minimum Description Length

Entropy ◽  
2019 ◽  
Vol 21 (7) ◽  
pp. 632
Author(s):  
Yunhui Fu ◽  
Shin Matsushima ◽  
Kenji Yamanishi

Non-negative tensor factorization (NTF) is a widely used multi-way analysis approach that factorizes a high-order non-negative data tensor into several non-negative factor matrices. In NTF, the non-negative rank has to be predetermined to specify the model and it greatly influences the factorized matrices. However, its value is conventionally determined by specialists’ insights or trial and error. This paper proposes a novel rank selection criterion for NTF on the basis of the minimum description length (MDL) principle. Our methodology is unique in that (1) we apply the MDL principle on tensor slices to overcome a problem caused by the imbalance between the number of elements in a data tensor and that in factor matrices, and (2) we employ the normalized maximum likelihood (NML) code-length for histogram densities. We employ synthetic and real data to empirically demonstrate that our method outperforms other criteria in terms of accuracies for estimating true ranks and for completing missing values. We further show that our method can produce ranks suitable for knowledge discovery.


Author(s):  
Mehrnaz Najafi ◽  
Lifang He ◽  
Philip S. Yu

With the increasing popularity of streaming tensor data such as videos and audios, tensor factorization and completion have attracted much attention recently in this area. Existing work usually assume that streaming tensors only grow in one mode. However, in many real-world scenarios, tensors may grow in multiple modes (or dimensions), i.e., multi-aspect streaming tensors. Standard streaming methods cannot directly handle this type of data elegantly. Moreover, due to inevitable system errors, data may be contaminated by outliers, which cause significant deviations from real data values and make such research particularly challenging. In this paper, we propose a novel method for Outlier-Robust Multi-Aspect Streaming Tensor Completion and Factorization (OR-MSTC), which is a technique capable of dealing with missing values and outliers in multi-aspect streaming tensor data. The key idea is to decompose the tensor structure into an underlying low-rank clean tensor and a structured-sparse error (outlier) tensor, along with a weighting tensor to mask missing data. We also develop an efficient algorithm to solve the non-convex and non-smooth optimization problem of OR-MSTC. Experimental results on various real-world datasets show the superiority of the proposed method over the baselines and its robustness against outliers.



2017 ◽  
Vol 11 (1) ◽  
pp. 2-15 ◽  
Author(s):  
René Michel ◽  
Igor Schnakenburg ◽  
Tobias von Martens

Purpose This paper aims to address the effective selection of customers for direct marketing campaigns. It introduces a new method to forecast campaign-related uplifts (also known as incremental response modeling or net scoring). By means of these uplifts, only the most responsive customers are targeted by a campaign. This paper also aims at calculating the financial impact of the new approach compared to the classical (gross) scoring methods. Design/methodology/approach First, gross and net scoring approaches to customer selection for direct marketing campaigns are compared. After that, it is shown how net scoring can be applied in practice with regard to different strategical objectives. Then, a new statistic for net scoring based on decision trees is developed. Finally, a business case based on real data from the financial sector is calculated to compare gross and net scoring approaches. Findings Whereas gross scoring focuses on customers with a high probability of purchase, regardless of being targeted by a campaign, net scoring identifies those customers who are most responsive to campaigns. A common scoring procedure – decision trees – can be enhanced by the new statistic to forecast those campaign-related uplifts. The business case shows that the selected scoring method has a relevant impact on economical indicators. Practical implications The contribution of net scoring to campaign effectiveness and efficiency is shown by the business case. Furthermore, this paper suggests a framework for customer selection, given strategical objectives, e.g. minimizing costs or maximizing (gross or lift)-added value, and presents a new statistic that can be applied to common scoring procedures. Originality/value Despite its lever on the effectiveness of marketing campaigns, only few contributions address net scores up to now. The new χ2-statistic is a straightforward approach to the enhancement of decision trees for net scoring. Furthermore, this paper is the first to the application of net scoring with regard to different strategical objectives.



2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Mar Rodríguez-Girondo ◽  
Niels van den Berg ◽  
Michel H. Hof ◽  
Marian Beekman ◽  
Eline Slagboom

Abstract Background Although human longevity tends to cluster within families, genetic studies on longevity have had limited success in identifying longevity loci. One of the main causes of this limited success is the selection of participants. Studies generally include sporadically long-lived individuals, i.e. individuals with the longevity phenotype but without a genetic predisposition for longevity. The inclusion of these individuals causes phenotype heterogeneity which results in power reduction and bias. A way to avoid sporadically long-lived individuals and reduce sample heterogeneity is to include family history of longevity as selection criterion using a longevity family score. A main challenge when developing family scores are the large differences in family size, because of real differences in sibship sizes or because of missing data. Methods We discussed the statistical properties of two existing longevity family scores: the Family Longevity Selection Score (FLoSS) and the Longevity Relatives Count (LRC) score and we evaluated their performance dealing with differential family size. We proposed a new longevity family score, the mLRC score, an extension of the LRC based on random effects modeling, which is robust for family size and missing values. The performance of the new mLRC as selection tool was evaluated in an intensive simulation study and illustrated in a large real dataset, the Historical Sample of the Netherlands (HSN). Results Empirical scores such as the FLOSS and LRC cannot properly deal with differential family size and missing data. Our simulation study showed that mLRC is not affected by family size and provides more accurate selections of long-lived families. The analysis of 1105 sibships of the Historical Sample of the Netherlands showed that the selection of long-lived individuals based on the mLRC score predicts excess survival in the validation set better than the selection based on the LRC score . Conclusions Model-based score systems such as the mLRC score help to reduce heterogeneity in the selection of long-lived families. The power of future studies into the genetics of longevity can likely be improved and their bias reduced, by selecting long-lived cases using the mLRC.



2021 ◽  
Author(s):  
Rosa F Ropero ◽  
M Julia Flores ◽  
Rafael Rumí

<p>Environmental data often present missing values or lack of information that make modelling tasks difficult. Under the framework of SAICMA Research Project, a flood risk management system is modelled for Andalusian Mediterranean catchment using information from the Andalusian Hydrological System. Hourly data were collected from October 2011 to September 2020, and present two issues:</p><ul><li>In Guadarranque River, for the dam level variable there is no data from May to August 2020, probably because of sensor damage.</li> <li>No information about river level is collected in the lower part of Guadiaro River, which make difficult to estimate flood risk in the coastal area.</li> </ul><p>In order to avoid removing dam variable from the entire model (or those missing months), or even reject modelling one river system, this abstract aims to provide modelling solutions based on Bayesian networks (BNs) that overcome this limitation.</p><p><em>Guarranque River. Missing values.</em></p><p>Dataset contains 75687 observations for 6 continuous variables. BNs regression models based on fixed structures (Naïve Bayes, NB, and Tree Augmented Naïve, TAN) were learnt using the complete dataset (until September 2019) with the aim of predicting the dam level variable as accurately as possible. A scenario was carried out with data from October 2019 to March 2020 and compared the prediction made for the target variable with the real data. Results show both NB (rmse: 6.29) and TAN (rmse: 5.74) are able to predict the behaviour of the target variable.</p><p>Besides, a BN based on expert’s structural learning was learnt with real data and both datasets with imputed values by NB and TAN. Results show models learnt with imputed data (NB: 3.33; TAN: 3.07) improve the error rate of model with respect to real data (4.26).</p><p><em>Guadairo River. Lack of information.</em></p><p>Dataset contains 73636 observations with 14 continuous variables. Since rainfall variables present a high percentage of zero values (over 94%), they were discretised by Equal Frequency method with 4 intervals. The aim is to predict flooding risk in the coastal area but no data is collected from this area. Thus, an unsupervised classification based on hybrid BNs was performed. Here, target variable classifies all observations into a set of homogeneous groups and gives, for each observation, the probability of belonging to each group. Results show a total of 3 groups:</p><ul><li>Group 0, “Normal situation”: with rainfall values equal to 0, and mean of river level very low.</li> <li>Group 1, “Storm situation”: mean rainfall values are over 0.3 mm and all river level variables duplicate the mean with respect to group 0.</li> <li>Group 2, “Extreme situation”: Both rainfall and river level means values present the highest values far away from both previous groups.</li> </ul><p>Even when validation shows this methodology is able to identify extreme events, further work is needed. In this sense, data from autumn-winter season (from October 2020 to March 2021) will be used. Including this new information it would be possible to check if last extreme events (flooding event during December and Filomenastorm during January) are identified.</p><p> </p><p> </p><p> </p>



Biometrika ◽  
2016 ◽  
Vol 103 (1) ◽  
pp. 175-187 ◽  
Author(s):  
Jun Shao ◽  
Lei Wang

Abstract To estimate unknown population parameters based on data having nonignorable missing values with a semiparametric exponential tilting propensity, Kim & Yu (2011) assumed that the tilting parameter is known or can be estimated from external data, in order to avoid the identifiability issue. To remove this serious limitation on the methodology, we use an instrument, i.e., a covariate related to the study variable but unrelated to the missing data propensity, to construct some estimating equations. Because these estimating equations are semiparametric, we profile the nonparametric component using a kernel-type estimator and then estimate the tilting parameter based on the profiled estimating equations and the generalized method of moments. Once the tilting parameter is estimated, so is the propensity, and then other population parameters can be estimated using the inverse propensity weighting approach. Consistency and asymptotic normality of the proposed estimators are established. The finite-sample performance of the estimators is studied through simulation, and a real-data example is also presented.



2020 ◽  
Vol 18 (2) ◽  
pp. 58-73
Author(s):  
Halimin Herjanto ◽  
◽  
Alexandra Chilicki ◽  
Chidchanok Anantamongkolkul ◽  
Erin McGuinness ◽  
...  

Consumers use online e-reviews as a popular tool for information and obtaining guidance. E-reviews have therefore become an important barometer in conducting product evaluations, and more importantly, to make purchasing decisions. This includes decisions about hotel selection. For hospitality industry marketers, the information in e-reviews is particularly important in translating and understanding consumer-specific needs. The current study brings valuable awareness to the limited academic research into hotel selection criteria among solo-traveling females. TripAdvisor’s top 25 list of cost-efficient hotels worldwide received 345 total consumer reviews. Noteworthy findings of the current study show unique selection criteria considered important to the solo traveling female including a hotel-provided cell phone programmed with emergency local contact information, and a nearby or on-premise automated teller machine. Study results also indicate that stewardship service, such as intimate and personalized hotel staff who “go the extra mile” is an important selection criterion. Also discussed are research limitations and implications.



Author(s):  
A. M. M. Al-Naggar ◽  
R. M. Abd El-Salam ◽  
M. R. A. Hovny ◽  
Walaa Y. S. Yaseen

Information on heritability and trait association in crops assist breeders to allocate resources necessary to effectively select for desired traits and to achieve maximum genetic gain with little time and resources. The objectives of this investigation were to determine the amount of genetic variability, heritability, genetic advance and strength of association of yield related traits among sorghum lines under different environments in Egypt. Six environments with 25 sorghum B-lines were at two locations in Egypt (Giza and Shandaweel) in two years and two planting dates in one location (Giza). A randomized complete block design was used in each environment with three replications. Significant variation was observed among sorghum lines for all studied traits in all environments. Across environments, grain yield/plant (GYPP) showed positive and significant correlations with number of grains/plant (r = 0.71), days to flowering (r = 0.47), 1000-grain weight (r = 0.16) and plant height (PH) (r = 0.19). In general, the estimates of phenotypic coefficient of variation (PCV) were higher than genotypic coefficient of variation (GCV). Combined across the six environments, the highest PCV and GCV was shown by PH trait (95.14 and 43.57%) followed by GYPP (36.42 and 30.78%), respectively, indicating that selection for high values of these traits of sorghum would be effective. GYPP and PH traits showed high heritability associated with high genetic advance from selection, indicating that there are good opportunities to get success in improvement of these traits via selection procedures. Results concluded that PH is good selection criterion for GYPP and therefore selection for tall sorghum plants would increase grain yield.



Author(s):  
Miroslav Hudec ◽  
Miljan Vučetić ◽  
Mirko Vujošević

Data mining methods based on fuzzy logic have been developed recently and have become an increasingly important research area. In this chapter, the authors examine possibilities for discovering potentially useful knowledge from relational database by integrating fuzzy functional dependencies and linguistic summaries. Both methods use fuzzy logic tools for data analysis, acquiring, and representation of expert knowledge. Fuzzy functional dependencies could detect whether dependency between two examined attributes in the whole database exists. If dependency exists only between parts of examined attributes' domains, fuzzy functional dependencies cannot detect its characters. Linguistic summaries are a convenient method for revealing this kind of dependency. Using fuzzy functional dependencies and linguistic summaries in a complementary way could mine valuable information from relational databases. Mining intensities of dependencies between database attributes could support decision making, reduce the number of attributes in databases, and estimate missing values. The proposed approach is evaluated with case studies using real data from the official statistics. Strengths and weaknesses of the described methods are discussed. At the end of the chapter, topics for further research activities are outlined.



Sign in / Sign up

Export Citation Format

Share Document