scholarly journals Graphs Regularized Robust Matrix Factorization and Its Application on Student Grade Prediction

2020 ◽  
Vol 10 (5) ◽  
pp. 1755 ◽  
Author(s):  
Yupei Zhang ◽  
Yue Yun ◽  
Huan Dai ◽  
Jiaqi Cui ◽  
Xuequn Shang

Student grade prediction (SGP) is an important educational problem for designing personalized strategies of teaching and learning. Many studies adopt the technique of matrix factorization (MF). However, their methods often focus on the grade records regardless of the side information, such as backgrounds and relationships. To this end, in this paper, we propose a new MF method, called graph regularized robust matrix factorization (GRMF), based on the recent robust MF version. GRMF integrates two side graphs built on the side data of students and courses into the objective of robust low-rank MF. As a result, the learned features of students and courses can grasp more priors from educational situations to achieve higher grade prediction results. The resulting objective problem can be effectively optimized by the Majorization Minimization (MM) algorithm. In addition, GRMF not only can yield the specific features for the education domain but can also deal with the case of missing, noisy, and corruptive data. To verify our method, we test GRMF on two public data sets for rating prediction and image recovery. Finally, we apply GRMF to educational data from our university, which is composed of 1325 students and 832 courses. The extensive experimental results manifestly show that GRMF is robust to various data problem and achieves more effective features in comparison with other methods. Moreover, GRMF also delivers higher prediction accuracy than other methods on our educational data set. This technique can facilitate personalized teaching and learning in higher education.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Jiawei Lian ◽  
Junhong He ◽  
Yun Niu ◽  
Tianze Wang

Purpose The current popular image processing technologies based on convolutional neural network have the characteristics of large computation, high storage cost and low accuracy for tiny defect detection, which is contrary to the high real-time and accuracy, limited computing resources and storage required by industrial applications. Therefore, an improved YOLOv4 named as YOLOv4-Defect is proposed aim to solve the above problems. Design/methodology/approach On the one hand, this study performs multi-dimensional compression processing on the feature extraction network of YOLOv4 to simplify the model and improve the feature extraction ability of the model through knowledge distillation. On the other hand, a prediction scale with more detailed receptive field is added to optimize the model structure, which can improve the detection performance for tiny defects. Findings The effectiveness of the method is verified by public data sets NEU-CLS and DAGM 2007, and the steel ingot data set collected in the actual industrial field. The experimental results demonstrated that the proposed YOLOv4-Defect method can greatly improve the recognition efficiency and accuracy and reduce the size and computation consumption of the model. Originality/value This paper proposed an improved YOLOv4 named as YOLOv4-Defect for the detection of surface defect, which is conducive to application in various industrial scenarios with limited storage and computing resources, and meets the requirements of high real-time and precision.



2011 ◽  
pp. 24-32 ◽  
Author(s):  
Nicoleta Rogovschi ◽  
Mustapha Lebbah ◽  
Younès Bennani

Most traditional clustering algorithms are limited to handle data sets that contain either continuous or categorical variables. However data sets with mixed types of variables are commonly used in data mining field. In this paper we introduce a weighted self-organizing map for clustering, analysis and visualization mixed data (continuous/binary). The learning of weights and prototypes is done in a simultaneous manner assuring an optimized data clustering. More variables has a high weight, more the clustering algorithm will take into account the informations transmitted by these variables. The learning of these topological maps is combined with a weighting process of different variables by computing weights which influence the quality of clustering. We illustrate the power of this method with data sets taken from a public data set repository: a handwritten digit data set, Zoo data set and other three mixed data sets. The results show a good quality of the topological ordering and homogenous clustering.





2019 ◽  
Vol 18 ◽  
pp. 117693511989029
Author(s):  
James LT Dalgleish ◽  
Yonghong Wang ◽  
Jack Zhu ◽  
Paul S Meltzer

Motivation: DNA copy number (CN) data are a fast-growing source of information used in basic and translational cancer research. Most CN segmentation data are presented without regard to the relationship between chromosomal regions. We offer both a toolkit to help scientists without programming experience visually explore the CN interactome and a package that constructs CN interactomes from publicly available data sets. Results: The CNVScope visualization, based on a publicly available neuroblastoma CN data set, clearly displays a distinct CN interaction in the region of the MYCN, a canonical frequent amplicon target in this cancer. Exploration of the data rapidly identified cis and trans events, including a strong anticorrelation between 11q loss and17q gain with the region of 11q loss bounded by the cell cycle regulator CCND1. Availability: The shiny application is readily available for use at http://cnvscope.nci.nih.gov/ , and the package can be downloaded from CRAN ( https://cran.r-project.org/package=CNVScope ), where help pages and vignettes are located. A newer version is available on the GitHub site ( https://github.com/jamesdalg/CNVScope/ ), which features an animated tutorial. The CNVScope package can be locally installed using instructions on the GitHub site for Windows and Macintosh systems. This CN analysis package also runs on a linux high-performance computing cluster, with options for multinode and multiprocessor analysis of CN variant data. The shiny application can be started using a single command (which will automatically install the public data package).



2006 ◽  
Vol 95 (4) ◽  
pp. 2199-2212 ◽  
Author(s):  
Matthew C. Tresch ◽  
Vincent C. K. Cheung ◽  
Andrea d'Avella

Several recent studies have used matrix factorization algorithms to assess the hypothesis that behaviors might be produced through the combination of a small number of muscle synergies. Although generally agreeing in their basic conclusions, these studies have used a range of different algorithms, making their interpretation and integration difficult. We therefore compared the performance of these different algorithms on both simulated and experimental data sets. We focused on the ability of these algorithms to identify the set of synergies underlying a data set. All data sets consisted of nonnegative values, reflecting the nonnegative data of muscle activation patterns. We found that the performance of principal component analysis (PCA) was generally lower than that of the other algorithms in identifying muscle synergies. Factor analysis (FA) with varimax rotation was better than PCA, and was generally at the same levels as independent component analysis (ICA) and nonnegative matrix factorization (NMF). ICA performed very well on data sets corrupted by constant variance Gaussian noise, but was impaired on data sets with signal-dependent noise and when synergy activation coefficients were correlated. Nonnegative matrix factorization (NMF) performed similarly to ICA and FA on data sets with signal-dependent noise and was generally robust across data sets. The best algorithms were ICA applied to the subspace defined by PCA (ICAPCA) and a version of probabilistic ICA with nonnegativity constraints (pICA). We also evaluated some commonly used criteria to identify the number of synergies underlying a data set, finding that only likelihood ratios based on factor analysis identified the correct number of synergies for data sets with signal-dependent noise in some cases. We then proposed an ad hoc procedure, finding that it was able to identify the correct number in a larger number of cases. Finally, we applied these methods to an experimentally obtained data set. The best performing algorithms (FA, ICA, NMF, ICAPCA, pICA) identified synergies very similar to one another. Based on these results, we discuss guidelines for using factorization algorithms to analyze muscle activation patterns. More generally, the ability of several algorithms to identify the correct muscle synergies and activation coefficients in simulated data, combined with their consistency when applied to physiological data sets, suggests that the muscle synergies found by a particular algorithm are not an artifact of that algorithm, but reflect basic aspects of the organization of muscle activation patterns underlying behaviors.



Author(s):  
MUSTAPHA LEBBAH ◽  
YOUNÈS BENNANI ◽  
NICOLETA ROGOVSCHI

This paper introduces a probabilistic self-organizing map for topographic clustering, analysis and visualization of multivariate binary data or categorical data using binary coding. We propose a probabilistic formalism dedicated to binary data in which cells are represented by a Bernoulli distribution. Each cell is characterized by a prototype with the same binary coding as used in the data space and the probability of being different from this prototype. The learning algorithm, Bernoulli on self-organizing map, that we propose is an application of the EM standard algorithm. We illustrate the power of this method with six data sets taken from a public data set repository. The results show a good quality of the topological ordering and homogenous clustering.



2021 ◽  
Vol 37 (1) ◽  
pp. 71-89
Author(s):  
Vu-Tuan Dang ◽  
Viet-Vu Vu ◽  
Hong-Quan Do ◽  
Thi Kieu Oanh Le

During the past few years, semi-supervised clustering has emerged as a new interesting direction in machine learning research. In a semi-supervised clustering algorithm, the clustering results can be significantly improved by using side information, which is available or collected from users. There are two main kinds of side information that can be learned in semi-supervised clustering algorithms: the class labels - called seeds or the pairwise constraints. The first semi-supervised clustering was introduced in 2000, and since that, many algorithms have been presented in literature. However, it is not easy to use both types of side information in the same algorithm. To address the problem, this paper proposes a semi-supervised graph based clustering algorithm that tries to use seeds and constraints in the clustering process, called MCSSGC. Moreover, we introduces a simple but efficient active learning method to collect the constraints that can boost the performance of MCSSGC, named KMMFFQS. In order to verify effectiveness of the proposed algorithm, we conducted a series of experiments not only on real data sets from UCI, but also on a document data set applied in an Information Extraction of Vietnamese documents. These obtained results show that the proposed algorithm can significantly improve the clustering process compared to some recent algorithms.



Author(s):  
Yun-Young Hwang Et.al

In order to make public data more useful, it is necessary to provide relevant data sets that meet the needs of users. We introduce the method of linkage between datasets. We provide a method for deriving linkages between fields of structured datasets provided by public data portals. We defined a dataset and connectivity between datasets. The connectivity between them is based on the metadata of the dataset and the linkage between the actual data field names and values. We constructed the standard field names. Based on this standard, we established the relationship between the datasets. This paper covers 31,692 structured datasets (as of May 31, 2020) among the public data portal datasets. We extracted 1,185,846 field names from over 30,000 datasets. We extracted 1,185,846 field names from over 30,000 datasets. As a result of analyzing the field names, the field names related to spatial information were the most common at 35%. This paper verified the method of deriving the relation between data sets, focusing on the field names classified as spatial information. For this reason, we have defined spatial standard field names. To derive similar field names, we extracted related field names into spaces such as locations, coordinates, addresses, and zip codes used in public datasets. The standard field name of spatial information was designed and derived 43% cooperation rate of 31,692 datasets. In the future, we plan to apply similar field names additionally to improve the data set cooperation rate of the spatial information standard.



2020 ◽  
Vol 498 (3) ◽  
pp. 3440-3451
Author(s):  
Alan F Heavens ◽  
Elena Sellentin ◽  
Andrew H Jaffe

ABSTRACT Bringing a high-dimensional data set into science-ready shape is a formidable challenge that often necessitates data compression. Compression has accordingly become a key consideration for contemporary cosmology, affecting public data releases, and reanalyses searching for new physics. However, data compression optimized for a particular model can suppress signs of new physics, or even remove them altogether. We therefore provide a solution for exploring new physics during data compression. In particular, we store additional agnostic compressed data points, selected to enable precise constraints of non-standard physics at a later date. Our procedure is based on the maximal compression of the MOPED algorithm, which optimally filters the data with respect to a baseline model. We select additional filters, based on a generalized principal component analysis, which are carefully constructed to scout for new physics at high precision and speed. We refer to the augmented set of filters as MOPED-PC. They enable an analytic computation of Bayesian Evidence that may indicate the presence of new physics, and fast analytic estimates of best-fitting parameters when adopting a specific non-standard theory, without further expensive MCMC analysis. As there may be large numbers of non-standard theories, the speed of the method becomes essential. Should no new physics be found, then our approach preserves the precision of the standard parameters. As a result, we achieve very rapid and maximally precise constraints of standard and non-standard physics, with a technique that scales well to large dimensional data sets.



2021 ◽  
Vol 35 (6) ◽  
pp. 137-146
Author(s):  
Haeyoon Lee ◽  
Muheon Jeong ◽  
Inseon Park

The purpose of this study is to obtain implications through comparative analysis of disaster safety datasets and services of representative public data portals in Korea and Japan. Comparative standards were established first. Then, dataset weight analysis of disaster-type and safety-management -stage components, trend analysis through text mining on data-set descriptions, and data quality and portal services analysis were performed. As a result public data sets were lower in Korea both numerically and proportionally than in Japan. Japan had a high proportion of disaster preparation and recovery datasets in terms of disaster safety management, while Korea had a high proportion of prevention data-sets. In addition, in terms of disaster response collaboration, most of Korea has material management and resource support, but Japan has high proportion of emergency recovery and situation management of damaged facilities. In terms of data quality, Japan has many datasets with four levels of Berners-Lee rating. However Korea has a high proportion of datasets with three levels of Beners-Lee rating. However, Korea has a better data format for big-data utilization. Portal services are mainly centered on natural disasters in Japan, but in Korea, they are centered on social disasters. The results of this study provide a reference for the future direction of disaster safety public data portals in Korea.



Sign in / Sign up

Export Citation Format

Share Document