An Empirical Study of a Linear Regression Combiner on Multi-class Data Sets

After an era of managing data collection difficulties, these days the issue has turned into the problem of how to process these vast amounts of information. Scientists, as well as researchers, think that today, probably the most essential topic in computing science is Big Data. Big Data is used to clarify the huge volume of data that could exist in any structure. This makes it difficult for standard controlling approaches for mining the best possible data through such large data sets. Classification in Big Data is a procedure of summing up data sets dependent on various examples. There are distinctive classification frameworks which help us to classify data collections. A few methods that discussed in the chapter are Multi-Layer Perception Linear Regression, C4.5, CART, J48, SVM, ID3, Random Forest, and KNN. The target of this chapter is to provide a comprehensive evaluation of classification methods that are in effect commonly utilized.

Download Full-text

SYSTEMATIC VARIATION NORMALIZATION IN MICROARRAY DATA TO GET GENE EXPRESSION COMPARISON UNBIASED

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720005001028 ◽

2005 ◽

Vol 03 (02) ◽

pp. 225-241 ◽

Cited By ~ 13

Author(s):

JEFF W. CHOU ◽

RICHARD S. PAULES ◽

PIERRE R. BUSHEL

Keyword(s):

Gene Expression ◽

Linear Regression ◽

Microarray Data ◽

Expression Patterns ◽

Microarray Gene Expression Data ◽

Systematic Variation ◽

Data Sets ◽

Microarray Gene Expression ◽

Pixel Intensity ◽

Non Linear

Normalization removes or minimizes the biases of systematic variation that exists in experimental data sets. This study presents a systematic variation normalization (SVN) procedure for removing systematic variation in two channel microarray gene expression data. Based on an analysis of how systematic variation contributes to variability in microarray data sets, our normalization procedure includes background subtraction determined from the distribution of pixel intensity values from each data acquisition channel and log conversion, linear or non-linear regression, restoration or transformation, and multiarray normalization. In the case when a non-linear regression is required, an empirical polynomial approximation approach is used. Either the high terminated points or their averaged values in the distributions of the pixel intensity values observed in control channels may be used for rescaling multiarray datasets. These pre-processing steps remove systematic variation in the data attributable to variability in microarray slides, assay-batches, the array process, or experimenters. Biologically meaningful comparisons of gene expression patterns between control and test channels or among multiple arrays are therefore unbiased using normalized but not unnormalized datasets.

Download Full-text

Prediction of Fetal Hemoglobin in Sickle Cell Anemia Using a Genetic Risk Score

Blood ◽

10.1182/blood.v120.21.3216.3216 ◽

2012 ◽

Vol 120 (21) ◽

pp. 3216-3216

Author(s):

Jacqueline N Milton ◽

Paola Sebastiani ◽

Clinton T. Baldwin ◽

Efthymia Melista ◽

Victor R. Gordeuk ◽

...

Keyword(s):

Linear Regression ◽

Sickle Cell ◽

Sickle Cell Anemia ◽

Genetic Risk ◽

Genetic Risk Score ◽

Fetal Hemoglobin ◽

P Value ◽

Data Sets ◽

Genome Wide ◽

Hbf Level

Abstract Abstract 3216 Fetal hemoglobin (HbF) is the major genetic modifier of clinical course of sickle cell anemia (homozygosity for HBB glu6val). HbF level is also an important predictor of mortality. If it were possible to know at birth the HbF level likely to be present after stabilization of this measurement at about age 5 years, then an improved prognosis might be given and HbF-inducing treatments better informed. Levels of HbF in adults are highly heritable and the production of HbF is genetically regulated by several quantitative trait loci and by genetic elements linked to the HBB gene cluster. One of the most popular approaches to genetic risk prediction uses a summary of the risk alleles in the form of a genetic risk score (GRS) that is used as a covariate of the genetic prediction model. We present the development of a GRS for HbF in 841 patients from the Cooperative Study of Sickle Cell Disease (CSSCD) cohort patients and assessed its ability to predict HbF values in three independent cohorts that included PUSH (N=77), Walk-PHaSST (N=181), and C-Data from the Comprehensive Sickle Cell Centers program (N= 127). We used the results of a genome-wide association study (GWAS) of HbF in sickle cell anemia, in which patients were genotyped using the 610K Illumina array, and association of each of the ∼550K SNPs with HbF was tested using a linear regression model with gender adjusted additive genetic effects. To build the GRS, we sorted SNPs by increasing p-value, starting from the most significant SNP associated with HbF (rs766432, p-value=2.61×10−21), and pruned the list by removing SNPs in high LD (r2 > 0.8). We then used this list of SNPs to generate a sequence of nested GRS. We started with the GRS that included only the most significant SNP and generated the second GRS by adding the second SNP from the list of SNPs. The third GRS was generated by adding the 3rd SNP from the list of SNPs to the second GRS, and so on. We repeated this analysis including up to 10,000 SNPs (p-value< .02185) and hence generated 10,000 GRS, for each of the subjects in the CSSCD. Each of these GRS was included as covariate in a linear regression model and the regression coefficients of the resultant 10,000 linear regression models were estimated using Least Squares methods in the CSSCD data. The predictive value of these GRS models was then evaluated in three independent cohorts. In this evaluation, we computed the 10,000 GRS for each subjects in each data sets, and then used the 10,000 regression models estimated in the CSSCD data set to compute the expected HbF value of patients, given their GRS. We then assessed the predictive accuracy by computing the correlation between the observed and predicted values of HbF. To produce more stable predictions, we also created ensembles of predictive models. An ensemble of the first 14 GRS models including 14 SNPs had the best predictive value in all 3 data sets and explains 23.4% of the variability in HbF; the correlation between the predicted HbF and observed HbF was 0.44, 0.28 and 0.39 in the three different cohorts. Of these 14 SNPs, 6 were located in BCL11A; other SNPs were located in the olfactory receptor region and the in chromosome 11p15 and the site of the HBB gene cluster and were found previously to be associated with HbF. We next compared these results to predictive models in which we included gender, coincident alpha thalassemia, and HBB haplotypes for prediction. The model including gender and alpha thalassemia explained only 2.6% of the variability of HbF in the discovery cohort and the model including HBB haplotypes explained 2.35% of the variability of HbF in the discovery cohort and neither model showed a significant correlation between the predicted and observed HbF in the three other cohorts. In addition, combining the non-genetic information with the GRS did not help to explain more of the variability in HbF. With as few as 14 SNPs we can explain more of the variability in HbF and do a better job of prediction in comparison to using other non-genetic risk factors or genome-wide significant SNPs; however, we still cannot explain all of the variability in HbF that is due to heritability. These results suggest that knowing the genotype of a few SNPs can help to predict HbF that after they have stabilized. Prediction of HbF at an early age has the potential to help foretell some features of the severity of the clinical course of the disease and aid to optimize the clinical management of patients. Disclosures: No relevant conflicts of interest to declare.

Download Full-text

Sellmeier fits with linear regression; multiple data sets; dispersion formulas for helium

Applied Optics ◽

10.1364/ao.22.002906 ◽

1983 ◽

Vol 22 (18) ◽

pp. 2906 ◽

Cited By ~ 11

Author(s):

Edson R. Peck

Keyword(s):

Linear Regression ◽

Data Sets ◽

Multiple Data ◽

Multiple Data Sets

Download Full-text

Statistical Models for the Twinning Rate

Acta geneticae medicae et gemellologiae twin research ◽

10.1017/s000156600000605x ◽

1987 ◽

Vol 36 (3) ◽

pp. 297-312 ◽

Cited By ~ 13

Author(s):

J.O. Fellman ◽

A.W. Eriksson

Keyword(s):

Linear Regression ◽

Statistical Models ◽

Maternal Age ◽

Regression Models ◽

Model Building ◽

Data Sets ◽

Linear Regression Models ◽

Linear Regression Technique ◽

Secular Decline ◽

Disaggregated Data

AbstractLinear regression models are used to explain the variations in the twinning rates. Data sets from different countries are analysed and maternal age, parity and marital status are the main regressors. The model building technique is also used in order to study the secular decline in the twinning rate. Linear regression technique makes it possible to compare the effect of different factors but the method requires sufficiently disaggregated data.

Download Full-text

Mahatma Gandhi National Rural Employment Guarantee Act (MGNREGA): A Tool for Employment Generation

International Journal of Social Sciences and Management ◽

10.3126/ijssm.v3i4.15974 ◽

2016 ◽

Vol 3 (4) ◽

pp. 281-286

Author(s):

Lamaan Sami ◽

Anas Khan

Keyword(s):

Linear Regression ◽

Empirical Study ◽

Significant Role ◽

Personal Interview ◽

Mahatma Gandhi ◽

Rural Employment ◽

Employment Generation ◽

Employment Increase ◽

Employment Guarantee ◽

The Impact

This study is an empirical study which aims to examine the impact of MGNREGA in generating employment to poor in selected districts in India. Data have been collected through personal interview and analyzed with the application of linear regression. The analysis of the data revealed that MGNREGA played a significant role in generating employment, increase in income and consumption of respondents in selected districts in India.Int. J. Soc. Sc. Manage. Vol. 3, Issue-4: 281-286

Download Full-text

Analysis of Influence Financial Ratios on Sharia Banking Performance in Indonesia (Empirical Study at Bank Muamalat Indonesia, Bank Syariah Mandiri, and Bank Mega Syariah)

Global Review of Islamic Economics and Business ◽

10.14421/grieb.2016.042-06 ◽

2016 ◽

Vol 4 (2) ◽

pp. 135

Author(s):

Shulhah Nurullaily

Keyword(s):

United States ◽

Regression Analysis ◽

Linear Regression ◽

Empirical Study ◽

Multiple Linear Regression ◽

Linear Regression Analysis ◽

Multiple Linear Regression Analysis ◽

The United States ◽

Negative Effect ◽

The Impact

This study aims to examine the performance of Sharia Banking in Indonesia after experiencing slowing growth due to the impact of the United States crisis in 2008/2009. Factors used to measure the performance of sharia banking represented by ROA are CAR, NPF, BOPO, NM and FDR. This research uses multiple linear regression analysis with sample of research of Bank Muamalat, Bank Mega Syariah, and Bank Syariah Mandiri with the period of research from the first quarter 2008 to the fourth quarter 2011. The result of this research that is NM and FDR have positive significant effect on ROA, while BOPO has a significant negative effect on ROA, CAR and NPF have no influence on ROA.

Download Full-text

Duplicate Question Detection in Stack Overflow: A Reproducibility Study

10.7287/peerj.preprints.26555 ◽

2018 ◽

Author(s):

Rodrigo F G Silva ◽

Klerisson V Paixao ◽

Marcelo de A. Maia

Keyword(s):

Empirical Study ◽

Scientific Literature ◽

Recall Rate ◽

Data Sets ◽

Continuous Growth ◽

Stack Overflow ◽

Reproducibility Study ◽

Question And Answer ◽

Over Time

Stack Overflow has become a fundamental element of developer toolset. Such influence increase has been accompanied by an effort from Stack Overflow community to keep the quality of its content. One of the problems which jeopardizes that quality is the continuous growth of duplicated questions. To solve this problem, prior works focused on automatically detecting duplicated questions. Two important solutions are DupPredictor and Dupe. Despite reporting significant results, both works do not provide their implementations publicly available, hindering subsequent works in scientific literature which rely on them. We executed an empirical study as a reproduction of DupPredictor and Dupe. Our results, not robust when attempted with different set of tools and data sets, show that the barriers to reproduce these approaches are high. Furthermore, when applied to more recent data, we observe a performance decay of our both reproductions in terms of recall-rate over time, as the number of questions increases. Our findings suggest that the subsequent works concerning detection of duplicated questions in Question and Answer communities require more investigation to assert their findings.

Download Full-text

PENGARUH STRUKTUR MODAL DAN PROFITABILITAS TERHADAP KEBIJAKAN DIVIDEN

Jurnal Manajemen & Bisnis Kreatif ◽

10.36805/manajemen.v5i2.1031 ◽

2020 ◽

Vol 5 (2) ◽

pp. 67-80

Author(s):

Indah Anggraeni Paramitha ◽

Lisdawati

Keyword(s):

Data Analysis ◽

Linear Regression ◽

Empirical Study ◽

Capital Structure ◽

Research Method ◽

Dividend Policy ◽

Secondary Data ◽

Financial Statements ◽

Dividend Payout ◽

Equity Ratio

Tujuan dari penelitian ini adalah untuk mengetahui dan memberi bukti empiris atas pengaruh StrukturModal dan Profitabilitas baik secara parsial maupun simultan terhadap Kebijakan Dividen pada PT.Mayora Indah, Tbk periode 2011-2017. Struktur Modal menggunakan pengukuran debt-to EquityRatio (DER), Profitabilitas menggunakan pengukuran Return On Asset (ROA) dan Kebijakan Dividenmenggunakan pengukuran dividend payout ratio (DPR). Metode penelitian menggunakan datasekunder berupa laporan keuangan PT. Mayora Indah, Tbk serta analisis data menggunakan regresilinier berganda, pengujian asumsi klasik yang meliputi: Uji Normalitas, Multikolinearitas,Heteroskedastisitas serta Uji Autokorelasi, dan Uji Hipotesis. Hasil penelitian menunjukkan bahwaStruktur Modal dan Profitabilitas berpengaruh secara parsial dan simultan terhadap KebijakanDividen.Kata kunci: kebijakan dividen, profitabilitas, struktur modal. The purpose of this study was to find out and give empirical study thru the effect of Capital Structureand Profitability in la partially and simultaneously on the Dividend Policy at PT. Mayora Indah, Tbkfor the period 2011-2017. Capital Structure uses debt-to Equity Ratio (DER) metering, profitabilityuses Return On Asset (ROA) metering and Dividend Policy uses dividend payout ratio (DPR)metering. The research method uses secondary data in the form of financial statements of PT. MayoraIndah, Tbk and data analysis using multiple linear regression, classic assumptions test which includethe normality, multicollinearity, heteroscedasticity and Autocorrelation tests, with hypothesis tests.The results showed that the Capital Structure and Profitability have a partial and simultaneous effecton the Dividend Policy.Keywords: dividend policy, profitabilty, capital structure.

Download Full-text