Adaptive lasso with weights based on normalized filtering scores in molecular big data

2020 ◽  
Vol 19 (04) ◽  
pp. 2040010 ◽  
Author(s):  
Abhijeet R. Patil ◽  
Byung-Kwon Park ◽  
Sangjin Kim

The molecular big data are highly correlated, and numerous genes are not related. The various classification methods performance mainly rely on the selection of significant genes. Sparse regularized regression (SRR) models using the least absolute shrinkage and selection operator (lasso) and adaptive lasso (alasso) are popularly used for gene selection and classification. Nevertheless, it becomes challenging when the genes are highly correlated. Here, we propose a modified adaptive lasso with weights using the ranking-based feature selection (RFS) methods capable of dealing with the highly correlated gene expression data. Firstly, an RFS methods such as Fisher’s score (FS), Chi-square (CS), and information gain (IG) are employed to ignore the unimportant genes and the top significant genes are chosen through sure independence screening (SIS) criteria. The scores of the ranked genes are normalized and assigned as proposed weights to the alasso method to obtain the most significant genes that were proven to be biologically related to the cancer type and helped in attaining higher classification performance. With the synthetic data and real application of microarray data, we demonstrated that the proposed alasso method with RFS methods is a better approach than the other known methods such as alasso with filtering such as ridge and marginal maximum likelihood estimation (MMLE), lasso and alasso without filtering. The metrics of accuracy, area under the receiver operating characteristics curve (AUROC), and geometric mean (GM-mean) are used for evaluating the performance of the models.

2021 ◽  
Vol 18 (2) ◽  
pp. 172988142199958
Author(s):  
Larkin Folsom ◽  
Masahiro Ono ◽  
Kyohei Otsu ◽  
Hyoshin Park

Mission-critical exploration of uncertain environments requires reliable and robust mechanisms for achieving information gain. Typical measures of information gain such as Shannon entropy and KL divergence are unable to distinguish between different bimodal probability distributions or introduce bias toward one mode of a bimodal probability distribution. The use of a standard deviation (SD) metric reduces bias while retaining the ability to distinguish between higher and lower risk distributions. Areas of high SD can be safely explored through observation with an autonomous Mars Helicopter allowing safer and faster path plans for ground-based rovers. First, this study presents a single-agent information-theoretic utility-based path planning method for a highly correlated uncertain environment. Then, an information-theoretic two-stage multiagent rapidly exploring random tree framework is presented, which guides Mars helicopter through regions of high SD to reduce uncertainty for the rover. In a Monte Carlo simulation, we compare our information-theoretic framework with a rover-only approach and a naive approach, in which the helicopter scouts ahead of the rover along its planned path. Finally, the model is demonstrated in a case study on the Jezero region of Mars. Results show that the information-theoretic helicopter improves the travel time for the rover on average when compared with the rover alone or with the helicopter scouting ahead along the rover’s initially planned route.


2010 ◽  
Vol 9 ◽  
pp. CIN.S3794 ◽  
Author(s):  
Xiaosheng Wang ◽  
Osamu Gotoh

Gene selection is of vital importance in molecular classification of cancer using high-dimensional gene expression data. Because of the distinct characteristics inherent to specific cancerous gene expression profiles, developing flexible and robust feature selection methods is extremely crucial. We investigated the properties of one feature selection approach proposed in our previous work, which was the generalization of the feature selection method based on the depended degree of attribute in rough sets. We compared the feature selection method with the established methods: the depended degree, chi-square, information gain, Relief-F and symmetric uncertainty, and analyzed its properties through a series of classification experiments. The results revealed that our method was superior to the canonical depended degree of attribute based method in robustness and applicability. Moreover, the method was comparable to the other four commonly used methods. More importantly, the method can exhibit the inherent classification difficulty with respect to different gene expression datasets, indicating the inherent biology of specific cancers.


2008 ◽  
Vol 26 (13) ◽  
pp. 2139-2146 ◽  
Author(s):  
Alex A. Adjei ◽  
Roger B. Cohen ◽  
Wilbur Franklin ◽  
Clive Morris ◽  
David Wilson ◽  
...  

Purpose To assess the tolerability, pharmacokinetics (PKs), and pharmacodynamics (PDs) of the mitogen-activated protein kinase kinase (MEK) 1/2 inhibitor AZD6244 (ARRY-142886) in patients with advanced cancer. Patients and Methods In part A, patients received escalating doses to determine the maximum-tolerated dose (MTD). In both parts, blood samples were collected to assess PK and PD parameters. In part B, patients were stratified by cancer type (melanoma v other) and randomly assigned to receive the MTD or 50% MTD. Biopsies were collected to determine inhibition of ERK phosphorylation, Ki-67 expression, and BRAF, KRAS, and NRAS mutations. Results Fifty-seven patients were enrolled. MTD in part A was 200 mg bid, but this dose was discontinued in part B because of toxicity. The 50% MTD (100 mg bid) was well tolerated. Rash was the most frequent and dose-limiting toxicity. Most other adverse events were grade 1 or 2. The PKs were less than dose proportional, with a median half-life of approximately 8 hours and inhibition of ERK phosphorylation in peripheral-blood mononuclear cells at all dose levels. Paired tumor biopsies demonstrated reduced ERK phosphorylation (geometric mean, 79%). Five of 20 patients demonstrated ≥ 50% inhibition of Ki-67 expression, and RAF or RAS mutations were detected in 10 of 26 assessable tumor samples. Nine patients had stable disease (SD) for ≥ 5 months, including two patients with SD for 19 (thyroid cancer) and 22 (uveal melanoma plus renal cancer) 28-day cycles. Conclusion AZD6244 was well tolerated with target inhibition demonstrated at the recommended phase II dose. PK analyses supported twice-daily dosing. Prolonged SD was seen in a variety of advanced cancers. Phase II studies are ongoing.


Author(s):  
Subrata Mukherjee ◽  
Xuhui Huang ◽  
Lalita Udpa ◽  
Yiming Deng

Abstract Systems in service continue to degrade with passage of time. Pipelines are among the most common systems that wear away with usage. For public safety it is of utmost importance to monitor pipelines and detect new defects within the pipelines. Magnetic flux leakage (MFL) testing is a widely used nondestructive evaluation (NDE) technique for defect detections within the pipelines, particularly those composed of ferromagnetic materials. Pipeline inspection gauge (PIG) procedure based on line-scans or 2D-scans can collect accurate MFL readings for defect detection. However, in real world applications involving large pipe-sectors such extensive scanning techniques are extremely time consuming and costly. In this paper, we develop a fast and cheap methodology that does not need MFL readings at all the points used in traditional PIG procedures but conducts defect detection with similar accuracy. We consider an under-sampling based scheme that collects MFL at uniformly chosen random scan-points over large lattices instead of extensive PIG scans over all lattice points. Based on readings for the chosen random scan points, we use Kriging to reconstruct MFL readings over the entire pipe-sectors. Thereafter, we use thresholding-based segmentation on the reconstructed data for detecting defective areas. We demonstrate the applicability of our methodology on synthetic data generated using popular finite element models as well as on MFL data collected via laboratory experiments. In these experiments spanning a wide range of defect types, our proposed novel MFL based NDE methodology is witnessed to have operating characteristics within the acceptable threshold of PIG based traditional methods and thus provide an extremely cost-effective, fast procedure with competing error rates that can be successfully used for scanning massive pipeline sectors.


2019 ◽  
Vol 6 (1) ◽  
Author(s):  
Tawfiq Hasanin ◽  
Taghi M. Khoshgoftaar ◽  
Joffrey L. Leevy ◽  
Richard A. Bauder

AbstractSevere class imbalance between majority and minority classes in Big Data can bias the predictive performance of Machine Learning algorithms toward the majority (negative) class. Where the minority (positive) class holds greater value than the majority (negative) class and the occurrence of false negatives incurs a greater penalty than false positives, the bias may lead to adverse consequences. Our paper incorporates two case studies, each utilizing three learners, six sampling approaches, two performance metrics, and five sampled distribution ratios, to uniquely investigate the effect of severe class imbalance on Big Data analytics. The learners (Gradient-Boosted Trees, Logistic Regression, Random Forest) were implemented within the Apache Spark framework. The first case study is based on a Medicare fraud detection dataset. The second case study, unlike the first, includes training data from one source (SlowlorisBig Dataset) and test data from a separate source (POST dataset). Results from the Medicare case study are not conclusive regarding the best sampling approach using Area Under the Receiver Operating Characteristic Curve and Geometric Mean performance metrics. However, it should be noted that the Random Undersampling approach performs adequately in the first case study. For the SlowlorisBig case study, Random Undersampling convincingly outperforms the other five sampling approaches (Random Oversampling, Synthetic Minority Over-sampling TEchnique, SMOTE-borderline1 , SMOTE-borderline2 , ADAptive SYNthetic) when measuring performance with Area Under the Receiver Operating Characteristic Curve and Geometric Mean metrics. Based on its classification performance in both case studies, Random Undersampling is the best choice as it results in models with a significantly smaller number of samples, thus reducing computational burden and training time.


Author(s):  
Baoyun Xia ◽  
Benjamin C Blount ◽  
Tonya Guillot ◽  
Christina Brosius ◽  
Yao Li ◽  
...  

Abstract Introduction The tobacco-specific nitrosamines (TSNAs) are an important group of carcinogens found in tobacco and tobacco smoke. To describe and characterize the levels of TSNAs in the Population Assessment of Tobacco and Health (PATH) Study Wave 1 (2013–2014), we present four biomarkers of TSNA exposure: N′-nitrosonornicotine, N′-nitrosoanabasine, N′-nitrosoanatabine, and 4-(methylnitrosamino)-1-(3-pyridyl)-1-butanol (NNAL) which is the primary urinary metabolite of 4-(methylnitrosamino)-1-(3-pyridyl)-1-butanone. Methods We measured total TSNAs in 11 522 adults who provided urine using automated solid-phase extraction coupled to isotope dilution liquid chromatography–tandem mass spectrometry. After exclusions in this current analysis, we selected 11 004 NNAL results, 10 753 N′-nitrosonornicotine results, 10 919 N′-nitrosoanatabine results, and 10 996 N′-nitrosoanabasine results for data analysis. Geometric means and correlations were calculated using SAS and SUDAAN. Results TSNA concentrations were associated with choice of tobacco product and frequency of use. Among established, every day, exclusive tobacco product users, the geometric mean urinary NNAL concentration was highest for smokeless tobacco users (993.3; 95% confidence interval [CI: 839.2, 1147.3] ng/g creatinine), followed by all types of combustible tobacco product users (285.4; 95% CI: [267.9, 303.0] ng/g creatinine), poly tobacco users (278.6; 95% CI: [254.9, 302.2] ng/g creatinine), and e-cigarette product users (6.3; 95% CI: [4.7, 7.9] ng/g creatinine). TSNA concentrations were higher in every day users than in intermittent users for all the tobacco product groups. Among single product users, exposure to TSNAs differed by sex, age, race/ethnicity, and education. Urinary TSNAs and nicotine metabolite biomarkers were also highly correlated. Conclusions We have provided PATH Study estimates of TSNA exposure among US adult users of a variety of tobacco products. These data can inform future tobacco product and human exposure evaluations and related regulatory activities.


2019 ◽  
Vol 12 (1) ◽  
pp. 106 ◽  
Author(s):  
Romulus Costache ◽  
Quoc Bao Pham ◽  
Ehsan Sharifi ◽  
Nguyen Thi Thuy Linh ◽  
S.I. Abba ◽  
...  

Concerning the significant increase in the negative effects of flash-floods worldwide, the main goal of this research is to evaluate the power of the Analytical Hierarchy Process (AHP), fi (kNN), K-Star (KS) algorithms and their ensembles in flash-flood susceptibility mapping. To train the two stand-alone models and their ensembles, for the first stage, the areas affected in the past by torrential phenomena are identified using remote sensing techniques. Approximately 70% of these areas are used as a training data set along with 10 flash-flood predictors. It should be remarked that the remote sensing techniques play a crucial role in obtaining eight out of 10 flash-flood conditioning factors. The predictive capability of predictors is evaluated through the Information Gain Ratio (IGR) method. As expected, the slope angle results in the factor with the highest predictive capability. The application of the AHP model implies the construction of ten pair-wise comparison matrices for calculating the normalized weights of each flash-flood predictor. The computed weights are used as input data in kNN–AHP and KS–AHP ensemble models for calculating the Flash-Flood Potential Index (FFPI). The FFPI also is determined through kNN and KS stand-alone models. The performance of the models is evaluated using statistical metrics (i.e., sensitivity, specificity and accuracy) while the validation of the results is done by constructing the Receiver Operating Characteristics (ROC) Curve and Area Under Curve (AUC) values and by calculating the density of torrential pixels within FFPI classes. Overall, the best performance is obtained by the kNN–AHP ensemble model.


Sign in / Sign up

Export Citation Format

Share Document