Background. Our publication of the new pathways of topological rank analysis (PoTRA) algorithm demonstrated a novel approach for using the Google Search PageRank algorithm to analyze gene expression networks to identify biological pathways significantly disrupted in hepatocellular carcinoma. In order to apply the PoTRA algorithm to analyze other cancer gene expression data sets, of various sizes and normal:tumor ratio composition, two important questions must be answered: 1. What is the optimal normal:tumor sample ratio?; and 2. What is the minimum number of samples that should be used for PoTRA analysis? To address these questions, the average standard deviation (SD) in PoTRA-ranked mRNA mediated dysregulated pathways was studied using randomly sampled data sets with various normal:tumor ratios and sizes drawn from the TCGA Breast Invasive Carcinoma (TCGA-BRCA) project.
Methods. To identify the optimal normal:tumor sample ratios, the SD analysis used random combinations of 1:N unbalanced normal:tumor data sets: (1:1, 1:2, 1:3, 1:5, 1:7, 1:9). To identify the minimum sample size, random resampling of normal and tumor samples of various sizes are used: (3 vs 3), (5 vs 5), (10 vs 10), (25 vs 25), (50 vs 50), (75 vs 75), (100 vs 100), and (113 vs 113).
Results. This analysis suggests that the 1:1 ratio achieves the lowest average rank variation and that the minimum sample size of 50 normal and 50 tumor samples reaches a steady state in the average rank variation.
Conclusion. In conclusion, future applications of the PoTRA algorithm to analyze gene expression data sets such as TCGA should use balanced data sets as well as a minimum sample size of 50 for both normal and tumor to ensure the most robust performance.