Effect of Number of Markers and of Sample Size on the Prediction Accuracy of Whole Genome Prediction Methods

Friday, October 28, 2011
Hall 1-2 (San Jose Convention Center)
Kamil Suliveres, HS , Sciences and Technology, Universidad Metropolitana(UMET), Puerto Rico, San Juan, PR
Gustavo de los Campos, PhD , Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL
Hemant Tiwari, PhD , Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL
Personalized medicine requires accurate predictions of genetic predisposition to diseases. Recent studies suggest that Whole Genome Prediction (WGP) could be an effective method to develop such predictions. However, little is known about the factors affecting the accuracy of WGP. Using data from the Framingham Heart Study, we evaluated the effect of the number of markers (p) and of the number of individuals (n) used to train models on the prediction accuracy of WGP applied to (cohort, sex and age-adjusted) adult height. Models were evaluated over a grid of values of p (p=0.2K to 400K, K=thousand) and of n (n =3K, 5K, 7K, 8.5K), and prediction accuracy (R2=squared correlation) was evaluated on 10 different testing datasets, each comprising 480-520 individuals. Using p=400K and n=8.5K R2 averaged 0.28. This is slightly higher than what other studies using WGP have found for this trait (R2=0.25, Makowsky et al., PLoS Genetics, 2011) and much higher than what have been obtained using models based on subsets of SNPs selected from GWAS (R2=0.10, Lango Allen et al., Nature, 2010). R2 increased monotonically with respect to p and n: using n=8.5K, R2=.11, .18, .26 .28, for p=3.5K, 14K, 100K and 400K, respectively. And with p=400K, R2= .16, .17, .22, .25, .28 for n=3K, 5K, 7K and 8.5K, respectively. We conclude that a large number of variants are needed to capture genetic differences for this trait and that WGP can potentially yield accurate predictions of complex traits; however, a large sample size is required to realize this potential.