Machine Learning accurately predicts human height from genetic information

Stephen Hsu reports on his team work to use novel machine learning methods (“compressed sensing”) to ~500,000 genomes from UK Biobank, resulting in an accurate predictor for human height which uses information from thousands of SNPs.

Hsu has also predicted that with the gene sequences of about 1 million people we could construct a good genomic predictor for cognitive ability and identify most of the associated common SNPs. Hsu has now gotten close to a good genetic predictor of height. His prediction about being able to predict intelligence seems even more credible. One challenge is that intelligence measurement is not precise and unambiguous like a height measurement.

1. The actual heights of most individuals in our replication tests are within a few centimeters of their predicted height.

2. The variance captured by the predictor is similar to the estimated GCTA-GREML SNP heritability. Thus, our results resolve the missing heritability problem for common SNPs.

3. Out-of-sample validation on ARIC individuals (a US cohort) shows the predictor works on that population as well. The SNPs activated in the predictor overlap with previous GWAS hits from GIANT.

BioRxIv – Accurate Genomic Prediction Of Human Height

We construct genomic predictors for heritable and extremely complex human quantitative traits (height, heel bone density, and educational attainment) using modern methods in high dimensional statistics (i.e., machine learning). Replication tests show that these predictors capture, respectively, ~40, 20, and 9 percent of total variance for the three traits. For example, predicted heights correlate ~0.65 with actual height; actual heights of most individuals in validation samples are within a few cm of the prediction. The variance captured for height is comparable to the estimated SNP heritability from GCTA (GREML) analysis, and seems to be close to its asymptotic value (i.e., as sample size goes to infinity), suggesting that we have captured most of the heritability for the SNPs used. Thus, our results resolve the common SNP portion of the “missing heritability” problem – i.e., the gap between prediction R-squared and SNP heritability. The ~20k activated SNPs in our height predictor reveal the genetic architecture of human height, at least for common SNPs. Our primary dataset is the UK Biobank cohort, comprised of almost 500k individual genotypes with multiple phenotypes. We also use other datasets and SNPs found in earlier GWAS for out-of-sample validation of our results.