SNP CLICK: An analysis of how machine learning can help us understand our genes
By Dinah Landsman
With the advent of the digital revolution, it is no surprise that scientific research and inquiry skyrocketed at unprecedented frequency. The world’s best minds were suddenly connected at the drop of a hat, and possibilities seemed limitless. With the introduction of the internet, so too came the sophistication of algorithms and what tech savvy individuals like to call “deep learning”. Deep learning allows us to investigate wide data sets, draw immediate comparisons, and refine statistical analysis. Deep learning shows immense promise in solving genetic problems that until the sequencing of the human genome had remained mysteries. It’s important to consider how the sophistication of genetic algorithms can help researchers with precision medicine and new approaches to understanding our genetic sequence.
Since the Human Genome Project, genetic information of all organisms has been analyzed with various methods, including what's known as a “Genome Wide Association Study”. A GWAS takes into account small variations in individual gene sets, called Single Nucleotide Polymorphisms, or SNPS. They are qualified as observational studies of these variants compared with the entire genome to see if any of them are associated with heritable traits. On a population level, it can be difficult to determine the frequency of a SNP, or if it’s population specific. Often, disease and phenotype associated SNPs are included in genetic risk factor calculations. Modern medicine has been tasked with figuring out how to interpret these variants as relevant both for an individual and a population. Thus far, a significant obstacle has been an insufficient P-value threshold, or skewed probability predictions, that prohibits the SNP analysis from providing accurate comparisons or polygenic risk score models. There has been no mathematical approach capable of evaluating and comparing all four to five million SNPs per person. That is, until scientists figured out that they could harness machine learning to raise the P-value threshold, and form complex associations between sets all at the same time.
Machines can approach polygenic risk calculations using supervised modeling based on SNP data which involves training the pre-set learning algorithms to map relationships between individual sample genotype data and the associated disease. The machine can interpret patterns, form connections, and spin a complex web of SNP associations. Deep learning relies on a formulaic approach to genome wide associations to create patterns and neural networks that allow it to interpret relevant SNP data to predict inheritance and common traits. Optimal predictive power for the target disease or phenotype is achieved by mapping the pattern of the selective features. This also makes the data transmissible and statistically significant for other such studies. Machine learning isn’t perfect, but scientists have found that it is an excellent tool for confronting massive data sets. One such technique is called LayerWise Relevance Propagation.
Layer Wise Relevance Propagation (LRP) is the best way to interpret the decisions made by complicated algorithms regarding heritability and predicted phenotype. For a machine learning model to be able to interpret data and generalize accurately, LRP can be used to ensure that its decisions are substantiated by relevant input data samples. Thus far, LRP has been widely utilized to explain convolution neural networks. Very often, it is able to explain its decisions by signifying which input features are relevant to its prediction, and how they compare to the entire data set. LRP in the future will potentially be expanded to non-euclidean domains as well to display mechanisms of molecular interactions. When applied to SNP associations, LRP has immense potential for providing a universal baseline for understanding gene inheritance and how to prevent or detect the inheritance of particular phenotypes and diseases. The genome-wide associations provide massive data sets for machine learning, and LRP functions as an extremely coherent way to understand deep learning predictions and reasoning.
Artificial intelligence and machine learning are at their core, tools to help perfect the scientific method, and allow for massive strides in genetics research and practice. They are extremely promising for the future of disease prevention, and post far reaching implications for the development of a safer, more educated world.