The overall goal is to predict autoimmune disease risk using gut microbial community composition obtained from Swedish children at 1 year of age from fecal samples. Samples from 1748 participants enrolled in the All Babies in Sweden (ABIS) study were obtained, processed, sequenced, and analyzed by Dr. Eric Triplett’s lab. Microbial community composition was calculated as the relative abundance of each bacterial taxonomic unit in each sample. Of the participants, 94 have been diagnosed with an autoimmune disease, including type 1 diabetes, celiac, juvenile idiopathic arthritis and a handful of related disorders. The goal of the project is to use deep learning algorithms to predict disease risk from the existing microbial community composition data.
Unfortunately, deep learning methods require orders of magnitude more data to reliably train than available in this study, a common problem when attempting to apply complex statistical models to limited biomedically-related data. To overcome this data shortage, we propose to use approaches from game theory to ‘bootstrap’ the existing data, essentially generating ‘new’ data that is statistically indistinguishable from the existing data. This ‘new’ data will then be combined with the existing data and used to train disease-risk prediction models.
The approach we will apply is called a Generative Adversarial Network (GAN). This approach pits two different neural networks against each other in a ‘competitive game’, the result of which is a generator capable of fabricating highly-realistic data samples from random noise. The ‘game’ consists of a genarator network that fabricates ‘fake’ data samples, and a discriminator network whose job it is to determine if a specific data sample is real or fake. The generator-discriminator pair is trained using backpropagation, allowing the generator to get better and better at generating fake data, while the discriminator gets better and better at determining which samples are real vs fake. If all goes well, the generator will ‘learn’ the statistical distribution underlying the real data samples so well that the discriminator (or any complex statistical model) is unable to determine real from fake data samples.
The specific aim of this USP project proposal is to develop, implement and evaluate a generative adversarial network capable of generating reliable microbial community data samples, conditioned on whether the microbial community is associated with elevated disease risk. Ultimately, I hope to use the software that will be developed in this research in future endeavors as a method to study the effects of various factors on the human microbial community as well as human health and well being