NIH-Funded Project Aims to Build a ‘Google’ for Biomedical Data
Nineteen institutions across the country are working to integrate years of data, ranging from electronic health records to genomic sequences.
July 31, 2019 | By Ruth Hailu | STAT
Every year, the National Institutes of Health spends billions of dollars for biomedical research, ranging from basic science investigations into cell processes to clinical trials. The results are published in journals, presented in academic meetings, and then — building off of their findings — researchers move on to their next project.
But what happens to the data that’s collected and what more could we learn from it? If we aggregated all the data from countless years of research, might we learn something new about ourselves, the diseases that infect us, and possible treatments?
That’s the hope behind the Biomedical Data Translator program, launched by the NIH in 2016: to create a “Google” for biomedical data that could sift through hundreds of separate data sources to help researchers connect “dots” in datasets with distinct formats and peculiarities.
“There is a lot of information that is currently available, through publications, and through databases … and, at the end of the day, it’s really too much for a human to be trying to mine through and make sense of,” said Christine Colvis, who leads the translator initiative at the National Center for Advancing Translational Sciences.
The program has awarded about $17.5 million to 19 institutions across the country that are working to integrate years of data, ranging from electronic health records to genomic sequences, that had previously been spread across a variety of platforms, and then applying new machine learning tools to help organize and reason through the wealth of information.
This means that, unlike Google, the Translator would be able to make connections between datasets that had not previously been associated with each other. In the words of Colvis, the “translator would find that A is connected to B and that B is connected to C … and so on and show you how A is connected to Z using over 100 sources of data.”
The size and collaborative nature of the project is part of what makes it unique, but also part of what makes it successful, according to Colvis. Each group attends online meetings three or four times a week and travels to “hackathons,” held twice a year, where all the teams gather for a week to troubleshoot and discuss ideas in person.
“This project is actually trying to do a technical feasibility study: Is it feasible, first of all, to combine the different data sources into something common enough where you can ask questions,” said Will Byrd, a member of the team working on the project at the University of Alabama at Birmingham. “And once you’ve done that, is it then possible to do reasoning over this so you can then extract patterns or latent information and do things like drug re-purposing or drug discovery? That’s the hope.”
That initial feasibility phase of the project is set to end this year.
The selection process was also unusual, requiring applicants to work out the answers to a series of puzzles.
“You could go to the website and solve the puzzle, and each puzzle you solved unlocked one page of a PDF document that was the request for proposals. If you solved all the puzzles, then you could submit,” Byrd said. “They were trying to attract teams that had certain types of problem-solving skills.”
Some of the participating institutions work on data architecture, which includes standardizing the format of the various datasets.
“The problem is [that] these databases are not in a common format — they were never designed to work together,” Byrd said. “NIH has spent billions of dollars to collect these data, but people can’t access it. If you are a biologist or a physician-scientist, you have no chance, unless you’re like an expert programmer.”
Others are building off of this work to create a platform for researchers to sift through the information and interpret the results of their searches — work that would usually require a computational scientist. This platform could help researchers understand the mechanism behind and find treatments for rare diseases, and also help understand why certain drugs and compounds work.
Stefano Rensi, a research engineer who works on the project at Stanford University, said that their group’s main focus is to create a tool to help drug industry researchers scan the scientific literature to better understand the biological mechanisms of compounds they’ve identified as possible therapies.
“Maybe you do a large-scale medical screen and you get a bunch of hits and basically it tells you what chemicals are working, but it doesn’t necessarily tell you why they’re working,” Rensi said. “You’d be surprised the number of times that drugs and compounds come out, and how little we actually know about what they do.”
There are other projects that focus directly on improving patient outcomes, such as an AI system being developed at UAB. That teams’ leader, Matt Might, recently used this AI to help find what was causing his son’s life-threatening symptoms.
While the institutions’ research areas may differ, all involved emphasized the importance of the collaborative nature of the project.
“It’s basically like having a menu of great ideas from all the smartest people around the country,” Rensi said.
Erika Augustine MD, MS
Tauna Batiste MS
Alison Bateman-House PhD, MPH, MA
Sharon Begley BA
Derek Bowen BS
PJ Brooks PhD
Program Director, Office of Rare Diseases Research, National Center for Advancing Translational Sciences, NIH
Wilson Bryan MD
Mona Chitre PharmD
Danielle Edwards BA
Emily Farkas PA-C
Tanya Fischer MD, PhD
Jayne Gershkowitz BA
Martin Graham PhD
David Jacoby MD, PhD
Karl Kieburtz MD, MPH
Jinkuk Kim PhD
Nikki Marinsek PhD
Rachel McMinn PhD
Elissa Orlando MPA
Sean Nicholson PhD
Traci Schilling MD
Benjamin Schlatka MBA
Scott Steele PhD
Marshall Summar MD
Holly Tabor PhD
Nancy Yu BS
Dina Zand MD