Transforming Open Access Biomedical Data into New Drugs and Diagnostics

Atul Butte

Atul Butte is being honored as a Champion of Change for the vision he has demonstrated and for his commitment to open science.

The past 20 years have seen amazing changes in biomedical research. Gone are the days  of sequencing a small piece of DNA, or measuring the expression level of one gene, or studying one protein at a time. Scientists can sequence an individual's entire DNA, measure the levels of every gene, and study nearly every protein, all simultaneously.  Moreover, scientists perform these measurements using commercial tools and services, which are all positive outcomes from the Human Genome Project. These tools, services, and discoveries enable scientists to learn what the differences might be between individuals with disease and those without, and how we might treat those diseases.  

But what is truly stunning is that, in many cases, today scientists share their raw measurements on the Internet. 

The impetus to share data comes from many directions.  Scientists share their data to enable others to reproduce their discoveries, while journal editors believe shared data helps readers trust their publications.  Biotechnology companies release data to help scientists understand their measurement platforms. And funding agencies are asking researchers to make their data publicly accessible to promote its reuse. Earlier this year, the White House Office of Science and Technology Policy directed Federal agencies with significant R&D grants and awards to ensure their recipients make their work publicly available within one year of publication.  This policy also applies to the digital data created by scientists.

Open scientific data is an amazing public resource.  Making scientific data publicly accessible costs only a small, marginal amount over its scientific creation. Public data enables transparency and reproducibility in science. Data is infinitely shareable without diminishment in its value. 

In fact, data takes on greater value when it is intersected with other data sets. Clinical trial data can be reanalyzed to find the subsets of patients who most greatly benefit from specific drugs.  DNA sequencing data from thousands of individuals can be used to learn what is “normal” and help us interpret DNA from an afflicted patient. And open data on health care costs, utilization, quality, and errors, all can be integrated into apps of the future, enabling patients and consumers to make better data-driven decisions.

In my lab, we have found that combining publicly available molecular measurements made by a dozen independent researchers on the same medical condition, such as preterm birth, can yield a reliable set of diagnostic markers that would not be obvious to each researcher working separately.  We also have seen that open measurements of diseases can be integrated with measurements of drug effects, resulting in new ways to use those drugs to treat conditions.  Finding new ways to use existing drugs could help get therapies to patients with rare diseases.  And these data-driven drugs and diagnostics can even form the basis of new businesses and ventures in medicine.

In this way, open scientific data is a kind of power platform to be leveraged. But perhaps open scientific data is best described as a means to thaw “frozen discoveries,” meaning that focusing light and energy can thaw on existing knowledge can release those discoveries. This yields the new drugs, diagnostics and knowledge we still sorely need in medicine.

I am honored to be recognized as an Open Science Champion of Change, and thank my daughter Kimi for inspiring my work, my lab and collaborators for their efforts in making these important discoveries, my wife, Gini Deshpande, for being a life-partner and collaborator in developing my science and launching our ventures, and to scientists everywhere for sharing their data with the public and for creating the tools to enable others do so.

Atul Butte is an Associate Professor in Pediatrics and Genetics at Stanford University, and the principal investigator of ImmPort, the long-term, sustainable data warehouse for re-use of immunological data funded by the National Institute for Allergy and Infectious Diseases. He is also a founder of Personalis, which provides clinical interpretation of whole genome sequences, of Carmenta, which uses public data to discover diagnostics for life-threatening conditions in pregnancy, and of NuMedii, which uses public big data to find new uses for drugs.

Your Federal Tax Receipt