Technology

Medicine’s Big Problem with Big Data: Information Hoarding

Information that may offer medical insights has been locked away in the filing cabinets of doctors’ offices.

Jun 7, 2014, 11:25 AM UTC

iStock / Palau83

Researchers at IBM, Berg Pharma, Memorial Sloan Kettering, UC Berkeley and other institutions are exploring how artificial intelligence and big data can be used to develop better treatments for diseases (as we explored in a separate story on Saturday).

But one of the biggest challenges for making full use of these computational tools in medicine is that vast amounts of data have been locked away — or never digitized in the first place.

The results of earlier research efforts or the experiences of individual patients are often trapped in the archives of pharmaceutical companies or the paper filing cabinets of doctors’ offices.

Patient privacy issues, competitive interests and the sheer lack of electronic records have prevented information sharing that could potentially reveal broader patterns in what appeared to any single doctor like an isolated incident.

When you can analyze clinical trials, genomic data and electronic medical records for 100,000 patients, “you see patterns that you don’t notice in a couple,” said Michael Keiser, an instructor at the UC San Francisco School of Medicine.

Given that promise, a number of organizations are beginning to pull together medical data sources.

Late last year, the American Society of Clinical Oncology announced the initial development of CancerLinQ, a “rapid learning system” that allows researchers to enter, access and analyze anonymized medical records of cancer patients.

Similarly, in April the CEO Roundtable on Cancer, a nonprofit representing major pharmaceutical companies, announced the launch of Project Data Sphere. It’s an open platform populated with clinical datasets from earlier Phase III studies conducted by AstraZeneca, Bayer, Celgene, Memorial Sloan Kettering, Pfizer, Sanofi and others.

The data has been harmonized and scrubbed of patient identifying details, enabling independent researchers or those working for life sciences companies to use it freely. They have access to built-in analytical tools, or can plug the data into their own software.

It might uncover little known drug candidates that showed some effectiveness against certain mutations, but were basically abandoned when they didn’t directly attack the principle target of a particular study, said Dr. Martin Murphy, chief executive of the CEO Roundtable on Cancer.

In some cases, it could also eliminate the need for control groups — those who receive the standard of care plus a placebo instead of the experimental treatment — since earlier studies have already indicated the outcomes for those patients. (That would be an important development because the fear of receiving a placebo is a major reason many patients decide against participating in clinical trials.)

The effort is happening now in part because of improving technology and in part because companies are coming around to the view that they’ll all be better off with the insights gleaned from this pooled data.

“It’s a recognition that it’s costing a lot more money to develop another drug,” Murphy said. “The low-hanging fruit was long ago harvested.”

Other information sharing efforts include the Global Alliance for Genomics and Health, the molecular databases maintained by EMBL-EBI and the National Institute of Health’s Biomarker Consortium.

Meanwhile, last month Google Ventures led a $130 million round in Flatiron Health, which has built an “oncology cloud” that aggregates information from billing systems and electronic medical records.

The system makes sense of data stored in inconsistent and unstructured formats from doctors offices and hospitals, to enable analysis of what’s happening across broad cancer patient populations. Ideally it can highlight what’s working for which types of cancer patients.

“Flatiron is focused on what we (and the industry) call ‘real world’ patient clinical data, whereby we’re trying to aggregate and organize data on the 96 percent of patients who do not participate in a prospective clinical trial,” co-founder Nat Turner said in an email.

“To really understand what’s working and how others are treating and what outcomes are being achieved, institutions should be open to de-identified data sharing and anonymous benchmarking, which is part of the Flatiron vision,” he said.

To be sure, there is good reason to proceed with some caution here. Medical information is highly sensitive, so any privacy risks demand careful consideration.

Supposedly “de-identified” data has proven to be anything but on several notable occasions in the past (including here, here and here). And electronic medical records have been compromised already.

But to the degree that there’s a social tradeoff here, many come down firmly on the side of: let’s try to save lives. Old habits and out-of-date regulations still mean the shift isn’t happening nearly fast enough if you ask David Patterson, a professor of computer science at UC Berkeley developing machine learning tools for cancer research.

“Those of us in the computer field are used to Internet time and Moore’s law,” he said. “For me as an outsider, it’s very frustrating that we can’t get bureaucratic agreement so that we can collect lots of data sets together.”

“Patient privacy is important but so is making progress on cancer,” he said. “The upside of collecting lots of information together is we can make progress on this terrible disease.”

No one interviewed for this article could point to a breakthrough treatment produced by these techniques to date. After all, the tools are new, the data sets are just coming together and clinical trials take years.

But nearly all agreed researchers are on the verge of something big.

“The tips of your shoes are just poking over the edge of the peaks,” Murphy said. “No one has been over this before in cancer.”

This article originally appeared on Recode.net.

See More: