Heres a typical AI study: Rule-based and machine learning algorithms identify patients with systemic sclerosis accurately in the electronic health record in which 3 million “de-identified” charts were scanned to assess machine learning methods for identifying a rare disease called systemic sclerosis.

This particular study was done at Vanderbilt University Medical Center, and after review by their Institutional Review Board, they used a system call Synthetic Derivative, which is a collection of over 3 million people’s EHR data over several decades which has been de-identified.

Medically these studies may well be quite useful. For example, in this case, an AI algorithm may one day be able to identify rare diseases such as Systemic Sclerosis earlier, perhaps leading to earlier intervention and even improved long-term outcomes.

I also don’t doubt the legality of studies such as these. No doubt, such a fine medical institution met all the appropriate privacy and HIPPA criteria.

My question is more fundamental, and it is simply this: Using this as an example, is EHR de-identification sufficient to use data from 3 million people—some data extracted over decades, well before EHRs even existed—without all these people giving there explicit, knowledgable consent?

That’s a tough question, and I am not sure the Vanderbilt Institutional Review Board, and the many other similar internal ethics review boards in every research organization and every healthcare organization and every high tech AI company in the country, are sufficiently neutral to make this judgment.

I don’t question any ethics or review board members’ credibility— I have found these folks have generally been selected carefully and have a high level of judgment and credibility.

Rather, I am curious about their conceptual neutrality, and their paradigm attachment.

Can any research-based institution which uses Big Data mining as their principle tool select committee members who are fundamentally opposed to the concept of the sufficiency of de-identification, which would practically exclude all of the ongoing healthcare Big Data mining operations?

Well, the cat is probably already out of the bag on this particular topic—and even if there will be a public debate on this—I doubt anything would change. the embedded hunger for data is just too great.

But there are other cats in other bags waiting to be opened, and I would be cautious about who is opening them.