80%.

I run across this number all the time in healthcare articles about the potential use of artificial intelligence within healthcare.

It is the percentage of “unstructured” data contained in electronic health records (EHRs) – data which is prime real estate for Natural Language Processing (NLP) to develop. 

But where doe this number come from?  Is there a specific research study which has a clear definition of structured and unstructured data – and a large enough EHR data set – to make this statement with reasonable certainty? 

Well, after several hours on PubMed, I think the answer is no.*

This 80% is said to be the “industry consensus”, and seems to be mostly driven by companies such as IBM ( Providers need new tools to make sense of unstructured data ).  

I am not suggesting that they are wrong.  Undoubtedly,  the people at IBM Watson’s Health Cloud, which recently has partnered with the likes of Apple, Johnson & Johnson,  and Medtronic, know a lot about health data. 

But what I am suggesting is that the definition of a “problem”  in healthcare – in this case, the “untapped unstructured data problem” within out EHR, shouldn’t be left strictly to the corporations to define. 

Remember, corporations have a different word for problem – and it’s called market opportunity.   

They are not unbiased. 

So, take this 80% number with a  grain of salt, and when you hear it, dive in a little deeper.  

How do you define structured versus unstructured data?  Is there such a thing as semi-structured data?  Is there any overlap of the two? 

You may be surprised by the answers.

*If anyone knows of a peer-reviewed study which shows the 80% number then please let me know via Twitter