The Chaos Theory of Data Quality
“One of those issues that is always a source of frustration in the enterprise,” explained Michael Vizard in his recent IT Business Edge blog post, The Never Ending War for Data Quality, “is the quality of the data we spend so much time and money processing. The quest to make sure we have high quality data is nothing short of a never-ending battle between the forces of order and the chaos that envelopes every attempt to organize anything.”
I have to admit that this is one of my pet peeves. A remarkably common misconception is that the only way to deal with the pervasive nature of “imperfect data” is to somehow magically keep all of the data “perfect” all of the time.
Data frequently contains numerous variations caused by different conventions, lack of standards, omissions, and other inconsistencies. The traditional approach to data quality is to heavily rely on standardization and other data cleansing efforts in order to prepare data before it can be effectively used for making business decisions. These preparation activities attempt to create a consistent format of parsed attributes with standardized values.
“Alas, the war over data quality can never really be won,” explains Vizard. “What can be done is that the number of instances where we have conflicting data and outright errors can be sharply reduced. There’s no shame in having bad data; everybody does. The only real sin is not trying to do anything about it.”
I agree with Vizard on the points that everybody has bad data and that we do need to do something about it.
However, the time is long overdue for us to stop depending on outdated approaches to data quality.
Perfection (especially in data) is impossible to achieve. Intelligent business decisions can be made using imperfect data – without extensive data cleansing. Instead of trying to make the data perfect, we need to focus on enabling enterprise applications to handle the unavoidable reality of imperfect data, which is something that humans do naturally.
Advancements in mathematics and machine learning algorithms provide the capability to adapt to (and overcome) data’s inherent chaos, and enable enterprises to make better data-driven business decisions.
I call this approach the Chaos Theory of Data Quality.
Related Posts
The Growing Importance of the Algorithm
Tags: Technology, Trends
Posted in Technology, Trends | No Comments »
