Data Quality Enlightenment

November 30th, 2009

After years of neglect, data quality is slowly moving to the forefront of business technology as both a discipline and a thriving industry.

However, given data quality license revenues are estimated at a relatively minuscule $400 million for 2009 (compared to $17 billion for DBMS license revenues), data quality is not quite center stage yet.

Therefore, in this post I want to discuss the increase in awareness by organizations that is necessary to give data quality its due.  I describe it as the three levels of Data Quality Enlightenment (DQE).

DQE Level 1 – Unaware

Organizations at Level 1 are blissfully unaware that slight discrepancies in their data create the potential for their business processes to fail.

Sometimes, the resulting failure is immediately visible.  Other times, it eventually becomes visible in a downstream application or after some period of time has passed.  Either way, the organization feels the impact of the failure as increased costs or decreased revenue, or both.

Upon finally recognizing the root cause of the problem to be data quality, organizations typically progresses to Level 2.

DQE Level 2 – Aware

Organizations at Level 2 have come to realize that they must implement data quality measures to avoid the costs of “bad data.”

The logic usually goes like this – if data is not perfect our business processes can fail, therefore we must make sure that our data is always perfect.  What wonderfully flawed logic!

No matter how hard an organization tries, their data can never be perfect.  Why?  Because by its nature, the data, and access to it, changes over time.

Existing records are updated.  New records are created.  Both of these actions can be performed by existing or new people and by existing or new systems.

With the additional reality that these people and systems can be both internal and external to the organization, the complexity grows exponentially.

Therefore, is it realistic to expect that all data throughout the enterprise will always be kept perfect and standardized the exact same way?

Will humans accessing the data know and use the standard methods?  Will humans always know the exact and correct data they want?  Will multiple applications (within and between organizations) that need to share data use the same standards for data perfection?

Of course not.  Simply put, perpetually perfect data is not possible.  Don’t believe anyone who tells you otherwise.

Yet despite these facts, the majority of the data quality industry is still focused on attempting to achieve data perfection.

The common belief is that the way to data Utopia is by writing rules to parse, standardize and match data.  Of course the different rules have fancy technical names like “deterministic” and “probabilistic” but they all boil down to manual, static rules that need to be created, maintained, and updated in perpetuity.

The rules an organization has in place today for “perfect data” will have to change (update old rules and add new rules) as the data changes.

Unlike Level 1, where organizations quickly realize they must change and progress to Level 2, most organizations at Level 2 get stuck here and never progress to Level 3.

DQE Level 3 – Enlightened

Organizations reach Level 3 when they achieve enlightenment via the “eureka moment” when they realize that getting and keeping data perfect at all times and forever is, fundamentally, an insane idea.

These organizations then seek to find a better way.

That better way is to enable all enterprise applications to function correctly despite the fact that the underlying operational data they use is not perfect.  And to do it without constantly updating and creating rules to parse, standardize, and match data.

The enlightened phase has only just begun with a select few organizations reaching Level 3.

Enlightenment is Inevitable

As is often the case,  enlightenment comes from a simple yet powerful idea that breaks away from the constraints of conventional thought.

It’s only a matter of time before every enterprise application will no longer assume and require “perfect” data in order to function correctly.

When this finally happens, and it will, everyone will benefit.

Tags: , ,
Posted in Business, Innovation, Technology | No Comments »

Data: Transparency vs. Quality?

November 23rd, 2009

In my previous post The Challenges of Data Transparency, I discussed the news about the data preparation for Recovery.gov regarding how state and local recipients are spending federal stimulus money.

In the post, I talked about the juxtaposition of data transparency and data quality, and how although missing or incomplete data is a common problem, completeness without any regard for accuracy could possibly do more harm than good.

I asked whether data should be concealed until it has been verified to be of sufficient quality, or should be provided as soon as it becomes available without regard for quality.

This past week, we have been inundated with news reports from numerous media outlets regarding the glaring data quality issues found on Recovery.gov, which would seem to indicate many would answer my question by advocating concealment until the verification of data quality has been performed.

I don’t want to get into some of the more politically charged aspects of the current debate.  I would prefer to pose the question in a more general sense.  When it comes to data, does it fundamentally come down to transparency vs. quality?

From my perspective, the underlying struggle in this debate is the desire to achieve both total data transparency and perfect data quality.  As wonderful as it would be if this was possible, the reality is simply that it is not.

Perfection (especially in data) is impossible to achieve.  Transparency reveals the quality issues naturally inherent in data.  I am not advocating we simply accept the reality of poor quality.  We must take action to identify and overcome data quality issues.

The traditional approach is employing standardization and other data cleansing techniques in an effort to perfect data.  Continuing advancements in mathematics and machine learning algorithms provide the capability to adapt to (and overcome) data’s inherent imperfections.

We must strive for total data transparency balanced with a realistic perspective of data quality.  Transparency provides the necessary access and emerging innovations in quality provide the methods for transforming data into actionable information.

Related Posts

The Chaos Theory of Data Quality

Drowning in Imperfect Data

The Growing Importance of the Algorithm

The Growing Importance of Mathematics

Tags: ,
Posted in Economy, Technology | No Comments »

HITECH Challenges

November 16th, 2009

In his recent Internet Evolution article Stimulus Plan Moves Healthcare Tech Center Stage, John Soat reported on the challenges facing healthcare providers under the section of the United States federal stimulus bill known as the Health Information Technology for Economic and Clinical Health (HITECH) Act, which is intended to jumpstart the use of digital technology in the healthcare industry, in particular the use of e-health records.

The article mentions some of the excellent e-health technology efforts under way at General Electric, Google, and Microsoft.  It’s amazing how 20 billion dollars (at least) in government funding can lead some of the biggest companies to focus on building digital technology solutions for the healthcare industry, which is something long overdue and for far longer than the recent stimulus funding has become available.

One of the primary challenges of the e-health evolution is in the area of electronic health records (EHR).  Back in March 2009, I wrote an article for the Executive Healthcare Management (EHM) magazine about lessons from the bleeding edge of EHR.

The evolution of digitizing, storing, and successfully retrieving accurate information necessary for servicing customers has been well underway in other industries for decades.  The evolution of customer data management is continuing and still has challenges to overcome.

However, in the healthcare industry, the customer is primarily a patient and the service being provided is primarily medical treatment.  In many cases, retrieving accurate information can be a matter of life and death.

Duplicate customer data can undermine the effectiveness of sales and marketing programs, causing unnecessary costs and wasteful spending that greatly reduces revenue.

However, duplicates in a master patient index can cause incorrect or outdated information to be used as the basis for medical treatments.  These mistakes can incur costs of a human nature far greater and far more important than costs of a financial nature.

HITECH is indeed presenting the healthcare industry with significant challenges to overcome.  However, these challenges are not simply about modernizing the industry with the latest and greatest technology.

Healthcare is a great example of how innovative technology is fundamentally about improving the quality of human life.

Tags:
Posted in Technology | No Comments »

Service-Oriented is Future-Oriented

November 9th, 2009

In his recent ebizQ.net article SOA, Phase 2: Toward a Loosely Coupled World, Joe McKendrick declared:

“I am a passionate believer in the power of technology, as an enabler of entrepreneurship and organizational transformation. I have long advocated flattening the organizational hierarchy, and pushing decision-making down to the managers and employees who deal with customers and production on a day-to-day basis.”

I couldn’t agree more.  Nothing has a more powerful effect on an organization’s ability to succeed than putting the right technology into the hands of front line employees.

There is an unstoppable industry trend gaining daily momentum where organizations are increasingly looking for solutions with cloud computing and software-as-a-service (SaaS) as the new paradigm for enterprise architecture.

“Cloud computing is pushing some software vendors to change their models to component delivery,” explains McKendrick.  “This makes plenty of room not only for small start-ups, but also for development shops within traditional enterprises that have great ideas.”

Historically, many of the most powerful new trends in technology originated from small entrepreneurial vendors.  By focusing on enhancing their highly specialized components, they can provide a great source of rapid innovation.  Therefore, small software vendors, whose solutions are designed for deployment using a loosely coupled service-oriented architecture (SOA), may be the industry’s small giants upon whose broad shoulders we will all be standing in the not-to-distant future.

And according to Mohan Sawhney, professor at Northwestern’s Kellogg School of Management:

“The best-run companies are becoming orchestrators of networks of services.  Five years from now, the concept of an application will be obsolete.  They will all be services, combined, mixed, matched and reused as needed.”

Therefore, when it comes to enterprise architecture — service-oriented is future-oriented.

Related Posts

The API and the Innovation of Enterprise Applications

Innovation Recession?

Innovation – Do More with Less

The Cloud brings Commoditization

Tags: , ,
Posted in Innovation, Technology, Trends | No Comments »

The Chaos Theory of Data Quality

November 2nd, 2009

“One of those issues that is always a source of frustration in the enterprise,” explained Michael Vizard in his recent IT Business Edge blog post, The Never Ending War for Data Quality, “is the quality of the data we spend so much time and money processing.  The quest to make sure we have high quality data is nothing short of a never-ending battle between the forces of order and the chaos that envelopes every attempt to organize anything.”

I have to admit that this is one of my pet peeves.  A remarkably common misconception is that the only way to deal with the pervasive nature of “imperfect data” is to somehow magically keep all of the data “perfect” all of the time.

Data frequently contains numerous variations caused by different conventions, lack of standards, omissions, and other inconsistencies.  The traditional approach to data quality is to heavily rely on standardization and other data cleansing efforts in order to prepare data before it can be effectively used for making business decisions.  These preparation activities attempt to create a consistent format of parsed attributes with standardized values.

“Alas, the war over data quality can never really be won,” explains Vizard.  “What can be done is that the number of instances where we have conflicting data and outright errors can be sharply reduced.  There’s no shame in having bad data; everybody does.  The only real sin is not trying to do anything about it.”

I agree with Vizard on the points that everybody has bad data and that we do need to do something about it.

However, the time is long overdue for us to stop depending on outdated approaches to data quality.

Perfection (especially in data) is impossible to achieve.  Intelligent business decisions can be made using imperfect data – without extensive data cleansing.  Instead of trying to make the data perfect, we need to focus on enabling enterprise applications to handle the unavoidable reality of imperfect data, which is something that humans do naturally.

Advancements in mathematics and machine learning algorithms provide the capability to adapt to (and overcome) data’s inherent chaos, and enable enterprises to make better data-driven business decisions.

I call this approach the Chaos Theory of Data Quality.

Related Posts

The Growing Importance of the Algorithm

The Growing Importance of Mathematics

Adaptive Software

Drowning in Imperfect Data

A Sisyphean Task…

Tags: ,
Posted in Technology, Trends | No Comments »

The API and the Innovation of Enterprise Applications

October 26th, 2009

“One of the bigger trends to come down the pike lately,” explained Jim Ericson in his recent Information Management blog post The API is the New Network, “is the proliferation of Web-based application programming interfaces, or APIs, and how network traffic is growing exponentially through APIs.”

More and more organizations continue to look to innovations in cloud computing, software-as-a-service (SaaS), and information as a service, as a new paradigm for enterprise applications.  In a recent press release, Gartner Research identified the Top 10 Strategic Technologies for 2010 and the list includes both cloud computing and client computing.

This is an almost stark contrast to the traditional approach taken by large technology vendors, who tend to innovate via acquisition in order to offer consolidated enterprise application development platforms with seamlessly integrated components for data quality, data integration, master data management and business intelligence.  This allows the large technology vendors to offer end-to-end solutions and the convenience of one-vendor information technology shopping.

However, does buying everything from one large vendor guarantee a best of breed solution for each individual component?

An API-oriented approach enables a plug-and-play enterprise application strategy.  Under this model, enterprise applications are assembled from best of breed individual components that are loosely coupled via a network of API calls.

Historically, many of the most powerful new trends in technology originated from small entrepreneurial ventures.  Small technology vendors tend to be specialists with a narrow focus that can provide a great source of rapid innovation.

Perhaps we are witnessing the beginning of the reversal of the recent trend of vendor consolidation, and a return to the earlier industry landscape where smaller vendors remained focused on enhancing and improving their highly specialized components.

If the API is indeed the new network, then the innovation of enterprise applications is to be found in collaboration and not consolidation.

Related Posts

Innovation Recession?

Innovation – Do More with Less

The Cloud brings Commoditization

Tags: , ,
Posted in Innovation, Technology, Trends | No Comments »

MDM: “Golden” Repository or “Fool’s Gold”

October 19th, 2009

Master Data Management (MDM) is the logical extension of a 20 year evolution in data management practice.  The strategic goal for MDM is to provide a single, “golden” repository of mission-critical data that assures all systems, organizations, and users are getting consistent, accurate information to support their needs.

Today, a number of vendors are positioning themselves to take on this challenge with new technologies that purport to make MDM feasible.  Once implemented, MDM promises to maintain real-time, clean, and consistent 360° views of prospects, customers, and products.

However, in her recent IT World Canada article Data quality vendors missing the mark, Kathleen Lau reported on a study by Andy Hayler, President and CEO of the analyst firm The Information Difference that shows:

“The issue for lack of attention to data quality by MDM vendors is that traditionally these vendors have focused on building systems that digest data quickly, only to later realize such systems were useless if the data being input was bad.”

Amassing poor quality data would appear to be what many MDM “solutions” are actually delivering.  The technology behind many of these systems is powerful and their functionality is impressively robust.

However, simply assuming the underlying data is “good enough” to support the MDM system, will only transform a “golden” repository of mission-critical data into an enterprise database of “fool’s gold.”

Tags:
Posted in Technology | No Comments »

Data Sherpas Needed

October 12th, 2009

In the recent New York Times article Training to Climb an Everest of Digital Data, Ashlee Vance reported on the challenges associated with managing – and deriving value from – massive repositories of data.

“Researchers and workers in fields as diverse as bio-technology, astronomy and computer science,” reports Vance, “will soon find themselves overwhelmed with information.  The next generation of computer scientists has to think in terms of what could be described as Internet scale.  Facebook, for example, uses more than 1 petabyte of storage space to manage its users’ 40 billion photos.  (A petabyte is about 1,000 times as large as a terabyte, and could store about 500 billion pages of text).”

According to Gartner Research, the volume of enterprise data is doubling every 18 months.  This rapid data proliferation is causing day-to-day business challenges to evolve faster than the existing applications (or new applications under development) can react.

“Science these days has basically turned into a data-management problem,” said Jimmy Lin, an associate professor at the University of Maryland, at a recent technology conference.

From the beginning of civilization, mathematics (the language of science) has been central to our advancement.  But our relatively new found ability to collect massive amounts of digital data has ushered in a new era for leveraging and benefiting from mathematics.

Advancements in machine learning technology using sophisticated mathematical algorithms are providing the capability to not only rapidly process large volumes of data, but more importantly, enable enterprises to make better data-driven business decisions.

According to Vance, companies large and small, as well as universities and government agencies, are “looking for big data experts” capable of scaling today’s digital data mountains.

Perhaps tomorrow we will even see a listing in the classifieds (or more likely in a Twitter status update) that simply reads:

Data Sherpas Needed

Related Posts

The Growing Importance of Mathematics

Adaptive Software

Drowning in Imperfect Data

A Sisyphean Task…

Tags: , ,
Posted in Business, Technology, Trends | No Comments »

The Challenges of Data Transparency

October 5th, 2009

In the recent Federal Computer Week article Stimulus spenders race to the finish line, Alice Lipowicz reported on the efforts of state and local recipients to prepare detailed reports on how they are spending federal stimulus money.

Among the common concerns cited in the article was dealing with the sheer volume of data, and more importantly, the quality of the data.

“Transparency advocates predict that data quality,” reports Lipowicz, “will be as much of a concern at Recovery.gov as it is at USAspending.gov, which Congress established in 2006 to provide visibility into federal spending.  That site has been plagued with problems such as errors, missing data and mislabeled data.”

Data transparency is definitely a laudable goal, and not just for the government.  Organizations in every industry and of every size need to do a better job of making available for review, the data that was used to drive critical business decisions, especially financial decisions.

Missing or incomplete data is a common problem, but transparency can not simply mean a massive dump of all available data.

Completeness without any regard for accuracy could possibly do more harm than good.  Data frequently contains numerous variations caused by different conventions, lack of standards, omissions, and other inconsistencies.

An excellent question raised in the article was:

“Data quality has been a problem for years, so why do we keep getting [more data] instead of addressing these priorities?”

I think that this question represents one of the most significant challenges for data transparency.

Should data be concealed until it has been verified to be of sufficient quality?  Or should data be provided as soon as it becomes available without regard for quality?

Please share your thoughts.

Related Posts

Drowning in Imperfect Data

A Sisyphean Task…

Tags: ,
Posted in Business, Economy | No Comments »

Fundamental Requirements for Data Matching Models

September 28th, 2009

Data matching determines whether two or more records should be linked, are duplicates, or represent the same entity.  There are many different approaches to data matching.  In Machine Learning, advanced mathematical techniques are used to construct a data matching model for the way that humans perceive similarity.

In this post, I want to discuss what any data matching model is attempting to achieve – since the reality is that all approaches to data matching construct “models”.

Input from Subject Matter Experts (SME)

Before any data matching model can be constructed, interviews must be conducted with subject matter experts possessing a business understanding of the data.  All organizations in every industry have unique data characteristics and unique data challenges.  An effective data matching model must capture the tacit knowledge of subject matter experts without losing anything in translation when instantiating business knowledge into a technological implementation.

A History Lesson

All data matching models start with a history lesson.  A training set is collected from existing data.  Typically, at least a few hundred records are collected, but a few thousand would be better.  The real distinction of a training set is that it has an answer key.  In order words, it contains annotations indicating which training records should be considered matches, potential matches, and non-matches.  The training set allows developers to create an initial data matching model that produces expected results.

Predicting the Future

It is easy to pass a test when you already know the answers – and maybe you even get a few wrong just to avoid suspicion that you may have cheated.  However, the real test is can you use what you have learned from the past to predict the future?  The mathematical definition of convergence is used to describe a data matching model that is ready to correctly evaluate records that were not included in its training set.

Conclusion

Regardless of the approach being used, all data matching applications construct a data matching model using SMEs and  historical training data.  The two challenges are:

  1. How much effort is required to build and maintain the matching model
  2. How well does the constructed model match input data that it hasn’t previously seen

An optimal data matching model is accurate, easy to build, easy to update/maintain, and can seamlessly adapt to new data.

Sounds demanding because it is… but with innovation, it can be done!

Related Posts

A Data Matching Benchmark

Matches Created

A more precise, but less certain world

Narrative Fallacy and Data Matching

Tags:
Posted in Data Matching | No Comments »

Previous page

Pages

RSS Netrics HD

About Netrics HD

Data matching is a fundamental operation in many applications, from improving data quality to implementing master data management. Stef Damianakis, CEO of Netrics, a world leader in matching technology, shares his thoughts on the state of the technology and business of data matching.

Brought to you by...

Netrics Logo

Calendar

March 2010
M T W T F S S
« Nov    
1234567
891011121314
15161718192021
22232425262728
293031  

Tag Cloud

Categories

Recent Posts

Recent Comments