Fundamental Requirements for Data Matching Models

September 28th, 2009 by Stefanos Damianakis

Data matching determines whether two or more records should be linked, are duplicates, or represent the same entity.  There are many different approaches to data matching.  In Machine Learning, advanced mathematical techniques are used to construct a data matching model for the way that humans perceive similarity.

In this post, I want to discuss what any data matching model is attempting to achieve – since the reality is that all approaches to data matching construct “models”.

Input from Subject Matter Experts (SME)

Before any data matching model can be constructed, interviews must be conducted with subject matter experts possessing a business understanding of the data.  All organizations in every industry have unique data characteristics and unique data challenges.  An effective data matching model must capture the tacit knowledge of subject matter experts without losing anything in translation when instantiating business knowledge into a technological implementation.

A History Lesson

All data matching models start with a history lesson.  A training set is collected from existing data.  Typically, at least a few hundred records are collected, but a few thousand would be better.  The real distinction of a training set is that it has an answer key.  In order words, it contains annotations indicating which training records should be considered matches, potential matches, and non-matches.  The training set allows developers to create an initial data matching model that produces expected results.

Predicting the Future

It is easy to pass a test when you already know the answers – and maybe you even get a few wrong just to avoid suspicion that you may have cheated.  However, the real test is can you use what you have learned from the past to predict the future?  The mathematical definition of convergence is used to describe a data matching model that is ready to correctly evaluate records that were not included in its training set.

Conclusion

Regardless of the approach being used, all data matching applications construct a data matching model using SMEs and  historical training data.  The two challenges are:

  1. How much effort is required to build and maintain the matching model
  2. How well does the constructed model match input data that it hasn’t previously seen

An optimal data matching model is accurate, easy to build, easy to update/maintain, and can seamlessly adapt to new data.

Sounds demanding because it is… but with innovation, it can be done!

Related Posts

A Data Matching Benchmark

Matches Created

A more precise, but less certain world

Narrative Fallacy and Data Matching

Tags:
Posted in Data Matching | No Comments »

Leave a Reply

Pages

RSS Netrics HD

About Netrics HD

Data matching is a fundamental operation in many applications, from improving data quality to implementing master data management. Stef Damianakis, CEO of Netrics, a world leader in matching technology, shares his thoughts on the state of the technology and business of data matching.

Brought to you by...

Netrics Logo

Calendar

September 2010
M T W T F S S
« Nov    
 12345
6789101112
13141516171819
20212223242526
27282930  

Tag Cloud

Categories

Recent Posts

Recent Comments