Fundamental Requirements for Data Matching Models

Monday, September 28th, 2009 @ 10:10 am

Data matching determines whether two or more records should be linked, are duplicates, or represent the same entity.  There are many different approaches to data matching.  In Machine Learning, advanced mathematical techniques are used to construct a data matching model for the way that humans perceive similarity.

In this post, I want to discuss what any data matching model is attempting to achieve – since the reality is that all approaches to data matching construct “models”.

Input from Subject Matter Experts (SME)

Before any data matching model can be constructed, interviews must be conducted with subject matter experts possessing a business understanding of the data.  All organizations in every industry have unique data characteristics and unique data challenges.  An effective data matching model must capture the tacit knowledge of subject matter experts without losing anything in translation when instantiating business knowledge into a technological implementation.

A History Lesson

All data matching models start with a history lesson.  A training set is collected from existing data.  Typically, at least a few hundred records are collected, but a few thousand would be better.  The real distinction of a training set is that it has an answer key.  In order words, it contains annotations indicating which training records should be considered matches, potential matches, and non-matches.  The training set allows developers to create an initial data matching model that produces expected results.

Predicting the Future

It is easy to pass a test when you already know the answers – and maybe you even get a few wrong just to avoid suspicion that you may have cheated.  However, the real test is can you use what you have learned from the past to predict the future?  The mathematical definition of convergence is used to describe a data matching model that is ready to correctly evaluate records that were not included in its training set.

Conclusion

Regardless of the approach being used, all data matching applications construct a data matching model using SMEs and  historical training data.  The two challenges are:

  1. How much effort is required to build and maintain the matching model
  2. How well does the constructed model match input data that it hasn’t previously seen

An optimal data matching model is accurate, easy to build, easy to update/maintain, and can seamlessly adapt to new data.

Sounds demanding because it is… but with innovation, it can be done!

Related Posts

A Data Matching Benchmark

Matches Created

A more precise, but less certain world

Narrative Fallacy and Data Matching

Tags:
Posted in Data Matching | No Comments »

Leave a Reply