Fundamental Requirements for Data Matching Models
Data matching determines whether two or more records should be linked, are duplicates, or represent the same entity. There are many different approaches to data matching. In Machine Learning, advanced mathematical techniques are used to construct a data matching model for the way that humans perceive similarity.
In this post, I want to discuss what any data matching model is attempting to achieve – since the reality is that all approaches to data matching construct “models”.
Input from Subject Matter Experts (SME)
Before any data matching model can be constructed, interviews must be conducted with subject matter experts possessing a business understanding of the data. All organizations in every industry have unique data characteristics and unique data challenges. An effective data matching model must capture the tacit knowledge of subject matter experts without losing anything in translation when instantiating business knowledge into a technological implementation.
A History Lesson
All data matching models start with a history lesson. A training set is collected from existing data. Typically, at least a few hundred records are collected, but a few thousand would be better. The real distinction of a training set is that it has an answer key. In order words, it contains annotations indicating which training records should be considered matches, potential matches, and non-matches. The training set allows developers to create an initial data matching model that produces expected results.
Predicting the Future
It is easy to pass a test when you already know the answers – and maybe you even get a few wrong just to avoid suspicion that you may have cheated. However, the real test is can you use what you have learned from the past to predict the future? The mathematical definition of convergence is used to describe a data matching model that is ready to correctly evaluate records that were not included in its training set.
Conclusion
Regardless of the approach being used, all data matching applications construct a data matching model using SMEs and historical training data. The two challenges are:
- How much effort is required to build and maintain the matching model
- How well does the constructed model match input data that it hasn’t previously seen
An optimal data matching model is accurate, easy to build, easy to update/maintain, and can seamlessly adapt to new data.
Sounds demanding because it is… but with innovation, it can be done!
Related Posts
Tags: Data Matching
Posted in Data Matching | No Comments »
