A Data Matching Benchmark

September 7th, 2009 by Stefanos Damianakis

A benchmark provides the ability to perform consistent and repeatable comparisons by running tests against it.  With many different approaches to data matching and many software vendors offering commercial solutions, an industry standard data matching benchmark would certainly be very useful.

However, in this post, I want to discuss recommendations for establishing your own benchmark for comparing and selecting the best data matching solution for your specific business needs.

Benchmark Data

The only way to truly determine that a data matching solution will satisfy your business requirements is to test it against your actual data.  Therefore, it is critically important to collect real data for benchmarking purposes.  At least a few hundred records should be collected, but a few thousand would be better.  The collected data should be separated into two benchmark data sets:

  1. Training Set
  2. Processing Set

The real distinction between these two data sets is that the training set contains only a small percentage of the total records and that it has an answer key.  In order words, the training set contains annotations indicating which records should be considered matches, potential matches, and non-matches.

The training set allows the competing data matching solutions to prepare for the benchmark by performing their own independent development and testing against the training set, which is intended to represent the expected results in alignment with your business definitions of what constitutes acceptable matches.  Since the training set contains the answer key, the vendor can perform this preparation without your direct involvement.

The processing set will be used for the actual benchmark tests, should contain the full volume data collection, but most importantly – no answer key.  Obviously, you will need an answer key for the processing set.  However, you should not provide it to the vendors.  The most important aspect of the benchmark is to measure whether the data matching solution was capable of producing the correct results by extrapolating from the training set the correct logic to apply to the processing set.

Benchmark Measurements

The benchmark results should be measured against the following statistics:

  • Processing Time – execution time for the processing set.
  • Matches – the number of records properly identified as a match.
  • Potential Matches – the number of records properly identified as a potential match.
  • Non-Matches – the number of records properly identified as a non-match.
  • False Positives – the number of records identified as a match that should have been identified as a non-match.
  • False Negatives – the number of records identified as a non-match that should have been identified as a match.

Benchmark Guidelines

Although benchmark preparation using the training set can be performed off-site, it is highly recommended that the actual benchmark using the processing set always be performed on-site.  Whenever possible, the benchmark should be executed on the same machine.  If this is not possible, then comparable machines should be used.

The Lightning Round

The final test in your data matching benchmark should be to simulate changing business requirements.  To do this, you would change some of the annotations in the training set.  You would also need to have an alternate answer key for the processing set available.  Similar to the original test, you would only provide the training set to the vendors.

This final test should be required to be performed on-site.  The additional benchmark measurement would be the modification time needed to prepare for the second test on the processing set.

Even better, you can make this phase of the benchmark the lightning round – by providing only a single business day for making the changes.

What else do you think should be included in a data matching benchmark?  Have you used a benchmark during vendor selection?

Please share your thoughts and experiences.

Related Posts

Matches Created

A more precise, but less certain world

Tags:
Posted in Data Matching | No Comments »

Leave a Reply

Pages

RSS Netrics HD

About Netrics HD

Data matching is a fundamental operation in many applications, from improving data quality to implementing master data management. Stef Damianakis, CEO of Netrics, a world leader in matching technology, shares his thoughts on the state of the technology and business of data matching.

Brought to you by...

Netrics Logo

Calendar

July 2010
M T W T F S S
« Nov    
 1234
567891011
12131415161718
19202122232425
262728293031  

Tag Cloud

Categories

Recent Posts

Recent Comments