Filed under: Data Matching

Fundamental Requirements for Data Matching Models

September 28th, 2009

Data matching determines whether two or more records should be linked, are duplicates, or represent the same entity.  There are many different approaches to data matching.  In Machine Learning, advanced mathematical techniques are used to construct a data matching model for the way that humans perceive similarity.

In this post, I want to discuss what any data matching model is attempting to achieve – since the reality is that all approaches to data matching construct “models”.

Input from Subject Matter Experts (SME)

Before any data matching model can be constructed, interviews must be conducted with subject matter experts possessing a business understanding of the data.  All organizations in every industry have unique data characteristics and unique data challenges.  An effective data matching model must capture the tacit knowledge of subject matter experts without losing anything in translation when instantiating business knowledge into a technological implementation.

A History Lesson

All data matching models start with a history lesson.  A training set is collected from existing data.  Typically, at least a few hundred records are collected, but a few thousand would be better.  The real distinction of a training set is that it has an answer key.  In order words, it contains annotations indicating which training records should be considered matches, potential matches, and non-matches.  The training set allows developers to create an initial data matching model that produces expected results.

Predicting the Future

It is easy to pass a test when you already know the answers – and maybe you even get a few wrong just to avoid suspicion that you may have cheated.  However, the real test is can you use what you have learned from the past to predict the future?  The mathematical definition of convergence is used to describe a data matching model that is ready to correctly evaluate records that were not included in its training set.

Conclusion

Regardless of the approach being used, all data matching applications construct a data matching model using SMEs and  historical training data.  The two challenges are:

  1. How much effort is required to build and maintain the matching model
  2. How well does the constructed model match input data that it hasn’t previously seen

An optimal data matching model is accurate, easy to build, easy to update/maintain, and can seamlessly adapt to new data.

Sounds demanding because it is… but with innovation, it can be done!

Related Posts

A Data Matching Benchmark

Matches Created

A more precise, but less certain world

Narrative Fallacy and Data Matching

Tags:
Posted in Data Matching | No Comments »

A Data Matching Benchmark

September 7th, 2009

A benchmark provides the ability to perform consistent and repeatable comparisons by running tests against it.  With many different approaches to data matching and many software vendors offering commercial solutions, an industry standard data matching benchmark would certainly be very useful.

However, in this post, I want to discuss recommendations for establishing your own benchmark for comparing and selecting the best data matching solution for your specific business needs.

Benchmark Data

The only way to truly determine that a data matching solution will satisfy your business requirements is to test it against your actual data.  Therefore, it is critically important to collect real data for benchmarking purposes.  At least a few hundred records should be collected, but a few thousand would be better.  The collected data should be separated into two benchmark data sets:

  1. Training Set
  2. Processing Set

The real distinction between these two data sets is that the training set contains only a small percentage of the total records and that it has an answer key.  In order words, the training set contains annotations indicating which records should be considered matches, potential matches, and non-matches.

The training set allows the competing data matching solutions to prepare for the benchmark by performing their own independent development and testing against the training set, which is intended to represent the expected results in alignment with your business definitions of what constitutes acceptable matches.  Since the training set contains the answer key, the vendor can perform this preparation without your direct involvement.

The processing set will be used for the actual benchmark tests, should contain the full volume data collection, but most importantly – no answer key.  Obviously, you will need an answer key for the processing set.  However, you should not provide it to the vendors.  The most important aspect of the benchmark is to measure whether the data matching solution was capable of producing the correct results by extrapolating from the training set the correct logic to apply to the processing set.

Benchmark Measurements

The benchmark results should be measured against the following statistics:

  • Processing Time – execution time for the processing set.
  • Matches – the number of records properly identified as a match.
  • Potential Matches – the number of records properly identified as a potential match.
  • Non-Matches – the number of records properly identified as a non-match.
  • False Positives – the number of records identified as a match that should have been identified as a non-match.
  • False Negatives – the number of records identified as a non-match that should have been identified as a match.

Benchmark Guidelines

Although benchmark preparation using the training set can be performed off-site, it is highly recommended that the actual benchmark using the processing set always be performed on-site.  Whenever possible, the benchmark should be executed on the same machine.  If this is not possible, then comparable machines should be used.

The Lightning Round

The final test in your data matching benchmark should be to simulate changing business requirements.  To do this, you would change some of the annotations in the training set.  You would also need to have an alternate answer key for the processing set available.  Similar to the original test, you would only provide the training set to the vendors.

This final test should be required to be performed on-site.  The additional benchmark measurement would be the modification time needed to prepare for the second test on the processing set.

Even better, you can make this phase of the benchmark the lightning round – by providing only a single business day for making the changes.

What else do you think should be included in a data matching benchmark?  Have you used a benchmark during vendor selection?

Please share your thoughts and experiences.

Related Posts

Matches Created

A more precise, but less certain world

Tags:
Posted in Data Matching | No Comments »

Humans adapting to computers – instead of the other way around

August 24th, 2009

Recently, my wife and I were shopping for eye glasses at a popular national retailer.  After finding frames she liked, we walked up to the sales counter to make our purchase.

“Have you purchased from us before?” the sales clerk asked my wife.  In order words, he was asking if she was already a customer that he should be able to find in their customer database.

“Yes,” my wife responded, “in fact, I bought my current glasses in this very store last year.”

“Excellent,” the sales clerk said as he prepared to enter her information into the computer, “we really appreciate that you have returned to purchase from us again, what’s your last name?”

“Damianakis.”

“Uh-huh. Could I see your drivers license?”

No, my wife wasn’t getting carded before she could buy a pair of glasses.  The sales clerk wanted to make sure he typed in our last name exactly as it is spelled.

“Hmmm,” the sales clerk muttered while staring at the computer screen, “I can’t seem to find your account.”  To his credit, he decided (I swear that I just stood there quietly smiling) to try intentionally misspelling our last name – five different ways.

“Are you absolutely sure that you purchased your glasses from this store?”

“Yes, absolutely.”

“Oh well,” the sales clerk responded, “I guess I will just have to enter you into the system again.  For the inconvenience, I will take another 5% off the price.”

I couldn’t help but think to myself – I just witnessed the creation of a duplicate customer – despite the diligent efforts of a front line employee!

It can sometimes be difficult to make a compelling business case for data quality.  But what company doesn’t value repeat business?  However, if your current reports are telling you that only 15% of new sales this year have been from repeat customers, how many of those apparently new customers are in fact, already a customer?

Furthermore, isn’t it time that we get computer systems to adapt to us, instead of us always adapting to their limitations?In this particular case, the sales clerk knew to try several intentional misspellings but was unable to find the right record. That’s backwards – the clerk should have entered the information he knew and the computer should have done the hard work to find the right record.

There is a better way! What are we waiting for, let’s eradicate this problem!

Related Posts

Data Checker – Long Overdue

Apples and Oranges

Tags:
Posted in Data Matching, It Happened | No Comments »

Drowning in Imperfect Data

August 17th, 2009

In a recent blog post Your Friend the Algorithm, Paul Barsch explains:

“With exponential trends of data growth and computational power colliding…companies are using technology…and sophisticated mathematical procedures to analyze data and make better decisions.”

Data frequently contains numerous variations caused by different conventions, lack of standards, omissions, and other inconsistencies.

Traditional approaches to data matching have heavily relied on data standardization to prepare records for matching.  This preparation creates a consistent format that allows for more direct comparisons on parsed attributes with standardized values.

However, the problem with the traditional approach is that “the world is literally drowning in data” explains Barsch.  “There’s too much data, and not enough analysis.”

Advancements in data matching technology using sophisticated mathematical algorithms are providing the capability to make better data-driven business decisions – without the prerequisite correction of data’s inherent imperfections.

According to Gartner Research, the volume of enterprise data doubles every 18 months.  There is also a rapidly growing need for real-time analysis of these burgeoning data volumes in order for companies to remain competitive in a constantly evolving marketplace.

“Algorithms help tackle complicated challenges,” explains Barsch.  “As data volumes and decision options increase, algorithms and the systems that run them take on added importance.”

The need to make imperfect data perfectly usable is becoming more important than ever.

Related Posts

Apples and Oranges

A Sisyphean Task…

Tags:
Posted in Data Matching | No Comments »

Matches Created

August 10th, 2009

Bill James is a baseball writer, historian, and statistician, who is perhaps best known for pioneering the field of sabermetrics, which as he defined it is “the search for objective knowledge about baseball.”

James uses analysis of baseball statistics to evaluate the contribution of an individual baseball player’s performance to their team’s ability to win a game.  For hitters, he believed that “a hitter should be measured by his success in that which he is trying to do…create runs.”

To measure this, James created a new baseball statistic that he called Runs Created:

(Hits + Walks) x Total Bases / (At Bats + Walks)

At the heart of this formula is the premise that a player’s ability to get on base is crucial to their team’s ability to score runs and win games.

Although that may sound rather obvious, the formula’s emphasis on statistics not typically considered important (e.g. Walks) was antithetical to baseball’s “conventional wisdom.”

Traditionally, statistics such as Batting Average (Hits / At Bats) and RBI (runs batted in) were considered tried and true techniques for evaluating hitters.

Additionally, there were the “intangibles” observed by scouts and coaches who trusted their “gut” more than nerdy number crunching.

After all, as these experts would argue – baseball is played on a field, not on a calculator.

All of this was detailed in Moneyball: The Art of Winning an Unfair Game, the excellent 2003 book by Michael Lewis.

Matches Created

In data matching, where statistical properties of fields and their values are used to measure the contribution each field makes to the likelihood that a matching record has been found, success should also be measured by what we are trying to do…create matches.

Tried and true techniques continue to be sought for the complex challenge of creating matches, with many of these techniques coming from advanced mathematics.

When you look under the hood of some of these new approaches to data matching, you might find some fields and their statistical properties being used in ways antithetical to “conventional wisdom.”

Initially, your “gut” might tell you these approaches simply don’t sound like they could possibly create acceptable matches.

However, success is truly measured by evaluating the match results – not the data matching techniques.

In some ways, it brings to mind what the 19th century poet John Keats referred to as Negative Capability:

“Capable of being in uncertainties, mysteries, and doubts without any irritable reaching for fact and reason.”

Of course, Keats was advocating an open-mindedness to new concepts in literature and philosophy, where if something speaks to you of a truth that you could accept but not explain, why bother with trying to explain it?

Therefore, if a new approach to data matching creates matches that you can accept, does it really matter what algorithm was used?

Perhaps we should follow Bill James lead and create a new statistic called Matches Created?

Related Posts

Narrative Fallacy and Data Matching

Tags:
Posted in Data Matching | No Comments »

A more precise, but less certain world

July 6th, 2009

I am reading the excellent book Super Crunchers by Ian Ayres, which has the great subtitle:

Why Thinking-By-Numbers is the New Way To Be Smart

“The heroic conception of expertise,” explains Ayres, “was that of an expert giving settled answers.  People are more likely to think of statistics as infinitely malleable and subject to manipulation.  This is a more precise, but less certain world.  The classical conception of probability is a world of absolutes.”

I couldn’t help but think of the classical approaches to data matching that rely largely on exact matching techniques to determine if two or more records should be linked, are duplicates, or represent the same entity.

“To the classicist, the probability of my currently having cancer is either 0 or 100 percent,” explains Ayres, “but we are all frequentists now.  Experts used to say Yes or No.  Now we have to contend with estimates and probabilities.”

I think that is exactly how many people feel about statistical data matching – they have to contend with estimates and probabilities.

Although potential matching records having a statistical probability less than 100 percent is less certain (than a 100% exact match), it is also more precise – because it tells you how reliable its prediction is by providing a confidence level greater than zero.

“This ability to report a confidence level in predictions underscores one of the most amazing things about the technique,” explains Ayres.  “If the prediction is imprecise (say because of poor or incomplete data), [the statistical technique] itself will be the first one to tell you not to rely on it.  When was the last time you heard a traditional expert [or a classical approach to data matching] tell you the precision of their estimate?”

I believe that when it comes to data matching, we all need to be more skeptical about certainty and more comfortable with precision – and to achieve this, we must continue the pursuit of innovation using mathematical techniques.

Related Posts

Matches Created

Narrative Fallacy and Data Matching

Apples and Oranges

Tags:
Posted in Data Matching | No Comments »

Narrative Fallacy and Data Matching

June 12th, 2009

In his excellent 2007 book The Black Swan: The Impact of the Highly Improbable, Nassim Nicholas Taleb used the term narrative fallacy to explain how humans create a story to make their prior predictions sound rational in retrospect.  Taleb was writing about how our general tendency to oversimplify complexity and our lack of appreciation for the role that randomness plays in our lives, combines to makes us so poor at making predictions as well as explaining why our predictions are frequently wrong.

Humans are relatively good at making intuitive judgments but relatively bad at explaining their intuition.  In order words, we usually know what we like but we can’t always explain why.

Data matching determines whether two or more records should be linked, are duplicates, or represent the same entity.  At a high level, there are two approaches to data matching:

  1. Rules Based (e.g., deterministic and probabilistic)
  2. Machine Learning

Rules Based

In the Rules Based approach to data matching, humans are asked to define specific sets of instructions (i.e. rules or matching criteria).

The users are asked to define preliminary rules with a theoretical exercise based on a metadata discussion about what specific fields should be used and how they should be weighted (i.e. how important each individual field is in determining that compared records are a match).  The developers then code a prototype application based on these rules and preliminary data matching results are presented for review.  During the review, users are asked to judge whether or not the matches are correct and explain why.

The explanations often lead to a considerable debate.  The users are forced to create a narrative about specific fields and their weights.  Forced to focus on the rules, the users can become so distracted that they actually ignore the data.  Instead of trying to reach agreement on what records should match, the users are trying to reach agreement on what the matching criteria should be.

Based on their feedback, developers make modifications to the rules.  This cycle of “modify the rules, re-run the application, review the results” usually requires a considerable amount of both time and data.  Additionally, the “completed” rule set is neither very effective with new data nor easily adaptive to the addition of new fields.

Machine Learning

In the Machine Learning approach to data matching, humans are asked to annotate data examples.

The users are simply asked to mark examples with a Yes, No, or Maybe – without explaining why.  The developers do not have to create or maintain any rules.  The matching engine automatically learns from the annotated examples and constructs a mathematical model of the way the users perceive similarity.

An optimal model can be quickly constructed using only a few hundred to a few thousand annotated data examples.  The resulting mathematical model automatically uses all of the available fields, determines the optimal weighting for each field, effectively matches new data and adapts to the addition of any new fields.

Conclusion

By eliminating the possibility of narrative fallacy, Machine Learning is a vastly superior approach to data matching.  By dramatically reducing the amount of human intervention (by both users and developers) required to achieve quality results, Machine Learning saves time and money, and provides the capability to make better data-driven business decisions.

Please share your thoughts and experiences with data matching.

Tags:
Posted in Data Matching | 1 Comment »

What’s in a Name? (Part III)

June 8th, 2009

Names aren’t just for people.  We use them to describe organizations (i.e. businesses), brands, products, and many more things.  When it comes to data matching, they may be just as important as persons.  For example, if my goal is to match entries in a CRM for a B to B business, I’m going to care as much about the organization, department, and title as I will about the current incumbent’s name.  If the task is to consolidate purchasing, component names and part numbers may be more important than people names.  For this, and many more reasons, it’s important that my matching software work with a broad range of data entities, and not be locked into only the most common.

As a proxy for the kinds of challenges you might face in matching components, I thought it would be entertaining to look at the variations on the Internet for appliance descriptions.  It’s something that’s on my mind, as I’m in the market right now.  My wife loves the Fisher & Paykel line:  for those of you who aren’t in the appliance market, F&P is an Australian brand that has captured a pretty good share of the high-end appliance market in recent years.  My wife loves the “look”.

I did a Google search for Fisher & Paykel refrigerators, went onto the F&P site, copied the first two lines of the “official” description for the model we’re thinking of buying :

Fisher & Paykel E522BRXU 17.6 cu. Ft EZKleen Stainless Steel  (reference string)

Then went to a series of additional sites and found the closest model, copying the text I found in the first two lines of their listings.  Here’s what I found in the next 4 sites I checked.

  1. Fisher Paykel 17.6 Cu Ft ActiveSmart Stainless Flat Door Left Hinge Refrigerator With Ice And Water Dispenser – E522BLXFDU  (homeappliancecenter.com)
  2. Fisher & Paykel  17.6 Cu. Ft.  Bottom Mount Refrigerator (Color: Stainless) Item #:278747 Model:E522BRXFDU (Lowes.com)
  3. Fisher Paykel E522BRXU 17.6 cu. ft. Freestanding Bottom-Freezer Refrigerator with Active Smart System, Adjustable Glass Shelves, External Water Dispenser and Curved Door Design  (ajmadison.com)
  4. Fisher and Paykel E522BLXU (17.6 cu. ft.) Bottom Freezer Refrigerator (epinions.com)

For me, manually looking up each of these appliances, the variations I found were no problem.  But how well would your matching software perform on the same data set?  Note the variations:

  • 3 variations on brand:  “Fisher & Paykel” (correct),” Fisher Paykel”,” Fisher and Paykel”
  • 4 variations on model number:  E522BRXU, E522BLXU, E522BRXFDU, E522BLXFD.  These are all basically the same refrigerator, but the R vs. L in the model number denotes a left-hand versus right-hand door hinge, and the FD models have a slightly different door handle.
  • 5 different variations on “17.6 cubic foot”.  Surprisingly to me, this is the most consistent of any piece of information, but it appears 5 different ways:  cu. Ft / Cu Ft/Cu. Ft./cu. ft./(cu. ft.)
  • Note that “EZKleen Stainless Steel”, which is the dominant descriptor in the “official” Paykel listing, doesn’t appear in any of the others, though “Stainless” by itself appears in 2 additional descriptions
  • “ActiveSmart”, which appears in the 4th line of the official Paykel listing (and therefore wasn’t quoted here) appears in two descriptions, in two ways:  “Active Smart” and “ActiveSmart”.
  • “Bottom” appears in three descriptions (but not in the original) as:  “Bottom Mount”, “Bottom Freezer”, and “Bottom-Freezer”
  • “Dispenser” appears in two descriptions, in two variations: “Ice and Water Dispenser” / “External Water Dispenser”

Now, imagine if these descriptions weren’t off the Internet, but were component descriptions from 5 different assembly operations, and that these operations, collectively, purchased tens of thousands of different components.  Could your software match them automatically?

It’s one thing for software to come with a built in module customized for person-name matching.  But complex enterprises work with thousands of different data entities, any and all of which require matching.

Tags:
Posted in Data Matching | No Comments »

What’s in a Name? (Part II)

June 1st, 2009

The name of a man is a numbing blow from which he never recovers.  ~Marshall McLuhan

In the first part of this series, we spoke of the difficulties of dealing with proper names, even though we only considered the case when names remained unchanged.  Of course, we all know that people sometimes change their names.

In US and Canadian practice, the most common event signaling a change is marriage, and conventional practice is for the woman to drop her last name and adopt her husband’s.  While this still happens, and is complicated enough to wreak havoc on data quality, the world isn’t nearly that simple any more.  A survey of postings to alt.wedding reveals 8 additional variant naming conventions for when “Jane Smith” marries “Michael Brown”*:

  1. Wife Hyphenates The Two Names (Jane Smith becomes Jane Smith-Brown).
  2. Wife Uses Birth Name as Middle Name (Jane Smith becomes Jane Smith Brown, with no hyphenation).
  3. Husband and Wife Keep Their Own Birth Names (Jane Smith stays Jane Smith)
  4. Wife takes Husband’s Name Socially, Keeps Own Name Professionally (Jane Smith is Jane Smith at work, but Jane Brown otherwise)
  5. Husband takes Wife’s Name (Michael Brown becomes Michael Smith)
  6. Husband and Wife Both Hyphenate (Jane and Michael become The Smith-Browns)
  7. Husband and Wife take Each Other’s Names as Middle Names (Jane becomes Jane Brown Smith, Michael becomes Michael Smith Brown) – last names are still different, but there is the symbolism of having taken each other as part of themselves.
  8. Husband and Wife Pick a New Name

For a moment, consider the impact on data quality of all of these variations.  Take one of the most common (#4) where a woman maintains her maiden name professionally.  This means that you’re stuck with trying to synchronize data records for different names depending on whether Jane views a given relationship as professional or personal.  It may be obvious to Jane which name to use where, but this is unlikely to be transparent to your business.  A financial institution would need to deal with Jane Smith for “professional” credit cards and bank accounts, and Jane Brown for “personal” accounts.  A media company might sell some products to Jane Smith, and others to Jane Brown.  Not to mention the issues of long standing relationships (pre-marriage) which would have started as Jane Smith and now need to be transitioned, and linked with new ones that now start as Jane Brown (except for the records that need to stay Jane Smith).

Obviously, this one example is just scratching the surface.  When you actually capture the changed name, your own staff can generate additional variations.  Will you actually capture the hyphen, or perhaps add a hyphen that’s not supposed to exist?  Will the middle name actually get fielded into the middle name field, or show up as one of two last names?  Will you replace the old middle name with the new one, or add it?

The many variations here point out the advantage of the Netrics Matching Platform’s approach to data matching, which is to look holistically at the data record and look for similarities wherever they occur.  Then you don’t care much at all if you’ve captured Jane | Smith | Brown  or Jane | Smith Brown or Jane | Smith-Brown.  It’s a powerful approach, made possible by the flexibility and computational simplicity of Netrics’ underlying approach.

You can also imagine the implication of these new naming variations on the complexity of rules-based systems.  Imagine trying to sort out all 8 variations in a name matching rules-base.  And we haven’t touched on non-US name changing conventions which can be quite different.

Finally, of course, there’s the issue that it’s not just marriage that causes people to change names.  In any culture, someone can adopt a nickname (formally or informally), and this variation can leak into your corporate data.  And there may be other reasons for name changes as well.

A few years ago, Netrics performed some data cleansing work for a hospital in Arizona. As our own data experts were performing QA testing on the resulting identified duplicates, we thought we had a big problem:  the Netrics Decision Engine was identifying pairs of male records with the same first name and different last names as duplicates. We were very concerned until we called and spoke with our client.  It turns out that when some Native American males leave the reservation, they adopt an “American” last name to use off the reservation.  The Netrics Decision Engine, which creates its own mathematical model using Machine Learning had figured out this subtlety in the data – something that we did not know, and more importantly, something that we did not have to ask of our client’s data experts.

Imagine trying to handle that with a probabilistic rules based system. Moreover, how can one know all of these cases for all of the different data sources and all different data types – a priori!

Our mantra is: Learn, don’t guess!

*Thanks to M. Elizabeth Hunter and Sonja Kueppers on soc.couples.wedding for compiling this list.

Tags:
Posted in Data Matching | No Comments »

What’s in a Name?

May 26th, 2009

What’s in a Name?

‘Tis but thy name that is my enemy;
… Oh, be some other name!
What’s in a name? that which we call a rose
By any other name would smell as sweet;
So Romeo would, were he not Romeo call’d…
–William Shakespeare, Romeo and Juliet, 1594

When Shakespeare penned this soliloquy for Juliet, the practice of inherited surnames had been fully adopted in England for only about 150 years.  Previously, English practice had varied between given names only (e.g. John), use of father’s name (e.g. John Jackson), profession (John Plowman, John Shepherd), and especially for nobles, names which identified places or events (John Essex).  Echos of all of these practices can be found in the modern surnames, especially those of British origin.

Of course, at the time surnames became established (roughly 1450), greater London, one of the largest cities in the western world, had a population of only about 60,000, compared to roughly 7.5 million today (and a peak population of 8.3 million in 1951).  So it shouldn’t be a surprise that an identification system would have problems when confronted with a population of 100-150x what it was “designed” to handle.  A web search for “John Jackson” in today’s NYC returned 23 names, and more than 300 for the metro area.

Still, it’s a system we’re stuck with.  Just about everything else is either private (e.g. SSN) or subject to change.  One’s name is, broadly speaking, just about the only public identifier that stays with you for your entire life.

Yet as bad as the basic system is, it’s made worse by the entropy of the real world.  Our “John Jackson” might be listed in some databases as Jack Jackson, John E. Jackson, J. Jackson, or J.E. Jackson.  And that’s a pretty standard Anglo-Saxon name.  What’s more difficult to deal with are names which originate outside of the US/UK which often have multiple spelling variations when transliterated to English spellings.  For example, Muhammad is the most commonly given name in the world (according to the Columbia Encyclopedia).  In the US, according to the Social Security Administration, it ranked as the 639th most popular name for newborns in the United States in 2006.  Not to mention “Mohammad” (589th) and “Mohammed” (633rd).  Indeed, Wikepedia lists 15 additional variations, not including the three just mentioned:  Mohamed, Muhammed, Mahommed, Muhamed, Mehmed, Mehmet, Mohand, Mahometus, Maometto, Moameth, Mahoma, Mukhammad, Maxamed, Mamadou, Makhambet.  Of course, there could be misspellings or typos for any of these names.

No wonder name-based “watch lists” have proven difficult to manage.  For example, the Terrorist Watch list has something like 1.1 million records describing approximately 400,000 unique individuals and has been widely criticized. It’s obvious for computer name searches to work reliably, two conditions need to be satisfied.  The system needs to:

  1. Search variations automatically.  Imagine an analyst missing a search because he entered only 3 or 4 variations of “Mohamed”, and the record for which he was looking was a 5th.
  2. Filter results accurately.  Obviously, any name database of any size is going to return hundreds of “hits” on any reasonably common name.  On the one hand, presenting hundreds of hits to a typical user is a waste of time, while, on the other hand, filtering out the correct result before presenting it to the user is worse.

Tags:
Posted in Data Matching | 1 Comment »

Pages

RSS Netrics HD

About Netrics HD

Data matching is a fundamental operation in many applications, from improving data quality to implementing master data management. Stef Damianakis, CEO of Netrics, a world leader in matching technology, shares his thoughts on the state of the technology and business of data matching.

Brought to you by...

Netrics Logo

Calendar

September 2010
M T W T F S S
« Nov    
 12345
6789101112
13141516171819
20212223242526
27282930  

Tag Cloud

Categories

Recent Posts

Recent Comments