Narrative Fallacy and Data Matching
In his excellent 2007 book The Black Swan: The Impact of the Highly Improbable, Nassim Nicholas Taleb used the term narrative fallacy to explain how humans create a story to make their prior predictions sound rational in retrospect. Taleb was writing about how our general tendency to oversimplify complexity and our lack of appreciation for the role that randomness plays in our lives, combines to makes us so poor at making predictions as well as explaining why our predictions are frequently wrong.
Humans are relatively good at making intuitive judgments but relatively bad at explaining their intuition. In order words, we usually know what we like but we can’t always explain why.
Data matching determines whether two or more records should be linked, are duplicates, or represent the same entity. At a high level, there are two approaches to data matching:
- Rules Based (e.g., deterministic and probabilistic)
- Machine Learning
Rules Based
In the Rules Based approach to data matching, humans are asked to define specific sets of instructions (i.e. rules or matching criteria).
The users are asked to define preliminary rules with a theoretical exercise based on a metadata discussion about what specific fields should be used and how they should be weighted (i.e. how important each individual field is in determining that compared records are a match). The developers then code a prototype application based on these rules and preliminary data matching results are presented for review. During the review, users are asked to judge whether or not the matches are correct and explain why.
The explanations often lead to a considerable debate. The users are forced to create a narrative about specific fields and their weights. Forced to focus on the rules, the users can become so distracted that they actually ignore the data. Instead of trying to reach agreement on what records should match, the users are trying to reach agreement on what the matching criteria should be.
Based on their feedback, developers make modifications to the rules. This cycle of “modify the rules, re-run the application, review the results” usually requires a considerable amount of both time and data. Additionally, the “completed” rule set is neither very effective with new data nor easily adaptive to the addition of new fields.
Machine Learning
In the Machine Learning approach to data matching, humans are asked to annotate data examples.
The users are simply asked to mark examples with a Yes, No, or Maybe – without explaining why. The developers do not have to create or maintain any rules. The matching engine automatically learns from the annotated examples and constructs a mathematical model of the way the users perceive similarity.
An optimal model can be quickly constructed using only a few hundred to a few thousand annotated data examples. The resulting mathematical model automatically uses all of the available fields, determines the optimal weighting for each field, effectively matches new data and adapts to the addition of any new fields.
Conclusion
By eliminating the possibility of narrative fallacy, Machine Learning is a vastly superior approach to data matching. By dramatically reducing the amount of human intervention (by both users and developers) required to achieve quality results, Machine Learning saves time and money, and provides the capability to make better data-driven business decisions.
Please share your thoughts and experiences with data matching.
Tags: Data Matching
Posted in Data Matching | 1 Comment »

I have worked with these different approaches:
Synonyms: This is in my eyes the most basic approach. You have a list of common translations to different words like common misspellings, nicknames and so on. This approach is of course very depending on heavy maintenance, and must be worked over for every language/country – and actually works better with English than other languages like the Germanic ones, where you use concatenated words (like ‘Main Street’ being ‘Mainstreet’).
Match codes: You find those from very simple ones to the more sophisticated ones – going from ignoring vowels, soundex and metaphone (for English) to proprietary findings of all kinds. In my eyes match codes works OK for selecting candidates for matching – but falls a bit short when coming to actually settling the case.
Algorithms: A complex algorithm is a more sophisticated way to settle if two different spelled records make up the same real world entity. You have to deal with truncations, non phonetic typos, rearranged words and letters and all that jazz. The “LevenshteinDistance” is an example of an algorithm you could use – but such a method is just a fraction compared to the commercial used algorithms around.
Probabilistic learning: This is if fact a variation of synonyms, but the collection is not based on up front maintenance but collection of users actual decisions when verifying automatic matching. The tool will register the frequency and context of the paired elements in the decisions. This of course requires a substantial collection. I have implemented such a feature at organisations, where several people every day do verify matching results.
And then parsing and standardisation is often supplementary methods used to improve the matching. Also bringing in more data to support the decision is in my eyes a key to actually settle if some records make up the same real world entity. Business and consumer/citizen directories are available in different forms, coverage and depth around the world.