Narrative Fallacy and Data Matching

Friday, June 12th, 2009 @ 3:00 pm

In his excellent 2007 book The Black Swan: The Impact of the Highly Improbable, Nassim Nicholas Taleb used the term narrative fallacy to explain how humans create a story to make their prior predictions sound rational in retrospect.  Taleb was writing about how our general tendency to oversimplify complexity and our lack of appreciation for the role that randomness plays in our lives, combines to makes us so poor at making predictions as well as explaining why our predictions are frequently wrong.

Humans are relatively good at making intuitive judgments but relatively bad at explaining their intuition.  In order words, we usually know what we like but we can’t always explain why.

Data matching determines whether two or more records should be linked, are duplicates, or represent the same entity.  At a high level, there are two approaches to data matching:

  1. Rules Based (e.g., deterministic and probabilistic)
  2. Machine Learning

Rules Based

In the Rules Based approach to data matching, humans are asked to define specific sets of instructions (i.e. rules or matching criteria).

The users are asked to define preliminary rules with a theoretical exercise based on a metadata discussion about what specific fields should be used and how they should be weighted (i.e. how important each individual field is in determining that compared records are a match).  The developers then code a prototype application based on these rules and preliminary data matching results are presented for review.  During the review, users are asked to judge whether or not the matches are correct and explain why.

The explanations often lead to a considerable debate.  The users are forced to create a narrative about specific fields and their weights.  Forced to focus on the rules, the users can become so distracted that they actually ignore the data.  Instead of trying to reach agreement on what records should match, the users are trying to reach agreement on what the matching criteria should be.

Based on their feedback, developers make modifications to the rules.  This cycle of “modify the rules, re-run the application, review the results” usually requires a considerable amount of both time and data.  Additionally, the “completed” rule set is neither very effective with new data nor easily adaptive to the addition of new fields.

Machine Learning

In the Machine Learning approach to data matching, humans are asked to annotate data examples.

The users are simply asked to mark examples with a Yes, No, or Maybe – without explaining why.  The developers do not have to create or maintain any rules.  The matching engine automatically learns from the annotated examples and constructs a mathematical model of the way the users perceive similarity.

An optimal model can be quickly constructed using only a few hundred to a few thousand annotated data examples.  The resulting mathematical model automatically uses all of the available fields, determines the optimal weighting for each field, effectively matches new data and adapts to the addition of any new fields.

Conclusion

By eliminating the possibility of narrative fallacy, Machine Learning is a vastly superior approach to data matching.  By dramatically reducing the amount of human intervention (by both users and developers) required to achieve quality results, Machine Learning saves time and money, and provides the capability to make better data-driven business decisions.

Please share your thoughts and experiences with data matching.

Tags:
Posted in Data Matching | No Comments »

Leave a Reply