Archives – September, 2009
September 28th, 2009
Data matching determines whether two or more records should be linked, are duplicates, or represent the same entity. There are many different approaches to data matching. In Machine Learning, advanced mathematical techniques are used to construct a data matching model for the way that humans perceive similarity.
In this post, I want to discuss what any data matching model is attempting to achieve – since the reality is that all approaches to data matching construct “models”.
Input from Subject Matter Experts (SME)
Before any data matching model can be constructed, interviews must be conducted with subject matter experts possessing a business understanding of the data. All organizations in every industry have unique data characteristics and unique data challenges. An effective data matching model must capture the tacit knowledge of subject matter experts without losing anything in translation when instantiating business knowledge into a technological implementation.
A History Lesson
All data matching models start with a history lesson. A training set is collected from existing data. Typically, at least a few hundred records are collected, but a few thousand would be better. The real distinction of a training set is that it has an answer key. In order words, it contains annotations indicating which training records should be considered matches, potential matches, and non-matches. The training set allows developers to create an initial data matching model that produces expected results.
Predicting the Future
It is easy to pass a test when you already know the answers – and maybe you even get a few wrong just to avoid suspicion that you may have cheated. However, the real test is can you use what you have learned from the past to predict the future? The mathematical definition of convergence is used to describe a data matching model that is ready to correctly evaluate records that were not included in its training set.
Conclusion
Regardless of the approach being used, all data matching applications construct a data matching model using SMEs and historical training data. The two challenges are:
- How much effort is required to build and maintain the matching model
- How well does the constructed model match input data that it hasn’t previously seen
An optimal data matching model is accurate, easy to build, easy to update/maintain, and can seamlessly adapt to new data.
Sounds demanding because it is… but with innovation, it can be done!
Related Posts
A Data Matching Benchmark
Matches Created
A more precise, but less certain world
Narrative Fallacy and Data Matching
Tags: Data Matching
Posted in Data Matching | No Comments »
September 21st, 2009
In his absolutely fantastic 2006 Princeton University essay The Algorithm: Idiom of Modern Science, Bernard Chazelle pondered the Holy Grail quest of computer science:
“How to unleash the full computing and modeling power of the Algorithm.”
Chazelle describes how Moore’s Law, which states that computing power doubles every two years, has delayed the rise to prominence of the algorithm, in much the same way that an abundance of relatively cheap oil has delayed the emergence of alternative energy sources.
The Triumph of Mathematics
“To make sense of the world, we have math,” explains Chazelle, and therefore, some might ask: Who needs algorithms?
“It is beyond dispute,” continues Chazelle, “that the dizzying success of 20th century science is, to a large degree, the triumph of mathematics. A page’s worth of math formulas is enough to explain most of the physical phenomena around us: why things fly, fall, float, gravitate, radiate, blow up, etc.”
As Albert Einstein said:
“The most incomprehensible thing about the universe is that it is comprehensible.”
“Granted,” says Chazelle, “Einstein’s assurance that something is comprehensible might not necessarily reassure everyone, but all would agree that the universe speaks in one tongue and one tongue only: mathematics.”
The New Language of Science
“The Algorithm’s coming-of-age as the new language of science,” declares Chazelle, “promises to be the most disruptive scientific development since quantum mechanics.”
Algorithms are thought by some to be simply a way to automate the rapid execution of a task. Although speed is important and the exponential growth of computing power has allowed algorithms to execute faster, it is the quality of the work performed by the algorithm that is vastly more important, especially algorithms used for complex data analysis in support of critical business decisions.
“The algorithmic paradigm,” explains Chazelle, “is not about what but how to think. Self-reference is associated mostly with self-replication. In the algorithmic world, by contrast, it is the engine powering the complex recursive designs that give abstraction its amazing richness: it is, in fact, the very essence of computing. Should even a fraction of that power be harnessed for modeling purposes, there’s no telling what might happen.”
For example, using graph theory (a branch of theoretical mathematics), algorithms can construct mathematical models for the ways that humans recognize patterns in data. The goal of these algorithms is not to replace human decision makers.
These algorithmically constructed models can be used to automate the rapid execution of analytical tasks providing true decision support for humans to use while navigating today’s challenging business environment, which faces daunting data volumes and a constantly evolving marketplace.
“Some say the Algorithm is poised to become the new New Math, the idiom of modern science,” explains Chazelle. “I say The Sciences They Are A-Changin’ and the Algorithm is Here to Stay. One thing is certain, Moore’s Law has put computing on the map: the Algorithm will now unleash its true potential.”
I completely agree and wholeheartedly echo the closing remark of Chazelle’s essay:
“May the Algorithm’s Force be with you.”
Related Posts
The Growing Importance of Mathematics
Drowning in Imperfect Data
Matches Created
A more precise, but less certain world
Narrative Fallacy and Data Matching
Tags: Innovation, Technology, Trends
Posted in Innovation, Technology, Trends | No Comments »
September 14th, 2009
For speaking at this year’s Enterprise Data World conference, I received a copy of Stephen Baker’s amazing book The Numerati, which was inspired by his Jan 23, 2006 BusinessWeek article Math Will Rock Your World (one of my all time favorites!).

“When it comes to producing data,” explains Baker, “we’re prolific. The very air we breathe is teeming with motes of information. People with the right smarts can summon meaning from the nearly bottomless sea of data. The key to this process is to find similarities and patterns. We humans do this instinctively.”
Humans Teach, Machines Learn
Advancements in machine learning technology using sophisticated mathematical algorithms are providing the capability to make better data-driven decisions.
“Learning machines swim in numbers,” explains Baker. “The learning process starts with humans…the annotators. Their work is…to teach the machine what we humans know at a glance.”
Therefore, these advancements are not an attempt to replace human knowledge workers. The number crunching capabilities of these advancements will allow us to “gradually evolve from data serfs into data masters.”
Advanced Geometry
There are many mathematical disciplines involved in machine learning. However, perhaps one of more surprising is advanced geometry.
“Scientists often describe the world of data as a domain of sharp angles, colliding planes, and vectors shooting along endless paths,” explains Baker. “Imagine a vast multidimensional space [with] dozens of markers…each marker occupies its own patch of real estate.”
Imagine each marker representing an individual character within a string of text. Machine learning using bipartite graphs to allow data to “produce a line – or vector – that intersects with each and every one of its own markers…it’s a little like those grade-school exercises where a child follows a series of numbers or letters with her pencil and ends up with a picture of a puppy or a Christmas tree,” explains Baker.
However, the picture that bipartite graphs are drawing are too complex for the three-dimensional world of the human imagination.
“The computer has no trouble depicting [data] as vectors,” continues Baker. “They all run neatly from one dimension through countless others and, more important, through every one of their distinguishing markers. [Data] that resemble each other, naturally enough, are neighbors in this vector space. [Data] that have a lot in common tend to point at similar angles. Each link shared is a line connecting them, a so-called edge. The next step is to calculate the importance of each edge…[edges] given a higher score…those lines on the graph are thicker.”
A New Era of Applied Mathematics
“The information age that we’re in is going to be an emerging new era of what would be called applied mathematics,” concludes Baker. “Mathematicians are going to dip into the sea of data to form…the mathematical modeling of humanity.”
From the beginning of civilization mathematics has been central to our advancement. It is after all the language of science. But our relatively new found ability to collect digital data has ushered in a new era for leveraging and benefiting from mathematics.
Related Posts
Drowning in Imperfect Data
Matches Created
A more precise, but less certain world
Narrative Fallacy and Data Matching
Tags: Innovation, Technology
Posted in Innovation, Technology | No Comments »
September 7th, 2009
A benchmark provides the ability to perform consistent and repeatable comparisons by running tests against it. With many different approaches to data matching and many software vendors offering commercial solutions, an industry standard data matching benchmark would certainly be very useful.
However, in this post, I want to discuss recommendations for establishing your own benchmark for comparing and selecting the best data matching solution for your specific business needs.
Benchmark Data
The only way to truly determine that a data matching solution will satisfy your business requirements is to test it against your actual data. Therefore, it is critically important to collect real data for benchmarking purposes. At least a few hundred records should be collected, but a few thousand would be better. The collected data should be separated into two benchmark data sets:
- Training Set
- Processing Set
The real distinction between these two data sets is that the training set contains only a small percentage of the total records and that it has an answer key. In order words, the training set contains annotations indicating which records should be considered matches, potential matches, and non-matches.
The training set allows the competing data matching solutions to prepare for the benchmark by performing their own independent development and testing against the training set, which is intended to represent the expected results in alignment with your business definitions of what constitutes acceptable matches. Since the training set contains the answer key, the vendor can perform this preparation without your direct involvement.
The processing set will be used for the actual benchmark tests, should contain the full volume data collection, but most importantly – no answer key. Obviously, you will need an answer key for the processing set. However, you should not provide it to the vendors. The most important aspect of the benchmark is to measure whether the data matching solution was capable of producing the correct results by extrapolating from the training set the correct logic to apply to the processing set.
Benchmark Measurements
The benchmark results should be measured against the following statistics:
- Processing Time – execution time for the processing set.
- Matches – the number of records properly identified as a match.
- Potential Matches – the number of records properly identified as a potential match.
- Non-Matches – the number of records properly identified as a non-match.
- False Positives – the number of records identified as a match that should have been identified as a non-match.
- False Negatives – the number of records identified as a non-match that should have been identified as a match.
Benchmark Guidelines
Although benchmark preparation using the training set can be performed off-site, it is highly recommended that the actual benchmark using the processing set always be performed on-site. Whenever possible, the benchmark should be executed on the same machine. If this is not possible, then comparable machines should be used.
The Lightning Round
The final test in your data matching benchmark should be to simulate changing business requirements. To do this, you would change some of the annotations in the training set. You would also need to have an alternate answer key for the processing set available. Similar to the original test, you would only provide the training set to the vendors.
This final test should be required to be performed on-site. The additional benchmark measurement would be the modification time needed to prepare for the second test on the processing set.
Even better, you can make this phase of the benchmark the lightning round – by providing only a single business day for making the changes.
What else do you think should be included in a data matching benchmark? Have you used a benchmark during vendor selection?
Please share your thoughts and experiences.
Related Posts
Matches Created
A more precise, but less certain world
Tags: Data Matching
Posted in Data Matching | No Comments »