Tag: Data Matching

What’s in a Name? (Part II)

June 1st, 2009

The name of a man is a numbing blow from which he never recovers.  ~Marshall McLuhan

In the first part of this series, we spoke of the difficulties of dealing with proper names, even though we only considered the case when names remained unchanged.  Of course, we all know that people sometimes change their names.

In US and Canadian practice, the most common event signaling a change is marriage, and conventional practice is for the woman to drop her last name and adopt her husband’s.  While this still happens, and is complicated enough to wreak havoc on data quality, the world isn’t nearly that simple any more.  A survey of postings to alt.wedding reveals 8 additional variant naming conventions for when “Jane Smith” marries “Michael Brown”*:

  1. Wife Hyphenates The Two Names (Jane Smith becomes Jane Smith-Brown).
  2. Wife Uses Birth Name as Middle Name (Jane Smith becomes Jane Smith Brown, with no hyphenation).
  3. Husband and Wife Keep Their Own Birth Names (Jane Smith stays Jane Smith)
  4. Wife takes Husband’s Name Socially, Keeps Own Name Professionally (Jane Smith is Jane Smith at work, but Jane Brown otherwise)
  5. Husband takes Wife’s Name (Michael Brown becomes Michael Smith)
  6. Husband and Wife Both Hyphenate (Jane and Michael become The Smith-Browns)
  7. Husband and Wife take Each Other’s Names as Middle Names (Jane becomes Jane Brown Smith, Michael becomes Michael Smith Brown) – last names are still different, but there is the symbolism of having taken each other as part of themselves.
  8. Husband and Wife Pick a New Name

For a moment, consider the impact on data quality of all of these variations.  Take one of the most common (#4) where a woman maintains her maiden name professionally.  This means that you’re stuck with trying to synchronize data records for different names depending on whether Jane views a given relationship as professional or personal.  It may be obvious to Jane which name to use where, but this is unlikely to be transparent to your business.  A financial institution would need to deal with Jane Smith for “professional” credit cards and bank accounts, and Jane Brown for “personal” accounts.  A media company might sell some products to Jane Smith, and others to Jane Brown.  Not to mention the issues of long standing relationships (pre-marriage) which would have started as Jane Smith and now need to be transitioned, and linked with new ones that now start as Jane Brown (except for the records that need to stay Jane Smith).

Obviously, this one example is just scratching the surface.  When you actually capture the changed name, your own staff can generate additional variations.  Will you actually capture the hyphen, or perhaps add a hyphen that’s not supposed to exist?  Will the middle name actually get fielded into the middle name field, or show up as one of two last names?  Will you replace the old middle name with the new one, or add it?

The many variations here point out the advantage of the Netrics Matching Platform’s approach to data matching, which is to look holistically at the data record and look for similarities wherever they occur.  Then you don’t care much at all if you’ve captured Jane | Smith | Brown  or Jane | Smith Brown or Jane | Smith-Brown.  It’s a powerful approach, made possible by the flexibility and computational simplicity of Netrics’ underlying approach.

You can also imagine the implication of these new naming variations on the complexity of rules-based systems.  Imagine trying to sort out all 8 variations in a name matching rules-base.  And we haven’t touched on non-US name changing conventions which can be quite different.

Finally, of course, there’s the issue that it’s not just marriage that causes people to change names.  In any culture, someone can adopt a nickname (formally or informally), and this variation can leak into your corporate data.  And there may be other reasons for name changes as well.

A few years ago, Netrics performed some data cleansing work for a hospital in Arizona. As our own data experts were performing QA testing on the resulting identified duplicates, we thought we had a big problem:  the Netrics Decision Engine was identifying pairs of male records with the same first name and different last names as duplicates. We were very concerned until we called and spoke with our client.  It turns out that when some Native American males leave the reservation, they adopt an “American” last name to use off the reservation.  The Netrics Decision Engine, which creates its own mathematical model using Machine Learning had figured out this subtlety in the data – something that we did not know, and more importantly, something that we did not have to ask of our client’s data experts.

Imagine trying to handle that with a probabilistic rules based system. Moreover, how can one know all of these cases for all of the different data sources and all different data types – a priori!

Our mantra is: Learn, don’t guess!

*Thanks to M. Elizabeth Hunter and Sonja Kueppers on soc.couples.wedding for compiling this list.

Tags:
Posted in Data Matching | No Comments »

What’s in a Name?

May 26th, 2009

What’s in a Name?

‘Tis but thy name that is my enemy;
… Oh, be some other name!
What’s in a name? that which we call a rose
By any other name would smell as sweet;
So Romeo would, were he not Romeo call’d…
–William Shakespeare, Romeo and Juliet, 1594

When Shakespeare penned this soliloquy for Juliet, the practice of inherited surnames had been fully adopted in England for only about 150 years.  Previously, English practice had varied between given names only (e.g. John), use of father’s name (e.g. John Jackson), profession (John Plowman, John Shepherd), and especially for nobles, names which identified places or events (John Essex).  Echos of all of these practices can be found in the modern surnames, especially those of British origin.

Of course, at the time surnames became established (roughly 1450), greater London, one of the largest cities in the western world, had a population of only about 60,000, compared to roughly 7.5 million today (and a peak population of 8.3 million in 1951).  So it shouldn’t be a surprise that an identification system would have problems when confronted with a population of 100-150x what it was “designed” to handle.  A web search for “John Jackson” in today’s NYC returned 23 names, and more than 300 for the metro area.

Still, it’s a system we’re stuck with.  Just about everything else is either private (e.g. SSN) or subject to change.  One’s name is, broadly speaking, just about the only public identifier that stays with you for your entire life.

Yet as bad as the basic system is, it’s made worse by the entropy of the real world.  Our “John Jackson” might be listed in some databases as Jack Jackson, John E. Jackson, J. Jackson, or J.E. Jackson.  And that’s a pretty standard Anglo-Saxon name.  What’s more difficult to deal with are names which originate outside of the US/UK which often have multiple spelling variations when transliterated to English spellings.  For example, Muhammad is the most commonly given name in the world (according to the Columbia Encyclopedia).  In the US, according to the Social Security Administration, it ranked as the 639th most popular name for newborns in the United States in 2006.  Not to mention “Mohammad” (589th) and “Mohammed” (633rd).  Indeed, Wikepedia lists 15 additional variations, not including the three just mentioned:  Mohamed, Muhammed, Mahommed, Muhamed, Mehmed, Mehmet, Mohand, Mahometus, Maometto, Moameth, Mahoma, Mukhammad, Maxamed, Mamadou, Makhambet.  Of course, there could be misspellings or typos for any of these names.

No wonder name-based “watch lists” have proven difficult to manage.  For example, the Terrorist Watch list has something like 1.1 million records describing approximately 400,000 unique individuals and has been widely criticized. It’s obvious for computer name searches to work reliably, two conditions need to be satisfied.  The system needs to:

  1. Search variations automatically.  Imagine an analyst missing a search because he entered only 3 or 4 variations of “Mohamed”, and the record for which he was looking was a 5th.
  2. Filter results accurately.  Obviously, any name database of any size is going to return hundreds of “hits” on any reasonably common name.  On the one hand, presenting hundreds of hits to a typical user is a waste of time, while, on the other hand, filtering out the correct result before presenting it to the user is worse.

Tags:
Posted in Data Matching | No Comments »

Next page