What’s in a Name?
What’s in a Name?
‘Tis but thy name that is my enemy;
… Oh, be some other name!
What’s in a name? that which we call a rose
By any other name would smell as sweet;
So Romeo would, were he not Romeo call’d…
–William Shakespeare, Romeo and Juliet, 1594
When Shakespeare penned this soliloquy for Juliet, the practice of inherited surnames had been fully adopted in England for only about 150 years. Previously, English practice had varied between given names only (e.g. John), use of father’s name (e.g. John Jackson), profession (John Plowman, John Shepherd), and especially for nobles, names which identified places or events (John Essex). Echos of all of these practices can be found in the modern surnames, especially those of British origin.
Of course, at the time surnames became established (roughly 1450), greater London, one of the largest cities in the western world, had a population of only about 60,000, compared to roughly 7.5 million today (and a peak population of 8.3 million in 1951). So it shouldn’t be a surprise that an identification system would have problems when confronted with a population of 100-150x what it was “designed” to handle. A web search for “John Jackson” in today’s NYC returned 23 names, and more than 300 for the metro area.
Still, it’s a system we’re stuck with. Just about everything else is either private (e.g. SSN) or subject to change. One’s name is, broadly speaking, just about the only public identifier that stays with you for your entire life.
Yet as bad as the basic system is, it’s made worse by the entropy of the real world. Our “John Jackson” might be listed in some databases as Jack Jackson, John E. Jackson, J. Jackson, or J.E. Jackson. And that’s a pretty standard Anglo-Saxon name. What’s more difficult to deal with are names which originate outside of the US/UK which often have multiple spelling variations when transliterated to English spellings. For example, Muhammad is the most commonly given name in the world (according to the Columbia Encyclopedia). In the US, according to the Social Security Administration, it ranked as the 639th most popular name for newborns in the United States in 2006. Not to mention “Mohammad” (589th) and “Mohammed” (633rd). Indeed, Wikepedia lists 15 additional variations, not including the three just mentioned: Mohamed, Muhammed, Mahommed, Muhamed, Mehmed, Mehmet, Mohand, Mahometus, Maometto, Moameth, Mahoma, Mukhammad, Maxamed, Mamadou, Makhambet. Of course, there could be misspellings or typos for any of these names.
No wonder name-based “watch lists” have proven difficult to manage. For example, the Terrorist Watch list has something like 1.1 million records describing approximately 400,000 unique individuals and has been widely criticized. It’s obvious for computer name searches to work reliably, two conditions need to be satisfied. The system needs to:
- Search variations automatically. Imagine an analyst missing a search because he entered only 3 or 4 variations of “Mohamed”, and the record for which he was looking was a 5th.
- Filter results accurately. Obviously, any name database of any size is going to return hundreds of “hits” on any reasonably common name. On the one hand, presenting hundreds of hits to a typical user is a waste of time, while, on the other hand, filtering out the correct result before presenting it to the user is worse.
Tags: Data Matching
Posted in Data Matching | 1 Comment »

There’s a lot of flack in the news recently from the ACLU and others about how inaccurate the Terrorist watch list is. How many of those guys are named some variation of Mohammad? With 18 different variations, no wonder it’s a mess.