<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Narrative Fallacy and Data Matching</title>
	<atom:link href="http://www.netrics.com/blog/narrative-fallacy-and-data-matching/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.netrics.com/blog/narrative-fallacy-and-data-matching/</link>
	<description>A High Definition View of the Business and Technology of Data Matching</description>
	<lastBuildDate>Fri, 12 Jun 2009 21:14:19 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Henrik Liliendahl Sørensen</title>
		<link>http://www.netrics.com/blog/narrative-fallacy-and-data-matching/comment-page-1/#comment-135</link>
		<dc:creator>Henrik Liliendahl Sørensen</dc:creator>
		<pubDate>Fri, 12 Jun 2009 21:14:19 +0000</pubDate>
		<guid isPermaLink="false">http://www.netrics.com/blog/?p=117#comment-135</guid>
		<description>I have worked with these different approaches:

Synonyms: This is in my eyes the most basic approach. You have a list of common translations to different words like common misspellings, nicknames and so on. This approach is of course very depending on heavy maintenance, and must be worked over for every language/country – and actually works better with English than other languages like the Germanic ones, where you use concatenated words (like ‘Main Street’ being ‘Mainstreet’). 

Match codes: You find those from very simple ones to the more sophisticated ones – going from ignoring vowels, soundex and metaphone (for English) to proprietary findings of all kinds. In my eyes match codes works OK for selecting candidates for matching – but falls a bit short when coming to actually settling the case.

Algorithms: A complex algorithm is a more sophisticated way to settle if two different spelled records make up the same real world entity. You have to deal with truncations, non phonetic typos, rearranged words and letters and all that jazz. The “LevenshteinDistance” is an example of an algorithm you could use – but such a method is just a fraction compared to the commercial used algorithms around. 

Probabilistic learning: This is if fact a variation of synonyms, but the collection is not based on up front maintenance but collection of users actual decisions when verifying automatic matching. The tool will register the frequency and context of the paired elements in the decisions. This of course requires a substantial collection. I have implemented such a feature at organisations, where several people every day do verify matching results. 

And then parsing and standardisation is often supplementary methods used to improve the matching. Also bringing in more data to support the decision is in my eyes a key to actually settle if some records make up the same real world entity. Business and consumer/citizen directories are available in different forms, coverage and depth around the world.</description>
		<content:encoded><![CDATA[<p>I have worked with these different approaches:</p>
<p>Synonyms: This is in my eyes the most basic approach. You have a list of common translations to different words like common misspellings, nicknames and so on. This approach is of course very depending on heavy maintenance, and must be worked over for every language/country – and actually works better with English than other languages like the Germanic ones, where you use concatenated words (like ‘Main Street’ being ‘Mainstreet’). </p>
<p>Match codes: You find those from very simple ones to the more sophisticated ones – going from ignoring vowels, soundex and metaphone (for English) to proprietary findings of all kinds. In my eyes match codes works OK for selecting candidates for matching – but falls a bit short when coming to actually settling the case.</p>
<p>Algorithms: A complex algorithm is a more sophisticated way to settle if two different spelled records make up the same real world entity. You have to deal with truncations, non phonetic typos, rearranged words and letters and all that jazz. The “LevenshteinDistance” is an example of an algorithm you could use – but such a method is just a fraction compared to the commercial used algorithms around. </p>
<p>Probabilistic learning: This is if fact a variation of synonyms, but the collection is not based on up front maintenance but collection of users actual decisions when verifying automatic matching. The tool will register the frequency and context of the paired elements in the decisions. This of course requires a substantial collection. I have implemented such a feature at organisations, where several people every day do verify matching results. </p>
<p>And then parsing and standardisation is often supplementary methods used to improve the matching. Also bringing in more data to support the decision is in my eyes a key to actually settle if some records make up the same real world entity. Business and consumer/citizen directories are available in different forms, coverage and depth around the world.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
