Archive for July, 2010
Following my last post about named entities, someone asked me how to tell “one named entity from another”. Well this is an interesting problem – for example I mentioned that sometimes people names look a good deal like company names because their contexts are sufficiently similar. And this really raises the question of whether that makes them essentially the same type of entity – in other words Entity Type may just be a proxy for “Contextual Role”. So maybe you decide that for your needs a “Hiring Entity” can be any one of a set of individuals either corporate or human. It all depends on the information need.
But what I really wanted to get into today is the notion of a “variant”. Many people seem to confuse “variants” with “synonyms” and furthermore confuse the idea of “name variants” with the idea of “different types of named entities”.
OK, for starters, let’s differentiate variants from synonyms. A “variant” of a name is an alternate graphemic representation (graphonym) of that name – and as such two variants point to the identical individual. Synonyms are not graphemes – they are concepts that are similar in terms of the idea they represent. So for example “pocketbook” and “purse” are synonyms – they are conceptually similar enough to occur in the same grammatical contexts. But they are not graphemes and they do not point to any specific individual in the world.
Now name variants are particularly interesting because it is so tricky to know when you actually have one. That is, how do you know when “Jon Smith” and “John Smith” are variants – to be true variants, they have to point to the SAME individual. Of course “Jon” may be another way of spelling the name “John” and either may be an “abbreviation” of the name “Jonathan”. But abbreviations start to get into murky territory as “variants” I think. May people, for example, use a “formal” version of their name in some contexts and the “abbreviated” or “informal” one in others. So, looking at all the contexts where “Jonathan Smith” occurs, it may start to look fairly dissimilar from “Jon Smith” – event though both strings may point to exactly the same individual in the world.
These sorts of issues, in my opinion, are just the ones that creators of Semantic Web resources need to be very careful about. They are even more important for search-engine related issues – especially for relevance ranking and SEO…..yet absent from discussions in this community. I welcome your thoughts as always…..