Posts Tagged ‘linguistics’
Following my last post about named entities, someone asked me how to tell “one named entity from another”. Well this is an interesting problem – for example I mentioned that sometimes people names look a good deal like company names because their contexts are sufficiently similar. And this really raises the question of whether that makes them essentially the same type of entity – in other words Entity Type may just be a proxy for “Contextual Role”. So maybe you decide that for your needs a “Hiring Entity” can be any one of a set of individuals either corporate or human. It all depends on the information need.
But what I really wanted to get into today is the notion of a “variant”. Many people seem to confuse “variants” with “synonyms” and furthermore confuse the idea of “name variants” with the idea of “different types of named entities”.
OK, for starters, let’s differentiate variants from synonyms. A “variant” of a name is an alternate graphemic representation (graphonym) of that name – and as such two variants point to the identical individual. Synonyms are not graphemes – they are concepts that are similar in terms of the idea they represent. So for example “pocketbook” and “purse” are synonyms – they are conceptually similar enough to occur in the same grammatical contexts. But they are not graphemes and they do not point to any specific individual in the world.
Now name variants are particularly interesting because it is so tricky to know when you actually have one. That is, how do you know when “Jon Smith” and “John Smith” are variants – to be true variants, they have to point to the SAME individual. Of course “Jon” may be another way of spelling the name “John” and either may be an “abbreviation” of the name “Jonathan”. But abbreviations start to get into murky territory as “variants” I think. May people, for example, use a “formal” version of their name in some contexts and the “abbreviated” or “informal” one in others. So, looking at all the contexts where “Jonathan Smith” occurs, it may start to look fairly dissimilar from “Jon Smith” – event though both strings may point to exactly the same individual in the world.
These sorts of issues, in my opinion, are just the ones that creators of Semantic Web resources need to be very careful about. They are even more important for search-engine related issues – especially for relevance ranking and SEO…..yet absent from discussions in this community. I welcome your thoughts as always…..
This week, by request, I am off my philosophy-writing habit for some news from the trenches. Although I want to move off of Sentiment Analysis for this post, thinking more about sentiment analysis started me thinking about Named Entity Recognition. After hearing an interesting talk last month by Satoshi Sekine of NYU, I realized just how many named entity types there are out there in the world and how challenging it is to find them. Finding named entities in unstructured data is only going to get more important as automated sentiment analysis and reputation analysis gain popularity. After all, if you can’t find the entities, you can’t find the topic and all the sophisticated topic-relevant analyses in the world won’t provide you any useful information.
So…what sorts of named-entities are there out there? The obvious ones like Person Names, Company Names and Place Names barely scratch the surface. Not only does each of these have up to 100 subcategories, but there are other types of top-level named entities too – such as Product Names, Natural Feature Names (e.g. Mount Everest) and Planet Names. OK, so maybe your NLP tools aren’t running on NASA data…..you still will come across the need at some point to find “niche” named entities – and your average stochastic named-entity tool will miss them. Let’s take the example of Ship names……well perhaps not the best example because there is ample context to train a tool to find them – but they are also good candidates for rule-based identification. For one thing, they appear in a limited number of events – “sailing” events as subjects and boarding or “attacking-by-pirates” events as objects. The also have nice little unique prefixes such as “S.S.” or “H.R.H.” and also invariably the definite determiner “the”. Other types of NEs are not so simple though. Product names in particular are very tricky. They rarely have a reliable canonical form or stable event context.
The only way to deal with this dilemma is good old-fashioned context research. Download *lots* of pages and do lots of kwic searches. Sometimes if you are really lucky, you find a list…..the best reward a hard-working linguist could hope for!
In this slide show from the Sentiment Analysis Symposium, LBTech founder Leslie Barrett discusses some of the issues that practitioners encounter when doing document-level sentiment analysis on news data. Click the link below to view the show.
Sentiment, News and the Polarity Problem