Archive for June, 2010
This week, by request, I am off my philosophy-writing habit for some news from the trenches. Although I want to move off of Sentiment Analysis for this post, thinking more about sentiment analysis started me thinking about Named Entity Recognition. After hearing an interesting talk last month by Satoshi Sekine of NYU, I realized just how many named entity types there are out there in the world and how challenging it is to find them. Finding named entities in unstructured data is only going to get more important as automated sentiment analysis and reputation analysis gain popularity. After all, if you can’t find the entities, you can’t find the topic and all the sophisticated topic-relevant analyses in the world won’t provide you any useful information.
So…what sorts of named-entities are there out there? The obvious ones like Person Names, Company Names and Place Names barely scratch the surface. Not only does each of these have up to 100 subcategories, but there are other types of top-level named entities too – such as Product Names, Natural Feature Names (e.g. Mount Everest) and Planet Names. OK, so maybe your NLP tools aren’t running on NASA data…..you still will come across the need at some point to find “niche” named entities – and your average stochastic named-entity tool will miss them. Let’s take the example of Ship names……well perhaps not the best example because there is ample context to train a tool to find them – but they are also good candidates for rule-based identification. For one thing, they appear in a limited number of events – “sailing” events as subjects and boarding or “attacking-by-pirates” events as objects. The also have nice little unique prefixes such as “S.S.” or “H.R.H.” and also invariably the definite determiner “the”. Other types of NEs are not so simple though. Product names in particular are very tricky. They rarely have a reliable canonical form or stable event context.
The only way to deal with this dilemma is good old-fashioned context research. Download *lots* of pages and do lots of kwic searches. Sometimes if you are really lucky, you find a list…..the best reward a hard-working linguist could hope for!