Archive for the ‘Uncategorized’ Category
Many people still do not truly understand how the so-called “Semantic Web” relates to their own business and online brand. The idea of the “Semantic Web” itself has unfortunately become over generalized due to the media hype surrounding related concepts like artificial intelligence, robotics and machine learning. While these are very interesting fields, and certainly somewhat related to the idea of a semantic web, they are not core to the practical applications of semantic-web compatible markups and standards. In fact, one needs to know very little about the semantic web to know that data standards (and in fact the very idea of “standardization”) is important to business. Let’s take the very simple idea of optimizing your brand on the internet. Your brand, like anything else online or offline, is introduced to the public through your “message”. Your message in turn, is crafted in words.
Now, the brand itself will be associated with a core idea and that core idea of course is part of the message. But there actually is a point when the concept of “semantics” creeps into your brand building. It is the point where you start to concern yourself with how people will find you. Sure, you have a great message, but how will people connect with the message through searches? That’s the key. People find things on the internet through search strings. And those stings need to connect with the way your core ideas are expressed. So you will need to master a key semantic concept in order to facilitate that connection. You will need to understand the “synonyms” that are part of expressing your brand concept. Let’s use an example. Perhaps you are selling a new type of soft drink – a soda brand. The core idea is going to be expressed by “soda” but we must expand the set of brand words beyond that one. In other words we can’t assume that everyone looking for the concept soda will use the word ‘soda’. They could use ‘pop’ or ‘cola’ to mean the same thing. They also could use some of your product’s attribute words instead of using the brand word itself or they could use a more general word such as “drink”. For example, “carbonated beverages”, “soft drinks” and the like would be search terms that you would like to lead to your product, the soda.
So with just the simplest of examples, you can already see that semantics does come into play in branding even on the most basic levels. It is important to understand the generality of terms as well as the concept of synonymy. When things get more complicated however, such as with a large financial or legal information product, the organization and presentation of data is a large project in itself, completely separate from branding. This is where data consultants such as LBTech can help. We are data experts specializing in everything from groundbreaking data standards to site optimization.
Following my last post about named entities, someone asked me how to tell “one named entity from another”. Well this is an interesting problem – for example I mentioned that sometimes people names look a good deal like company names because their contexts are sufficiently similar. And this really raises the question of whether that makes them essentially the same type of entity – in other words Entity Type may just be a proxy for “Contextual Role”. So maybe you decide that for your needs a “Hiring Entity” can be any one of a set of individuals either corporate or human. It all depends on the information need.
But what I really wanted to get into today is the notion of a “variant”. Many people seem to confuse “variants” with “synonyms” and furthermore confuse the idea of “name variants” with the idea of “different types of named entities”.
OK, for starters, let’s differentiate variants from synonyms. A “variant” of a name is an alternate graphemic representation (graphonym) of that name – and as such two variants point to the identical individual. Synonyms are not graphemes – they are concepts that are similar in terms of the idea they represent. So for example “pocketbook” and “purse” are synonyms – they are conceptually similar enough to occur in the same grammatical contexts. But they are not graphemes and they do not point to any specific individual in the world.
Now name variants are particularly interesting because it is so tricky to know when you actually have one. That is, how do you know when “Jon Smith” and “John Smith” are variants – to be true variants, they have to point to the SAME individual. Of course “Jon” may be another way of spelling the name “John” and either may be an “abbreviation” of the name “Jonathan”. But abbreviations start to get into murky territory as “variants” I think. May people, for example, use a “formal” version of their name in some contexts and the “abbreviated” or “informal” one in others. So, looking at all the contexts where “Jonathan Smith” occurs, it may start to look fairly dissimilar from “Jon Smith” – event though both strings may point to exactly the same individual in the world.
These sorts of issues, in my opinion, are just the ones that creators of Semantic Web resources need to be very careful about. They are even more important for search-engine related issues – especially for relevance ranking and SEO…..yet absent from discussions in this community. I welcome your thoughts as always…..
This week, by request, I am off my philosophy-writing habit for some news from the trenches. Although I want to move off of Sentiment Analysis for this post, thinking more about sentiment analysis started me thinking about Named Entity Recognition. After hearing an interesting talk last month by Satoshi Sekine of NYU, I realized just how many named entity types there are out there in the world and how challenging it is to find them. Finding named entities in unstructured data is only going to get more important as automated sentiment analysis and reputation analysis gain popularity. After all, if you can’t find the entities, you can’t find the topic and all the sophisticated topic-relevant analyses in the world won’t provide you any useful information.
So…what sorts of named-entities are there out there? The obvious ones like Person Names, Company Names and Place Names barely scratch the surface. Not only does each of these have up to 100 subcategories, but there are other types of top-level named entities too – such as Product Names, Natural Feature Names (e.g. Mount Everest) and Planet Names. OK, so maybe your NLP tools aren’t running on NASA data…..you still will come across the need at some point to find “niche” named entities – and your average stochastic named-entity tool will miss them. Let’s take the example of Ship names……well perhaps not the best example because there is ample context to train a tool to find them – but they are also good candidates for rule-based identification. For one thing, they appear in a limited number of events – “sailing” events as subjects and boarding or “attacking-by-pirates” events as objects. The also have nice little unique prefixes such as “S.S.” or “H.R.H.” and also invariably the definite determiner “the”. Other types of NEs are not so simple though. Product names in particular are very tricky. They rarely have a reliable canonical form or stable event context.
The only way to deal with this dilemma is good old-fashioned context research. Download *lots* of pages and do lots of kwic searches. Sometimes if you are really lucky, you find a list…..the best reward a hard-working linguist could hope for!
In this slide show from the Sentiment Analysis Symposium, LBTech founder Leslie Barrett discusses some of the issues that practitioners encounter when doing document-level sentiment analysis on news data. Click the link below to view the show.
Sentiment, News and the Polarity Problem