Sunday, October 5, 2008

What is semantic about semantic search?

Semantic search is supposed to be about the idea of exploiting the meaning of words rather than treating them as just sequences of letters. But, what does "meaningful" mean? This has been a central question for much of Western philosophy in the 20th Century and we can take advantage of this thought for organizing our approach to semantic search.
One theory is that the meaning of a word is the category of things it refers to. On this theory, we know the meaning of a word when we know what category it references. We can write rules for recognizing categories, for example, that a string in form of "ddd-dd-dddd" refers to the category of social security numbers. The word "unicorn" refers to the category of mythical animals. We can reason about these words because we can have rules about categories and their properties. We have a semantic web of concepts when we link together all of the words and categories into an ontology and a set of rules for dealing with the elements of this ontology.
We recognize that some words, in the sense of a string of letters, can be ambiguous, for example, "bank" can refer to the side of a river, a financial institution, or to the act of tilting an airplane, but we use context to pick which sense of "bank" is intended. Part of this context often includes a syntactic analysis of the sentences in which the word occurs.
Such an approach would allow computational reasoning about a large number of entities. It would allow us to recognize times and place names, for example, and to do things like set up appointments. This is the approach taken by the semantic web and by many of the systems that claim to do semantic search. It is capable of serving many needs, but it is not sufficient for semantic search.
The meaning as categorization view implies that there is a fixed set of categories that a word could refer to. Companies have spent years and millions of dollars coming up with a definitive set of categories, and rules for identifying them. The problem for semantic search, is that people are not limited to reasoning about a fixed set of categories. Human thought is much more flexible than that.

  • People frequently create new categories (palm top computers).

  • People make up ad hoc categories (things to buy my wife as a gift).

  • People often create new words (iPod).

  • People frequently use old words in new ways (Twitter).

  • Words are far more ambiguous than seems apparent (palm, strike)

  • Knowing what category a word belongs to may not be enough to deliver useful, meaningful results (silicon).


Let's take a closer look at these last two items, which are related. The 500 most frequently used words in the English language have an average of 23 definitions each. The word "set" has 464. In addition, the problem of word ambiguity is far more pervasive than seems apparent.
For example, the simple sentence:

"The companies have agreed to a brief delay in implementing their agreement."


This sentence does not seem to be particularly ambiguous. You probably had not trouble understanding it. But if you look up each word in the Oxford English Dictionary, you'll find that each word has multiple definitions.





WordNumber of major definitions
The37
Companies14
Have39
Agreed17
To54
A62
Brief20
Delay8
In84
Implementing8
Their7
Agreement9

By this analysis, this simple sentence has almost 8 quintillion possible interpretations 7,788,584,618,680,320. Even if you take out the stop words (the, have, to, a, in, their), there are still 2,741,760 interpretations. Humans can understand this sentence because they (more or less unconsciously) use each word to focus the interpretations of each other word. Put another way, the meaning of each word is given by its context. The philosopher Wittgenstein, came to the conclusion that the meaning of a word is its use in the language. This analysis also argues for the same kind of idea. One of the conclusions we can draw from this analysis is that having a complete representation of the meanings of English words, is a formidable and daunting task. And it would still not be sufficient to identify relevant results. There is more ambiguity than even the number of definitions.
A search for the word "silicon," for example, in a broad search engine such as Yahoo returns documents about the properties of the element, alternative biochemistry, and the Silicon Valley City Guide. These are not the topics likely to be of much interest to a searcher looking for information from a green perspective. From this point of view, silicon is about solar cells. A green search engine should return documents about solar cells, in preference to ones about its atomic structure. It's the same definition of silicon in both searches, but they give very different results.
This is the way that the Truevert search engine works. It learns the meanings of words the way people do—from how they are used in the language and uses words to disambiguate other words. There is much that can be learned about semantic search from philosophers and linguists. Taking advantage of that information can be very helpful in delivering to people the information that they are looking for, rather than just the information that is convenient to compute. Delivering focused search results depends on the ability to understand the meaning of words to a detailed level. This understanding will not come from syntactic analysis or from the construction of elaborate ontologies. It will come from using human-like processes on the documents themselves.

HLR

No comments: