Tuesday, April 7, 2009

Semantic Search


We had a great experience last week at Alternative Search Engine Day II. We were on two panels, one on green search (Brian) and one on semantic search (Herb). The gist of our presentation on semantic search can be seen in a slideshare presentation.

One of the things that came out of the discussion on semantic search is the wide variety of approaches that fall under the umbrella of semantic search.

Although much of the attention in semantic search falls on the semantic web and annotation using things like RDF triples to represent knowledge, there are other approaches and conflating them only serves to add to the confusion.

There are at least four approaches to semantic search. Different semantic search engines may use one or more of these approaches. The point of semantic search is to use meaning to improve the user's search experience. For example, one approach is to use contextual analysis to help to disambiguate queries. Does the word "strike," for example, refer to baseball or labor or something else entirely? This approach is a major emphasis of Truevert.

Another approach focuses on reasoning. Given a set of facts that are represented in the system, additional facts can be inferred from them. If the system knows who J.S. Bach's children were, and it knows who each of their children were, then a reasoning system can infer who Bach's grandchildren were. TrueKnowledge presented a system with an emphasis on reasoning at the conference.

A number of semantic search engines emphasize natural language understanding. These engines process the content they index and the queries people submit to try to identify the intent of the information. They use the syntax of the sentence and rules to identify people, places, organizations, and so forth. Powerset makes extensive use of natural language understanding.

The fourth approach uses an ontology to represent knowledge about a domain and expand queries. On this approach, when a user enters a query for a word like "truck," the system adds terms from its ontology (e.g., "vehicle" because a truck is a kind of vehicle) to make the search more focused as well as more broad. This approach is used by a large number of semantic search systems.

There is not just one approach to semantic search. Most semantic search engines mix and match them in various ways to yield a unique search experience for their users. Each approach has much to contribute. None of the semantic search engines presenting at the conference limited themselves to just one of these.

Semantic search is not a single monolithic tool, either. Different kinds of search are intended to fulfill different kinds of functions. One size does not fit all. There is room for variety.

Finally, if you can think of other approaches to semantic search, please let know. There are over 120 different terms that are roughly synonyms for the word, "think." There are likely to be more than four approaches to notion of semantic search.

Monday, March 23, 2009

AltSearchEngines Day II

We will be participating in AltSearchEngines Day II, next Monday, March 30th at San Francisco’s Intercontinental Hotel, from 9:00AM – 5:00 PM.

This is a one-day, grassroots event which brings together the best and brightest minds of the "alternative" search engines. Practically every alternative to the big "G" will be represented, from Microsoft / Powerset to SurfCanyon to Yahoo BOSS to key vertical search categories including Green Search, Health Search, Image Search, and Semantic Search, plus several special interim presentations.

This is a unique opportunity to meet some of the most influential innovators in Search face to face!


If you are in town for Web 2.0, this is a perfect compliment to the week!
Please join us.

Register and find more details here.


Tuesday, November 11, 2008

Tuesday, October 28, 2008

Updated site, new layout, added news

We've just updated the site to include green news from around net.

Sunday, October 5, 2008

What is semantic about semantic search?

Semantic search is supposed to be about the idea of exploiting the meaning of words rather than treating them as just sequences of letters. But, what does "meaningful" mean? This has been a central question for much of Western philosophy in the 20th Century and we can take advantage of this thought for organizing our approach to semantic search.
One theory is that the meaning of a word is the category of things it refers to. On this theory, we know the meaning of a word when we know what category it references. We can write rules for recognizing categories, for example, that a string in form of "ddd-dd-dddd" refers to the category of social security numbers. The word "unicorn" refers to the category of mythical animals. We can reason about these words because we can have rules about categories and their properties. We have a semantic web of concepts when we link together all of the words and categories into an ontology and a set of rules for dealing with the elements of this ontology.
We recognize that some words, in the sense of a string of letters, can be ambiguous, for example, "bank" can refer to the side of a river, a financial institution, or to the act of tilting an airplane, but we use context to pick which sense of "bank" is intended. Part of this context often includes a syntactic analysis of the sentences in which the word occurs.
Such an approach would allow computational reasoning about a large number of entities. It would allow us to recognize times and place names, for example, and to do things like set up appointments. This is the approach taken by the semantic web and by many of the systems that claim to do semantic search. It is capable of serving many needs, but it is not sufficient for semantic search.
The meaning as categorization view implies that there is a fixed set of categories that a word could refer to. Companies have spent years and millions of dollars coming up with a definitive set of categories, and rules for identifying them. The problem for semantic search, is that people are not limited to reasoning about a fixed set of categories. Human thought is much more flexible than that.

  • People frequently create new categories (palm top computers).

  • People make up ad hoc categories (things to buy my wife as a gift).

  • People often create new words (iPod).

  • People frequently use old words in new ways (Twitter).

  • Words are far more ambiguous than seems apparent (palm, strike)

  • Knowing what category a word belongs to may not be enough to deliver useful, meaningful results (silicon).


Let's take a closer look at these last two items, which are related. The 500 most frequently used words in the English language have an average of 23 definitions each. The word "set" has 464. In addition, the problem of word ambiguity is far more pervasive than seems apparent.
For example, the simple sentence:

"The companies have agreed to a brief delay in implementing their agreement."


This sentence does not seem to be particularly ambiguous. You probably had not trouble understanding it. But if you look up each word in the Oxford English Dictionary, you'll find that each word has multiple definitions.





WordNumber of major definitions
The37
Companies14
Have39
Agreed17
To54
A62
Brief20
Delay8
In84
Implementing8
Their7
Agreement9

By this analysis, this simple sentence has almost 8 quintillion possible interpretations 7,788,584,618,680,320. Even if you take out the stop words (the, have, to, a, in, their), there are still 2,741,760 interpretations. Humans can understand this sentence because they (more or less unconsciously) use each word to focus the interpretations of each other word. Put another way, the meaning of each word is given by its context. The philosopher Wittgenstein, came to the conclusion that the meaning of a word is its use in the language. This analysis also argues for the same kind of idea. One of the conclusions we can draw from this analysis is that having a complete representation of the meanings of English words, is a formidable and daunting task. And it would still not be sufficient to identify relevant results. There is more ambiguity than even the number of definitions.
A search for the word "silicon," for example, in a broad search engine such as Yahoo returns documents about the properties of the element, alternative biochemistry, and the Silicon Valley City Guide. These are not the topics likely to be of much interest to a searcher looking for information from a green perspective. From this point of view, silicon is about solar cells. A green search engine should return documents about solar cells, in preference to ones about its atomic structure. It's the same definition of silicon in both searches, but they give very different results.
This is the way that the Truevert search engine works. It learns the meanings of words the way people do—from how they are used in the language and uses words to disambiguate other words. There is much that can be learned about semantic search from philosophers and linguists. Taking advantage of that information can be very helpful in delivering to people the information that they are looking for, rather than just the information that is convenient to compute. Delivering focused search results depends on the ability to understand the meaning of words to a detailed level. This understanding will not come from syntactic analysis or from the construction of elaborate ontologies. It will come from using human-like processes on the documents themselves.

HLR