Moderator: Kevin Ryan, VP, Global Content Director, SES & Search Engine Watch
Speakers:
Amit Kumar, Director Product Management, Yahoo! Search
Nagaraju Bandaru, Co-Founder & CTO, BooRah
Erik Collier, VP, Product Management, Ask.com
Tim McGuiness, VP of Search, hakia.com
Scott Prevost, General Manager & Director of Product, Powerset
First Speaker: Amit Kumar
My job today is to convince you that semantic search is coming closer and closer to reality and that you can participate, and I'm planning on exploring the following:
- What does semantic search mean (to us)?
- Potential user experiences?
- Let's build it together!
What does it mean (to us)? (Understand underlying structured data)
When the web started, most of the content was written by hand. Research papers, personal experiences, etc. Over the last 5 years or so, most of the content (70%+) is coming from structured databases. Search engines and other companies have recognized that getting to and understanding the structured data is of key importance. Once you have access to that data, helping users understand that data is the next step. Help the users get from “to do” to “done”. These are the guiding principles for Yahoo in terms of semantic search.
Potential User Experiences
(Amit provides a list of ten sample search queries and then shows samples of Yahoo’s semantic search results – the results include images, supplemental links below the images, suggestions for further clarifying the query, constraining the query to specific sources, etc.)
Local results show maps, yelp reviews, store hours, phone, etc. right in the search results.
For the search query util.list, the results show not only relevant results, but it breaks down the results by version of software (cshel: Cool!)
All of these results are live today. Some of the things are on by default, including blog results, LinkedIn results, Yelp, etc. It’s providing real value to the users.
Yahoo wants your data. Expose your structured data. Use microformats, RDP markup (preferred). Use datafeeds (based on open RDFa format). Build SearchMonkey applications. In return for going through the trouble to exposure your data, Yahoo is seeing +15% increase in clicks for participating sites.
Nagaraju Bandaru, Co-founder & CTO, BooRah
BooRah is a local search company that uses natural language processing to extract intention and affinity from user reviews and blogs from across the Internet. BooRah generates that structured data that helps make the semantic search experience richer.
Ultimately, semantic search is about leveraging information to make the content on the semantic search results more relevant.
Google is continue to use “smart indexing”, keywords/attributes, and behind the scenes semantics. Yahoo is using a more open approach, and is using content markup, open search and integrates vertical data feeds to enhance their results.
The companies that are trying to understand the content (more so than just the keywords) are companies like Hakia, Aggregate Knowledge, and BooRah use natural language search, discovery and recommendation, and sentiment analysis.
Does search behavior ultimately change the user experience? Bandaru thinks so. When people start searching for “Where can I find the best fois gras in San Francisco?” and then start getting results that they like and can use, they’ll start doing those types of searches more frequently.
This doesn’t negate standard SEO best practices because that is still information that the search engines need, and there are still users who do searches “old school”.
Bandaru’s company specializes in sentiment extraction to enhance search. Sentiment extraction summarizes “gobs” of content – like category specific scores, normalizes different data sources, and then rolls it up into an easy search and sort. They also look at inferred meta data, which leverages existing content like reviews, blogs and message boards. All of this increases relevance and content. It helps filter out keyword spam and also provides more context for location aware mobile applications.
Third Speaker: Erik Collier
We believe that the true semantic web is quite a ways away. You will start to see it when you find that you don’t have to rephrase your query to get the results you want. Right now, that’s not the case. The burden is on the user to think the way the engine wants them to think. If the meaning of the query is the same, you should get the same results no matter how you phrase your question. The semantic web will understand your intent and give you the answer based on your intent, not based on the exact arrangement of your keywords.
To see examples of how query phrasing and word selection affects the outcome of results, search Google, Yahoo, Live, Ask, Powerset and Hakia with the following queries and note the differences in the results:
What is the population size of Japan?
What is the population of Japan?
Ask is currently using structured content feeds containing all listings for a 2 wk period. They apply logic based on the actual question (queries for specific actors/titles/etc), and then based on specific words in the search. They then apply user data like location based on IP.
Kevin Ryan asks Eric Collier: So tell me… we’ve been spending years teaching people to search using caveman speak. How are we going to change the minds of people to get them to switch from the caveman speak to real language search terms?
Erik Collier responds: We’re going to start using a club and hitting them over the head. No really, because Ask.com’s history is using natural language, we feel like we have a leg up on that because we’ve always encouraged our users to ask “real questions”. We are trying to encourage them by phrasing the suggested query refinements in natural language as well.
Kartal Guner, Chief architect, Hakia.com
Semantic technology embodies cognitive knowledge and operates on concept relations. It paves the way to text-menaing-representation and conversational aptitude. Challenges:
- Know-how
- Time constraints/sclability
Current web search suffers from the limitations of statistical methods operating on keywords:
- Dependence on link referrals, behavior tracking, corpus selection, etc.
- Long tail phenomena
- May be vulnerable to external manipulation (miserable failure)
Semantic Search Operations at Hakia:
- Generalization-Specialization
- Parallelization (equivalent meanings)
- Question type detection and relevance – we’ve broken the standards in about 50 different types.
- Categorization
- Compression
- Content characterization (Qdexing)
- Disambiguation (applies to all of the above)
All of these capabilities allow search forms of higher refinement and human interactivity – conversational search (the user asks questions, the search engine asks questions back to help further refine the query).
Their goals are to provide higher relevance for searchers, the freedom to search in natural language and a new search experience – going beyond 10 blue links, coherent text, focused sentence and referenced links.
For advertisers, there’s a higher relevance and better targeting. For SEOs, the goal is to allow them to focus on content rather than keywords.
Scott Prevost, Powerset, General Manager and Director of Product
What we’ve heard from most of the speakers is that semantic search is about structured data, and really most of the data on the web today is still unstructured. So Powerset provides structure to that content.
Keyword techniques involve shallow representations of document meaning and user intent. Powerset believes that better relevance can be achieved through improved models of the meaning of documents and queries.
Their “Vision for Search” is two pronged, the first is improving search relevance by applying deep natural language processing to extract semantic features from text and encode them in the index, they also extract semantic features from the queries themselves and then retrieving and ranking documents based on the semantic keywords and other features. The second prong is to improve the user experience.
The semantic impact on relevance is improving recall through word and phrase variations (synonyms, hypernyms and anaphors) and improving precision through linguistic structure and content.
The impact on the user experience includes allowing more natural and flexible querying – regardless of whether you’re searching using keywords, topics, phrases, full questions. Powerset also highlights the relevant part of the results for the user.
Live blogging provided by Cshel.