In this post, I will try to explain the factors that can impact the relevance score in enterprise search. First of all, you will have to give it to Microsoft Research Team and MSN internet search team to come up with such an incredible search engine that actually works. Enterprise search is a feature that is available with MOSS Enterprise Search license and is truly a search engine that crawls and keep tab of not only SharePoint data & files but also external data/files including network shares, LOB system (via BDC), People's search via LDAP and mysites etc. The best part is the extensibility of Search center by adding new content sources, scopes, managed property mappings and creative use of XSLT to format and display search results in an enterprise themed format. The most misunderstood aspect of enterprise search is Relevance. Simply put, relevance score is calculated from a number of content relevance algorithm. Depending on search keywords that a user has typed in, the search results are sorted in the order of relevance score. There are number of ways the relevance score can be impacted.
Click Distance:
There are several hyperlinks used in a web based application that connects one item to the other. Depending on how many clicks it takes for a user to get to a document or search result, the relevance score is impacted. Needless to say, the result with the least # of click distance tends to appear at the top of search results. Crawler looks at the Authoritative pages (page that contains topic specific unique content) and all the links that connects the pages and assigns them relevance score. The closer a page is to authoritative pages, the higher the relevance score assigned to them.
URL Depth:
URL depth also impacts the relevance score. The more the /(slashes) in URL, the lower is the relevance score.
HyperLinks Anchor Text:
The text used to describe an hyperlink is called Anchor text. <a href="docs.aspx">My Documents</a>. In this example - "My Documents" is the anchor text used to describe url "docs.aspx". The anchor text do not play any role if the results are going to appear in the search query but they play an important role in assigning relevance score. In other words, if a user searches for "crawl" and there are no documents with crawl keyword (although there are hyperlinks that says Crawl), the search query ignores the description and hence no results are displayed.
Document Title:
Each document have some metadata that is inherently stored with them including author's name, last modified date and title. The title plays an important role in assigning relevance score. Most people do not pay much attention to title but the search engine is smart enough to ignore the default title assigned by the editor tool and then looks for the first page to set the title relevance. For eg. all powerpoint have a title of "Slide 1", but search engine will ignore that and tries to read the first slide and assign title relevance that way.
File Type biasing:
Simply put, file type biasing means the document types that are indexed and searched first before other document types picks up. The default ranking order in Enterprise Search is:
-
Web Pages
-
Power Point Presentations (ppt, pptx)
-
Word documents (doc, docx)
-
XML Files (.xml)
-
Excel (xls, xlsx)
-
Plain text files (.txt)
-
SharePoint list items
It is really important to understand how MOSS Search engine assigns relevance score and places the links/documents at the top of search results. In my future posts, I will talk about best bets and how they can be helpful in highlighting some of common search keywords in an organization.