Developing an SEO-Friendly Website: Content Delivery and Search Spider Control (part 2)

1/4/2011 9:13:01 AM

3. How to Display Different Content to Search Engines Versus Visitors

A variety of strategies exist to segment content delivery. The most basic is to serve content that is not meant for the engines in unspiderable formats (e.g., placing text in images, Flash files, plug-ins, etc.). You should not use these formats for the purpose of cloaking. You should use them only if they bring a substantial end-user benefit (such as an improved user experience). In such cases, you may want to show the search engines the same content in a search-spider-readable format. When you’re trying to show the engines something you don’t want visitors to see, you can use CSS formatting styles (preferably not display:none, as the engines may have filters to watch specifically for this), JavaScript, user-agent, cookie, or session-based delivery, or perhaps most effectively, IP delivery (showing content based on the visitor’s IP address).

Be very wary when employing cloaking such as that we just described. The search engines expressly prohibit these practices in their guidelines, and though there is leeway based on intent and user experience (e.g., your site is using cloaking to improve the quality of the user’s experience, not to game the search engines), the engines do take these tactics seriously and may penalize or ban sites that implement them inappropriately or with the intention of manipulation.

3.1. The robots.txt file

This file is located on the root level of your domain (e.g., http://www.yourdomain.com/robots.txt), and it is a highly versatile tool for controlling what the spiders are permitted to access on your site. You can use robots.txt to:

Prevent crawlers from accessing nonpublic parts of your website
Block search engines from accessing index scripts, utilities, or other types of code
Avoid the indexation of duplicate content on a website, such as “print” versions of HTML pages, or various sort orders for product catalogs
Auto-discover XML Sitemaps

The robots.txt file must reside in the root directory, and the filename must be entirely in lowercase (robots.txt, not Robots.txt, or other variations including uppercase letters). Any other name or location will not be seen as valid by the search engines. The file must also be entirely in text format (not in HTML format).

When you tell a search engine robot not to access a page, it prevents the crawler from accessing the page. Figure 2 illustrates what happens when the search engine robot sees a direction in robots.txt not to crawl a web page.

Figure 2. Impact of robots.txt

In essence, the page will not be crawled, so links on the page cannot pass link juice to other pages since the search engine does not see the links. However, the page can be in the search engine index. This can happen if other pages on the Web link to the page. Of course, the search engine will not have very much information on the page since it cannot read it, and will rely mainly on the anchor text and other signals from the pages linking to it to determine what the page may be about. Any resulting search listings end up being pretty sparse when you see them in the Google index, as shown in Figure 3.

Figure 3. SERPs for pages that are listed in robots.txt

Figure 3 shows the results for the Google query site:news.yahoo.com/topics/ inurl:page. This is not a normal query that a user would enter, but you can see what the results look like. Only the URL is listed, and there is no description. This is because the spiders aren’t permitted to read the page to get that data. In today’s algorithms, these types of pages don’t rank very high because their relevance scores tend to be quite low for any normal queries.

Google, Yahoo!, Bing, Ask, and nearly all of the legitimate crawlers on the Web will follow the instructions you set out in the robots.txt file. Commands in robots.txt are primarily used to prevent spiders from accessing pages and subfolders on a site, though they have other options as well. Note that subdomains require their own robots.txt files, as do files that reside on an https: server.

3.1.1. Syntax of the robots.txt file

The basic syntax of robots.txt is fairly simple. You specify a robot name, such as “googlebot”, and then you specify an action. The robot is identified by user agent, and then the actions are specified on the lines that follow. Here are the major actions you can specify:

Disallow: the pages you want to block the bots from accessing (as many disallow lines as needed)
Noindex: the pages you want a search engine to block and not index (or de-index if previously indexed); this is unofficially supported by Google and unsupported by Yahoo! and Bing

Some other restrictions apply:

Each User-Agent/Disallow group should be separated by a blank line; however, no blank lines should exist within a group (between the User-Agent line and the last Disallow).
The hash symbol (#) may be used for comments within a robots.txt file, where everything after # on that line will be ignored. This may be used either for whole lines or for the end of lines.
Directories and filenames are case-sensitive: “private”, “Private”, and “PRIVATE” are all uniquely different to search engines.

Here is an example of a robots.txt file:

User-agent: Googlebot
Disallow:

User-agent: msnbot
Disallow: /

# Block all robots from tmp and logs directories
User-agent: *
Disallow: /tmp/
Disallow: /logs     # for directories and files called logs

The preceding example will do the following:

Allow “Googlebot” to go anywhere.
Prevent “msnbot” from crawling any part of the site.
Block all robots (other than Googlebot) from visiting the /tmp/ directory or directories or files called /logs (e.g., /logs or logs.php).

Notice that the behavior of Googlebot is not affected by instructions such as Disallow: /. Since Googlebot has its own instructions from robots.txt, it will ignore directives labeled as being for all robots (i.e., uses an asterisk).

One common problem that novice webmasters run into occurs when they have SSL installed so that their pages may be served via HTTP and HTTPS. A robots.txt file at http://www.yourdomain.com/robots.txt will not be interpreted by search engines as guiding their crawl behavior on https://www.yourdomain.com. To do this, you need to create an additional robots.txt file at https://www.yourdomain.com/robots.txt. So, if you want to allow crawling of all pages served from your HTTP server and prevent crawling of all pages from your HTTPS server, you would need to implement the following:

For HTTP:

User-agent: *
Disallow:

For HTTPS:

User-agent: *
Disallow: /

These are the most basic aspects of robots.txt files, but there are more advanced techniques as well. Some of these methods are supported by only some of the engines, as detailed in the list that follows:

Crawl delay

Crawl delay is supported by Yahoo!, Bing, and Ask. It instructs a crawler to wait the specified number of seconds between crawling pages. The goal with the directive is to reduce the load on the publisher’s server:

User-agent: msnbot
Crawl-delay: 5

Pattern matching

Pattern matching appears to be usable by Google, Yahoo!, and Bing. The value of pattern matching is considerable. You can do some basic pattern matching using the asterisk wildcard character. Here is how you can use pattern matching to block access to all subdirectories that begin with private (e.g., /private1/, /private2/, /private3/, etc.):

User-agent: Googlebot
Disallow: /private*/

You can match the end of the string using the dollar sign ($). For example, to block URLs that end with .asp:

User-agent: Googlebot
Disallow: /*.asp$

You may wish to prevent the robots from accessing any URLs that contain parameters in them. To block access to all URLs that include a question mark (?), simply use the question mark:

User-agent: *
Disallow: /*?*

The pattern-matching capabilities of robots.txt are more limited than those of programming languages such as Perl, so the question mark does not have any special meaning and can be treated like any other character.

Allow directive

The Allow directive appears to be supported only by Google, Yahoo!, and Ask. It works the opposite of the Disallow directive and provides the ability to specifically call out directories or pages that may be crawled. When this is implemented it can partially override a previous Disallow directive. This may be beneficial after large sections of the site have been disallowed, or if the entire site itself has been disallowed.

Here is an example that allows Googlebot into only the google directory:

User-agent: Googlebot
Disallow: /
Allow: /google/

Noindex directive

This directive works in the same way as the meta robots noindex command (which we will discuss shortly) and tells the search engines to explicitly exclude a page from the index. Since a Disallow directive prevents crawling but not indexing, this can be a very useful feature to ensure that the pages don’t show in search results. However, as of October 2009, only Google supports this directive in robots.txt.

Sitemaps

We discussed XML Sitemaps at the beginning of this chapter. You can use robots.txt to provide an autodiscovery mechanism for the spider to find the XML Sitemap file. The search engines can be told to find the file with one simple line in the robots.txt file:

Sitemap: sitemap_location

The sitemap_location should be the complete URL to the Sitemap, such as http://www.yourdomain.com/sitemap.xml. You can place this anywhere in your file.

For full instructions on how to apply robots.txt, see Robots.txt.org . You may also find it valuable to use Dave Naylor’s robots.txt generation tool to save time and heartache (http://www.davidnaylor.co.uk/the-robotstxt-builder-a-new-tool.html).

You should use great care when making changes to robots.txt. A simple typing error can, for example, suddenly tell the search engines to no longer crawl any part of your site. After updating your robots.txt file it is always a good idea to check it with the Google Webmaster Tools Test Robots.txt tool.

Developing an SEO-Friendly Website: Content Delivery and Search Spider Control (part 3)

Developing an SEO-Friendly Website: Content Delivery and Search Spider Control (part 1)

Other

Windows Server 2008 : Configuring and Managing the Terminal Services - Load Balancing

Cloud Application Architectures : Web Application Design

Microsoft ASP.NET 4 : Configuring ASP.NET from IIS

Microsoft ASP.NET 4 : .NET Configuration

Programming WCF : Queued Services - Transactions

Programming WCF : Queued Services - Queued Calls

The Membership Data Store

Creating a Web Application with VB 2010 with Navigation and Data-Binding

Building ASP.NET Web Applications : Understanding State Management

The AJAX Control Toolkit : Adding Safe Popup Capabilities to Web Pages