3. How to Display Different Content to Search Engines Versus
Visitors
A variety of strategies exist to segment content delivery. The
most basic is to serve content that is not meant for the engines in
unspiderable formats (e.g., placing text in images, Flash files,
plug-ins, etc.). You should not use these formats for the purpose of
cloaking. You should use them only if they bring a substantial end-user
benefit (such as an improved user experience). In such cases, you may
want to show the search engines the same content in a
search-spider-readable format. When you’re trying to show the engines
something you don’t want visitors to see, you can use CSS formatting
styles (preferably not display:none,
as the engines may have filters to watch specifically for this),
JavaScript, user-agent, cookie, or session-based delivery, or perhaps
most effectively, IP delivery (showing content based on the visitor’s IP
address).
Be very wary when employing cloaking such as that we just
described. The search engines expressly prohibit these practices in
their guidelines, and though there is leeway based on intent and user
experience (e.g., your site is using cloaking to improve the quality of
the user’s experience, not to game the search engines), the engines do
take these tactics seriously and may penalize or ban sites that
implement them inappropriately or with the intention of
manipulation.
3.1. The robots.txt file
This file is located on the root level of your domain (e.g.,
http://www.yourdomain.com/robots.txt), and it is
a highly versatile tool for controlling what the spiders are permitted
to access on your site. You can use robots.txt to:
Prevent crawlers from accessing nonpublic parts of your
website
Block search engines from accessing index scripts,
utilities, or other types of code
Avoid the indexation of duplicate content on a website, such
as “print” versions of HTML pages, or various sort orders for
product catalogs
Auto-discover XML Sitemaps
The robots.txt file must
reside in the root directory, and the filename must be entirely in
lowercase (robots.txt, not
Robots.txt, or other variations including
uppercase letters). Any other name or location will not be seen as
valid by the search engines. The file must also be entirely in text
format (not in HTML format).
When you tell a search engine robot not to access a page, it
prevents the crawler from accessing the page. Figure 2 illustrates what happens when the
search engine robot sees a direction in robots.txt not to crawl a web page.
In essence, the page will not be crawled, so links on the page
cannot pass link juice to other pages since the search engine does not
see the links. However, the page can be in the search engine index.
This can happen if other pages on the Web link to the page. Of course,
the search engine will not have very much information on the page
since it cannot read it, and will rely mainly on the anchor text and
other signals from the pages linking to it to determine what the page
may be about. Any resulting search listings end up being pretty sparse
when you see them in the Google index, as shown in Figure 3.
Figure 3
shows the results for the Google query
site:news.yahoo.com/topics/ inurl:page. This is
not a normal query that a user would enter, but you can see what the
results look like. Only the URL is listed, and there is no
description. This is because the spiders aren’t permitted to read the
page to get that data. In today’s algorithms, these types of pages
don’t rank very high because their relevance scores tend to be quite
low for any normal queries.
Google, Yahoo!, Bing, Ask, and nearly all of the legitimate
crawlers on the Web will follow the instructions you set out in the
robots.txt file. Commands in
robots.txt are primarily used to
prevent spiders from accessing pages and subfolders on a site, though
they have other options as well. Note that subdomains require their
own robots.txt files, as do files
that reside on an https: server.
3.1.1. Syntax of the robots.txt file
The basic syntax of robots.txt is fairly simple. You specify
a robot name, such as “googlebot”, and then you specify an action.
The robot is identified by user agent, and then the actions are
specified on the lines that follow. Here are the major actions you
can specify:
Disallow: the pages you
want to block the bots from accessing (as many disallow lines as
needed)
Noindex: the pages you
want a search engine to block and not index
(or de-index if previously indexed); this is unofficially
supported by Google and unsupported by Yahoo! and Bing
Some other restrictions apply:
Each User-Agent/Disallow group should be separated by a
blank line; however, no blank lines should exist within a group
(between the User-Agent line and the last Disallow).
The hash symbol (#) may be used for comments within a
robots.txt file, where
everything after # on that line will be ignored. This may be
used either for whole lines or for the end of lines.
Directories and filenames are case-sensitive: “private”,
“Private”, and “PRIVATE” are all uniquely different to search
engines.
Here is an example of a robots.txt file:
User-agent: Googlebot
Disallow:
User-agent: msnbot
Disallow: /
# Block all robots from tmp and logs directories
User-agent: *
Disallow: /tmp/
Disallow: /logs # for directories and files called logs
The preceding example will do the following:
Allow “Googlebot” to go anywhere.
Prevent “msnbot” from crawling any part of the
site.
Block all robots (other than Googlebot) from visiting the
/tmp/ directory or
directories or files called /logs (e.g., /logs or logs.php).
Notice that the behavior of Googlebot is not affected by
instructions such as Disallow: /.
Since Googlebot has its own instructions from robots.txt, it will ignore directives
labeled as being for all robots (i.e., uses an asterisk).
One common problem that novice webmasters run into occurs when
they have SSL installed so that their pages may be served via HTTP
and HTTPS. A robots.txt file at
http://www.yourdomain.com/robots.txt will not
be interpreted by search engines as guiding their crawl behavior on
https://www.yourdomain.com. To do this, you
need to create an additional robots.txt file at
https://www.yourdomain.com/robots.txt. So, if
you want to allow crawling of all pages served from your HTTP server
and prevent crawling of all pages from your HTTPS server, you would
need to implement the following:
For HTTP:
User-agent: *
Disallow:
For HTTPS:
User-agent: *
Disallow: /
These are the most basic aspects of robots.txt files, but there are more
advanced techniques as well. Some of these methods are supported by
only some of the engines, as detailed in the list that
follows:
Crawl delay
Crawl delay is supported by Yahoo!, Bing, and Ask. It
instructs a crawler to wait the specified number of seconds
between crawling pages. The goal with the directive is to
reduce the load on the publisher’s server:
User-agent: msnbot
Crawl-delay: 5
Pattern matching
Pattern matching appears to be usable by Google, Yahoo!,
and Bing. The value of pattern matching is considerable. You
can do some basic pattern matching using the asterisk wildcard
character. Here is how you can use pattern matching to block
access to all subdirectories that begin with
private (e.g., /private1/, /private2/, /private3/, etc.):
User-agent: Googlebot
Disallow: /private*/
You can match the end of the string using the dollar
sign ($). For example, to block URLs that end with
.asp:
User-agent: Googlebot
Disallow: /*.asp$
You may wish to prevent the robots from accessing any
URLs that contain parameters in them. To block access to all
URLs that include a question mark (?), simply use the question
mark:
User-agent: *
Disallow: /*?*
The pattern-matching capabilities of robots.txt are more limited than
those of programming languages such as Perl, so the question
mark does not have any special meaning and can be treated like
any other character.
Allow directive
The Allow directive
appears to be supported only by Google, Yahoo!, and Ask. It
works the opposite of the Disallow directive and provides the
ability to specifically call out directories or pages that may
be crawled. When this is implemented it can partially override
a previous Disallow
directive. This may be beneficial after large sections of the
site have been disallowed, or if the entire site itself has
been disallowed.
Here is an example that allows Googlebot into only the
google directory:
User-agent: Googlebot
Disallow: /
Allow: /google/
Noindex
directive
This directive works in the same way as the meta
robots
noindex command (which we will
discuss shortly) and tells the search engines to explicitly
exclude a page from the index. Since a Disallow directive prevents crawling
but not indexing, this can be a very useful feature to ensure
that the pages don’t show in search results. However, as of
October 2009, only Google supports this directive in robots.txt.
Sitemaps
We discussed XML Sitemaps at the beginning of this
chapter. You can use robots.txt to provide an
autodiscovery mechanism for the spider to find the XML Sitemap
file. The search engines can be told to find the file with one
simple line in the robots.txt file:
Sitemap: sitemap_location
The sitemap_location
should be the complete URL to the Sitemap, such as
http://www.yourdomain.com/sitemap.xml.
You can place this anywhere in your file.
For full instructions on how to apply robots.txt, see Robots.txt.org. You
may also find it valuable to use Dave Naylor’s robots.txt generation
tool to save time and heartache (http://www.davidnaylor.co.uk/the-robotstxt-builder-a-new-tool.html).
You should use great care when making changes to robots.txt. A simple typing error can,
for example, suddenly tell the search engines to no longer crawl any
part of your site. After updating your robots.txt file it is always a good idea
to check it with the Google
Webmaster Tools Test Robots.txt tool.