3.4. The canonical tag
In February 2009, Google, Yahoo!, and Microsoft announced a new
tag known as the canonical tag.
This tag was a new construct designed explicitly for purposes of
identifying and dealing with duplicate content. Implementation is very
simple and looks like this:
<link rel="canonical" href="http://www.seomoz.org/blog" />
This tag is meant to tell Yahoo!, Bing, and Google that the page
in question should be treated as though it were a copy of the URL
http://www.seomoz.org/blog and that all of the
link and content metrics the engines apply should technically flow
back to that URL (see Figure 8).
The canonical URL tag
attribute is similar in many ways to a 301 redirect from an SEO
perspective. In essence, you’re telling the engines that multiple
pages should be considered as one (which a 301 does), without actually
redirecting visitors to the new URL (often saving your development
staff trouble). There are some differences, though:
Whereas a 301 redirect points all traffic (bots and human
visitors), the canonical URL
tag is just for engines, meaning you can still separately track
visitors to the unique URL versions.
A 301 is a much stronger signal that multiple pages have a
single, canonical source. Although the engines are certainly
planning to support this new tag and trust the intent of site
owners, there will be limitations. Content analysis and other
algorithmic metrics will be applied to ensure that a site owner
hasn’t mistakenly or manipulatively applied the tag, and you can
certainly expect to see mistaken use of the canonical tag, resulting in the engines
maintaining those separate URLs in their indexes .
301s carry cross-domain functionality, meaning you can
redirect a page at Domain1.com to Domain2.com and carry over those
search engine metrics. This is not the case
with the canonical URL tag,
which operates exclusively on a single root domain (it will carry
over across subfolders and subdomains).
We will discuss some applications for this tag later in this
chapter. In general practice, the best solution is to resolve the
duplicate content problems at their core, and eliminate them if you
can. This is because the canonical
tag is not guaranteed to work. However, it is not always possible to
resolve the issues by other means, and the canonical tag provides a very effective
backup plan.
3.5. Blocking and cloaking by IP address range
You can customize entire IP addresses or ranges to block
particular bots through server-side restrictions on IPs. Most of the
major engines crawl from a limited number of IP ranges, making it
possible to identify them and restrict access. This technique is,
ironically, popular with webmasters who mistakenly assume that search
engine spiders are spammers attempting to steal their content, and
thus block the IP ranges to restrict access and save bandwidth. Use
caution when blocking bots, and make sure you’re not restricting
access to a spider that could bring benefits, either from search
traffic or from link attribution.
3.6. Blocking and cloaking by user agent
At the server level, it is possible to detect user agents and
restrict their access to pages or websites based on their declaration
of identity. As an example, if a website detected a rogue bot, you
might double-check its identity before allowing access. The search
engines all use a similar protocol to verify their user agents via the
Web: a reverse DNS lookup followed by a corresponding forward DNS→IP
lookup. An example for Google would look like this:
> host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.
> host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1
A reverse DNS lookup by itself may be insufficient, because a
spoofer could set up reverse DNS to point to xyz.googlebot.com or any other
address.
3.7. Using iframes
Sometimes there’s a certain piece of content on a web page (or a
persistent piece of content throughout a site) that you’d prefer
search engines didn’t see. As we discussed earlier in this chapter,
clever use of iframes can come in handy, as Figure 9
illustrates.
The concept is simple: by using iframes, you can embed content
from another URL onto any page of your choosing. By then blocking
spider access to the iframe with robots.txt, you ensure that the search
engines won’t “see” this content on your page. Websites may do this
for many reasons, including avoiding duplicate content problems,
reducing the page size for search engines, or lowering the number of
crawlable links on a page (to help control the flow of link
juice).
3.8. Hiding text in images
As we discussed previously, the major search engines still have
very little capacity to read text in images (and the processing power
required makes for a severe barrier). Hiding content inside images
isn’t generally advisable, as it can be impractical for alternative
devices (mobile, in particular) and inaccessible to others (such as
screen readers).
3.9. Hiding text in Java applets
As with text in images, the content inside Java applets is not
easily parsed by the search engines, though using them as a tool to
hide text would certainly be a strange choice.
3.10. Forcing form submission
Search engines will not submit HTML forms in an attempt to
access the information retrieved from a search or submission. Thus, if
you keep content behind a forced-form submission and never link to it
externally, your content will remain out of the engines (as Figure 10
demonstrates).
The problem comes when content behind forms earns links outside
your control, as when bloggers, journalists, or researchers decide to
link to the pages in your archives without your knowledge. Thus,
although form submission may keep the engines at bay, make sure that
anything truly sensitive has additional protection (e.g., through
robots.txt or meta
robots).
3.11. Using login/password protection
Password protection of any kind will effectively prevent any
search engines from accessing content, as will any form of
human-verification requirements, such as CAPTCHAs (the boxes that
request the copying of letter/number combinations to gain access). The
major engines won’t try to guess passwords or bypass these
systems.
3.12. Removing URLs from a search engine’s index
A secondary, post-indexing tactic, URL removal is possible at
most of the major search engines through verification of your site and
the use of the engines’ tools. For example, Yahoo! allows you to
remove URLs through its Site Explorer system (http://help.yahoo.com/l/us/yahoo/search/siteexplorer/delete/siteexplorer-46.html),
and Google offers a similar service (https://www.google.com/webmasters/tools/removals)
through Webmaster Central. Microsoft’s Bing search engine may soon
carry support for this as well.