Developing an SEO-Friendly Website: Content Delivery and Search Spider Control (part 3)

1/4/2011 9:17:36 AM

3.4. The canonical tag

In February 2009, Google, Yahoo!, and Microsoft announced a new tag known as the canonical tag. This tag was a new construct designed explicitly for purposes of identifying and dealing with duplicate content. Implementation is very simple and looks like this:

<link rel="canonical" href="http://www.seomoz.org/blog" />

This tag is meant to tell Yahoo!, Bing, and Google that the page in question should be treated as though it were a copy of the URL http://www.seomoz.org/blog and that all of the link and content metrics the engines apply should technically flow back to that URL (see Figure 8).

Figure 8. How search engines look at the canonical tag

The canonical URL tag attribute is similar in many ways to a 301 redirect from an SEO perspective. In essence, you’re telling the engines that multiple pages should be considered as one (which a 301 does), without actually redirecting visitors to the new URL (often saving your development staff trouble). There are some differences, though:

Whereas a 301 redirect points all traffic (bots and human visitors), the canonical URL tag is just for engines, meaning you can still separately track visitors to the unique URL versions.
A 301 is a much stronger signal that multiple pages have a single, canonical source. Although the engines are certainly planning to support this new tag and trust the intent of site owners, there will be limitations. Content analysis and other algorithmic metrics will be applied to ensure that a site owner hasn’t mistakenly or manipulatively applied the tag, and you can certainly expect to see mistaken use of the canonical tag, resulting in the engines maintaining those separate URLs in their indexes .
301s carry cross-domain functionality, meaning you can redirect a page at Domain1.com to Domain2.com and carry over those search engine metrics. This is not the case with the canonical URL tag, which operates exclusively on a single root domain (it will carry over across subfolders and subdomains).

We will discuss some applications for this tag later in this chapter. In general practice, the best solution is to resolve the duplicate content problems at their core, and eliminate them if you can. This is because the canonical tag is not guaranteed to work. However, it is not always possible to resolve the issues by other means, and the canonical tag provides a very effective backup plan.

3.5. Blocking and cloaking by IP address range

You can customize entire IP addresses or ranges to block particular bots through server-side restrictions on IPs. Most of the major engines crawl from a limited number of IP ranges, making it possible to identify them and restrict access. This technique is, ironically, popular with webmasters who mistakenly assume that search engine spiders are spammers attempting to steal their content, and thus block the IP ranges to restrict access and save bandwidth. Use caution when blocking bots, and make sure you’re not restricting access to a spider that could bring benefits, either from search traffic or from link attribution.

3.6. Blocking and cloaking by user agent

At the server level, it is possible to detect user agents and restrict their access to pages or websites based on their declaration of identity. As an example, if a website detected a rogue bot, you might double-check its identity before allowing access. The search engines all use a similar protocol to verify their user agents via the Web: a reverse DNS lookup followed by a corresponding forward DNS→IP lookup. An example for Google would look like this:

Code View: Scroll / Show All

> host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.

> host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1

A reverse DNS lookup by itself may be insufficient, because a spoofer could set up reverse DNS to point to xyz.googlebot.com or any other address.

3.7. Using iframes

Sometimes there’s a certain piece of content on a web page (or a persistent piece of content throughout a site) that you’d prefer search engines didn’t see. As we discussed earlier in this chapter, clever use of iframes can come in handy, as Figure 9 illustrates.

Figure 9. Using iframes to prevent indexing of content

The concept is simple: by using iframes, you can embed content from another URL onto any page of your choosing. By then blocking spider access to the iframe with robots.txt, you ensure that the search engines won’t “see” this content on your page. Websites may do this for many reasons, including avoiding duplicate content problems, reducing the page size for search engines, or lowering the number of crawlable links on a page (to help control the flow of link juice).

3.8. Hiding text in images

As we discussed previously, the major search engines still have very little capacity to read text in images (and the processing power required makes for a severe barrier). Hiding content inside images isn’t generally advisable, as it can be impractical for alternative devices (mobile, in particular) and inaccessible to others (such as screen readers).

3.9. Hiding text in Java applets

As with text in images, the content inside Java applets is not easily parsed by the search engines, though using them as a tool to hide text would certainly be a strange choice.

3.10. Forcing form submission

Search engines will not submit HTML forms in an attempt to access the information retrieved from a search or submission. Thus, if you keep content behind a forced-form submission and never link to it externally, your content will remain out of the engines (as Figure 10 demonstrates).

Figure 10. Use of forms, which are unreadable by crawlers

The problem comes when content behind forms earns links outside your control, as when bloggers, journalists, or researchers decide to link to the pages in your archives without your knowledge. Thus, although form submission may keep the engines at bay, make sure that anything truly sensitive has additional protection (e.g., through robots.txt or meta robots).

3.11. Using login/password protection

Password protection of any kind will effectively prevent any search engines from accessing content, as will any form of human-verification requirements, such as CAPTCHAs (the boxes that request the copying of letter/number combinations to gain access). The major engines won’t try to guess passwords or bypass these systems.

3.12. Removing URLs from a search engine’s index

A secondary, post-indexing tactic, URL removal is possible at most of the major search engines through verification of your site and the use of the engines’ tools. For example, Yahoo! allows you to remove URLs through its Site Explorer system (http://help.yahoo.com/l/us/yahoo/search/siteexplorer/delete/siteexplorer-46.html ), and Google offers a similar service (https://www.google.com/webmasters/tools/removals ) through Webmaster Central. Microsoft’s Bing search engine may soon carry support for this as well.

Developing an SEO-Friendly Website: Content Delivery and Search Spider Control (part 3)

Developing an SEO-Friendly Website: Content Delivery and Search Spider Control (part 2)

Developing an SEO-Friendly Website: Content Delivery and Search Spider Control (part 1)

Other

Windows Server 2008 : Configuring and Managing the Terminal Services - Load Balancing

Cloud Application Architectures : Web Application Design

Microsoft ASP.NET 4 : Configuring ASP.NET from IIS

Microsoft ASP.NET 4 : .NET Configuration

Programming WCF : Queued Services - Transactions

Programming WCF : Queued Services - Queued Calls

The Membership Data Store

Creating a Web Application with VB 2010 with Navigation and Data-Binding

Building ASP.NET Web Applications : Understanding State Management

The AJAX Control Toolkit : Adding Safe Popup Capabilities to Web Pages