Developing an SEO-Friendly Website : Redirects

1/6/2011 9:06:20 AM

A redirect is used to indicate when content has moved from one location to another. For example, you may have some content at http://www.yourdomain.com/old.html and decide to restructure your site. As a result of this move, your content may move to http://www.yourdomain.com/critical-keyword.html.

Once a redirect is implemented users who go to the old versions of your pages (perhaps via a bookmark they kept for the page) will be sent to the new versions of those pages. Without the redirect, the user would get a Page Not Found (404) error. With the redirect, the web server tells the incoming user agent (whether a browser or a spider) to instead fetch the requested content from the new URL.

1. Why and When to Redirect

Redirects are also important for letting search engines know when you have moved content. After doing so, the search engines will continue to have the old URL in their index and return it in their search results. The solution to this is to implement a redirect. Here are some scenarios in which you may end up needing to implement redirects:

You have old content that expires, so you remove it.
You find that you have broken URLs that have links and traffic.
You change your hosting company.
You change your CMS.
You want to implement a canonical redirect (redirect all pages on http://yourdomain.com to http://www.yourdomain.com).
You change the URLs where your existing content can be found for any reason.

Not all of these scenarios require a redirect. For example, you can change hosting companies without impacting any of the URLs used to find content on your site, in which case no redirect is required. However, any scenario in which any of your URLs change is a scenario in which you need to implement redirects.

2. Good and Bad Redirects

It turns out that there are many ways to perform a redirect. Not all are created equal. The basic reason for this is that there are two major types of redirects that can be implemented, tied specifically to the HTTP status code returned by the web server to the browser. These are:

“301 moved permanently”: This status code tells the browser (or search engine crawler) that the resource has been permanently moved to another location, and there is no intent to ever bring it back.
“302 moved temporarily”: This status code tells the browser (or search engine crawler) that the resource has been temporarily moved to another location, and that the move should not be treated as permanent.

Both forms of redirect send a human or a search engine crawler to the new location, but the search engines interpret these two HTTP status codes in very different ways. When a crawler sees a 301 HTTP status code, it assumes it should pass the historical link juice (and any other metrics) from the old page to the new one. When a search engine crawler sees a 302 HTTP status code, it assumes it should not pass the historical link juice from the old page to the new one. In addition, the 301 redirect will lead the search engine to remove the old page from the index and replace it with the new one.

The preservation of historical link juice is very critical in the world of SEO. For example, imagine you had 1,000 links to http://www.yourolddomain.com and you decided to relocate everything to http://www.yournewdomain.com. If you used redirects that returned a 302 status code, you would be starting your link-building efforts from scratch again. In addition, the old version of the page may remain in the index and compete for search rankings in the search engines.

It should also be noted that there can be redirects that pass no status code, or the wrong status code, such as a 404 error (i.e., page not found) or a 200 OK (page loaded successfully). These are also problematic, and should be avoided. You want to definitively return a 301 HTTP status code when you have performed a redirect whenever you make a permanent move to a page’s location.

3. Methods for URL Redirecting and Rewriting

There are many possible ways to implement redirects. On Apache web servers (normally present on machines running Unix or Linux as the operating system), it is possible to implement redirects quite simply in a standard file called .htaccess using the Redirect and RedirectMatch directives (learn more about this file format at http://httpd.apache.org/docs/2.2/howto/htaccess.html). More advanced directives known as rewrite rules can be employed as well using the Apache module known as mod_rewrite, which we will discuss in a moment.

On web servers running Microsoft IIS (http://www.iis.net/), different methods are provided for implementing redirects. The basic method for doing these redirects is through the IIS console (you can read more about this at http://www.mcanerin.com/EN/articles/301-redirect-IIS.asp ). People with IIS servers can also make use of a text file with directives provided they use an ISAPI plug-in such as ISAPI_Rewrite (http://www.isapirewrite.com/), and this scripting language offers similar capabilities as that of Apache’s mod_rewrite module.

Many programmers use other techniques for implementing redirects. This can be done directly in programming languages such as Perl, PHP, ASP, and JavaScript. The key thing that the programmer must do, if he implements redirects in this fashion, is to make sure the HTTP status code returned by the web server is a 301. You can check the header returned with the Firefox plug-in Live HTTP Headers.

Another method that you can use to implement a redirect occurs at the page level, via the meta refresh tag, which looks something like this:

<meta http-equiv="refresh" content="5;url=http://www.yourdomain.com/newlocation.htm" />

The first parameter in the content section in the preceding statement, the number 5, indicates the number of seconds the web server should wait before redirecting the user to the indicated page. This gets used in scenarios where the publisher wants to display a page letting the user know that he is going to get redirected to a different page than the one he requested.

The problem is that most meta refreshes are treated as though they are a 302 redirect. The sole exception to this is if you specify a redirect delay of 0 seconds. You will have to give up your helpful page telling the user that you are moving him, but the search engines appear to treat this as though it were a 301 redirect (to be safe, the best practice is simply to use a 301 redirect if at all possible).

3.1. Mod_rewrite and ISAPI_Rewrite for URL rewriting and redirecting

There is much more to discuss on this topic than we can reasonably address in this book. The following description is intended only as an introduction to help orient more technical readers, including web developers and site webmasters, on how rewrites and redirects function. To skip this technical discussion, proceed to Section 6.10.4.

Mod_rewrite for Apache and ISAPI_Rewrite for Microsoft IIS Server offer very powerful ways to rewrite your URLs. Here are some reasons for using these powerful tools:

You have changed your URL structure on your site so that content has moved from one location to another. This can happen when you change your CMS, or change your site organization for any reason.
You want to map your search-engine-unfriendly URLs into friendlier ones.

If you are running Apache as your web server, you would place directives known as rewrite rules within your .htaccess file or your Apache configuration file (e.g., httpd.conf or the site-specific config file in the sites_conf directory). Similarly, if you are running IIS Server, you’d use an ISAPI plug-in such as ISAPI_Rewrite and place rules in an httpd.ini config file.

Note that rules can differ slightly on ISAPI_Rewrite compared to mod_rewrite, and the following discussion focuses on mod_rewrite. Your .htaccess file would start with:

RewriteEngine on
RewriteBase /

You should omit the second line if you’re adding the rewrites to your server config file, since RewriteBase is supported only in .htaccess. We’re using RewriteBase^/ at the beginning of all the rules, just ^ (we will discuss regular expressions in a moment). here so that you won’t have to have

After this step, the rewrite rules are implemented. Perhaps you want to have requests for product page URLs of the format http://www.yourdomain.com/products/123http://www.yourdomain.com/get_product.php?id=123, without the URL changing in the Location bar of the user’s browser and without you having to recode the get_product.php script. Of course, this doesn’t replace all occurrences of dynamic URLs within the links contained on all the site pages; that’s a separate issue. You can accomplish this first part with a single rewrite rule, like so: to display the content found at

RewriteRule ^products/([0-9]+)/?$ /get_product.php?id=$1 [L]

The preceding example tells the web server that all requests that come into the /product/ directory should be mapped into requests to /get_product.php, while using the subfolder to /product/ as a parameter for the PHP script.

The ^ signifies the start of the URL following the domain, $ signifies the end of the URL, [0-9] signifies a numerical digit, and the + immediately following it means one or more occurrences of a digit. Similarly, the ? immediately following the / means zero or one occurrence of a slash character. The () puts whatever is wrapped within it into memory. You can then access what’s been stored in memory with $1 (i.e., what is in the first set of parentheses). Not surprisingly, if you included a second set of parentheses in the rule, you’d access that with $2, and so on. The [L] flag saves on server processing by telling the rewrite engine to stop if it matches on that rule. Otherwise, all the remaining rules will be run as well.

Here’s a slightly more complex example, where URLs of the format http://www.yourdomain.com/webapp/wcs/stores/servlet/ProductDisplay?storeId=10001&catalogId=10001&langId=-1&categoryID=4&productID=123 would be rewritten to http://www.yourdomain.com/4/123.htm:

RewriteRule ^([^/]+)/([^/]+)\.htm$
/webapp/wcs/stores/servlet/ProductDisplay?storeId=10001&catalogId=10001&
langId=-1&categoryID=$1&productID=$2 [QSA,L]

The [^/] signifies any character other than a slash. That’s because, within square brackets, ^ is interpreted as not. The [QSA] flag is for when you don’t want the query string dropped (like when you want a tracking parameter preserved).

To write good rewrite rules you will need to become a master of pattern matching (which is simply another way to describe the use of regular expressions). Here are some of the most important special characters and how the rewrite engine interprets them:

* means 0 or more of the immediately preceding character.

+ means 1 or more of the immediately preceding character.

? means 0 or 1 occurrence of the immediately preceding character.

^ means the beginning of the string.

$ means the end of the string.

. means any character (i.e., it acts as a wildcard).

\ “escapes” the character that follows; for example, \. means the dot is not meant to be a wildcard, but an actual character.

^ inside [] brackets means not; for example, [^/] means not slash.

It is incredibly easy to make errors in regular expressions. Some of the common gotchas that lead to unintentional substring matches include:

Using .* when you should be using .+ since .* can match on nothing.
Not “escaping” with a backslash a special character that you don’t want interpreted, as when you specify . instead of \. and you really meant the dot character rather than any character (thus, default.htm would match on defaulthtm, and default\.htm would match only on default.htm).
Omitting ^ or $ on the assumption that the start or end is implied (thus, default\.htm would match on mydefault.html whereas ^default\.htm$ would match only on default.htm).
Using “greedy” expressions that will match on all occurrences rather than stopping at the first occurrence.

The easiest way to illustrate what we mean by greedy is to provide an example:

RewriteRule ^(.*)/?index\.html$ /$1/ [L,R=301]

This will redirect requests for http://www.yourdomain.com/blah/index.html to http://www.yourdomain.com/blah//. This is probably not what was intended. Why did this happen? Because .* will capture the slash character within it before the /? gets to see it. Thankfully, there’s an easy fix. Simply use [^ or .*? instead of .* to do your matching. For example, use ^(.*?)/? instead of ^(.*)/? or [^/]+/[^/] instead of .*/.*.

So, to correct the preceding rule you could use the following:

RewriteRule ^(.*?)/?index\.html$ /$1/ [L,R=301]

When wouldn’t you use the following?

RewriteRule ^([^/]*)/?index\.html$ /$1/ [L,R=301]

This is more limited because it will match only on URLs with one directory. URLs containing multiple subdirectories, such as http://www.yourdomain.com/store/cheese/swiss/wheel/index.html, would not match.

As you might imagine, testing/debugging is a big part of URL rewriting. When debugging, the RewriteLog and RewriteLogLevel directives are your friends! Set the RewriteLogLevel to 4 or more to start seeing what the rewrite engine is up to when it interprets your rules.

By the way, the [R=301] flag in the last few examples—as you might guess—tells the rewrite engine to do a 301 redirect instead of a standard rewrite.

There’s another handy directive to use in conjunction with RewriteRule, called RewriteCond. You would use RewriteCond if you are trying to match on something in the query string, the domain name, or other things not present between the domain name and the question mark in the URL (which is what RewriteRule looks at).

Note that neither RewriteRule nor RewriteCond can access what is in the anchor part of a URL—that is, whatever follows a #—because that is used internally by the browser and is not sent to the server as part of the request. The following RewriteCond example looks for a positive match on the hostname before it will allow the rewrite rule that follows to be executed:

RewriteCond %{HTTP_HOST} !^www\.yourdomain\.com$ [NC]
RewriteRule ^(.*)$ http://www.yourdomain.com/$1 [L,R=301]

Note the exclamation point at the beginning of the regular expression. The rewrite engine interprets that as not.

For any hostname other than http://www.yourdomain.com, a 301 redirect is issued to the equivalent canonical URL on the www subdomain. The [NC] flag makes the rewrite condition case-insensitive. Where is the [QSA] flag so that the query string is preserved, you might ask? It is not needed when redirecting; it is implied.

If you don’t want a query string retained on a rewrite rule with a redirect, put a question mark at the end of the destination URL in the rule, like so:

RewriteCond %{HTTP_HOST} !^www\.yourdomain\.com$ [NC]
RewriteRule ^(.*)$ http://www.yourdomain.com/$1? [L,R=301]

Why not use ^yourdomain\.com$ instead? Consider:

RewriteCond %{HTTP_HOST} ^yourdomain\.com$ [NC]
RewriteRule ^(.*)$ http://www.yourdomain.com/$1? [L,R=301]

That would not have matched on typo domains, such as “yourdoamin.com”, that the DNS server and virtual host would be set to respond to (assuming that misspelling was a domain you registered and owned).

Under what circumstances might you want to omit the query string from the redirected URL, as we did in the preceding two examples? When a session ID or a tracking parameter (such as source=banner_ad1) needs to be dropped. Retaining a tracking parameter after the redirect is not only unnecessary (because the original URL with the source code appended would have been recorded in your access logfiles as it was being accessed); it is also undesirable from a canonicalization standpoint. What if you wanted to drop the tracking parameter from the redirected URL, but retain the other parameters in the query string? Here’s how you’d do it for static URLs:

RewriteCond %{QUERY_STRING} ^source=[a-z0-9]*$
RewriteRule ^(.*)$ /$1? [L,R=301]

And for dynamic URLs:

RewriteCond %{QUERY_STRING} ^(.+)&source=[a-z0-9]+(&?.*)$
RewriteRule ^(.*)$ /$1?%1%2 [L,R=301]

Need to do some fancy stuff with cookies before redirecting the user? Invoke a script that cookies the user and then 301s him to the canonical URL:

RewriteCond %{QUERY_STRING} ^source=([a-z0-9]*)$
RewriteRule ^(.*)$ /cookiefirst.php?source=%1&dest=$1 [L]

Note the lack of a [R=301] flag in the preceding code. That’s on purpose. There’s no need to expose this script to the user. Use a rewrite and let the script itself send the 301 after it has done its work.

Other canonicalization issues worth correcting with rewrite rules and the [R=301] flag include when the engines index online catalog pages under HTTPS URLs, and URLs missing a trailing slash that should be there. First, the HTTPS fix:

# redirect online catalog pages in the /catalog/ directory if HTTPS
RewriteCond %{HTTPS} on
RewriteRule ^catalog/(.*) http://www.yourdomain.com/catalog/$1 [L,R=301]

Note that if your secure server is separate from your main server, you can skip the RewriteCond line.

Now to append the trailing slash:

RewriteRule ^(.*[^/])$ /$1/ [L,R=301]

After completing a URL rewriting project to migrate from dynamic URLs to static, you’ll want to phase out the dynamic URLs not just by replacing all occurrences of the legacy URLs on your site, but also by 301-redirecting the legacy dynamic URLs to their static equivalents. That way, any inbound links pointing to the retired URLs will end up leading both spiders and humans to the correct new URL—thus ensuring that the new URLs are the ones that are indexed, blogged about, linked to, and bookmarked, and the old URLs will be removed from the index as well. Generally, here’s how you’d accomplish that:

RewriteCond %{QUERY_STRING} id=([0-9]+)
RewriteRule ^get_product\.php$ /products/%1.html? [L,R=301]

However, you’ll get an infinite loop of recursive redirects if you’re not careful. One quick-and-dirty way to avoid that situation is to add a nonsense parameter to the destination URL for the rewrite and ensure that this nonsense parameter isn’t present before doing the redirect. Specifically:

RewriteCond %{QUERY_STRING} id=([0-9]+)
RewriteCond %{QUERY_STRING} !blah=blah
RewriteRule ^get_product\.php$ /products/%1.html? [L,R=301]
RewriteRule ^products/([0-9]+)/?$ /get_product.php?id=$1&blah=blah [L]

Notice that the example used two RedirectCond lines, stacked on top of each other. All redirect conditions listed together in the same block will be “ANDed” together. If you wanted the conditions to be “ORed”, it would require the use of the [OR] flag.

4. Redirecting a Home Page Index File Without Looping

Many websites link to their own home page in a form similar to http://www.yourdomain.com/index.html. The problem with that is that most incoming links to the site’s home page specify http://www.yourdomain.com, thus dividing the link juice into the site. Once a publisher realizes this, they will want to fix their internal links and then 301 redirect http://www.yourdomain.com/index.html to http://www.yourdomain.com/, but there will be problems with recursive redirects that develop if this is not done correctly.

When someone comes to your website by typing in http://www.yourdomain.com, the DNS system of the Internet helps the browser locate the web server for your website. How, then, does the web server decide what to show to the browser? It turns out that it does this by loading a file from the hard drive of the web server for your website.

When no file is specified (i.e., as in the preceding example, only the domain name is specified), the web server loads a file that is known as the default file. This is often a file with a name such as index.html, index.htm, index.shtml, index.php, or default.asp.

The filename can actually be anything, but most web servers default to one type of filename or another. Where the problem comes in is that many CMSs will expose both forms of your home page, both http://www.yourdomain.com and http://www.yourdomain.com/index.php.

Perhaps all the pages on the site link only to http://www.yourdomain.com/index.php, but given human nature, most of the links to your home page that third parties give you will most likely point at http://www.yourdomain.com/. This can create a duplicate content problem if the search engine now sees two versions of your home page and thinks they are separate, but duplicate, documents. Google is pretty smart at figuring out this particular issue, but it is best to not rely on that.

Since you learned how to do 301 redirects, you might conclude that the solution is to 301-redirect http://www.yourdomain.com/index.php to http://www.yourdomain.com/. Sounds good, right? Unfortunately, there is a big problem with this.

What happens is the server sees the request for http://www.yourdomain.com/index.php and then sees that it is supposed to 301-redirect that to http://www.yourdomain.com/, so it does. But when it loads http://www.yourdomain.com/ it retrieves the default filename (index.php) and proceeds to load http://www.yourdomain.com/index.php. Then it sees that you want to redirect that to http://www.yourdomain.com/, and it creates an infinite loop.

4.1. The default document redirect solution

The solution that follows is specific to the preceding index.php example. You will need to plug in the appropriate default filename for your own web server.

Copy the contents of index.php to another file. For this example, we’ll be using sitehome.php.
Create an Apache DirectoryIndex directive for your document root. Set it to sitehome.php. Do not set the directive on a serverwide level; otherwise, it may cause problems with other folders that still need to use index.php as a directory index.
Put this in an .htaccess file in your document root: DirectoryIndex sitehome.php. Or, if you aren’t using per-directory context files, put this in your httpd.conf:
```
<Directory /your/document/root/examplesite.com/>
 DirectoryIndex sitehome.php
</Directory>
```
Clear out the contents of your original index.php file. Insert this line of code:
```
<? header("Location: http://www.example.com"); ?>
```

This sets it up so that index.php is not a directory index file (i.e., the default filename). It forces sitehome.php to be read when someone types in the canonical URL (http://www.yourdomain.com). Any requests to index.php from old links can now be 301-redirected while avoiding an infinite loop.

If you are using a CMS, you also need to make sure when you are done with this that all the internal links now go to the canonical URL, http://www.yourdomain.com. If for any reason the CMS started to point to http://www.yourdomain.com/sitehome.php the loop problem would return, forcing you to go through this entire process again.

Other

Developing an SEO-Friendly Website: Content Delivery and Search Spider Control (part 3)

Developing an SEO-Friendly Website: Content Delivery and Search Spider Control (part 2)

Developing an SEO-Friendly Website: Content Delivery and Search Spider Control (part 1)

Windows Server 2008 : Configuring and Managing the Terminal Services - Load Balancing

Cloud Application Architectures : Web Application Design

Microsoft ASP.NET 4 : Configuring ASP.NET from IIS

Microsoft ASP.NET 4 : .NET Configuration

Programming WCF : Queued Services - Transactions

Programming WCF : Queued Services - Queued Calls