A redirect is used to indicate when content has moved from one
location to another. For example, you may have some content at
http://www.yourdomain.com/old.html and decide to
restructure your site. As a result of this move, your content may move to
http://www.yourdomain.com/critical-keyword.html.Once a redirect is implemented users who go to the old versions of
your pages (perhaps via a bookmark they kept for the page) will be sent to
the new versions of those pages. Without the redirect, the user would get
a Page Not Found (404) error. With the redirect, the web server tells the
incoming user agent (whether a browser or a spider) to instead fetch the
requested content from the new URL.
1. Why and When to Redirect
Redirects are also important for letting search engines know when
you have moved content. After doing so, the search engines will continue
to have the old URL in their index and return it in their search
results. The solution to this is to implement a redirect. Here are some
scenarios in which you may end up needing to implement redirects:
You have old content that expires, so you remove it.
You find that you have broken URLs that have links and
traffic.
You change your hosting company.
You change your CMS.
You want to implement a canonical redirect (redirect all pages
on http://yourdomain.com to
http://www.yourdomain.com).
You change the URLs where your existing content can be found
for any reason.
Not all of these scenarios require a redirect. For example, you
can change hosting companies without impacting any of the URLs used to
find content on your site, in which case no redirect is required.
However, any scenario in which any of your URLs change is a scenario in
which you need to implement redirects.
2. Good and Bad Redirects
It turns out that there are many ways to perform a redirect. Not
all are created equal. The basic reason for this is that there are two
major types of redirects that can be implemented, tied specifically to
the HTTP status code returned by the web server to the browser. These
are:
“301 moved permanently”
This status code tells the browser (or search engine
crawler) that the resource has been permanently moved to another
location, and there is no intent to ever bring it back.
“302 moved temporarily”
This status code tells the browser (or search engine
crawler) that the resource has been temporarily moved to another
location, and that the move should not be treated as
permanent.
Both forms of redirect send a human or a search engine crawler to
the new location, but the search engines interpret these two HTTP status
codes in very different ways. When a crawler sees a 301 HTTP status
code, it assumes it should pass the historical link juice (and any other
metrics) from the old page to the new one. When a search engine crawler
sees a 302 HTTP status code, it assumes it should not pass the
historical link juice from the old page to the new one. In addition, the
301 redirect will lead the search engine to remove the old page from the
index and replace it with the new one.
The preservation of historical link juice is very critical in the
world of SEO. For example, imagine you had 1,000 links to
http://www.yourolddomain.com and you decided to
relocate everything to
http://www.yournewdomain.com. If you used redirects
that returned a 302 status code, you would be starting your
link-building efforts from scratch again. In addition, the old version
of the page may remain in the index and compete for search rankings in
the search engines.
It should also be noted that there can be redirects that pass no
status code, or the wrong status code, such as a 404 error (i.e., page
not found) or a 200 OK (page loaded successfully). These are also
problematic, and should be avoided. You want to definitively return a
301 HTTP status code when you have performed a redirect whenever you
make a permanent move to a page’s location.
3. Methods for URL Redirecting and Rewriting
There are many possible ways to implement redirects. On Apache web
servers (normally present on machines running Unix or Linux as the
operating system), it is possible to implement redirects quite simply in
a standard file called .htaccess
using the Redirect and RedirectMatch directives (learn more about
this file format at http://httpd.apache.org/docs/2.2/howto/htaccess.html).
More advanced directives known as rewrite rules can
be employed as well using the Apache module known as mod_rewrite, which
we will discuss in a moment.
On web servers running Microsoft IIS (http://www.iis.net/), different methods are provided for
implementing redirects. The basic method for doing these redirects is
through the IIS console (you can read more about this at http://www.mcanerin.com/EN/articles/301-redirect-IIS.asp).
People with IIS servers can also make use of a text file with directives
provided they use an ISAPI plug-in such as ISAPI_Rewrite (http://www.isapirewrite.com/), and this scripting
language offers similar capabilities as that of Apache’s mod_rewrite
module.
Many programmers use other techniques for implementing redirects.
This can be done directly in programming languages such as Perl, PHP,
ASP, and JavaScript. The key thing that the programmer must do, if he
implements redirects in this fashion, is to make sure the HTTP status
code returned by the web server is a 301. You can check the header
returned with the Firefox plug-in Live HTTP
Headers.
Another method that you can use to implement a redirect occurs at
the page level, via the meta refresh
tag, which looks something like this:
<meta http-equiv="refresh" content="5;url=http://www.yourdomain.com/newlocation.htm" />
The first parameter in the content section in the preceding
statement, the number 5, indicates the number of seconds the web server
should wait before redirecting the user to the indicated page. This gets
used in scenarios where the publisher wants to display a page letting
the user know that he is going to get redirected to a different page
than the one he requested.The problem is that most meta refreshes are treated as though they
are a 302 redirect. The sole exception to this is if you specify a
redirect delay of 0 seconds. You will have to give up your helpful page
telling the user that you are moving him, but the search engines appear
to treat this as though it were a 301 redirect (to be safe, the best
practice is simply to use a 301 redirect if at all possible).
3.1. Mod_rewrite and ISAPI_Rewrite for URL rewriting and
redirecting
There is much more to discuss on this topic than we can
reasonably address in this book. The following description is intended
only as an introduction to help orient more technical readers,
including web developers and site webmasters, on how rewrites and
redirects function. To skip this technical discussion, proceed to
Section 6.10.4.
Mod_rewrite for Apache and ISAPI_Rewrite for Microsoft IIS
Server offer very powerful ways to rewrite your URLs. Here are some
reasons for using these powerful tools:
You have changed your URL structure on your site so that
content has moved from one location to another. This can happen
when you change your CMS, or change your site organization for any
reason.
You want to map your search-engine-unfriendly URLs into
friendlier ones.
If you are running Apache as your web server, you would place
directives known as rewrite rules within your
.htaccess file or your Apache
configuration file (e.g., httpd.conf or the site-specific config file
in the sites_conf directory).
Similarly, if you are running IIS Server, you’d use an ISAPI plug-in
such as ISAPI_Rewrite and place rules in an httpd.ini config file.
Note that rules can differ slightly on ISAPI_Rewrite compared to
mod_rewrite, and the following discussion focuses on mod_rewrite. Your
.htaccess file would start
with:
RewriteEngine on
RewriteBase /
You should omit the second line if you’re adding the rewrites to
your server config file, since RewriteBase is supported only in .htaccess. We’re using RewriteBase^/ at the beginning of all the
rules, just ^ (we will discuss
regular expressions in a moment). here so that you won’t have to
have
After this step, the rewrite rules are implemented. Perhaps you
want to have requests for product page URLs of the format
http://www.yourdomain.com/products/123http://www.yourdomain.com/get_product.php?id=123,
without the URL changing in the Location bar of the user’s browser and
without you having to recode the get_product.php script. Of course, this
doesn’t replace all occurrences of dynamic URLs within the links
contained on all the site pages; that’s a separate issue. You can
accomplish this first part with a single rewrite rule, like so: to display
the content found at
RewriteRule ^products/([0-9]+)/?$ /get_product.php?id=$1 [L]
The preceding example tells the web server that all requests
that come into the /product/
directory should be mapped into requests to /get_product.php, while using the subfolder
to /product/
as a parameter for the PHP script.
The ^ signifies the start of
the URL following the domain, $
signifies the end of the URL, [0-9]
signifies a numerical digit, and the + immediately following it means one or more
occurrences of a digit. Similarly, the ? immediately following the / means zero or one occurrence of a slash
character. The () puts whatever is
wrapped within it into memory. You
can then access what’s been stored in memory with $1 (i.e., what is in the first set of
parentheses). Not surprisingly, if you included a second set of
parentheses in the rule, you’d access that with $2, and so on. The [L] flag saves on server processing by
telling the rewrite engine to stop if it matches on that rule.
Otherwise, all the remaining rules will be run as well.
Here’s a slightly more complex example, where URLs of the format
http://www.yourdomain.com/webapp/wcs/stores/servlet/ProductDisplay?storeId=10001&catalogId=10001&langId=-1&categoryID=4&productID=123
would be rewritten to
http://www.yourdomain.com/4/123.htm:
RewriteRule ^([^/]+)/([^/]+)\.htm$
/webapp/wcs/stores/servlet/ProductDisplay?storeId=10001&catalogId=10001&
langId=-1&categoryID=$1&productID=$2 [QSA,L]
The [^/] signifies any
character other than a slash. That’s because, within square brackets,
^ is interpreted as
not. The [QSA]
flag is for when you don’t want the query string dropped (like when
you want a tracking parameter preserved).
To write good rewrite rules you will need to become a master of
pattern matching (which is simply another way to
describe the use of regular expressions). Here are some of the most
important special characters and how the rewrite engine interprets
them:
- * means 0 or more of the immediately
preceding character.
- + means 1 or more of the immediately
preceding character.
- ? means 0 or 1 occurrence of the
immediately preceding character.
- ^ means the beginning of the
string.
- $ means the end of the string.
- . means any character (i.e., it acts as
a wildcard).
- \ “escapes” the character that follows;
for example, \. means the dot is not meant to
be a wildcard, but an actual character.
- ^ inside []
brackets means not; for example,
[^/] means not
slash.
It is incredibly easy to make errors in regular expressions.
Some of the common gotchas that lead to unintentional substring
matches include:
Using .* when you should be using
.+ since .* can match on
nothing.
Not “escaping” with a backslash a special character that you
don’t want interpreted, as when you specify .
instead of \. and you really meant the dot
character rather than any character (thus, default.htm would match on defaulthtm, and default\.htm would match only on
default.htm).
Omitting ^ or $ on
the assumption that the start or end is implied (thus, default\.htm would match on mydefault.html whereas ^default\.htm$ would match only on
default.htm).
Using “greedy” expressions that will match on all
occurrences rather than stopping at the first occurrence.
The easiest way to illustrate what we mean by greedy is to
provide an example:
- RewriteRule ^(.*)/?index\.html$ /$1/ [L,R=301]
This will redirect requests for
http://www.yourdomain.com/blah/index.html to
http://www.yourdomain.com/blah//. This is
probably not what was intended. Why did this happen? Because
.* will capture the slash character within it
before the /? gets to see it. Thankfully, there’s
an easy fix. Simply use [^ or
.*? instead of .* to do your
matching. For example, use ^(.*?)/? instead of
^(.*)/? or [^/]+/[^/]
instead of .*/.*.
So, to correct the preceding rule you could use the
following:
- RewriteRule ^(.*?)/?index\.html$ /$1/ [L,R=301]
When wouldn’t you use the following?
- RewriteRule ^([^/]*)/?index\.html$ /$1/ [L,R=301]
This is more limited because it will match only on URLs with one
directory. URLs containing multiple subdirectories, such as
http://www.yourdomain.com/store/cheese/swiss/wheel/index.html,
would not match.
As you might imagine, testing/debugging is a big part of URL
rewriting. When debugging, the RewriteLog and RewriteLogLevel directives are your friends!
Set the RewriteLogLevel to 4 or
more to start seeing what the rewrite engine is up to when it
interprets your rules.
By the way, the [R=301] flag
in the last few examples—as you might guess—tells the rewrite engine
to do a 301 redirect instead of a standard rewrite.
There’s another handy directive to use in conjunction with
RewriteRule, called RewriteCond. You would use RewriteCond if you are trying to match on
something in the query string, the domain name, or other things not
present between the domain name and the question mark in the URL
(which is what RewriteRule looks
at).
Note that neither RewriteRule
nor RewriteCond can access what is
in the anchor part of a URL—that is, whatever follows a #—because that
is used internally by the browser and is not sent to the server as
part of the request. The following RewriteCond example looks for a positive
match on the hostname before it will allow the rewrite rule that
follows to be executed:
RewriteCond %{HTTP_HOST} !^www\.yourdomain\.com$ [NC]
RewriteRule ^(.*)$ http://www.yourdomain.com/$1 [L,R=301]
Note the exclamation point at the beginning of the regular
expression. The rewrite engine interprets that as
not.
For any hostname other than
http://www.yourdomain.com, a 301 redirect is
issued to the equivalent canonical URL on the www subdomain. The
[NC] flag makes the rewrite
condition case-insensitive. Where is the [QSA] flag so that the query string is
preserved, you might ask? It is not needed when redirecting; it is
implied.
If you don’t want a query string retained on a rewrite rule with
a redirect, put a question mark at the end of the destination URL in
the rule, like so:
RewriteCond %{HTTP_HOST} !^www\.yourdomain\.com$ [NC]
RewriteRule ^(.*)$ http://www.yourdomain.com/$1? [L,R=301]
Why not use ^yourdomain\.com$
instead? Consider:
RewriteCond %{HTTP_HOST} ^yourdomain\.com$ [NC]
RewriteRule ^(.*)$ http://www.yourdomain.com/$1? [L,R=301]
That would not have matched on typo domains, such as
“yourdoamin.com”, that the DNS server and virtual host would be set to
respond to (assuming that misspelling was a domain you registered and
owned).
Under what circumstances might you want to omit the query string
from the redirected URL, as we did in the preceding two examples? When
a session ID or a tracking parameter (such as source=banner_ad1) needs to be dropped.
Retaining a tracking parameter after the redirect is not only
unnecessary (because the original URL with the source code appended
would have been recorded in your access logfiles as it was being
accessed); it is also undesirable from a canonicalization standpoint.
What if you wanted to drop the tracking parameter from the redirected
URL, but retain the other parameters in the query string? Here’s how
you’d do it for static URLs:
RewriteCond %{QUERY_STRING} ^source=[a-z0-9]*$
RewriteRule ^(.*)$ /$1? [L,R=301]
And for dynamic URLs:
RewriteCond %{QUERY_STRING} ^(.+)&source=[a-z0-9]+(&?.*)$
RewriteRule ^(.*)$ /$1?%1%2 [L,R=301]
Need to do some fancy stuff with cookies before redirecting the
user? Invoke a script that cookies the user and then 301s him to the
canonical URL:
RewriteCond %{QUERY_STRING} ^source=([a-z0-9]*)$
RewriteRule ^(.*)$ /cookiefirst.php?source=%1&dest=$1 [L]
Note the lack of a [R=301]
flag in the preceding code. That’s on purpose. There’s no need to
expose this script to the user. Use a rewrite and let the script
itself send the 301 after it has done its work.
Other canonicalization issues worth correcting with rewrite
rules and the [R=301] flag include
when the engines index online catalog pages under HTTPS URLs, and URLs
missing a trailing slash that should be there. First, the HTTPS
fix:
# redirect online catalog pages in the /catalog/ directory if HTTPS
RewriteCond %{HTTPS} on
RewriteRule ^catalog/(.*) http://www.yourdomain.com/catalog/$1 [L,R=301]
Note that if your secure server is separate from your main
server, you can skip the RewriteCond line.
Now to append the trailing slash:
RewriteRule ^(.*[^/])$ /$1/ [L,R=301]
After completing a URL rewriting project to migrate from dynamic
URLs to static, you’ll want to phase out the dynamic URLs not just by
replacing all occurrences of the legacy URLs on your site, but also by
301-redirecting the legacy dynamic URLs to their static equivalents.
That way, any inbound links pointing to the retired URLs will end up
leading both spiders and humans to the correct new URL—thus ensuring
that the new URLs are the ones that are indexed, blogged about, linked
to, and bookmarked, and the old URLs will be removed from the index as
well. Generally, here’s how you’d accomplish that:
RewriteCond %{QUERY_STRING} id=([0-9]+)
RewriteRule ^get_product\.php$ /products/%1.html? [L,R=301]
However, you’ll get an infinite loop of recursive redirects if
you’re not careful. One quick-and-dirty way to avoid that situation is
to add a nonsense parameter to the destination URL for the rewrite and
ensure that this nonsense parameter isn’t present before doing the
redirect. Specifically:
RewriteCond %{QUERY_STRING} id=([0-9]+)
RewriteCond %{QUERY_STRING} !blah=blah
RewriteRule ^get_product\.php$ /products/%1.html? [L,R=301]
RewriteRule ^products/([0-9]+)/?$ /get_product.php?id=$1&blah=blah [L]
Notice that the example used two RedirectCond lines, stacked on top of each
other. All redirect conditions listed together in the same block will
be “ANDed” together. If you wanted the conditions to be “ORed”, it
would require the use of the [OR]
flag.
4. Redirecting a Home Page Index File Without Looping
Many websites link to their own home page in a form similar to
http://www.yourdomain.com/index.html. The problem
with that is that most incoming links to the site’s home page specify
http://www.yourdomain.com, thus dividing the link
juice into the site. Once a publisher realizes this, they will want to
fix their internal links and then 301 redirect
http://www.yourdomain.com/index.html to
http://www.yourdomain.com/, but there will be
problems with recursive redirects that develop if this is not done
correctly.
When someone comes to your website by typing in
http://www.yourdomain.com, the DNS system of the
Internet helps the browser locate the web server for your website. How,
then, does the web server decide what to show to the browser? It turns
out that it does this by loading a file from the hard drive of the web
server for your website.
When no file is specified (i.e., as in the preceding example, only
the domain name is specified), the web server loads a file that is known
as the default file. This is often a file with a name such as index.html, index.htm, index.shtml, index.php, or default.asp.
The filename can actually be anything, but most web servers
default to one type of filename or another. Where the problem comes in
is that many CMSs will expose both forms of your home page, both
http://www.yourdomain.com and
http://www.yourdomain.com/index.php.
Perhaps all the pages on the site link only to
http://www.yourdomain.com/index.php, but given
human nature, most of the links to your home page that third parties
give you will most likely point at
http://www.yourdomain.com/. This can create a
duplicate content problem if the search engine now sees two versions of
your home page and thinks they are separate, but duplicate, documents.
Google is pretty smart at figuring out this particular issue, but it is
best to not rely on that.
Since you learned how to do 301 redirects, you might conclude that
the solution is to 301-redirect
http://www.yourdomain.com/index.php to
http://www.yourdomain.com/. Sounds good, right?
Unfortunately, there is a big problem with this.
What happens is the server sees the request for
http://www.yourdomain.com/index.php and then sees
that it is supposed to 301-redirect that to
http://www.yourdomain.com/, so it does. But when it
loads http://www.yourdomain.com/ it retrieves the
default filename (index.php) and
proceeds to load
http://www.yourdomain.com/index.php. Then it sees
that you want to redirect that to
http://www.yourdomain.com/, and it creates an
infinite loop.
4.1. The default document redirect solution
The solution that follows is specific to the preceding index.php example. You will need to plug in
the appropriate default filename for your own web server.
Copy the contents of index.php to another file. For this
example, we’ll be using sitehome.php.
Create an Apache DirectoryIndex directive for your
document root. Set it to sitehome.php. Do not set the directive
on a serverwide level; otherwise, it may cause problems with other
folders that still need to use index.php as a directory index.
Put this in an .htaccess file in your document root:
DirectoryIndex sitehome.php.
Or, if you aren’t using per-directory context files, put this in
your httpd.conf:
<Directory /your/document/root/examplesite.com/>
DirectoryIndex sitehome.php
</Directory>
Clear out the contents of your original index.php file. Insert this line of
code:
<? header("Location: http://www.example.com"); ?>
This sets it up so that index.php is not a directory index file
(i.e., the default filename). It forces sitehome.php to be read when someone types
in the canonical URL (http://www.yourdomain.com).
Any requests to index.php from
old links can now be 301-redirected while avoiding an infinite
loop.
If you are using a CMS, you also need to make sure when you are
done with this that all the internal links now go to the canonical
URL, http://www.yourdomain.com. If for any reason
the CMS started to point to
http://www.yourdomain.com/sitehome.php the loop
problem would return, forcing you to go through this entire process
again.