When giteinbrittany.com isn't www.giteinbrittany.com - using mod_rewrite to correct the URL
Hopefully the Blog title has added a bit of mystique to my first Blog rambling of the new year (today was my first day back at work after the break so I can't honestly say "Happy New Year" to anyone as I'm reluctantly back in the office ...)
It's been pointed out to me that the home page for our French vacation rental website, www.giteinbrittany.com is actually available via four different URL's:
http://www.giteinbrittany.com
http://giteinbrittany.com
http://www.giteinbrittany.com/index.html
http://giteinbrittany.com/index.html
In other words the whole of the site is available both with- and without- the www prefix, and if this wasn't bad enough, because the index.html page is also served by default when just the domain name is accessed, there are potentially four URLs for the home page.
The upshot of which is depending upon how other websites have linked to my website I can have up to four indexable entries in the search engine databases for the home page and two for all other website pages.
Now although the search engines do tend to work out the best correlation, they rarely do this 100% correctly (some of the different Google data centres are returning different results for instance), and because I have the home page internally linked to /index.html within the site rather than just /, this also has an additional negative effect of page rank dilution.
So what to do?
Well the solution is a bit of website management I'd not really fiddled with much before, the .htaccess file, and in particular making changes to the website configuration so that attempts to access these different URLs results in them being simply and automatically redirected by my hosting provider to a single page.
If you have a really good memory you may remember that back in July 2006 I first created a simple .htaccess file to prevent people browsing the subdirectories of my website. I achieved that with a .htaccess file that contained the following rows:
The new trick is to use the mod_rewrite directives in the .htaccess file to cause the web server to take the URL that's requested and rewrite it to what you want it to be.
So if someone requests http://giteinbrittany.com you can respond back with http://www.giteinbrittany.com, and what's even cleverer you can (if you want to) tell the client browser that you have done this and return a HTML 301 error code telling them that the page they requested has been permanently redirected to the new page. At first I was a bit worried about this, returning HTML errors back didn't seem to me to be a good idea, but after quite a bit of Googling I found that returning a 301 error is quite safe, it's what browsers and search engines expect, and has the benefit of ensuring that the search engines will then automatically link the 'input' to the 'output' URL, thus removing at a stroke the problem of having duplicate search engine entries for the same website content.
By the by, if you want to temporarily redirect users to a different webpage (perhaps if a section of the website is unavailable for some short term reason) you can return a 302 error code instead.
There are loads of instructions and tutorials on how to use mod_rewrite out there on the web so I won't repeat them here, instead pointing out a couple that I found to be useful over at workingwith.me.uk there's a simple beginners guide to mod_rewrite that explains the basic structure and usage of the .htaccess entries, at HTMLsource there's an explanation of regular expressions in mod_rewrite and on Stephen Hargrove's Blog he shows how to redirect from the non-www site to the www-version.
So putting these different tutorials together I ended up with an extended .htaccess file like this:
The first part of the .htaccess is exactly as before, then the RewriteEngine and RewriteBase commands loads the mod_rewrite engine (as it's an optional webserver plugin and may not be loaded by default) and defines the base directory from which rewrite instructions will be derived.
The rewritecond line says to match any URLs where the http_host (i.e. first part of the URL you are accessing) starts with (the upper 'hat' means that the string starts with) giteinbrittany.com. The [nc] simply says to make the match not case sensitive (so GiTeInBrItTaNy is equally matched as is giteinbrittany).
Then for all URLs that match the rewritecond line, apply the rewriterule, which says take any page requests (the ^(.*)$ means match any page name) and return the requested web page prefixed by http://www.giteinbrittany.com. The [R=301 means return the page with a 301 HTML permanently redirected response code and again NC means make the match case insensitive.
So for example when you request http://giteinbrittany.com/fred.html, rewritecond will match http_host to giteinbrittany.com, then the rewrite rule will fire for fred.html and return instead the page http://www.giteinbrittany.com/fred.html with a 301 response code.
All sounds a bit complicated but believe me it does work and it works seamlessly without any problems at all.
To see all this in action take a look at the website server responses provided by the StepForth's HTTP viewer server tool.
And not only can you see it working for giteinbrittany.com, try entering other URLs such as bbc.co.uk and see the same 301 response and automatic page redirection being returned as well.
So this fixes the bigger problem of dual page ranks for www- and non-www pages, but still leaves the issue of the default home page and index.html being separately indexed.
For this one I'm still thinking about the right answer. I've been advised that I should redirect all index.html requests to the default home page (/), and then change the website navigation structure to match but this isn't trivial in Rational Application Developer that I use, or I could use a similar mod_rewrite to change index.html requests to direct them straight to the default home page, but again if I don't change the website navigation structure to match this just seems wrong as all internal navigation links will still point to '/index.html' which then gets 301'd to '/' ... it seems wrong for me to do this.
I think I will celebrate the success I have achieved and will ponder this secondary problem a bit further ...
It's been pointed out to me that the home page for our French vacation rental website, www.giteinbrittany.com is actually available via four different URL's:
http://www.giteinbrittany.com
http://giteinbrittany.com
http://www.giteinbrittany.com/index.html
http://giteinbrittany.com/index.html
In other words the whole of the site is available both with- and without- the www prefix, and if this wasn't bad enough, because the index.html page is also served by default when just the domain name is accessed, there are potentially four URLs for the home page.
The upshot of which is depending upon how other websites have linked to my website I can have up to four indexable entries in the search engine databases for the home page and two for all other website pages.
Now although the search engines do tend to work out the best correlation, they rarely do this 100% correctly (some of the different Google data centres are returning different results for instance), and because I have the home page internally linked to /index.html within the site rather than just /, this also has an additional negative effect of page rank dilution.
So what to do?
Well the solution is a bit of website management I'd not really fiddled with much before, the .htaccess file, and in particular making changes to the website configuration so that attempts to access these different URLs results in them being simply and automatically redirected by my hosting provider to a single page.
If you have a really good memory you may remember that back in July 2006 I first created a simple .htaccess file to prevent people browsing the subdirectories of my website. I achieved that with a .htaccess file that contained the following rows:
<Files .htaccess>
order allow,deny
deny from all
</Files>
IndexIgnore */*
order allow,deny
deny from all
</Files>
IndexIgnore */*
The new trick is to use the mod_rewrite directives in the .htaccess file to cause the web server to take the URL that's requested and rewrite it to what you want it to be.
So if someone requests http://giteinbrittany.com you can respond back with http://www.giteinbrittany.com, and what's even cleverer you can (if you want to) tell the client browser that you have done this and return a HTML 301 error code telling them that the page they requested has been permanently redirected to the new page. At first I was a bit worried about this, returning HTML errors back didn't seem to me to be a good idea, but after quite a bit of Googling I found that returning a 301 error is quite safe, it's what browsers and search engines expect, and has the benefit of ensuring that the search engines will then automatically link the 'input' to the 'output' URL, thus removing at a stroke the problem of having duplicate search engine entries for the same website content.
By the by, if you want to temporarily redirect users to a different webpage (perhaps if a section of the website is unavailable for some short term reason) you can return a 302 error code instead.
There are loads of instructions and tutorials on how to use mod_rewrite out there on the web so I won't repeat them here, instead pointing out a couple that I found to be useful over at workingwith.me.uk there's a simple beginners guide to mod_rewrite that explains the basic structure and usage of the .htaccess entries, at HTMLsource there's an explanation of regular expressions in mod_rewrite and on Stephen Hargrove's Blog he shows how to redirect from the non-www site to the www-version.
So putting these different tutorials together I ended up with an extended .htaccess file like this:
<Files .htaccess>
order allow,deny
deny from all
</Files>
IndexIgnore */*
RewriteEngine on
RewriteBase /
### re-direct non-www to www
rewritecond %{http_host} ^giteinbrittany.com [nc]
rewriterule ^(.*)$ http://www.giteinbrittany.com/$1 [r=301,nc]
order allow,deny
deny from all
</Files>
IndexIgnore */*
RewriteEngine on
RewriteBase /
### re-direct non-www to www
rewritecond %{http_host} ^giteinbrittany.com [nc]
rewriterule ^(.*)$ http://www.giteinbrittany.com/$1 [r=301,nc]
The first part of the .htaccess is exactly as before, then the RewriteEngine and RewriteBase commands loads the mod_rewrite engine (as it's an optional webserver plugin and may not be loaded by default) and defines the base directory from which rewrite instructions will be derived.
The rewritecond line says to match any URLs where the http_host (i.e. first part of the URL you are accessing) starts with (the upper 'hat' means that the string starts with) giteinbrittany.com. The [nc] simply says to make the match not case sensitive (so GiTeInBrItTaNy is equally matched as is giteinbrittany).
Then for all URLs that match the rewritecond line, apply the rewriterule, which says take any page requests (the ^(.*)$ means match any page name) and return the requested web page prefixed by http://www.giteinbrittany.com. The [R=301 means return the page with a 301 HTML permanently redirected response code and again NC means make the match case insensitive.
So for example when you request http://giteinbrittany.com/fred.html, rewritecond will match http_host to giteinbrittany.com, then the rewrite rule will fire for fred.html and return instead the page http://www.giteinbrittany.com/fred.html with a 301 response code.
All sounds a bit complicated but believe me it does work and it works seamlessly without any problems at all.
To see all this in action take a look at the website server responses provided by the StepForth's HTTP viewer server tool.
And not only can you see it working for giteinbrittany.com, try entering other URLs such as bbc.co.uk and see the same 301 response and automatic page redirection being returned as well.
So this fixes the bigger problem of dual page ranks for www- and non-www pages, but still leaves the issue of the default home page and index.html being separately indexed.
For this one I'm still thinking about the right answer. I've been advised that I should redirect all index.html requests to the default home page (/), and then change the website navigation structure to match but this isn't trivial in Rational Application Developer that I use, or I could use a similar mod_rewrite to change index.html requests to direct them straight to the default home page, but again if I don't change the website navigation structure to match this just seems wrong as all internal navigation links will still point to '/index.html' which then gets 301'd to '/' ... it seems wrong for me to do this.
I think I will celebrate the success I have achieved and will ponder this secondary problem a bit further ...
Labels: Website
0 Comments:
Post a Comment
<< Home