Friendly URLs — Possibly all of what makes a good URL structure
Long story short — I strongly suggest following the convention of organizing files and directories in a file system. This is what users are already familiar with, and since majority of them aren’t willing to learn, you’re better keeping your URLs this way. Among all other things, friendly URLs have one additional benefit — when an URL address is easy to read, people are more likely to click. For e.g. references left somewhere in the comments section on a popular news site can drive pretty high amounts of new users to your website. More people on a website is what makes a webmaster happy.
Friendly URLs generator
First, probably most important thing to describe — if you want to make your URLs pretty, you need to generate static parts for use in URL addresses from various input, can be a headline of a news article, name of a file, say — whatever you’d use as a page title. Let’s start with with a short piece of code written in Python; later I’ll move to further details clarifying what we do in the code, following by a list of explanations of all other aspects as well.
If the favorite language of yours supports Unicode (hint — PHP doesn’t), you shouldn’t have much problems implementing such a function in it — it looks just as clear as a crystal. Don’t shy away from Regular Expressions — they’ll explained, too. See:
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import re
foo = "good example"
bar = u"*!Iñtërnâtiônàlizætiøn!*"
def pretty_url(input):
input = input.replace("_", "-") # 1
regex = re.compile(r"\W+", re.U) # 2
input = re.sub(regex, "-", input)
input = re.sub(r"^\W+|\W+$", "", input) # 3
input = input.lower() # 4
return input
print pretty_url(foo) # Outputs "good-example"
print pretty_url(bar) # Outputs "iñtërnâtiônàlizætiøn"
- Replace all underscores with a hyphen. Otherwise the regex special character of \W would leave them untouched — \W stands for a set matching any non-alphanumeric character, except an underscore.
- We could skip the previous one and just use [^a-zA-Z0-9] instead but that wouldn’t handle non-ASCII characters. These two lines go as follows: “compile a regular expression pattern specifying an Unicode flag, then replace one or more occurrences with a hyphen“.
- There’s always a risk of leaving a hyphen in the beginning or the end of the string. Literally, this regex means: “Replace one or more occurrences of a non-alphanumeric character with an empty string, matching only the start OR the end of the string“.
- Convert all what’s left to lowercase.
This is a “prettified” version of function I used for generating urlslugs on a file sharing website — not a nice input if you ask me. And you know what, it’s almost perfect — it handles pretty much everything except really weird things such as KoЯn or Qu33n5_0f_th3_5t0n3_4g3. I didn’t make these things up, these are the real world examples (band names, actually). Such a wonderful piece of evidence for the Troutman’s Fifth Programming Postulate: “If the input editor has been designed to reject all bad input, an ingenious idiot will discover a method to get bad data past it“.
Further details & more explanations
James Gardner covered the problem of choosing a good URL structure with such an excellent list of tips in the Pylons Book. Even though the chapter feels pretty much comprehensive, I felt like I could point out a couple of more.
Here’s my take on the topic, I tried to put here together everything what comes to my mind:
- URLs should describe the contents of your website — if users know what are they going to see, they are more likely to click. Also, search engines like them this way. Both things mean more users coming to your site — therefore, instead of/in addition to content ID you use an urslug generated from the page title.
- URLs need to be short, you should strip all the unnecessary parts. Visitors of your website really don’t need to know that your news controller has a view action in it, if it’s the default one. As long as you don’t run a website processing large chunks of data, dates and ID numbers aren’t necessary either.
- Striping everything but an urlslug is an overdo. How do people know which pages of your website are located at the top of your site’s navigation structure? Which of them are more important than others?
- Separate words with hyphens. No commas, colons, semicolons or any other punctuation marks — only hyphens. But what about an underscore — you may ask. Glad that you asked: no underscores. You can’t register a domain name with an underscore in it. Only hyphens. You use underscore, you create a mess instead of following the convention of keeping things simple.
- Also, people tend to share URLs via comments. And a hell a lot of programmers haven’t grasped Regular Expressions yet, so if you use too much of the vivid imagination, their half-assed URL parsing function will break the address, referring people to nothing but a 404 Error. This is not a good thing.
- Following the file system-like convention — every URL of yours should look like a static one. No place for dynamic-looking parts. No questions marks, no ampersands, no equals signs. Static. Just like in a file system. You need to get something via GET method, you achieve that using fancy URL rewrite, separating things with a slash. Queries separated by slashes are more readable and easier to type than, say, index.php?param=value&otherparam=other%20value%20and&will=this%20ever%20end.
- Every page on your site should be available under one, and only one unique URL address. Search engines don’t like same content going from different URLs since that’s exactly how search algorithm measures the value of particular pages. Canonical URL is an ugly workaround for muddle-headed programmers, and doesn’t make things as good as one-unique-URL-per-page approach.
- Stick to lowercase characters. This will make your URLs easier to read. Also, since URLs are case-sensitive, you’re better off people guessing if you used upper-case, or a lower-case character in there. This way you may end up having one page indexed under more than one address in the search engine index, and you already know — that’s a bad thing.
- Since all of what you serve is just a plain HTML, consider ending your URLs with a .html extension. Except for the navigation parts, like category index or pagination — following the file-system-like convention, these would play the role of directories, organizing your URLs in a nice, user-friendly logical structure.
Friendly URLs handling
There are several different approaches to URL rewriting. Since all these mechanisms used for mapping the URL address are described in a great detail elsewhere, I’ll just leave here a few references so you can compare different approaches the fancy URLs problem.
- Routing in Ruby on Rails
- Pylons Routing (improved version of the Ruby on Rails routing system)
- mod_rewrite (Apache module described as “the Swiss Army knife of URL manipulation“, which says pretty much everything)
- Django URL dispatcher
- Zope Virtual Host Monster
Disclaimer
Remember — everything what’s written here is just a humble attempt of mine to cover up the topic in a comprehensive manner; by chance addressing all possible issues with a short explanation. Not a standard, just a convention — no need to argue why underscore is better than hyphen or so — you’re free to do whatever feels right for you in a given moment.
Just make sure if it really is a well-considered decision. Because when it comes to URL structure, fixing what seemed-to-be-right-back-then after a year or so might be more problematic than one would expect — you will need to trouble yourself with redirects, in addition to that — Google still hasn’t made such an operation painless, years go by and not much has changed in that matter.
Please send in your input, for I am willing to add some more.