Thursday, March 15, 2007

RegExp good practices

After not finding anything from a Google search on "RegExp good practices", I figured someone should write something about it. Here's what I think are important tips when writing code with regular expressions:

  1. Avoid them if possible - regexp bugs are hard to trace
  2. Be careful with .* and negatives ([^...]) - matching too much is the leading cause of regexp bugs
  3. Comment regular expressions - the longer it is, the more ellaborate you should be
  4. Never ever regex document.body.innerHTML!

1) The main reason why I recommend staying away from regular expressions is that they are difficult to test, especially when used to handle input.

2) More often than not, developers don't test all cases. What happens if there are french characters? Line breaks? Some may argue that there are testing frameworks out there to help, but let's face it, not everyone uses them, and even if you do, complex regular expressions can still sometimes get away with obscure unusual character-sequence matches that only a regexp guru would be able to pick up.

3) All code should, of course, be commented, but reality is far from perfect. The least you as a developer should do is comment regular expressions, simply because they are so cryptic by nature. I find that "translating" the regexp into plain english and then stating what it is supposed to accomplish is reasonable.

4) I hack a lot while experimenting, and I'll tell you this first hand: Regular expressions are for small strings only! If your project absolutely requires regex'ing through paragraphs and paragraphs of content, you're probably neglecting your server-side capabilities, and heading towards a glacially slow dead-end.

Last thoughts

I'm mostly talking with javascript in mind here, but I think it's safe to say that these regexp rules apply in any environment. If you have any to add, feel free to post a comment.

No comments:

Post a Comment