contact subscribe

Quoted Strings with Regular Expressions.

It’s a deceptively simple programming task. How do you capture a string of characters between single or double quotes?

Several times I’ve needed to grab name / value pairs of data entered by the user. For example, title="This is the Title", weight="28", etc. Using regular expressions, this would seem to be straightforward. The regular expression would simply match a string of characters in a manner similar to this:

(\w+)="([^"]*)"

Here’s a visual representation of what the above regular expression matches:

The Problem

However, what if our string of characters looks like this?

title="A Review of \"A Tale of Two Cities\""

Here the string of characters we want to capture itself has quotes in it; these quotes have been “escaped” using the backslash character \. What we want to “grab” out of the above string is everything between the first and last quotes, so we get the string A Review of "A Tale of Two Cities". However, our original regular expression won’t grab that. It only matches up to the first quote character it encounters: the string A Review of \.

Obviously this is problematic, because we need a regular expression that grabs as many “escaped characters” (that is, a backslash followed by another character) as possible. It needs to work even with something like this:

title="A Review of \"A Tale of Two Cities\" and \"Moby Dick\""

The Solution

The trick is to divide the string into consistently divisible parts. The string above, for example, divides into an (optional) initial string of characters, followed by repeated groupings that each start with \", like so:

By constructing a regular expression that matches this sequence zero or more times, we now have a solution. Our original regular expression, now modified, looks like this:

(\w+)="([^"\\]*(\\.[^"\\]*)*)"

Represented visually:

I’ve created a command-line program, written in PHP, that demonstrates this in action.

This should work too: (\w+)=”((\.|[^\])*)”

I think it will match anything that isn’t a slash, and when it does find a slash it must be followed by exactly one letter AND end on a quote.

Posted by: Phil Harnish at May 18, 2005 03:59 PM

(Note: for clarity’s sake, I’ll refer to the backslash escape character as Q).

Phil’s suggested regex — (\w+)=”((Q.|[^Q])*)” — has two problems:

  1. By using the | operator, the regex is less efficient.
  2. It is greedy, and matches more than the intended match (i.e. it will match past the last quote intended if there is another quote character following on the same line.)

Posted by: Trent at May 23, 2005 09:20 AM

I think that if you were going to use:

(\w+)=”((\\.|[^\\])*)”

then it should have been:

(\w+)=”((\\.|[^”])*)

Now it is looking for a quoted character OR a non quote. So it will stop when it finds a quote that isn’t escaped.

But the point about performance is important too.

Or does this cause other problems that I don’t see?

I personally like the one without the | because it can be used with SED and grep.

Posted by: Jerry Jeremiah at April 28, 2006 05:01 PM

Post Your Comment




Remember Me?