Php recursive regex problem

I want to parse SQL statements with extra notations. I am trying to use the recursive
notation to match nested parentheses. The following example reduces the problem to
the simplest form:

<?php
header("Content-type: text/plain");

function doMatch($re, $string)
{            
   printf("\
using \\"%s\\":\
", $re);
   if (preg_match($re, $string, $matches))
   {
      printf("%d Matches\
",count($matches));
      for ($i=0; $i<count($matches); $i++)
         printf("%d: %s\
", $i, $matches[$i]);
   }         
   else          
      printf("No match (boo hoo)\
");
}            
                                                                                                                                                                                   
$string = "<click:blah,blah> (SELECT GROUP_CONCAT(s.fname) FROM Scouts) AS bogus";
$reThatWorks = "/\\((?:([^()]+)|(?R))*\\)/";
$reThatFails = "/ \\((?:([^()]+)|(?R))*\\)/";

doMatch($reThatWorks,$string);
doMatch($reThatFails,$string);
?>


This is the output:

using "/\\((?:([^()]+)|(?R))*\\)/":
2 Matches
0: (SELECT GROUP_CONCAT(s.fname) FROM Scouts)
1:  FROM Scouts

using "/ \\((?:([^()]+)|(?R))*\\)/":
No match (boo hoo)

The only difference between the regexes is the failing regex tries to match a space in
front of the nested parentheses part. If fails with a space after as well.

Ultimately, I’d like to match the contents of the <…>, content up to optional “AS alias”,
and the alias if used. For now, I’d be happy if I could understand why the space causes
failure, and a workaround, if possible.

Thanks in advance (if anyone takes this on)
Chuck Jungmann

PS. I know I could do this by iterating through each character, but using regexes would
be more flexible.
or after the regex for the nested parentheses prevents a match.

Can you include the desired space in square brackets, as a literal?


"/[ ]\\((?:([^()]+)|(?R))*\\)[ ]/":

Thank you for your reply. I did try the , but it still fails to match.

I have played with it a bit and found that one or more . (wildcard) in place of
the space character will match, but any specific character causes it to fail.

Another thing was removing the left-paren “(” from the character class
results in a match, but it’s not the right one, as it doesn’t handle the recursion,
and stops the match at the first right-paren “)”.

Although I’m still interested in why it fails and how to fix it, I decided to put
in the non-recursive match for up to three nested levels:

\w*\s*\(([^()]+|\(([^()]+|\(([^()]+)\))\))*\)

to match a parenthetical expression:
(SELECT GROUP_CONCAT(name) FROM Scouts)

or a function:
CONCAT(fname,’ ', lname)

Since I posted, I’ve read some things that suggest that recursive matching
steps outside of what one should expect from regular expressions. It could
be that the exceptional nature of recursion prevents it from playing nice
with the rest of the statement.

I think the recursion is acting greedily (I don’t quite understand it, but that’s the best way I can put it) and trying to match everything in parentheses - so in the part “GROUP_CONCAT(name)” there is no space in front of “(name)” which causes it to fail.

try:

$re = "#\\s*\\((?:([^\\(\\)]+)|(?R))*\\)#";

aamonkey,

First, thanks for your reply. It helped me learn something new about regular expressions.

At first, I thought that your example was identical to mine, except for the escaping the parentheses in the character class. As I continued to look for differences, I noticed the critical one: you used \s* where I had used ’ ’ or \s+, knowing that my string included a space. Further experimentation showed that completely optional (either * or ?) preceding text would still allow a match, but anything specific fails.

Here’s my mostly working regex that matches the line in the example:
$re = “/(<([^>]+)>\s*)?((?:\((?:([^()]+)|(?R))*\))(\s+as\s+(\w+))?)/i”;

What I want from the match is to save the contents of <…>, and the alias that follows the “as” and return what follows the <…> ($matches[3] for this regex). I don’t care about the parentheses except that I need to know when they end to match the proper “as.” The problem is that the regex only matches when the <blah,blah> is optional, and I’m only interested in the match if the <blah,blah> part is there. I know this is a small difference, and that I can work around it by checking for $matches[2], but it seems like there should be a way to positively match the <…> and still recursively match the parentheses.

You speculated that perhaps the lack of a space before the parenthesis after GROUP_CONCAT might be the problem, and I worked with that assumption for a while, and I don’t think it matters. I tried putting an optional space at several locations within the recursive part, but nothing seems to make it possible to positively match <…>.

I dont understand what you’re trying to get out of your parsing.

More specifically define your input, and your intended output.

Will there only be one SQL statement in the string? What parts do you actually want to capture? Will the <> bit always be first?

I am parsing complete SQL statements. In defining the statements, I use the angle-brackets to pass extra information to the program that uses the SQL output. I am using preg_replace_callback() with the callback in a closure to extract the text within the angle-brackets, putting them into an associative array where the key is the either the field name or the alias. For matches, I return the text that follows the angle-brackets. So…I need to capture the contents of <> and the fieldname or alias. In the example, I obviously need to capture the alias because there will be no field name for the parenthetical phrase.

The <> is always first, and it defines an action or attribute of the field in the output. The <> can also precede the SELECT to define an action or attribute of the row. I only parse a single SQL statement at a time.

This fragment I’m working on is a small part of a larger regular expression, but it stands alone for the purposes of figuring out how, or why it doesn’t, work.