Preg match syntax issue?


$String = 'page=4&amp;searchId=2" title="Go to page 4">4</a>, <a href="Results.jsp?page=5&amp;searchId=2" title="Go to page 5">5</a>, <a href="Results.jsp?page=6&amp;searchId=2" title="Go to page 6">6</a>, <a href="Results.jsp?page=7&amp;searchId=2" title="Go to page 7">7</a>, <a href="Results.jsp?page=8&amp;searchId=2" title="Go to page 8">8</a> [<a href="Results.jsp?page=2&amp;searchId=2">Next</a><a href="Results.jsp?page=180&amp;searchId=2">Last</a>]</span>';
preg_match_all('#page=180&amp;searchId=2">(.*?)</a>#', $String, $Values);
print_r($Values);

returns “Last” in the array, as expected.


$String = 'page=4&amp;searchId=2" title="Go to page 4">4</a>, <a href="Results.jsp?page=5&amp;searchId=2" title="Go to page 5">5</a>, <a href="Results.jsp?page=6&amp;searchId=2" title="Go to page 6">6</a>, <a href="Results.jsp?page=7&amp;searchId=2" title="Go to page 7">7</a>, <a href="Results.jsp?page=8&amp;searchId=2" title="Go to page 8">8</a> [<a href="Results.jsp?page=2&amp;searchId=2">Next</a><a href="Results.jsp?page=180&amp;searchId=2">Last</a>]</span>';
preg_match_all('#page=(.*?)&amp;searchId=2">Last</a>#', $String, $Values);
print_r($Values);

doesn’t return 180. Anyone know why?

Because (.*) matches everything you throw at it, it’s very greedy. If you change your code to indicate you’re only interested in numbers it works just fine:


$String = 'page=4&amp;searchId=2" title="Go to page 4">4</a>, <a href="Results.jsp?page=5&amp;searchId=2" title="Go to page 5">5</a>, <a href="Results.jsp?page=6&amp;searchId=2" title="Go to page 6">6</a>, <a href="Results.jsp?page=7&amp;searchId=2" title="Go to page 7">7</a>, <a href="Results.jsp?page=8&amp;searchId=2" title="Go to page 8">8</a> [<a href="Results.jsp?page=2&amp;searchId=2">Next</a><a href="Results.jsp?page=180&amp;searchId=2">Last</a>]</span>'; 
preg_match_all('#page=(\\d+)&amp;searchId=2">Last</a>#', $String, $Values); 
print_r($Values);  


Array
(
    [0] => Array
        (
            [0] => page=180&searchId=2">Last
        )

    [1] => Array
        (
            [0] => 180
        )

)

Hi Scallio,

I used ([0-9]{0,4}) and it worked too. Yours is a bit nicer though. I still don’t understand why, because I figured it would match any string between the “page=” and “&searchId=2”>Last</a>". Obviously, that’s not so. I just need to learn syntax better :S lol

\d is shorthand for [0-9] and + is shorthand for ‘1 or more times’ (the same as {1,}), whereas {0,4} is the syntax for between 0 and 4 times, so in effect they are the same, yes, except that mine won’t accept empty numbers, and will accept more than 4 numbers whereas your code doesn’t.

It does, just not the page= you were expecting :wink:


page=[COLOR="#FF0000"]4&amp;searchId=2" title="Go to page 4">4</a>, <a href="Results.jsp?page=5&amp;searchId=2" title="Go to page 5">5</a>, <a href="Results.jsp?page=6&amp;searchId=2" title="Go to page 6">6</a>, <a href="Results.jsp?page=7&amp;searchId=2" title="Go to page 7">7</a>, <a href="Results.jsp?page=8&amp;searchId=2" title="Go to page 8">8</a> [<a href="Results.jsp?page=2&amp;searchId=2">Next</a><a href="Results.jsp?page=180[/COLOR]&amp;searchId=2">Last</a>]</span>

The part in red is matched by your regex since (.*) will just grab everything and anything. Think about it.

(except his regex was (.*?), which makes it NON greedy…)

Yes, but that only works going forward. i.e. if you have a string like /this/is/a/string and you match /this/(.)/ it will match [color=red]/this/is/a/[/color]string, i.e., everything up until the last / in the string.
Whereas with /this/(.
?)/, i.e., making it non-greedy, will match [color=red]/this/is/[/color]a/string, i.e., it stops directly after the first slash it finds and doesn’t “eat” other slashes in between.

With the problem of the OP however he wants to match as little as possible before the subject string, as far as I know there is nothing you can do to make that happen. Making the .* non-greedy in his case has no effect whatsoever.

[ot] > With the problem of the OP however he wants to match as little as possible before the subject string, as far as I know there is nothing you can do to make that happen.

It could be done, assuming I’m understanding what you’re looking for correctly. However, in this case, using \\d is the right and proper thing to be doing.[/ot]

biglittle, it looks like your confusion arises from not quite understanding how PCRE (the regex library used for the preg_* functions) chooses what to return.

Put simply, it returns the first valid match (of course, if there is one). The subject string is searched from left to right, character by character, when looking for a match.

Given your regex, upon reaching the very first [COLOR="#006400"]page=[/COLOR] and matching it against the regex’s [COLOR="#B22222"]page=[/COLOR], things are looking good. The next part is then executed, the [COLOR="#B22222"](.*?)[/COLOR], which happily eats up everything that it can with an eye to still getting a successful match of the whole regex. Since you only ask that what comes after the [COLOR="#B22222"](.*?)[/COLOR] be the literal [COLOR="#006400"]&amp;searchId=2">Last</a>[/COLOR], then it eats up everything to that point.

As an aside, a greedy version like [COLOR="#B22222"](.*)[/COLOR] would continue looking through the whole subject string after noticing that [COLOR="#006400"]&amp;searchId=2">Last</a>[/COLOR] had been seen. It’s greedy and wants to eat as much as possible. In your case, since [COLOR="#006400"]&amp;searchId=2">Last</a>[/COLOR] does not occur later in the string, both greedy and non-greedy would eat the same amount. The only difference is how much of the string is examined after finding that part of the string.

So, after [COLOR="#B22222"](.*?)[/COLOR] noms everything that it can, the rest of the regex goes on to try and get matched. The [COLOR="#006400"]&amp;searchId=2">Last</a>[/COLOR] is there at this stage so the regex has found its first match. At this point, nothing else is done. The match is returned and processing of the subject string stops immediately. A different regex engine, POSIX, would continue on in the string to try and find any more matches and would return the longest (leftmost) match possible (POSIX doesn’t have the concept of greedy/non-greedy): in your case, there isn’t a longer match from the initial [COLOR="#006400"]page=[/COLOR] starting point. However, PCRE gives up at the very first match that it can find.

Hopefully that hasn’t confused you entirely. In short, PCRE finds the first matching part of the subject string possible.

A final point, since you were using preg_match_all(), after finding the first match then the subject string is examined again starting at the ending point of the previous match (i.e, between [COLOR="#006400"]>[/COLOR] and [COLOR="#006400"]][/COLOR] near the end of the string). From this point, the rest of the string (only [COLOR="#006400"]>]</span>[/COLOR]) does not match so only the one match is pushed into the array.