Of course, I’ll try and explain.
The regex is essentially split into two parts, one fairly simple (but extremely useful) and the other a bit less simple but mostly because it’s a bit ugly. Those parts are separated by the alternation operator (a pipe, |) such that the main regex can match either of the two alternatives. Those two parts are a) \\G0
and b) (?<=^|1)0(?=0{0,3}(?:1|$))
. I’ll cover part b first (because part a won’t make much sense without it).
Matching (?<=^|1)0(?=0{0,3}(?:1|$))
Taking this to pieces, there are three main parts:
(?<=^|1)
0
(?=0{0,3}(?:1|$))
The two complicated parts use lookarounds (lookbehind, and lookahead, respectively).
Part a. looks “behind” the current matching position for either the start of the subject string, or a number 1. So given the subject string above, this part would match successfully at the start of the string and immediately after any number 1s.
Part b. matches just a number 0. So building up what can be matched, that’s only a zero preceded by the start of the string or a number 1. Easy enough so far?
Part c. is a little more complex. It looks ahead (after the number 0) to see if there are between 0 and 3 number 0s followed by either a number 1 or the end of the string. This is the part that limits the number of sequential 0s to between 1 and 4 inclusive (or as the OP stated, “all zero sequences whose length less than 5”). If that’s not clear, here are a few examples. Say we just matched a zero (in grey) and want to check this lookahead:
[color=grey]0[color=green]000[/color][color=red]0[/color][/color]
= FAIL
because there is a fourth zero after the zero from part b.
[color=grey]0[color=green]1[/color][/color]
= PASS
because there is a following 1
[color=grey]0[color=green]<end of string>[/color][/color]
= PASS
because the zero was at the end of the string
That is the end of the complicated part of the main regex. So in English, it matches:
- any 0,
- either at the start of the string or preceded by a 1,
- and at the end of the string or followed by up to three 0s.
Visually, this part would match as follows:
[color=grey][color=green]0[/color]0011[color=green]0[/color]1[color=green]0[/color]011000001[color=green]0[/color]11[color=green]0[/color]111[color=green]0[/color]1[color=green]0[/color]000111000001[color=green]0[/color]000[/color]
Great, but that only matches the first of the sequences of up to 4 0s! This is where the super-concise other alternative comes in.
Matching \\G0
The \\G
start of match assertion is the key here, and will take some explaining. This is a special check which is only true when the current matching position is at the start point of the match.
The start point of the match is the point at which the current matching run starts (“well, duh” some might say). In practice, this means the points either at the very start of the whole process (when the start point of the match is the beginning of the string) and when starting again after a replacement (when the start point of the match is essentially the point after the replacement).
So this part matches a 0 which is at the start of the subject string (aside: for the observant reader, this means the ^
alternative in the lookbehind for the other part is redundant!) or immediately following the point where matching starts again after a replacement (so, after matching the first 0 in a sequence).
Again lets describe this visually. Given a subject string of ababcabab
lets see what happens:
preg_replace('/\\Gab/', '|$0', 'ababcabab')
gives
[color=green]|ab|ab[/color]cabab
(replaced parts highlighted in green)
preg_replace('/ab/', '|$0', 'ababcabab')
gives [color=green]|ab[/color][color=green]|ab[/color]c[color=green]|ab[/color][color=green]|ab[/color]
The difference above is caused by the \\G
which means that the letters ab
could only be matched at a start point. After matching ab
before the c
, then the start point is before that letter c
. The c
does not match the regex and so it is skipped and the next character is examined, but now this is not at the match start point so \\G
fails.
So back to \\G0
, it matches:
- any 0,
- start point of a match (i.e. following a replacement)
Putting the pieces together
Our full regex looks for:
- any 0, that is
[list]
- at the start of the string or preceded by a 1, and
- at the end of the string or followed by up to three 0s;
[/list]
or
[list]
- immediately follows one of the above.
[/list]