preg_replace, more than 4

system · March 10, 2011, 6:10am

I have a string contains sequences of 0 and 1. I need to replace all zero sequences whose length less than 5, into number 1 with same length. The zero sequences with length 5 or more should be left as is.

For example

source : 11000001100010011000001
result : 11000001111111111000001

<snip/>

chris_upjohn · March 10, 2011, 6:35am

Try the below, it should work fine.

$numbers = '00011010011000001011011101';
echo $numbers . '<br />';
$numbers = preg_replace('/0{5}/', '11111', $numbers);
echo $numbers;

rpkamp · March 10, 2011, 10:01am


$string = '11000001100010011000001';
$pattern = '/(0+)/e'; 
$replacement = "strlen('\\\\1') > 4 ? '\\\\1' : str_repeat('1', strlen('\\\\1'))";

echo preg_replace(
  $pattern,
  $replacement,
  $string
);

salathe · March 10, 2011, 8:00pm

ScallioXTX, I would always advise using preg_replace_callback() rather than the evil modifier!


$numbers     = '0001101001100000101101110100001110000010000';
$pattern     = '/0+/'; 
$replacement = function ($m) {
    if (strlen($m[0]) < 5) {
        return strtr($m[0], '0', '1');
    }
    return $m[0];
};

echo preg_replace_callback($pattern, $replacement, $numbers);
// 1111111111100000111111111111111110000011111

This could also be done with plain replacement by crafting a regex which looks only for 0s within a sequence of between one and four 0s.


$numbers = '0001101001100000101101110100001110000010000';
echo preg_replace('/\\G0|(?<=^|1)0(?=0{0,3}(?:1|$))/', '1', $numbers);
// 1111111111100000111111111111111110000011111

However, because something can be done, does not mean it should. But for education purposes, go wild. (:

rpkamp · March 10, 2011, 8:22pm

I agree, preg_replace_callback is nicer, I’m not entirely sure why I suggested /e instead!

And I thought I was pretty okay with regex, but this is going over my head. Would you mind giving a break down of what the different parts do?

salathe · March 12, 2011, 2:55pm

Of course, I’ll try and explain.

The regex is essentially split into two parts, one fairly simple (but extremely useful) and the other a bit less simple but mostly because it’s a bit ugly. Those parts are separated by the alternation operator (a pipe, |) such that the main regex can match either of the two alternatives. Those two parts are a) \\G0 and b) (?<=^|1)0(?=0{0,3}(?:1|$)). I’ll cover part b first (because part a won’t make much sense without it).

Matching (?<=^|1)0(?=0{0,3}(?:1|$))

Taking this to pieces, there are three main parts:

(?<=^|1)
0
(?=0{0,3}(?:1|$))

The two complicated parts use lookarounds (lookbehind, and lookahead, respectively).

Part a. looks “behind” the current matching position for either the start of the subject string, or a number 1. So given the subject string above, this part would match successfully at the start of the string and immediately after any number 1s.

Part b. matches just a number 0. So building up what can be matched, that’s only a zero preceded by the start of the string or a number 1. Easy enough so far?

Part c. is a little more complex. It looks ahead (after the number 0) to see if there are between 0 and 3 number 0s followed by either a number 1 or the end of the string. This is the part that limits the number of sequential 0s to between 1 and 4 inclusive (or as the OP stated, “all zero sequences whose length less than 5”). If that’s not clear, here are a few examples. Say we just matched a zero (in grey) and want to check this lookahead:

[color=grey]0[color=green]000[/color][color=red]0[/color][/color] = FAIL
because there is a fourth zero after the zero from part b.
[color=grey]0[color=green]1[/color][/color] = PASS
because there is a following 1
[color=grey]0[color=green]<end of string>[/color][/color] = PASS
because the zero was at the end of the string

That is the end of the complicated part of the main regex. So in English, it matches:

any 0,
either at the start of the string or preceded by a 1,
and at the end of the string or followed by up to three 0s.

Visually, this part would match as follows:
[color=grey][color=green]0[/color]0011[color=green]0[/color]1[color=green]0[/color]011000001[color=green]0[/color]11[color=green]0[/color]111[color=green]0[/color]1[color=green]0[/color]000111000001[color=green]0[/color]000[/color]

Great, but that only matches the first of the sequences of up to 4 0s! This is where the super-concise other alternative comes in.

Matching \\G0

The \\G start of match assertion is the key here, and will take some explaining. This is a special check which is only true when the current matching position is at the start point of the match.

The start point of the match is the point at which the current matching run starts (“well, duh” some might say). In practice, this means the points either at the very start of the whole process (when the start point of the match is the beginning of the string) and when starting again after a replacement (when the start point of the match is essentially the point after the replacement).

So this part matches a 0 which is at the start of the subject string (aside: for the observant reader, this means the ^ alternative in the lookbehind for the other part is redundant!) or immediately following the point where matching starts again after a replacement (so, after matching the first 0 in a sequence).

Again lets describe this visually. Given a subject string of ababcabab lets see what happens:

preg_replace('/\\Gab/', '|$0', 'ababcabab') gives
[color=green]|ab|ab[/color]cabab (replaced parts highlighted in green)
preg_replace('/ab/', '|$0', 'ababcabab') gives [color=green]|ab[/color][color=green]|ab[/color]c[color=green]|ab[/color][color=green]|ab[/color]

The difference above is caused by the \\G which means that the letters ab could only be matched at a start point. After matching ab before the c, then the start point is before that letter c. The c does not match the regex and so it is skipped and the next character is examined, but now this is not at the match start point so \\G fails.

So back to \\G0, it matches:

any 0,
start point of a match (i.e. following a replacement)

Putting the pieces together

Our full regex looks for:

any 0, that is
[list]
at the start of the string or preceded by a 1, and
at the end of the string or followed by up to three 0s;
[/list]
or
[list]
immediately follows one of the above.
[/list]

rpkamp · March 14, 2011, 12:56am

I’ve read it several times and I think I get it. So, if we call \G0 a and (?<=^|1)0(?=0{0,3}(?:1|$)) b, am I correct in stating the following happens?


0001101001100000101101110100001110000010000
aaa  b ba  baaaa b  b   b baaa   baaaa baaa

I put an a below any zero that will be replaced by a 1 as per part a and analogous a b for a 0 that will be replaced by a 1 as per part b

Is the above correct?

salathe · March 14, 2011, 7:36am

This kind of thing can take a while to grasp (my explanation probably didn’t help). If you think that you get it, then awesome! (:

The idea is correct, but the sequences of 5 zeros would not be matched.


0001101001100000101101110100001110000010000
aaa  b ba  [s]baaaa[/s] b  b   b baaa   [s]baaaa[/s] baaa

rpkamp · March 14, 2011, 10:06am

Ah, I already knew most of the concepts (except for \G, but you explained that well!), but just never saw them in such a complex setting.
After reading your explanation about 4 or 5 times (because it’s such a complex subject, not because your explanation is bad, it isn’t!) I’m pretty sure I fully get what it does now

Yes, of course

Thanks, Salathe !