Regex: remove everything not div tags

Hi,

Searched but haven’t found a solution to this.
I want to remove everything from html code that is not a <div> or </div> tag (opening or closing).
Since this matches the divs:

<div.*?>|</div>

I thought I could just negate it somehow, such as:

[^(<div.*?>)]|[^(</div>)]

(does not work)

Any ideas? :slight_smile:
Cheers

I want to remove everything from html code that is not a <div> or </div> tag (opening or closing).

Am I misreading this. Surely you will then just be left with a string containing <div>s and </div>s which doesn’t seem much use.

This works for me:


~</(?!div).*?>|<(?!/)(?!div).*?>~is

Use in PHP as follows:


$some_string = preg_replace('~</(?!div).*?>|<(?!/)(?!div).*?>~is', '', $some_html);

Breakdown of this regex:

~ - Start regex
</ - match </ literally
COLOR=“Blue”[/COLOR] - Negative lookahead for the literal string div
.*? - match anything, lazyly. Shouldn’t be needed here, but without it the regex doesn’t work !?
> - match > literally
| - OR match the following:
< - match < literally
COLOR=“Blue”[/COLOR] - Negative lookahead for the literal string /
COLOR=“Blue”[/COLOR] - Negative lookahead for the literal string div
.*? -match anything, lazyly.
> - match > literally
~ - End regex
is - Modifiers: Case Insensitive (i) and Single Line mode (s)

Single line mode is to also remove HTML that spans multiple lines, like

<script language=“javascript”
src=“/some/path/to/some/javascript.js”>

For info on negative lookahead, see here: http://www.regular-expressions.info/lookaround.html

Hope that helps :slight_smile:

Agreed with Phillip. Is this the story where someone asks how to move a mountain because they want to lay a pipeline from point A to point B?

How I understood it is that the OP wished to remove all tags except for div tags, thus leaving everything outside tags (content) and div tags in tact. Which is exactly what my regex provided in post #3 does :slight_smile:

I’ll have to see it to understand it then.


$some_html = <<<HT
<div id="some_div"><a href="#">some link</a></div><hr /><abbr>PM</abbr>
HT;

$some_string = preg_replace('~</(?!div).*?>|<(?!/)(?!div).*?>~is', '', $some_html); 

var_dump(htmlentities($some_string));

/* OUTPUT:
string(58) "<div id="some_div">some link</div>PM"
*/

Does that help? :slight_smile:

If that’s the type of input, no bizarre nesting or whatever, and never actually outputted to a real HTML page, then yes.

.*? - match anything, lazyly. Shouldn’t be needed here, but without it the regex doesn’t work !?

Because the lookahead doesn’t match stuff, just looks? But also prolly misunderstanding that question too.

That makes sense. Thank you for that :slight_smile:

Thanks ScallioXTX! That’s pretty much what I was after. And thanks for the detailed explanation. I remember look-ahead now, but it’s been a while. Thanks also to the other comments.

I was, in fact, trying to get a string containing only div tags (as mentioned by philip). The reason for this is that when examining (e.g. wordpress) generated pages it can be useful to have a skeleton outline of the (potentially bloated) div structure. This can be done by hand, of course, but seems to be against the spirit of computing :slight_smile:

Since the tags contain id and class properties, which are useful to know, combining the regex from Scallio with the following gives a visual guide viewable in a browser, showing the nesting and naming of each div without other clutter:

preg_replace('~<div(.*?)>|<div$1>\
$1<br>\
~is', '', $some_html)

Is this the story where someone asks how to move a mountain because they want to lay a pipeline from point A to point B?

Possibly :). Although I knew it would be relatively straightforward to some. I know there are various tools for examining source code, but this seems like a fair use of regexps and can be done in a text editor.

Cheers

They can be, just be careful. Regular expressions work on regular languages. HTML isn’t a regular language. Meaning, for small things, a regex will be fine, but when there’s complicated nesting and possibly strange content floating around, you’ll want to check by hand afterwards if it matters.

I actually find that the hierarchical HTML view shown when you use the “Inspect element” contextual menu option, in for example Chrome and Firefox, are invaluable for this.