Split string with HTML tags

bewise · July 11, 2011, 9:15am

Hello All,

Im looking to split a string with HTML tags into two sections. 1st section with a limited number of characters, the next section with the remainder of the original string.

Im currently using the following in order to split the string:


public static string SplitWord(string x, int length)
	{
        if (x.Length > length)
        {
            x = x.Substring(0, length);
        }
        return x;
	}

string1 = SplitWord(string, 1000);
string2 = string.Substring(1000);

With my current solution, on the odd occasion it will split up a html tag. For example:

string = “<di”
string1 = “v>”;

or

string = “<div>”
string1 = “</div>”;

Is there a way of diving up a string into two parts, but making sure it does after a closing html tag.

Ideal solution:
string = “<div></div>”
string1 = “”;

PhilipToop · July 12, 2011, 11:01am

I believe what you are after is a regular expression

var strText="<div>when <b>in doubt</b> do nothing</div>";
var strBits = strText.match(/<[^> ]+[^>]*>[^<]*/g);

Will split the string so you end up with an array
strBits[0] = "<div when "
strBits[1] = “<b>in doubt”
strBits[2] = “</b> do nothing”
strBits[4] = “</div>”

EnderMB · July 12, 2011, 11:26am

Never, ever, ever use a regular expression or string manipulation when dealing with HTML. Tasks like this are what HTML parsers like HtmlAgilityPack. were made for!

If you can get to grips with a bit of XPath, then you can separate the content from the HTML, do your necessary splits and then rebuild the HTML around it however you wish.

PhilipToop · July 12, 2011, 12:00pm

The question related to javascript. HtmlAgilityPack says it is a .Net code library, so is server side and therefore not appropriate in this instance.

EnderMB · July 12, 2011, 12:24pm

This thread is in the .NET forum, which is the reason I posted a .NET library, unless I’m missing something?

Regardless, it’s a matter of language. HTML is too complex a language to parse with Regular Expressions, as this question on SO shows.

I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and RegEx is a Chomsky Type 3 grammar (regular expression). Since a Type 2 grammar is fundamentally more complex than a Type 3 grammar - you can’t possibly hope to make this work. But many will try, some will claim success and others will find the fault and totally mess you up.

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

PhilipToop · July 12, 2011, 1:14pm

This thread is in the .NET forum,

Indeed it is - my mistake.

williamjerry · July 14, 2011, 3:40am

I think only regular expression can help in this case otherwise code will get very heavy.

EnderMB · July 14, 2011, 6:34am

Read my post again.

It is linguistic fact that a regular expression is not capable of handling HTML.

honeymonster · July 14, 2011, 7:12am

Ahem. Actually with the .NET extensions to regular expressions - specifically the way you can create expressions which matches levels - I would argue that it can be done.

It will not be pretty - and you are correct to point to alternative solutions. I’m just being obnoxious.

EnderMB · July 14, 2011, 8:12am

I think this SO post (despite it relating to JSON) covers my opinion on that method.

“Some systems offer extensions to regular expressions that kinda-sorta handle balanced expressions. However they’re all ugly hacks, they’re all unportable, and they’re all ultimately the wrong tool for the job.”

I’ve used extensions before (admittedly not with .NET) and it was far more trouble than it was worth, and it didn’t handle wild HTML code very well. More often than not if you’re performing a simple string task on a bit of HTML then the HtmlAgilityPack will do it in a couple of lines. I’d argue that it’s the best .NET library I’ve ever used, and like many developers it brings me great pain to see anyone using a regular expression to parse HTML.