Split string with HTML tags

Hello All,

Im looking to split a string with HTML tags into two sections. 1st section with a limited number of characters, the next section with the remainder of the original string.

Im currently using the following in order to split the string:


public static string SplitWord(string x, int length)
	{
        if (x.Length > length)
        {
            x = x.Substring(0, length);
        }
        return x;
	}

string1 = SplitWord(string, 1000);
string2 = string.Substring(1000);


With my current solution, on the odd occasion it will split up a html tag. For example:

string = “<di”
string1 = “v>”;

or

string = “<div>”
string1 = “</div>”;

Is there a way of diving up a string into two parts, but making sure it does after a closing html tag.

Ideal solution:
string = “<div></div>”
string1 = “”;

I believe what you are after is a regular expression

var strText="<div>when <b>in doubt</b> do nothing</div>";
var strBits = strText.match(/<[^> ]+[^>]*>[^<]*/g);

Will split the string so you end up with an array
strBits[0] = "<div when "
strBits[1] = “<b>in doubt”
strBits[2] = “</b> do nothing”
strBits[4] = “</div>”

Never, ever, ever use a regular expression or string manipulation when dealing with HTML. Tasks like this are what HTML parsers like HtmlAgilityPack. were made for!

If you can get to grips with a bit of XPath, then you can separate the content from the HTML, do your necessary splits and then rebuild the HTML around it however you wish.

The question related to javascript. HtmlAgilityPack says it is a .Net code library, so is server side and therefore not appropriate in this instance.

This thread is in the .NET forum, which is the reason I posted a .NET library, unless I’m missing something?

Regardless, it’s a matter of language. HTML is too complex a language to parse with Regular Expressions, as this question on SO shows.

I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and RegEx is a Chomsky Type 3 grammar (regular expression). Since a Type 2 grammar is fundamentally more complex than a Type 3 grammar - you can’t possibly hope to make this work. But many will try, some will claim success and others will find the fault and totally mess you up.

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

This thread is in the .NET forum,

Indeed it is - my mistake.

I think only regular expression can help in this case otherwise code will get very heavy.

Read my post again.

It is linguistic fact that a regular expression is not capable of handling HTML.

Ahem. Actually with the .NET extensions to regular expressions - specifically the way you can create expressions which matches levels - I would argue that it can be done.

It will not be pretty - and you are correct to point to alternative solutions. I’m just being obnoxious.

I think this SO post (despite it relating to JSON) covers my opinion on that method.

“Some systems offer extensions to regular expressions that kinda-sorta handle balanced expressions. However they’re all ugly hacks, they’re all unportable, and they’re all ultimately the wrong tool for the job.”

I’ve used extensions before (admittedly not with .NET) and it was far more trouble than it was worth, and it didn’t handle wild HTML code very well. More often than not if you’re performing a simple string task on a bit of HTML then the HtmlAgilityPack will do it in a couple of lines. I’d argue that it’s the best .NET library I’ve ever used, and like many developers it brings me great pain to see anyone using a regular expression to parse HTML.