How do I prevent backtracking in this regular expression?

carlosbcg · January 7, 2012, 3:49am

Hi there Perl users!

I am trying to do some parsing with PHP but figure you Perl coders are probably more adept at regular expressions than my PHP compatriots so…I am asking here.

Here’s the problem…

The test data is as follows:

</span> <a href=“/someurl_here”>Learn more</a> </div> <div id=“pocs1”> Hi there. </div> <div id=“pocs2”>Press Enter.</div> </div> <div id=“pets” style=“color:#767676;display:none;font-size:9pt;margin:5px 0 0 8px”>Press Enter.</div> </td> </tr> </table> </div> </div> </form> </div> <div id=“asdfasdfsrchdsc”> </div> <div id=“asdfsdb”> </div> <a href=“http://www.domain.com/clubsinfo/cheese/cheeses_2/monthly_products.asp?itemid=30005&year=2009” class=l onmousedown=“return rwt(this”><div id=“nossln”></div> <div id=“subform_ctrl”> </div> </div> <div id=“holiday”> </div> <div id=“appbar”> <div id=“ab_name”><span></span></div> <div><div id=asdf>Page 8 <nobr> (0.18 seconds) </nobr></div></div> <ol id=“ab_ctls”><li class=“ab_ctl” id=“ab_ctl_ss”><div’

Just a bunch of gibberish. But within it you will notice there are two links.

<a href=“/someurl_here”

and

<a href=“http://www.domain.com/clubsinfo/cheese/cheeses_2/monthly_products.asp?itemid=30005&year=2009”

What I want to do is capture ONLY the last link inside the quotation marks.

When I use the regular expression…

<a href=“(.*?)”\sclass=l

I end up capturing from the first <a to the class=l which captures both the links.

How do I prevent my regular expression from backtracking to the first <a?

I have been beating my head against this for hours and have tried all kinds of ?!, ?=, ?>, ?<, and all manner of stuff and none of it works.

Would really appreciate any insight or tips you all could give me. Thanks!

Carlos

jurn · January 7, 2012, 6:06am

hi Carlos,

You can change your regexp from the ‘.’ which means any character,
to [^“] which means any character but '”'. (double quotes).

Jurn

carlosbcg · January 7, 2012, 8:03pm

Thanks for very, very much! That did it! Much simpler than the gibberish that one sees when searching for ways to extract links from HTML on the Internet.

Carlos