Getting exact word regex

Hi!

I have a list of keywords that I want to try and exact from a string that will contain html. I only want to get the exact word on it’s own, i.e.

String = “This is a long string <a href=”“>cat</a> link. I love cats. I love cat.”;

Word = “cat”;

I only want to get the word “cat” when it’s on it’s own. I don’t want it when it’s apart of a link so or apart of another word cats. Just “cat”. I will probably need to also check for punctuation like fullstops and commas.

My current code is:


preg_match_all('/\\b'.$word.'\\b/i', $string, $matches);

This sort of works- only if the word is within a link, it will match it also (which I don’t want to happen).

Any ideas for this? I feel like it should be simple but I’m a bit stuck haha.

Thank you to anyone who helps :slight_smile:

I don’t understand the question. Which of the three "cat"s, numbers 1,2, and 3, do you want and not want in:

[COLOR=#494949]“This is a long string <a href=”“>cat</a> link. I love cats. I love cat.”;

?[/COLOR]

The 3rd cat only, sorry if that didn’t make sense X)

You could feed it back into itself with an if/else statement.

First you’ll do a preg_match for “href” and if it appears, then the script ends because it’s a link, which you don’t want. If it does not appear, then you do the preg_match again for “cat”.

So it will return true for “cat”, “cats”, and “tabby-cat.jpg” but not <a href=“cat.html”>

If you use true regex, then you could use syntax that accepts “cat”, but not if the string contains “href”. See http://www.tutorialspoint.com/php/php_regular_expression.htm

	<h3>cats.html</h3>
	<?php
	$string = "<a href='cats.html';>";
	$pattern = "cat";
	if (preg_match('/href/i', $string))
	{
	echo "This is a link containing 'cat'.";
	exit();
	} 
	else
	{
		if (preg_match('/cat/i', $string))
		{
		echo "The string contains 'cat'";
		}
		else
		{
		echo "The string does not contain 'cat'";
		}
	}
	?>

This is how I’d do it


$string = 'This is a long string <a href="">cat</a> link. I love cats. I love cat.';


$tmp = $string;


while( ($x = stripos($tmp, '<a ')) !== FALSE && ($y = stripos($tmp, '</a>')) !== FALSE ) {
    $tmp = substr($tmp,0,$x).str_repeat('*',$y-$x).substr($tmp,$y);
}


$word = 'cat';


preg_match_all('/\\b'.$word.'\\b/i', $tmp, $matches, PREG_OFFSET_CAPTURE); 

Slight correction:

Change this:

$tmp = substr($tmp,0,$x).str_repeat(‘*’,$y-$x).substr($tmp,$y);

for this:

$tmp = substr($tmp,0,$x).str_repeat(‘*’,$y+4-$x).substr($tmp,$y+4);

Hi guys,

Thanks for your responses!

I don’t think either of those solutions will help my problem. The string could contain pretty much anything (any html, or any random stuff- it’s a users wordpress post). The process is to go through the whole post looking for only the single word on it’s own- I think the second solution is closer to what I want, but not exactly since a user could have already linked “cat” and I don’t want to remove it, and then link it again. I only want to link the word if it’s on it’s own and untouched (if that makes sense).

So to clarify the process is:

  1. Get the user’s submitted WordPress post which can contain HTML.
  2. Look through the post for single keywords (in this case, the example was cat).
  3. If the cat is completely on it’s own (no html surrounding it, not apart of another word) then link it.
  4. Continue until end of the WordPress post.

Thank you again guys for your help!

I ended up using a solution that involves DomDocument. It seems to work well so far, I just need to test it using lots of different scenarios to see if it breaks. Thanks again :slight_smile: