Regex to extract whole sentences that contain a certain word

As per the title this is my failed attempt.

Target string:

Start of sentence one. This is a wordmatch one two three four. Another, sentence here.

Regular expression:


\\b[A-Z].*?(wordmatch).*?\\b

Expected match:


This is a wordmatch one two three four

Actual match:


Start of sentence one. This is a wordmatch

I’m a bit stumped on this one. Is there a nice punctuation escape character I’m missing out on :wink: I understand why the word boundary won’t work, but am I really going to have to create a character class to try and guess punctuation rules?

You have to imagine that you are a pc. What defines a sentence for a machine?
A machine won’t analyze the context of a sentence to find out if it could be a sentence or not. So we need boundaries… The classic boundary for a sentence are the punctuations… so unfortunately you will have to go down that road…


$str = 'Start of sentence one. This is a wordmatch one two three four. Another, sentence here.';
$regex = '/[A-Z][^\\.;]*(wordmatch)[^\\.;]*/';

if (preg_match($regex, $str, $match))
    echo $match[0];

Thanks Fristi. Now try this:


$string = 'Having using Kaspersky Antivirus in the past, and been highly impressed, I found myself looking for a new antivirus for a freshly built PC. I had been using AVG 7.5 for the last year, and after becoming fed up of being nagged to use the paid version of AVG8 I decided to try the latest offering from Kaspersky Labs, in the form of the 2009 version of their anti-virus software.';
$regex = '/[A-Z][^\\.;]*(virus)[^\\.;]*/';

if (preg_match_all($regex, $str, $match)) 
    print_r($match);

See how it’s incorrectly matching the second sentence.

Do you want the exact word virus matched or any word that contains the word virus?

What do you want the regex to match in this example?

Hi fristi,

I want it to match any whole sentence that begins with, ends with or contains a string. In this case the string is virus. So it should match any whole sentence that contains the word virus. In this case it will be:

Having using Kaspersky Antivirus in the past, and been highly impressed, I found myself looking for a new antivirus for a freshly built PC

I had been using AVG 7.5 for the last year, and after becoming fed up of being nagged to use the paid version of AVG8 I decided to try the latest offering from Kaspersky Labs, in the form of the 2009 version of their anti-virus software

I.e. both sentences because they contain the word virus :slight_smile:

The problem with the second on, is that it it contains 7.5 so the parser ends the previous sentence at 7 and he can’t start the next one because it is a 5 instead of a captivate Letter. This is a tricky one, since a . doesn’t mean a sentence boundary.

I don’t know if it can be done, I’ll look into it.

Just to give you a bit of background on what I’m trying to do, so you don’t think it’s a fruitless exercise. I’m doing a MySQL fulltext search, but I want to show a helpful search snippet in my results preferably highlighting the match in bold like Google do :slight_smile: It’d be easy to find the match using strpos, and then just pick 50 chars either side, but I want the search snippet to have some context. If you look at google, their snippets all start with the beginning of a sentence, rather than mid-sentence. They also don’t cut off parts of any words when they cut off the snippet.

Therefore, this is what I’m attempting to do and this regex is the start of it :slight_smile:

Let’s keep our fingers crossed


$string = 'Having using Kaspersky Antivirus in the past, and been highly impressed,
           I found myself looking for a new antivirus for a freshly built PC.
           I had been using AVG 7.5 for the last year, and after becoming fed up of
           being nagged to use the paid version of AVG8 I decided to try the latest
           offering from Kaspersky Labs, in the form of the 2009 version of their anti-virus software.';


$regex = '/[A-Z][^\\.;\\?\\!]*(virus)[^\\.;\\?\\!]*/';
$string = preg_replace('/(\\d+)\\.(\\d+)/', "$1,$2", $string);

if (preg_match_all($regex, $string, $match)) {

    foreach ($match[0] as &$str)
        $str = preg_replace('/(\\d+),(\\d+)/', "$1.$2", $str);

    print_r($match);
}

The problem remains that until you can describe exactly what a “sentence” is, you cannot hope to instruct the computer to isolate it for you.

Say if you came across a badly written sentence with 2000 chars you wouldn’t want to display it all would you?

If you came across a sentence which was 2 words, that would not transfer much information to the user either would it?

What you are trying to do is also described as creating a "Document Surrogate" one of the things I found out in [URL=“http://paulgeraghty.posterous.com/search-patterns”]search patterns.

There seems to be two ways to go on this, you either;

a) try and constrain your mysql full text search in the database first - only bring back x chars from the table

b) bring back everything from the table

if you decide on a) explode on . and choose the array item which contains the word.

If you elect for b) you potentially are able to show EACH full-text scoring term, eg if you have 2 articles mentioning virus, the first says virus once but the second says it twice, then mysql will score the second article higher than the first - so shouldn’t you display BOTH words in some context? e.g.

Your results:

  1. I was going to buy a virus checker and thought, hell why bother? Just connect to the web when they are asleep. Take that virus."

  2. “I think I caught the virus when working in the laundry, all those sleeves.”

Its less about sentences, more about what confers the most, yet somehow manageable information to your users.

<?php

$str = 'Having used Kaspersky Antivirus in the past, and been highly impressed,
           I found myself looking for a new antivirus for a freshly built PC!!!
           This is a sentence that should not match&hellip;
           I\\'d been using AVG 7.5 for the last year, and after becoming fed up of
           being nagged to use the paid version of AVG8 I decided to try the latest
           offering from Kaspersky Labs, in the form of the 2009 version of their anti-virus software.';




$bound = '(?:[!?.;]+|&hellip;)';
$filler = '(?:[^!?.;\\d]|\\d*\\.?\\d+)*';
$keyword = 'virus';
preg_match_all("#{$bound}({$filler}{$keyword}{$filler})(?={$bound})#si", "!$str", $matches);
echo $str, '<hr/><pre>', print_r($matches, true), '</pre>';

?>