Google News - How does it do it?

footballud · April 10, 2010, 9:56am

Ive posted this in here mainly as PHP is the only language i know well.

I have news on my site - which is pretty basic - I added a couple of RSS feeds and i have a script that loops through them and matches them up to a list of team names. What im finding is that im ending up with a load of duplicated stories - as you might expect all the major sites post a similar story about the same thing - I want to try and group these together - Ive been thinking about it for a while and cant really get a picture of the best way to go about this.

The simplest method i came up with was if a story matches the same team and the same player and is within an hour of the original then they may well be about the same thing but this seemed a bit crap to be honest.

I wondered if anyone could explain how google news groups its story - done a bit of searching on the net and there are several sites that explain the principle and i can see that it finds stories the same but nothing that goes into technically how it is done - Im not looking for anyone to give me code or anything like that - quite looking for ward to having a go at coding it myself - was more hoping to get a bit of a discussion going about it can be done.

Thanks

Paul_Wilkins · April 10, 2010, 12:06pm

You may want to start by having a look at the similar_text function.

AnthonySterling · April 10, 2010, 12:13pm

Hm, interesting…


$headlines = array(
    'Anthony likes M&Ms',
    'Anthony likes sweets',
    'Free mugs with every purchase',
    'Sport, its a mugs game',
    'Mugs, free with every purchase',
);

$tolerance = 17;

foreach($headlines as $headline){
    echo '<h1>', $headline ,'</h1>';
    echo '<p>Possibly related stories:-</p>';
    echo '<ul>';
    foreach($headlines as $related){
        if($headline !== $related && $tolerance >= levenshtein($headline, $related)){
            echo '<li>', $related, '</li>';
        }
    }
    echo '</ul>';
}


[B]Anthony likes M&Ms[/B]
Possible related stories:-
    * Anthony likes sweets

[B]Anthony likes sweets[/B]
Possible related stories:-
    * Anthony likes M&Ms

[B]Free mugs with every purchase[/B]
Possible related stories:-
    * Mugs, free with every purchase

[B]Sport, its a mugs game[/B]
Possible related stories:-
     *

[B]Mugs, free with every purchase[/B]
Possible related stories:-
    * Free mugs with every purchase

footballud · April 10, 2010, 9:59pm

AnthonySterling:

Hm, interesting…


$headlines = array(
    'Anthony likes M&Ms',
    'Anthony likes sweets',
    'Free mugs with every purchase',
    'Sport, its a mugs game',
    'Mugs, free with every purchase',
);

$tolerance = 17;

foreach($headlines as $headline){
    echo '<h1>', $headline ,'</h1>';
    echo '<p>Possibly related stories:-</p>';
    echo '<ul>';
    foreach($headlines as $related){
        if($headline !== $related && $tolerance >= levenshtein($headline, $related)){
            echo '<li>', $related, '</li>';
        }
    }
    echo '</ul>';
}


[B]Anthony likes M&Ms[/B]
Possible related stories:-
    * Anthony likes sweets

[B]Anthony likes sweets[/B]
Possible related stories:-
    * Anthony likes M&Ms

[B]Free mugs with every purchase[/B]
Possible related stories:-
    * Mugs, free with every purchase

[B]Sport, its a mugs game[/B]
Possible related stories:-
     *

[B]Mugs, free with every purchase[/B]
Possible related stories:-
    * Free mugs with every purchase

Thats really interesting - i had tried levenshtein function a while back on something different and i had dismissed it but i had not considered testing on the tolerance being over a certain limit - From that experiment there it definately seems to offer some potential.

Ill take a look at the similar text function as well - ive been looking at a couple of sites that do new comparison and they could definately be using some form of that levenshtein function.

Anybody else has any suggestions - ill post back on the current ones in a day or two once ive added them to my current script. I know google are notoriously secretive but has anyone read anything about google news?