Ive posted this in here mainly as PHP is the only language i know well.
I have news on my site - which is pretty basic - I added a couple of RSS feeds and i have a script that loops through them and matches them up to a list of team names. What im finding is that im ending up with a load of duplicated stories - as you might expect all the major sites post a similar story about the same thing - I want to try and group these together - Ive been thinking about it for a while and cant really get a picture of the best way to go about this.
The simplest method i came up with was if a story matches the same team and the same player and is within an hour of the original then they may well be about the same thing but this seemed a bit crap to be honest.
I wondered if anyone could explain how google news groups its story - done a bit of searching on the net and there are several sites that explain the principle and i can see that it finds stories the same but nothing that goes into technically how it is done - Im not looking for anyone to give me code or anything like that - quite looking for ward to having a go at coding it myself - was more hoping to get a bit of a discussion going about it can be done.
$headlines = array(
'Anthony likes M&Ms',
'Anthony likes sweets',
'Free mugs with every purchase',
'Sport, its a mugs game',
'Mugs, free with every purchase',
);
$tolerance = 17;
foreach($headlines as $headline){
echo '<h1>', $headline ,'</h1>';
echo '<p>Possibly related stories:-</p>';
echo '<ul>';
foreach($headlines as $related){
if($headline !== $related && $tolerance >= levenshtein($headline, $related)){
echo '<li>', $related, '</li>';
}
}
echo '</ul>';
}
[B]Anthony likes M&Ms[/B]
Possible related stories:-
* Anthony likes sweets
[B]Anthony likes sweets[/B]
Possible related stories:-
* Anthony likes M&Ms
[B]Free mugs with every purchase[/B]
Possible related stories:-
* Mugs, free with every purchase
[B]Sport, its a mugs game[/B]
Possible related stories:-
*
[B]Mugs, free with every purchase[/B]
Possible related stories:-
* Free mugs with every purchase
Thats really interesting - i had tried levenshtein function a while back on something different and i had dismissed it but i had not considered testing on the tolerance being over a certain limit - From that experiment there it definately seems to offer some potential.
Ill take a look at the similar text function as well - ive been looking at a couple of sites that do new comparison and they could definately be using some form of that levenshtein function.
Anybody else has any suggestions - ill post back on the current ones in a day or two once ive added them to my current script. I know google are notoriously secretive but has anyone read anything about google news?