I run an article directory. I was wondering how I could check an article that is currently pending review against published articles in our database. What I am looking to do is to place the content of the body in the pending article into a variable. That much is easy and I know how to do that. Then I would like to use that variable to compare against other content in our database and have it output a percentage of how close it is to any other article already published in our database.
So basically lets say:
$body="This is the text in the article that is pending.";
How could I compare this variable and run an query to check the database for similarity and have it return the highest percent that closely matches it?
For example: If the above variable is an exact duplicate of an article in my database it should tell me there was a 100% match. Or, if it was close to another article written it may return 35% match.
Any direction our guidance with this would be so greatly appreciated.
I found this on this website and it seems like they were trying to do something similar.
$body1="This is the text in the article that is pending.";
$body2="This is the pending text in the article, that is.";
I think you have to be clear between wanting:
a) a 100% match, because all the words are the same, albeit in a different order
b) a straight "diff" operation which tells you char for char which ones need to be altered for body1 to match body2 -- probably only a 30% match -- which is what levenshtein() seems to do.
If it is a) you want then it appears to involve splitting up the string to words?
Something must already exist to do this, it sounds as if it would be useful in many scenarios.
(I'm no greater expert on this than I was in the post you referred to)