Find similar patterns and extract data

I’m trying to extract “three” and/or “(3)” from all of these patterns. The strings are never the same size and the words are sometimes incorrect, so I can’t do an easy preg_match syntax on it.

There are a lot of similar words that proceed the string though. A perfect sentence is “PRIMARY TERM: This leaseshall remain in force for a primary term of three (3) years from the effective date hereof, and as long thereafter…”

My initial thoughts are to use similar_text and preg_match, but I haven’t thought up a good way just yet. Any ideas how this could be done?


$String = '.... remain in force for a primy term OL three (3) years from the effective date hereof, and as lo....';
$String2 = '.... *JE~ in force for a primary torm of three (*B years from the effective date hereof, .......';
$String3 = '.... remain in farce for a primary term OL threA (3 years from the effective date hereof, and as lo....';

Seems a common value is “(” so look for the word before that.
Note: not doing any spell checking here.

<?php
$String = '.... remain in force for a primy term OL three (3) years from the effective date hereof, and as lo....';
$String2 = '.... *JE~ in force for a primary torm of three (*B years from the effective date hereof, .......';
$String3 = '.... remain in farce for a primary term OL threA (3 years from the effective date hereof, and as lo....';

$words = explode(" ", $String3);
$keys = array();
foreach($words as $k => $word){	
	if (strpos($word,'(') !== false) {
	    $keys[] = $k-1;
	}
}
foreach($keys as $k){
	echo "{$words[$k]}<br />";
}
?>

Hmm I should have put a 4th string in there as it could be “three ^*B) yars fr0m”. I need to figure out how many years it is. Odds are, the three or 3 will come out. Even a good chance both will, so then I’ll compare if three = 3 then it’s definitely 3 years, but if either are a number then I’ll use it.

Also, the string is just part of a huge document so “(” could come anywhere else. It’s OCR too, so it decides where ( comes and gos. My real goal is to match a similarity of “remain in force for a primary term of (*submatch) from”. Sorry I could have been more specific from the beginning.

Well here’s how I did it…


$Matches = array(1=>'one', 2=>'two', 3=>'three', 4=>'four', 5=>'five', 6=>'six', 7=>'seven', 8=>'eight', 9=>'nine', 10=>'ten');
$Patterns = array('primary term of', 'remain in force for', 'this lease shall');
foreach($Patterns as $Pattern) {
	$String = preg_match("/$Pattern/", $Document) ? substr(next(explode($Pattern, $Document)), 0, 100) : '';	// LOOK FOR AT LEAST 1 PATTERN, GRAB THE NEXT 100 CHARACTERS
	if($String) { break; }
}
foreach($Matches as $Int => $Word) {
	$Year = preg_match("/$Int/", $String) || preg_match("/$Word/", $String) ? $Int : ''; // SEE WHICH WORD OR NUMBER EXISTS
	if($Year) { break; }
}