Replacing Terms in PHP

devbanana · February 26, 2013, 4:54pm

Hello,

I’m a bit stumped about the best way to go about this.

I’m creating a module for Drupal that filters text and replaces certain words with links. I have a list of the terms to be replaced in a text file.

I tried using strtr, but it replaces letters within words, not just words themselves. So for instance, the word “since” was being replace with “sin” as a link, plus ce, which is obviously not what I want.

I could do preg_replace, putting \b for word boundary around the string, but that is entirely too slow.

So what is the best way for me to do this?

devbanana · February 26, 2013, 8:18pm

OK, here’s what I have so far. $terms is an array where the key is the term to replace, and the value is the title of the page to link to.


// Sort the terms from greatest to smallest
$t = array_keys($terms);
usort($t, function ($a, $b)
{
return strlen($b) - strlen($a);
});

$words = array();
foreach ($t as $term)
{
// Is this in our text?
if (stripos($text, $term) !== false) {
$words[] = preg_quote($term);
}
}

$text = preg_replace_callback('/\\b('.
implode('|', $words) .
')\\b/i', function ($matches) use($terms)
{
return "[cathenlink=" .
$terms[strtolower($matches[1])] . "]" .
$matches[1] . "[/cathenlink]";
},
$text);

The [cathenlink] tag is a BB code tag I created to link to the appropriate page, given the title of the page.

Now, this is still taking a while to parse. There are 30241 possible replacements. I think it helps that I only include the ones that actually exist i the page, but still it is quite a bit.

Is there any way to reduce the time? At this point, this code is working, but it is extremely slow.

devbanana · February 27, 2013, 4:31pm

Still nothing?

Let me try to explain one more time:

I’m writing a text filter, to convert certain words into links.

I have a MySQL table with each term and its corresponding words to link. This table has currently 30,126 rows, so every bit of optimization is necessary.

I have a function, now modified from before so I will show it below, which does the linking.

In order to reduce the number of terms to work through, I did a simple LIKE in the database to at least narrow down the terms. This has made it faster, but it is still pretty slow.

Here’s my function. Note that this is Drupal, so the database is queried with db_query:

function ce_execute_filter($text)
{

// If text is empty, return as-is
if (!$text) {
return $text;
}

// Split by paragraph
$lines = preg_split('/\
+/', $text, -1, PREG_SPLIT_DELIM_CAPTURE);

// Contains the parsed and linked text
$linked_text = '';

foreach ($lines as $line)
{

// If this fragment is only one or more newline characters,
// Add it to $linked_text and continue without parsing
if (preg_match('/^\
+$/', $line)) {
$linked_text .= $line;
continue;
}

// Select any terms that might be in this line
// Ordered by descending length of term,
// so that the longest terms get replaced first
$result = db_query('SELECT title, term FROM {catholic_encyclopedia_terms} ' .
"WHERE :text LIKE CONCAT('%', CONCAT(term, '%')) " .
'GROUP BY term ' .
'ORDER BY char_length(term) DESC',
array(
':text' => $line
))
->fetchAll();

// Array with lowercase term as key, title of entry as value
$terms = array();

// Array of the terms only in descending order of length
$ordered_terms = array();

foreach ($result as $r)
{
$terms[strtolower($r->term)] = $r->title;
$ordered_terms[] = preg_quote($r->term);
}

// If no terms were returned, add the line and continue without parsing.
if (empty($ordered_terms)) {
$linked_text .= $line;
continue;
}

// Do the replace
// Get the regexp by joining $ordered_terms with |
$line = preg_replace_callback('/\\b('.
implode('|', $ordered_terms) .
')\\b/i', function ($matches) use($terms)
{
if ($matches[1]) {
return "[cathenlink=" .
$terms[strtolower($matches[1])] . "]" .
$matches[1] . "[/cathenlink]";
}
},
$line);

$linked_text .= $line;
}

return $linked_text;
}

The cathenlink thing is a tag that will be replaced by an actual link later in the text.

OK, so this is getting closer, but still I want to figure out how to make it faster.