Redirect Old URLs with redirect.php

Since my boss firmly believes that the best websites are redesigned yearly (and all attempts to dissuade him merely steels his resolve,) I need to maintain a sizable array of redirected pages. I have basically used this article as my resource:

However, when I check old pages through Google Webmaster Tools, I am still getting about 60 crawl errors. Those URLs are in my array, but when I check, yes indeed, they still throw a 404 error.

I have configured everything exactly as this article outlines. URLs above and below some of these lines work just fine.

Does anyone have any idea what might be going wrong?

Hi TMacFarlane,

Are you getting 404 errors for your Old or New urls?

If the answer is you old urls then they should get the 404 error; the trick is capturing the 404ed URL and then trying to slice important parts and then redirect it to the closest match in the new site. The old urls should always generate a 404 error, you are counting on this to trigger your custom 404.php page that will handling this redirection.

So do you have this flow:

  1. url links to non-existent page is clicked.
  2. Apache/IIS serve the custome 404.php page
  3. At the top of the 404.php page you have included the redirector code that if it successfully can create a match it redirects and exits the script before the ‘not found’ code is reached; otherwise the ‘Not found’ page content loads

Hope this helps
Steve

I am getting the 404 errors from the OLD URL! While the 404 may be happening in the background, the redirect should kick in so that the user/bot does not experience it (they actually should end up experiencing a 301-moved permanently.) The 404.php page is set up just as explained (I can provide the code, but I don’t think this is necessary as some URLs work, and others do not.) We are essentially on the same page about this.

I am afraid the part of your message: “if it successfully can create a match it redirects” may be the crux of the issue here, and I cannot provide a good example in the hundreds of lines of why certain ones do not match. Be that as it may, I have combed through the array for any misplaced single-quotes or commas (thankfully, Dreamweaver highlights when I leave one of these out.) But here might be a good example:

Somehow, our TESTING directory ended up being crawled. So I added the following line at the end of the array in redirect.php:

'TESTING' => '/'

This should mean that if I type in my address bar “www.mysite.com/TESTING” I should naturally go to www.mysite.com/ – right? nope. Why?

Hi,

I threw this together.

The URL would normally come from the strtolower($_SERVER[‘REQUEST_URI’]); but I was using the posted form to test out different urls:


<?php
if($_POST){
  if($_POST['requested_url']){
   /* List of good urls */
   $new_urls = array(
  'http://www.mysite.com/blog.php'
  , 'http://www.mysite.com/contact.php'
  , 'http://www.mysite.com/article1/story.php'
  , 'http://www.mysite.com/article2/story.php'
  , 'http://www.mysite.com/article1/story.php'
  , 'http://www.mysite.com/blog/article1/story.php'
 );

  $old_url = strtolower(htmlentities($_POST['requested_url'])); 
   $o_Url = new Url($old_url, $new_urls);
   $matched_new_url = $o_Url->run();
   if($matched_new_url){
     $o_Redirect = new Redirect();
     $o_Redirect->setRedirect($matched_new_url);
     $o_Redirect->go();
     exit;
   } else {
     //do 404.php
   }
 }
} 


class Url {
  protected $old_url;
  protected $new_urls;
  protected $host;
  protected $hit_list;
  
  public function __construct($old_url, $new_urls){
    $this->old_url = $old_url;
    $this->new_urls = $new_urls;
    $this->host = '';
    $this->path_segments = array();
    $this->hit_list = array();
  }
  public function run(){
    $result = $this->parseUrl();
    if($result == 'Search Terms Found'){
      $match = $this->setClosestMatch();
    }
    return $match;
  }
  protected function parseUrl(){
    $old_url_parts = parse_url($this->old_url);
    $this->host = $old_url_parts['host'];
     $this->path_segments =  explode('/', $old_url_parts['path']);
    /* Trim extensions from the end 
     * of url for easier comparison
     */
    $last_part_num = count( $this->path_segments) - 1;
    /* This approach won't work if you 
     * have domains like http://www.mysite.com/first.part.php
     */
    $last_part = explode('.', $this->path_segments[$last_part_num]);
    foreach($last_part as $part){
      switch($part){
        case 'php':;
        case 'htm' :;
        case 'html':;
          break;
        default:
          // reset the last part to the page name minus the .php
           $this->path_segments[$last_part_num] = $part;
      }
    }
    if($last_part_num > 0){
      // path has search terms
      return 'Search Terms Found';
    } else {
       // path does not have search terms
      return 'No Search Terms';
    }
  }
  protected function setClosestMatch(){
    $hit_list = array();
    foreach($this->new_urls as $url){
      $hit_list[$url] = 0; //set the current url's hit list to empty (no matches yet)
      $url_parts = parse_url($url);
      $path_segments =  explode('/', $url_parts['path']);
      $last_part_num = count($path_segments) - 1;
      $last_part = explode('.', $path_segments[$last_part_num]);
      foreach($last_part as $part){
        switch($part){
          case 'php':;
          case 'htm' :;
          case 'html':;
            break;
          default:
            // reset the last part to the page name minus the .php
             $path_segments[$last_part_num] = $part;
        }
      }
      /*
       * Check which item has the most hits 
       */
      foreach($this->path_segments as $old_part){
         if(in_array($old_part, $path_segments)){
           $hit_list[$url]++;
         }
      }
     
    }
    asort($hit_list); // sort array to get the highest number of hits to the end of the array
    end($hit_list); // push the array pointer to the last array item
    return key($hit_list); // return the web address of the new site which best matches
  }
  
}

class Redirect {
    protected $redirect;
    protected $o_Session;
    protected $default_redirect;
    
    function __construct() {
       $this->default_redirect = 'http://www.mysite.com';
  }

  public function setRedirect($redirect) {
    $this->redirect = $redirect;
  }
  public function go() {
    $this->redirect();
  }
  protected function redirect() {
    header('Location: '.$this->redirect);
    exit();
  }
  public function getRedirector(array $urls = null){
    $default_urls = array(
      'http://www.google.ca'
      ,'http://ca.yahoo.com/'
      ,'http://www.bing.com/?cc=ca'
    );
    if($urls != null){
      shuffle($urls);
      return $urls[0];
    } else {
      shuffle($default_urls);
      return $default_urls[0];
   }
 }
}
?>
<html>
<head></head>
<body>
 <form action='<?php ?>' method='POST'>
   <input type='text' name='requested_url'></input>
   <input type='submit' value='redirect'></input>
 </form>
</body>

</html>
 


This worked for me… it pulls relative keywords out of the posted url and then redirects. Again instead of using the if($_POST){} code you could simply test the strtolower($_SERVER[‘REQUEST_URI’]); instead and it would work.

Pleas feel free to ask any questions hope it helps.

Steve

Currently, redirect.php assigns “strtolower($_SERVER[‘REQUEST_URI’]);” to the variable “$oldurl”. It works, just not on every item in the array “$redir”.

I am sorry, but I fail to follow. How exactly do I test “strtolower($_SERVER[‘REQUEST_URI’]);”?

Hi,

In the article ‘How to Avoid 404s and Redirect Old URLs in PHP’ in step three it says:

We’ll place our redirection code in another file named redirect.php, to keep the functionality separate from the 404 content.

Add the following code at the top of your 404.php file just after the

<?php
include('redirect.php'); 

Now create redirect.php in the website root and add the following code:

<?php  
// current address  
$oldurl = strtolower($_SERVER['REQUEST_URI']);  
// new redirect address  
$newurl = '';

This code gets the url that the user tried but apache/IIS could not find, so apache/IIS has loaded the 404.php page and passed the original url request via the $_SERVER[‘REQUEST_URI’] parameter. If you have done like the article suggested and create a redirect.php then you would start with the $oldurl = strtolower($_SERVER[‘REQUEST_URI’]); in the first line.

I then wrote some code that you might like to use as it will automatically assign the best match rather than you having to do it via an associative array. All it needs is a list of valid links on your newest site.

I have modified the code to be less ridged; it no longer needs an exact match. It will do an exact match or a regular expression search on the text that makes up the path of the url. So try doing this with the code I provide:

  1. Create the 404.php and have your webserver set non-found urls to this page
  2. Ensure that the redirect.php is the first include (the first thing in) the redirector file and ensure that you set the $old_url to strtolower(htmlentities($_POST[‘requested_url’]));
  3. Then create your correct urls; those that you want old links to redirect to in the $new_urls array. List them with their full path include the ‘http://’ or ‘https://’, the domain and the directory path and the file with extension name like: http://www.mysite.com/contact_us.php
  4. Then make sure all the code I have done is in the redirect php; in the code below I include steps 2,3 so you can simple copy this in its’ entirety to the redirect.php

The redirect.php code:


<?php
$old_url = strtolower(htmlentities($_POST['requested_url']));

/* Change these to your site urls that you want to redirect 
 *to if an appropriate match is found
 */
$new_urls = array(
  'http://www.mysite.com/blog.php'
  , 'http://www.mysite.com/contact.php'
  , 'http://www.mysite.com/article1/story.php'
  , 'http://www.el.net/sign_in.php'
  , 'http://www.mysite.com/article2/story.php'
  , 'http://www.mysite.com/article1/story.php'
  , 'http://www.mysite.com/blog/article1/story.php'
 );

$o_Url = new GetMatchedUrl($old_url, $new_urls);
$redirection_url = $o_Url->run();

/* Change this to the default page you want
 * to redirect to if no redirect is set 
 */
$default_redirect = 'http:www.mysite.com';
$o_Redirector = new  Redirector($default_redirect); 
$o_Redirector->setRedirect($redirection_url);
$o_Redirector->go();

/*******************************/
/********** CLASSES *************/
/*******************************/

Class Redirector {
    protected $redirect;
    protected $o_Session;
     protected $default_redirect;
     function __construct($default_redirect) {
       $this->default_redirect = $default_redirect;
    }
    public function setRedirect($redirect) {
     $this->redirect = $redirect;
    }
     public function go() {
         $this->redirect();
     }
    protected function redirect() {
    header('Location: '.$this->redirect);
    exit();
   }
    public function getRedirector(array $urls = null){
        $default_urls = array(
            'http://www.google.ca'
            ,'http://ca.yahoo.com/'
            ,'http://www.bing.com/?cc=ca'
       &nbsp;);
        if($urls != null){
            shuffle($urls);
            return $urls[0];
        } else {
            shuffle($default_urls);
            return $default_urls[0];
        }
     }
}

Class GetMatchedUrl {
  protected $old_url;
  protected $new_urls;
  protected $host;
  protected $hit_list;
  
  public function __construct($old_url, $new_urls){
    $this->old_url = $old_url;
    $this->new_urls = $new_urls;
    $this->host = '';
    $this->path_segments = array();
    $this->hit_list = array();
  }
  public function run(){
    $result = $this->parseUrl();
    if($result == 'Search Terms Found'){
      $match = $this->setClosestMatch();
    }
    return $match;
  }
  protected function parseUrl(){
    $old_url_parts = parse_url($this->old_url);
    $this->host = $old_url_parts['host'];
     $this->path_segments =  explode('/', $old_url_parts['path']);
    /* Trim extensions from the end 
     * of url for easier comparison
     */
    $last_part_num = count( $this->path_segments) - 1;
    /* This approach won't work if you 
     * have domains like http://www.mysite.com/first.part.php
     */
    $last_part = explode('.', $this->path_segments[$last_part_num]);
    foreach($last_part as $part){
      switch($part){
        case 'php':;
        case 'htm' :;
        case 'html':;
          break;
        default:
          // reset the last part to the page name minus the .php
           $this->path_segments[$last_part_num] = $part;
      }
    }
    if($last_part_num > 0){
      // path has search terms
      return 'Search Terms Found';
    } else {
       // path does not have search terms
      return 'No Search Terms';
    }
  }
  protected function setClosestMatch(){
    $hit_list = array();
    foreach($this->new_urls as $url){
      $hit_list[$url] = 0; //set the current url's hit list to empty (no matches yet)
      $url_parts = parse_url($url);
      $path_segments =  explode('/', $url_parts['path']);
      $last_part_num = count($path_segments) - 1;
      $last_part = explode('.', $path_segments[$last_part_num]);
      foreach($last_part as $part){
        switch($part){
          case 'php':;
          case 'htm' :;
          case 'html':;
          case 'shtml':;
            break;
          default:
            // reset the last part to the page name minus the .php
             $path_segments[$last_part_num] = $part;
        }
      }
      /*
       * Check which item has the most hits 
       */
      foreach($this->path_segments as $old_part){
         if(in_array($old_part, $path_segments)){
           $hit_list[$url]++;
          } elseif(preg_grep("/$old_part/", $path_segments)){ 
           // The simple regex matches potential partial parts of a url
           // i.e. old url www.oldsite.com/contact_us.htm will be 
           // matched to www.newsite.com/ contact.php
           // or it will match www.same_domain_but_changed_page_names.net/contact.shtml
           $hit_list[$url]++;
         }
      }
     
    }
    // sort array to get the highest number of hits to the end of the array
    asort($hit_list);
    // push the array pointer to the last array item
    end($hit_list);
    // return the web address of the new site which best matches
    return key($hit_list); 
  } 
}
?>

 

I configured my test apache server with a custom 404.php page and then used a redirector.php file included at the top of it. and ran the same code as I show above only with my domains urls and then tested a bunch of changed domain names, it succesfully matched most of the time. When it didn’t it gave me the custom 404.php error that allows them to click a link to the main site. It worked nicely.

Hope this helps.

Steve

BTW if you don’t want to get a TEST or another directory indexed by robots you can use in each file using an single include


<head><?php require_once('no_follow.html')?></head>

 <!-- no_follow.html-->
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

You could also consider a robot.txt file. It is not meant as a means for security it is merely a ‘don’t enter’ sign, but search robots will respect it. So you could create a robot.txt file at the root of your site and then do:


User-Agent: *
Disallow: /TEST/

Steve

Oh!

I think I understand now. What you are saying is to replace the code in my “redirect.php” file with your code (and just enter the names of my changed pages in the array, and replace the form language ((_$POST) with the server call.) Then the server will do a search-and-match based on the keywords int the URL. I will give it a try-- it might take more than a week.

I do use a robots.txt, and I usually disallow all kinds of thing, but this case got through the cracks. Don’t tell my boss!

Thanks so much for the explicit directions.

I guess the part that bothers me the most about this the vagueness of the pattern matching. It seems like a search engine-- where you end up with a bunch of porn sites while looking for something innocuous. For instance:

I’ve been tagging with comments the items in the array that haven’t been redirecting, and some of them deal with awards pages. Previously, our awards pages were prefixed with “awd-” and later they were prefixed with “award-”. Now they have been moved to their own folder, called “Awards” (to feed a breadcrumb) and the old URLs in my array that were prefixed “award-” are being redirected no problem. I worry that this type of misdirection will only be exacerbated if I introduce a keyword-matching system to the game.

As I write this, I begin to wonder if some kind of pattern matching misdirection isn’t to blame for my present “redirect.php” file breaking. But I do want something exact, if only to give me the feeling that i am micromanaging this better. What do you say?

OK–

I’ve actually had the spare time to try this. Every old URL goes to the first item in my $new_urls array (/About/index.php). That seems the same as just writing a redirect to my homepage into my .htaccess file.

Hi,

The way that the parsing of the url works in the code I gave you is this:

  1. it takes a full url and breaks it down into parts. So for instance http://www.mysite.com/foo/bar/file1.php will be separated into ‘http://’, ‘www.mysite.com’, ‘/foo/bar/file1.php’
  2. The path ‘/foo/bar/file1.php’ is then split into parts: ‘foo’, ‘bar’, ‘file1.php’
  3. The end file then has the extension removed like ‘file1.php’ becomes ‘file1’
  4. These path attributes are stored in an array.
  5. The array is looped through to see if there is exact matches in the path. For each exact match a hitcount variable is increased by 1. If it is not an exact match then it looks for words inside of the path piece, so if the path piece in the old site was ‘signin’ and in the new site it is ‘sign_in’ it will generate a hitcount for the new url that contains ‘sign_in’.
  6. For examples like you gave where it used to be prefixed with awd- and later award- this exact or word within the word regex will not work. You could build special conditions for edge cases. In the code below I modified the setClosestMatch() method to have a test for ‘awd-’ and I successfully had the code redirect when I put in http://www.mysite.com/awd-bestd.htm to http://www.mysite.com/award-best_distance.html. Although edge cases are hard-coding it allows you to in the future update page names that will not need edge cases and will redirect with exact or word in word(s) matching. Here is the modified setClosestMatch() method (see the changes in the bottom inner foreach loop.

protected function setClosestMatch(){
    $hit_list = array();
    foreach($this->new_urls as $url){
      $hit_list[$url] = 0; //set the current url's hit list to empty (no matches yet)
      $url_parts = parse_url($url);
      $path_segments =  explode('/', $url_parts['path']);
      $last_part_num = count($path_segments) - 1;
      $last_part = explode('.', $path_segments[$last_part_num]);
      foreach($last_part as $part){
        switch($part){
          case 'php':;
          case 'htm' :;
          case 'html':;
            break;
          default:
            // reset the last part to the page name minus the .php
             $path_segments[$last_part_num] = $part;
        }
      }
      /*
       * Check which item has the most hits 
       */
      foreach($this->path_segments as $old_part){
         if(in_array($old_part, $path_segments)){
           $hit_list[$url]++;
         } elseif(preg_grep("/$old_part/", $path_segments)){
           $hit_list[$url]++;
         } elseif(preg_match("/awd\\-/", $old_part)){
           if(preg_match('/award\\-/', $url)){
             $hit_list[$url]++;
           }
         }
      }
    }
    asort($hit_list); // sort array to get the highest number of hits to the end of the array
    end($hit_list); // push the array pointer to the last array item
    return key($hit_list); // return the web address of the new site which best matches
  }

You are right that if none of your pages match it will just redirect to the main page so it would be cleaner/faster to just use a .htaccess redirect.

Regards,
Steve

Thanks very much for this solution, Steve, but I am afraid it is not what I am looking for. My redirect.php file is not really broken. I have managed to winnow my crawl errors from 349 to 19. Of those 19, four are just fine if they hit a 404. That leaves just fifteen that are aggravating me. I just wonder why my array won’t catch them–what is it about those dogs that won’t hunt. Furthermore, I can’t help wonder how many outliers there will be in your method.

I’ve always shied away from regular expression pattern-matching, which is probably what makes me more of a front-end guy. However, if I find another use for search functions, I will be able to refer to this code. I am also extremely grateful to you for modeling a better method of scripting (private vs. public functions, etc.)

Plus, you gotta know, it is extremely difficult for me to let go of a practice I have been diligently managing for the past year.

In the end, I was just hoping that someone could point out why those select few entries were not redirecting as appointed. I haven’t been able to isolate any syntax errors, or any site-level priorities, or anything like that.

Thanks vey much for your careful and diligent support. I sincerely hope you don’t view it as casting pearls before swine, or that it was in vain.

Hi TMacFarlane,

Hey no problem, I am glad that in the future you may be able to use this code, but please know that I enjoyed doing this as it let me play with some things I hadn’t done before so I did get something valuable out of it.

Maybe can you post the 15 or so links that don’t get redirected and maybe we can help you with a regex pattern(s) that help you address these?

Regards,
Steve