Rewrite encoded URL

Hello.
For an unknown reason some sites link to mine with encoded url (info got from google webmaster tools).
Something like [B][noparse]http://domain.tld/file.php%3Fid%3DUSERid%26cat%3Dsmth[/noparse][/B] instead of [B][noparse]http://domain.tld/file.php?id=USERid&cat=smth[/noparse][/B]
I’ve tried several things (conditions, rules) from htaccess (httpd.conf) to rewrite encoded urls.
An approach is to RewriteRule ^file.php(.*)$ index.php?qs=$1 and handle $_GET[‘qs’] from index.php
Is it possible to Rewrite the encoded url? Or at least to match %3F and the others.
Thanks

Marcianos,

First, WELCOME to SitePoint!

Second, yes, but not the way you’re doing it because a RewriteRule can only examine the {REQUEST_URI} string (i.e., NOT the query string).

Third, why does it matter whether the query string is encoded like that? Apache can handle that and so can PHP. IMHO, just create the proper URIs for your website and let Google worry about storing them on their server.

If you want to continue, please have a look through the tutorial article linked in my signature and know that encoded characters must be contained (in their natural form) in a character range definition (except the ? which is ONLY permitted as the demarcation between a URI and query string). If you still have questions, come back here (please PM me, too, as I’m not hear as often as I had been when on staff).

Regards,

DK

First, WELCOME to SitePoint!
Thank you David!

Second, yes, but not the way you’re doing it because a RewriteRule can only examine the {REQUEST_URI} string (i.e., NOT the query string).

Maybe I was not clear. From Google Webmaster Tools I am getting some 404 errors it says the referrers are third site.
I never understand when a string is encoded and displayed decoded by a web browser, client mail or whatever and then sent to the server as I see or as it have received it or as it likes.
The fact is that GWT displays this address [B][noparse]http://domain.tld/file.php%3Fid%3DUSERid%26cat%3Dsmth[/noparse][/B] getting a 404 error from my server. I cannot fix the origin of the problem (external sites) so, to not to lose backlinks I want to rewrite that string to a good url

Third, why does it matter whether the query string is encoded like that? Apache can handle that and so can PHP. IMHO, just create the proper URIs for your website and let Google worry about storing them on their server.

I cannot realize my Apache does not handle that. URIs are ok. As Google says, ‘bad’ adresses are from other sites

David, I found something may be a key.
In my httpd.conf
RewriteCond %{HTTP_HOST} ^domain\.com
RewriteRule ^(.*)$ http://www.domain.com$1 [R=permanent,L]

More precisely, closest problematic uri is [noparse]http://[/noparse]www.domain.tld/file.php%3Fid%3DUSERid%26cat%3Dsmth (with www) (=> 404)
I removed ‘www’ and the returned url was OK.
What do you suggest?
Thank you

Just looking around I see the same issue at ubuntuforums.org but backwards.
Regular URIs are w/o ‘www’.
In FF, if I substitute ubuntuforums.org/showthread.php?p=1195677 by ubuntuforums.org/showthread.php%3Fp%3D1195677 the thread is not displayed but main page (its ‘404’ page).
If I add www. the URI returns the first one and the thread is displayed

Marcianos,

The point of my first post coding comment was that the (.*) above CANNOT access the query string. Fortunately, you’ve dropped that.

Add the / as above OR avoid the question of the / between the domain and path by

...
RewriteRule .? http://www.domain.com%{REQUEST_URI} [R=301,L]

After all, your (.*) is already available as the {REQUEST_URI} variable.

Most hosts have configured their DNS servers to include both the www and non-www version of their client domains - yours has apparently not. Therefore, check my signature’s tutorial Example Code section for code to force either the www or non-www requests.

Regards,

DK

 RewriteRule ^file.php(.*)$ index.php?qs=$1

The point of my first post coding comment was that the (.*) above CANNOT access the query string. Fortunately, you’ve dropped that.

In this particular case there’s no query string because Apache does not recognize ‘?’, there’s only a %3F. So (.*) gets what is after file.php and send it as a query string (qs) It is working fine now as a partial solution.

 RewriteRule ^(.*)$ http://www.domain.com/$1 [R=permanent,L]

Actually I already have ‘…com$1’ but it has be ‘…com/$1’ if the rule is placed in .htaccess instead of httpd.conf

I substituted those both httpd.conf lines by your Code Generator (nice!)

RewriteCond %{HTTP_HOST} !www\\. [NC]
RewriteRule .? http://www.%{HTTP_HOST}%{REQUEST_URI} [R=301,L]

Another (not all) directives in httpd.conf

ServerAlias [noparse]www.domain.com[/noparse] ...
# Handle TRACE request
RewriteCond %{REQUEST_METHOD} ^TRACE
RewriteRule .? - [F]

# retain query string
RewriteCond %{QUERY_STARING} !''
RewriteRule .? %{REQUEST_URI}? [QSA,L]
<Directory /path/to/www>
Options Indexes IncludesNOEXEC FollowSymLinks +ExecCGI
allow from all
AllowOverride All
...
</Directory>
...

There are also some directives in .htaccess (waiting moving) like IndexIgnore, Limit GET POST, Limit PUT DELETE, Enable GZIP, Expire headers, cache headers and rewrite rules from old to new site and shorcuts.

Also both domain.com and [noparse]www.domain.com[/noparse] have ‘A’ records to the same IP

Problem persists :frowning:
Thank you!

Marcianos,

David,

I copied-pasted “staring” from … [noparse]http://datakoncepts.com/mrg.php[/noparse] :slight_smile:
Anyway, despite I don’t get what for this rule, I included it just to see if the ‘encoded’ was decoded by Apache. No way.
I don’t know what else to try, maybe a creating test.domain.com and starting from almost no rules.
Thank you,
Marcianos

I have a subdomain for testing
httpd.conf is the dafault from server admin control panel. No .htaccess file

[I]SuexecUserGroup “#567” “#524
ServerName test.domain.com
DocumentRoot /path/to/www/dir
ErrorLog /path/to…
CustomLog /path/to… combined
ScriptAlias /cgi-bin/ /path/to…/cgi-bin/
DirectoryIndex index.html index.htm index.php index.php4 index.php5
<Directory /path/to/www/dir>
Options -Indexes +IncludesNOEXEC +FollowSymLinks +ExecCGI
allow from all
AllowOverride All
AddHandler fcgid-script .php
AddHandler fcgid-script .php5
FCGIWrapper /path/to…/fcgi-bin/php5.fcgi .php
FCGIWrapper /path/to…/fcgi-bin/php5.fcgi .php5
</Directory>
<Directory /path/to…/cgi-bin>
allow from all
</Directory>

RewriteEngine on
RemoveHandler .php
RemoveHandler .php5
IPCCommTimeout 46[/I]

I’ve uploaded test.php containing
<? if(isset($_GET[‘qstring’]))echo $_GET[‘qstring’]; ?>

From FF: h t t p://test.domain.com/test.php?qstring=12345
FF displays ‘12345’.

From FF: h t t p://test.domain.com/test.php%3Fqstring%3D12345
FF displays ‘The requested URL /test.php?qstring=12345 was not found on this server’

In this particular case there’s no query string because Apache does not recognize ‘?’, there’s only a %3F. So (.*) gets what is after file.php and send it as a query string (qs) It is working fine now as a partial solution.

WRONG! Apache knows that %3F is the ? character and will treat everything after it as a query string. Because the query string is NOT available to the regex in a RewriteRule (only available in a RewriteCond statement and only when specified), your (.*) will NEVER match anything (unless you’ve enabled MultiViews which, IMHO, is a dumb thing to do). Test it and look at the value (null every time) that $1 returns.

Well, sorry, I guess I misspoke.
These are the facts:

If I FF to [noparse]www.domain.com/dir/file.php%3Fi%3Dbob[/noparse]
I get a 404 error.

In .htaccess I add
RewriteRule ^dir/file\.php(.*) index.php?qs=$1

Retry [noparse]www.domain.com/dir/file.php%3Fi%3Dbob[/noparse] from web browser
index.php is displayed.
FF URI: [noparse]www.domain.com/index.php?qs=%3Fid=bob[/noparse] (yes, %3f and ‘=’)

I add these lines into index.php to handle qs


if(isset($_SERVER['QUERY_STRING']))	{
	$qs = urldecode(str_replace("qs=%3f",NULL,$_SERVER['QUERY_STRING']));
	if(strstr($qs,"id"))	{
		header("Location: file.php?$qs");
		exit;
	}
}

That acts like asking for [noparse]www.domain.com/dir/file.php?id=bob[/noparse]

Marcianos,

I confess to a lack of objectivity with regard to altering links you have no control over. I don’t consider this to be much of a problem at all because:

  1. You can’t change the incoming links (at their source)

  2. Apache can (and DOES) “translate” encoded characters like the ones you’ve shown to deal with them

Therefore, your efforts are like a tempest in a teacup!

That’s not to say that they MUST be ignored, though, as I thought I’d explained that encoded characters can be readily altered by matching within regular expressions by including the ACTUAL character within a character range definition. Similar to the ‘change a character’ or ‘change an extension’ sample codes within my signature’s tutorial, it would look like this:

# you cannot match %3f as that's ? and does not exist in
# either the {REQUEST_URI} or {QUERY_STRING}
# ? is the marker used to separate the {REQUEST_URI} from the {QUERY_STRING}

# replace %3d with =
RewriteCond %{QUERY_STRING} (.*)[=](.*)
RewriteRule .? %{REQUEST_URI}?%1=%2 [L]

# replace %26 with &
RewriteCond %{QUERY_STRING} (.*)[&](.*)
RewriteRule .? %{REQUEST_URI}?%1&%2 [L]

Beyond that (which is probably not thoroughly explained in the tutorial, EVERYTHING you need is there.

Regards,

DK

1. You can’t change the incoming links (at their source)
I agree.

Therefore, your efforts are like a tempest in a teacup!
I also agree but you know, at this time it is something like we say in Spanish ‘to remove the sting’.

[I]# replace %3d with =
RewriteCond %{QUERY_STRING} (.*)[=](.*)
RewriteRule .? %{REQUEST_URI}?%1=%2 [L]

# replace %26 with &
RewriteCond %{QUERY_STRING} (.*)[&](.*)
RewriteRule .? %{REQUEST_URI}?%1&%2 [L][/I]

does not work as expected from FF, Chromium, Safari. Nothing changes (->404)
I also tried individual replace of url encoded chars (? = &) None of them were substituted by their decoded value (or at least displayed the page as the substitution was interpreted by the browser but not displayed)

I understand ‘?’ is the separator of REQUEST_URI and QUERY_STRING and it is not ‘detected’

From my example (partial solution) it seems Apache does not interpret %3F as ‘?’ (nor as a string to match in reg exp)
RewriteRule ^dir/file\.php(.*) index.php?qs=$1 applied to www .domain.com/dir/file.php%3Fi%3Dbob
sends to the user’s browser the query string qs=%3fid=bob
I have to understand that the error 404 is displayed because Apache does not find the file dir/file.php%3F…
BTW, it is a mess for me to understand what is received as encoded and then displayed or processed as decoded (or viceversa) from both client and server.
Thank you,
M

M,

Did I really do that? Sorry, change the [L]'s to [R=301,L] so that you’ll see the redirections.

Regards,

DK

David,

I had already changed those flags. I added the rules at bottom of httpd.conf of test.domain.com (post #10)
test.domain.com%3Fqstring=12345 -> “Not Found test.domain.com?qstring=12345

Thanks,
M.