Nervous about UTF-8 breaking my code

I would like to support UTF-8, but it has been explained to me that I have to use different functions as a result. (i.e. http://us3.php.net/mbstring)

Below is some code that I fear might “blow up” if I switch to UTF-8…


	// Trim all Form data.
	$trimmed = array_map('trim', $_POST);

	// ************************
	// Validate Form Data.		*
	// ************************

	// Validate First Name.
	if (empty($trimmed['firstName'])){
		// No First Name.
		$errors['firstName'] = 'Enter your First Name.';
	}else{
		// First Name Exists.
		if (preg_match('#^[A-Z \\'.-]{2,30}$#i', $trimmed['firstName'])){
			// Valid First Name.
			$firstName = $trimmed['firstName'];
		}else{
			// Invalid First Name.
			$errors['firstName'] = 'First Name must be 2-30 characters (A-Z \\' . -)';
		}
	}//End of VALIDATE FIRST NAME


	// Validate Username.
	if (empty($trimmed['username'])){
		// No Username.
		$errors['username'] = 'Enter your Username.';
	}else{
		// Username Exists.
		if (preg_match('~(?x)							# Comments Mode
						^						# Beginning of String Anchor
						(?=.{8,30}$)				# Ensure Length is 8-30 Characters
						.*						# Match Anything
						$						# End of String Anchor
						~i', $trimmed['username'])){
			// Valid Username.


			// ******************************
			// Check Username Availability.	*
			// ******************************

			// Build query.
			$q1 = 'SELECT id
					FROM member
					WHERE username=?';

			// Prepare statement.
			$stmt1 = mysqli_prepare($dbc, $q1);

			// Bind variable to query.
			mysqli_stmt_bind_param($stmt1, 's', $trimmed['username']);

			// Execute query.
			mysqli_stmt_execute($stmt1);

			// Store results.
			mysqli_stmt_store_result($stmt1);

			// Check # of Records Returned.
			if (mysqli_stmt_num_rows($stmt1)>0){
				// Duplicate Username.
				$errors['username'] = 'This Username is taken.  Try again.';
			}else{
				// Unique Username.
				$username = $trimmed['username'];
			}
		}else{
			// Invalid Username.
			$errors['username'] = 'Username must be 8-30 characters.';
		}
	}//End of VALIDATE USERNAME

There seems to be three areas where I could run into issues…

1.) array_map

2.) preg_match

3.) Prepared Statements

I see there is a Multi-Byte Regex, but am not sure how easy it would translate to my code?!

And as far as everything else, well, I just don’t know.

It would be nice to have a more “International” website/support, but I am wondering if I will break all of my code and expose my website to all kinds of Security Vulnerabilities by switching to UTF-8??

Any suggestions? :-/

Thanks,

Debbie

Using UTF-8 is a big pain in the butt. It’s a shame PHP doesn’t have native support for multi-byte characters.

It would be good to have a discussion on the topic because there are things I’m not even sure about, such as what you mention with the parameterized queries.

In addition to these considerations, you have to make sure your HTML, CSS, and PHP files are saved in UTF-8 format without BOM (byte order mark). A header indicating the content is in UTF-8 format should be output by the server prior to delivering content, and off course the charset on the page should indicate it is in UTF-8. Then when doing anything with a MySQL database, you need to be certain to have your database in UTF-8 collation, have your text fields in UTF-8, and make sure your connection is in UTF-8, which probably should be set with mysql_set_charset() or by running a SET NAMES query just to be certain.

Did I forget anything? :confused:

Funny you should say that, because after starting this thread, that very same realization came to me…

Personally I think this whole topic is a wild goose chase and a waste of my time. (At least at this point.)

But more than being a pain in the *ss to implement UTF-8 support, I fear that I’ll open up a dozen new “Attack Vectors” and become a sitting target!!

Based on that fear, and what you said above, I think my time is better served elsewhere…

Thanks,

Debbie

If you want to internationalize your site, you don’t have a whole lot of choice. If you are dealing exclusively with an American audience, then it won’t matter too much. Many, if not most, open source scripts are supporting UTF-8 these days. It’s the direction of the future.

I thought I read somewhere that PHP6 has been delayed because providing native support for multibyte strings requires twice as much memory and slows PHP down too much. :confused:

I’m not aware of any security vulnerabilities as a result of improper use of multibyte string functions. I know that if you don’t use the correct function, it can produce undesired results. Doing a single byte string comparison function on a multibyte string, for instance.

PHP offers an overloading feature where the mb (multibyte) version of the string function will be called when the single byte version is used. Maybe that could be helpful to you.

http://www.php.net/manual/en/mbstring.overload.php

You might often find it difficult to get an existing PHP application to work in a given multibyte environment. This happens because most PHP applications out there are written with the standard string functions such as substr(), which are known to not properly handle multibyte-encoded strings.

mbstring supports a ‘function overloading’ feature which enables you to add multibyte awareness to such an application without code modification by overloading multibyte counterparts on the standard string functions. For example, mb_substr() is called instead of substr() if function overloading is enabled. This feature makes it easy to port applications that only support single-byte encodings to a multibyte environment in many cases.

I think my intentions were good, but I think I’m over my head on this one, and there isn’t a business justification. (I hardly have any visitors to my website as it stands?!)

Many, if not most, open source scripts are supporting UTF-8 these days. It’s the direction of the future.

But for now, it seems like there are other features and functionality that is more important.

I’m not aware of any security vulnerabilities as a result of improper use of multibyte string functions. I know that if you don’t use the correct function, it can produce undesired results. Doing a single byte string comparison function on a multibyte string, for instance.

I have read about hackers injecting suspect hexadecimal strings into code that wasn’t set up right and then they take over.

The whole DYNAMIC MULTI-byte thing just makes me uneasy. (It’s a byte, no wait, it’s 3 bytes?!)

Based on your earlier post, there is A LOT of things that I’d have to take into consideration including making sure the pages get sent in the right format, making sure my code and functions are coded properly, and making sure the database is set up right. Plus I may have to to some things with my hosting environment?!

I dunno… Sounds really complicated… ;-/

Maybe if my site takes off then I can re-visit this.

Thanks,

Debbie

UTF-8 is actually a very difficult topic, take it slow, make your site work with latin-1 encoding for now.
There are well known set of steps - things you must do, in order to make your site utf-8 compatable.
It’s perfectly OK to convert your site later, who knows, maybe php will have a native utf-8 support later, maybe in a year or so.
Right now there is a mb_string extension to do all your string manipulation and as for regex, the PCRE in php actually does understand multibyte strings.

Now for your code sample, I don’t know where you got these validations, but I don’t like this:
if (preg_match(‘#[1]{2,30}$#i’, $trimmed[‘firstName’]))

It looks wrong, looks like the . (dot) should be escaped too like this \.
Also for utf-8 this is not going to work since first name may have may other valid characters, other A through Z, really, I don’t know of any good way to write a first name validation that allows for names in all possible languages that can be written using utf-8 encoding. I think you will have to give up the idea of being able to just validate a first name with a simple regular expression. It just not realistic.

I’ve seen many of your posts, Debbie, you really worry about too many things. Just take it one step at a time, make your site work, then add more advanced features.


  1. A-Z \'.- ↩︎

Yep. I feel the fear!! :shifty:

make your site work with latin-1 encoding for now.

Which specific one?

And some people have said that Latin is limited.

Is there a way to “have my cake and eat it too”? That is benefits of UTF-8 without all of the migration pains?!

There are well known set of steps - things you must do, in order to make your site utf-8 compatable.
It’s perfectly OK to convert your site later, who knows, maybe php will have a native utf-8 support later, maybe in a year or so.

That is sorta what I am thinking. (UTF-8 should be a lower priority on my website than some other Features.)

Right now there is a mb_string extension to do all your string manipulation and as for regex, the PCRE in php actually does understand multibyte strings.

Okay.

Now for your code sample, I don’t know where you got these validations, but I don’t like this:
if (preg_match(‘#[1]{2,30}$#i’, $trimmed[‘firstName’]))

It looks wrong, looks like the . (dot) should be escaped too like this \.

Periods don’t need escaping, but obvious a single quote does.

Also for utf-8 this is not going to work since first name may have may other valid characters, other A through Z, really, I don’t know of any good way to write a first name validation that allows for names in all possible languages that can be written using utf-8 encoding. I think you will have to give up the idea of being able to just validate a first name with a simple regular expression. It just not realistic.

That was just an example of where I am using Regex and where I fear Multi-Byte would blow it up.

Obviously if I am supporting all UTF-8 characters, it would be dumb to restrict things to A-Z!! :stuck_out_tongue:

I’ve seen many of your posts, Debbie, you really worry about too many things.

Hard to not worry when i am a newbie and there are lots of bad people out there trying to take me down…

Just take it one step at a time, make your site work, then add more advanced features.

That is what I am trying to do, and why I am backing of UTF-8 for now.

Thanks,

Debbie


  1. A-Z \'.- ↩︎

Periods don’t need escaping, but obvious a single quote does.

In regular expression a single dot has a special meaning, it means (any character), so it does need escaping unless you really mean to say “any character”

Wrong! The period is in a CLASS so it does NOT need escaping…

Debbie