Best way to sanitize Email input for SQL

I’m working on a script for a subscription form. The user inputs their name, Email address and CAPTCHA. The data will go into a MySql database and a cron job will send them what they subscribed to.
My question is, what’s the best way to sanitize the Email address?
Should I use something like this?
Or is there a better way.
For the name I’m just using preg replace:

$name = preg_replace('#[^A-Za-z 0-9]#i', '', $_POST['Name']);

Which I think will suffice there, unless anyone knows better.

I’d go for the filter function. email addresses can get really complex, so a simplistic regexp would probably discard otherwise correct email addresses.

to sanitize it? [FPHP]filter_var[/FPHP] is fine.
validating it? That’s a wholllle other ball of annoyance.

Also… is name an actual name, or a nickname? Cause… i’m not familiar with that many people with numbers in their real name.

Yes, I could probably remove the numbers.

Good to know that is safe.

Does filter_var with FILTER_VALIDATE_EMAIL work for this?
This is what I have so far:

$email = filter_var($_POST['EmailAddr'], FILTER_SANITIZE_EMAIL);
if (filter_var($email, FILTER_VALIDATE_EMAIL)) { //Do stuff }
else { //Don't do stuff }

Depends on your definition of ‘work’.

FILTER_VALIDATE_EMAIL will work for the vast majority of use cases - actual, human-entered emails.

PHP’s filter_var uses the following regular expression to validate emails:

/^(?!(?:(?:\x22?\x5C[\x00-\x7E]\x22?)|(?:\x22?[^\x5C\x22]\x22?)){255,})(?!(?:(?:\x22?\x5C[\x00-\x7E]\x22?)|(?:\x22?[^\x5C\x22]\x22?)){65,}@)(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2F-\x39\x3D\x3F\x5E-\x7E]+)|(?:\x22(?:[\x01-\x08\x0B\x0C\x0E-\x1F\x21\x23-\x5B\x5D-\x7F]|(?:\x5C[\x00-\x7F]))*\x22))(?:\.(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2F-\x39\x3D\x3F\x5E-\x7E]+)|(?:\x22(?:[\x01-\x08\x0B\x0C\x0E-\x1F\x21\x23-\x5B\x5D-\x7F]|(?:\x5C[\x00-\x7F]))*\x22)))*@(?:(?:(?!.*[^.]{64,})(?:(?:(?:xn--)?[a-z0-9]+(?:-[a-z0-9]+)*\.){1,126}){1,}(?:(?:[a-z][a-z0-9]*)|(?:(?:xn--)[a-z0-9]+))(?:-[a-z0-9]+)*)|(?:\[(?:(?:IPv6:(?:(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){7})|(?:(?!(?:.*[a-f0-9][:\]]){7,})(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,5})?::(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,5})?)))|(?:(?:IPv6:(?:(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){5}:)|(?:(?!(?:.*[a-f0-9]:){5,})(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,3})?::(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,3}:)?)))?(?:(?:25[0-5])|(?:2[0-4][0-9])|(?:1[0-9]{2})|(?:[1-9]?[0-9]))(?:\.(?:(?:25[0-5])|(?:2[0-4][0-9])|(?:1[0-9]{2})|(?:[1-9]?[0-9]))){3}))\]))$/iD

Yes, that’s ugly as sin. But it’s effective in -almost- every case - it has the best ‘correct’ ratio of given attempts to solve the problem.

Noted exceptions include that it thinks “”@sitepoint.com is valid, and blar@123.111.101.100 to be invalid.

There are a few more, but you get the point - they’re edge cases at best.

I’ve come to the decision that validating an email address is pointless. If a user genuinely wants to receive mail from a site, then they’ll give you a valid and working email. But if they don’t want our spam, yet we force them to give an email anyway, then they’re going to give something like blah_blah_fake_fake_fake_eat_all_the_spam@whatd.you.expect.com. An email can be valid yet obviously fake. I think the better solution is to not require an email unnecessarily. It should always be an optional piece of information that the user can provide if and when they decide they actually want to receive something from you.

1 Like

I do see your point on this, but in this case, the form is for signing up to an Email service, therefore providing a valid Email address is necessary, not doing so would be rather pointless.
No one is holding a gun to their head, forcing them to fill the form in, the user can choose to fill the form if they want our Emails, or not do so if they don’t.

Quite true, there is not a lot we can do about those. But I don’t see any harm in filtering out invalid addresses which would otherwise populate my database, forcing the cron job to go through them all every time it fires off.

Those are the ones I’m interested in, anything else can go…

I guess it will do. I don’t fancy the prospect of my writing my own regex to validate Email addresses.

To confirm that the email address is actually valid you need to send an email address with a registration link in the email. When they click on that link you then know that the email address is valid. If they don’t click within a set time (possibly a couple of days) you discard it as invalid.

2 Likes

Of course. Now you mention it, I have seen this before with forums and suchlike. It is one extra step for the user to carry out, but it will validate the phony addresses the do pass a validation filter.

Well… sort of. For example, I very frequently use disposable email services such as Mailinator, where_literally_any_random_email@mailinator.com will work. If the user doesn’t want to give their email, then they’re not going to give their email, and no amount of validation or verification will change that.

Ok, so it’s not 100% watertight, but those two levels of validation have got to be better than none at all. Bearing in mind the sole purpose of the form is to subscribe to emails, users who don’t want to give their address don’t fill in the form at all. But if they want to receive the emails, then are required to give a valid address.

So then why would we make the user jump through verification hoops?

That’s my point. If the user wants to give their address, then verification is redundant. And if the user doesn’t want to give their address, then verification won’t fix that.

Though to emphasize the point, we’re talking purely about -verification-. Sanitization is definitely a must.

1 Like

Or just escaping works too. Which we ought to be doing anyway. :wink:

The reason I dont say just escaping, especially when it comes to email, is that the script would be feeding that string into another program - which we cannot trust to sanitize the input.

If the mail daemon has a vulnerability to it, and we dont sanitize…

No it isn’t - it prevents someone else giving your email address to sign you up without your permission.

It is also a legal requirement in some countries such as Australia where NOT validating that the owner of the email address really wants to receive future emails from you automatically defines all your emails as SPAM.

3 Likes

What escape code do you use for emails passed to the mail daemon so that it will treat all values correctly as email addresses even if they are not?

sanitization should at least render malicious code injection into a marginally nonmalicious string. Depending on how permissive the sanitization is, of course.

For reference, Sanitizing an email using PHP’s filter has this description:

Remove all characters except letters, digits and !#$%&'*+-/=?^_`{|}~@.[]

(Note that this sanitization is more restrictive than validation’s regex)

The same goes for Germany.

To me, I think an email verification should happen through a real email. The user adds an email address, wanting whatever service it is you offer, but you send an email to that user with a link to a web page, which states something like, “You are now opting in for our service, thank you very much! Click the button below to approve”. It is a step more, but then the user’s approval can be logged and stored for prosperity for laws like in Germany and Australia and the email address is 100% verified.

Scott

2 Likes

Apart from felgall’s reasoning why the verification is not redundant there’s one more pragmatic reason: people are known to make mistakes and a certain percentage of people will misspell their email in such a way as to change them into an invalid format - and in such cases validation will prevent collecting of invalid emails.

For the same reason I stated above I don’t think sanitizing emails makes any sense at all. If an unexpected invalid character appears in an email then there is 80% probability that the user made a mistake by providing an invalid character instead of a proper one and in this case automatic stripping of characters will not change the email to a valid one. Emails should be validated as soon as possible and if they don’t pass validation then users should re-enter them. Once they are validated then sanitization is not necessary.

Even if email addresses came from an unknown or external source I have no control over I still wouldn’t trust sanitization to turn the bad ones into good ones. If they don’t validate I simply discard them - and maybe mark them to be corrected by a human.