Best way to sanitize Email input for SQL

Or just a ‘confirm email’ box… reducing this to an === check instead of a regex. Also captures a better percentage of the ‘typoed’ errors - lemonjiuce@sitepoint.com is still “valid”.

Sanitization is not designed to change bad ones into good ones.

1 Like

Certainly, but a second email box is a usability problem - or at least an annoyance. Personally I hate those confirm email fields and I always just copy and paste to them.

Sure, so what is sanitization supposed to do for emails if validation is implemented (which I think should be)?

I personally don’t like this process. I, for instance, copy and paste the second email entry and if I didn’t see the mistake in the first entry, I probably won’t see it in the second.

I am more for the thinking, people should simply be smart and careful when entering such data and the two email entry fields veers away from that kind of thinking IMHO. It also adds another step into the process. Something better might be a little javascript, which asks for a verification of the entry, before the email address is sent to the server. It could be over a confirmation box or, in the case of registration, a small blendover of the field with the email address rerendered with a check box next to it. Still, all of that is saying, “dear user, we think you might be too stupid to add your email address in our field correctly, so we are making sure you are sure you are doing it right.” It is a poor message.

Scott

By that logic, never comment your code; programmers should know what they’re doing.
Never do any verification of input, because people cant make mistakes.

Or do what every person who’s ever commented code, written documentation, or done any form of user-interaction support does: Assume your audience are monkeys hitting buttons randomly.

Not completely. I’d still have the verification I mentioned earlier. Just the UX shouldn’t give the user the message, “We don’t care about you, our user, because we think you are most likely a monkey hitting buttons randomly.” :smile:

Scott

Well, then thats where you decide your balance between having to tell the user what you know to be true, and having to add complexity and processing time to your code to do all the extra things you have to do to counteract it without them being aware.

Personally, i’ll stick a confirm box in there.

What added complexity? What if they confirm the wrong email, like I might do with copy and paste? I am a monkey sometimes. LOL! The end result is the same: a confirmation (verification) email, which goes nowhere and an unverified email address. The user would still go through trying to resend a new confirmation email. It will fail to arrive too. You would make note to the user, that if the email doesn’t arrive, to start the process over. No verification of an email address could catch all stupid mistakes I can make, that is for sure. :smiley:

Scott

As stated before, you’re changing === into preg_match.

If they confirm the wrong email, then you’re outside both cases - neither sanitization/validation or a confirm box will catch it.

I have my browser set up to do this automatically - the browser simply enters my email address in all email fields for me. I also have a tendency to think that the author of such a page is an idiot and to leave when they ask for the same information twice because they don’t believe my browser entered it properly the first time.

If you want people to review what they have entered then display it ALL again on the following screen and ask them to confirm. Don’t annoy them by making them enter the same thing twice (password is a special case as that is not visible to be able to check it).

1 Like

The sanitization is a separate issue from validation, it is to remove any security risks, such as SQL injections. I’m no expert on this, but it is my belief that this should be done at the very first instance, before the input can be passed on to any variable, script or db.
The validation on the other hand is to guard against other things, such as people making a typing error, people or bots which for some reason beyond my understanding desire to enter something other than a email address in the form, people who find amusement in entering someone else’s or a fictional address into the form.
While I take the point that the validation methods may not be 100% infallible, I believe they should be effective in the vast majority of cases, which makes them far from redundant IMO.

I don’t like to make users jump though hoops, but the sad, unfortunate state of the web today means we don’t just have to consider the users, but also the abusers who spoil the web for everyone. They make life difficult for developers by forcing the need for security, which can in turn spoil the UX for honest users. I don’t like it, but that’s the way it is. The initial validation filter FILTER_VALIDATE_EMAIL is not really a “hoop” for the user at all, as the server does all the work, the user only need enter a valid address, which is not a big ask.
The confirmation email, I agree is a little bit of a hoop, but should be quite effective at weeding out phonies.

Well, in many cases this is correct approach (however, SQL injection is not avoided with sanitization but with escaping of values or with prepared statements) but not always because there may be drawbacks to premature sanitization. Let’s focus on the specific case of email addresses - by sanitizing before validating you are effectively weakening the validation process because the validation script does not receive exactly the data that came from the user/client. The sanitization process may strip certain invalid characters from the input and then the validation process may regard it as valid while in fact it was not - because it contained invalid characters!

On the other hand, if you validate the email address first then if it passes validation you can be 100% sure it is correct (in the sense of correct format) and therefore there’s no need to sanitize it. That was my point.

1 Like

You should validate everything entered by the user themselves.

You should sanitize data that comes from sources where a user could tamper with it but is normally not expected to. The purpose of sanitizing is to prevent attacks made via such tampering. The info will no longer be valid if it is tampered with but at least after it is sanitized it can do no harm.

These steps come before any other processing. You only need one or the other depending on whether the data originated with the user or whether it originated elsewhere and is expected to be valid provided that an abuser doesn’t taper with it.

In the case of email addresses there is another process that some also call validation that comes later - others call it verification. That is the step where the user having received an email at the address they entered then clicks on a link in the email to confirm that it was them who entered it in the first place. This is the step that blocks abusers from entering millions of valid email addresses that don’t belong to them.

As an email address is not actually confirmed for use until passing through this last step there is no point in attempting to check for typos in what is originally entered as in that case they will not receive the verification email and will know almost straight away that they must have typed it wrong and can go back to correct it.

I’d argue that we should (almost) always escape rather than sanitize. If a user tampers with a string to inject SQL, for example, then we shouldn’t strip quotes and backslashes and other SQL characters. Or if a user tampers to inject <script>s, then we shouldn’t strip less than signs. Because there are lots of legitimate uses for those characters. And the same is true for emails. We should escape rather than delete.

I agree with Lemon that validating should come first. If a user enters a malformed address, then they should get a message back informing them. If instead you sanitize first, turning illegal characters into a valid address, then the user is never informed they made a mistake nor given a chance to fix it before submitting the rest of the form.

1 Like

What, exactly, is the legitimate use of a less than sign in a first name? In a phone number? In pretty much -any- text field? I think by far there would be more situations where a less than sign is NOT legitimate than there are those that are.

In the end, I would much, much rather strip down things I know I dont want stored. Saves space in my database, saves bandwidth in retrieving records, if the data is freely displayed to others it saves their page load times to not have to sanitize-on-display…inserting the data is a one-time operation. displaying is a multi-time.

Sure, there are characters that don’t make sense in these cases - so that’s where validation is a perfect tool for preventing them with the added benefit of letting the user know immediately about bad data.

Stripping down things leads to incorrect information and lack of awareness. If there is a real attempt to inject js, SQL, or any other bad data, then I’d rather have those things recorded in the database because that’s what allows me to know what is going on. I think stripping is a bad thing in most cases - there are valid uses for stripping but it should be an exception rather than the rule.

If you don’t want the data I entered to be stored in your database then let me know about it instead of stripping it behind my back. Otherwise there are situations that occur from time to time where a comment doesn’t make sense because apparently a system stripped a part of what the poster wrote. Sometimes people write really strange things in form fields because they don’t know any better - like explaining delivery methods in the surname field. It’s good to have all those data for reference instead of getting rid of them.

Well, in the times of multi-megabyte pages consisting of oversized scripts, unoptimized graphics and auto-playing videos, the few characters saved by sanitizing - is it even worth considering? :smiley:

1 Like

They are completely unrelated things.

Sanitizing is an input process that makes sure data is made to be valid before you try and process it at all. It is a substitute for validating where the data doesn’t come from user input but could potentially be tampered with by a mis-user.

Escaping is an output function that is done immediately before the data is fed into something where the data could be mistaken as code. Where the data can be kept separate from the code then escaping is nonsense. Escaping is always reative to the use you are making of the data on this specific occasion.

Each of these should be used where it is needed and not used where it is not needed. So comparing sanitizing and escaping and saying that one is more common than the other is nonsense.

Escaping is only necessary when the data and code get jumbled together and the data can legitimately contain something that can be mistaken for code. Then you need to escape your VALID data. If they can be kept separate or the data can not validly contain anything that can be mistaken for code then there is no escape function that handles the situation as there is nothing to escape.

If you process data without sanitizing or validating it first then your entire script is open to abuse and not just the final destination for the data.

If you’re sanitizing as a form of security to prevent tampering, then no, they’re not unrelated. You should use escaping, not sanitizing, to prevent tampering. Then validating/sanitizing (of which we should overwhelmingly favor validating) can be used to make sure data is meaningful to your application.

I think there are some good points being made here, but some may be tangential to the OP “email input”.

For an email address, it should be both valid (syntax) and “real” (verified / confirmed)
If it doesn’t make it past those tests then I don’t see why it should ever get anywhere near being inserted into a database. BUT it might be nice to let the user know there was a problem.

To me, sanitization is most important in ensuring a mal query won’t get near the database and escaping is most important when submitted content is used. eg. if a post has <script> I want it output as &lt;script&rt; so it displays the tag, not run it.

Take that exact same idea and apply it to SQL, then you don’t need sanitization to safeguard your database queries.

What about Bobby Drop Tables?