sandypodenco — 2013-02-04T04:51:27-05:00 — #1
I have searched but I can't find a simple method to remove links of the form www.website.com or .co.uk etc from a string. I have found regular expressions that remove urls that start with http:// but not straightforward www ones. Any suggestions?
This is the code I have got to remove urls from a string called $data:
$data = preg_replace('/\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~|$!:,.;]*[A-Z0-9+&@#\/%=~|$]/i', '', $data);
This simply deletes any links, so it will remove for example http://www.junkwebsite.com/ but not www.junkwebsite.com.
Thanks for any suggestions.
2ndmouse — 2013-02-04T05:40:27-05:00 — #2
Would it be acceptable to simply disable the link by stripping dots and slashes?
sandypodenco — 2013-02-04T05:57:13-05:00 — #3
Thanks for the reply. I need to remove any link in its entirety. There will also be HTML in the string so I can't simply remove any HTML. I don't want to leave a link that is human readable either.
cpradio — 2013-02-04T05:59:33-05:00 — #4
I haven't had a chance to fully test this one yet (I will later), but this is my quick attempt before my morning coffee.
$data = preg_replace('/\\b((https?|ftp|file):\\/\\/|www\\.)[-A-Z0-9+&@#\\/%?=~_|$!:,.;]*[A-Z0-9+&@#\\/%=~_|$]/i', '', $data);
cups — 2013-02-04T06:57:58-05:00 — #5
Do you actually want to remove all <a href=""></a> tags from the string?
sandypodenco — 2013-02-04T09:25:33-05:00 — #6
Thanks for that. That gets rid of pretty much any link, the only exception being if someone types in website.com for example. Would it be easy to catch anything with a .com or .co.uk etc on the end as well?
sandypodenco — 2013-02-04T09:27:31-05:00 — #7
No. I just want to get rid of any actual websites, human or machine readable. What I want to avoid is someone adding a link or saying something like "check out this great website xxxxxx.com". I don't mind if the <a> tags are still there afterwards.
michael_morris1 — 2013-02-04T09:45:28-05:00 — #8
Not really possible. Humans can parse out www dot example dot com quite easily after all. You'll never come up with an expression that stops all possible ways of including a link reference in a message. You can strip anchor tags and perhaps any string ending in .com, but a persistent spammer will find a way to include the link.
cpradio — 2013-02-04T09:57:20-05:00 — #9
You could always use strip_tags, giving it a list of tags you want to allow
cpradio — 2013-02-04T10:00:12-05:00 — #10
You could try this
$data = preg_replace('/\\b((https?|ftp|file):\\/\\/|www\\.)?[-A-Z0-9+&@#\\/%?=~_|$!:,.;]*[A-Z0-9+&@#\\/%=~_|$]/i', '', $data);
cups — 2013-02-04T11:16:23-05:00 — #11
Yeah, that was the solution I was angling towards - seems to offer a greater degree of protection too.
I wonder how effective a security solution that really is?
cpradio — 2013-02-04T11:24:32-05:00 — #12
Not very, as you can place onmouseover tags on anything and pretty much get XSS attacks to varying degrees. Which is why it is very important to NOT allow your users to write HTML that will be displayed directly and to instead force the use of shortcodes (such as WP) or bbcodes (for forums). At least you can then use strip tags and have full control of the output of the shortcodes and bbcodes.
sandypodenco — 2013-02-04T14:45:53-05:00 — #13
Thanks for that. What is is supposed to do differently to the code above?
sandypodenco — 2013-02-04T14:49:17-05:00 — #14
Agreed. These are all users who have registered and have had their identities verified in some way. I just want to make it impossible to add a working link and harder to add a human readable link, which the code does do quite well.
michael_morris1 — 2013-02-04T14:56:31-05:00 — #15
Outside of spammers, why is link sharing among users a problem for your site?
cpradio — 2013-02-04T14:57:48-05:00 — #16
It should remove website.com or .co.uk, whereas the original didn't.
sandypodenco — 2013-02-04T15:22:40-05:00 — #17
Because the pages are open to visitors as well as the users and I don't want visitors to the site distracted by endless links to other sites. If someone wants to advertise something there are paid options.
michael_morris1 — 2013-02-04T15:41:40-05:00 — #18
Well, your site, your rules. I don't know enough about the place to know what will and won't work in your specific case, but in general the more you try to control what users can post the more likely they will simply choose to post elsewhere. It's one thing to disallow link tags, but disallowing mentions of other domains is the sort of thing that would send me to your competitor pretty much immediately.
sandypodenco — 2013-02-04T15:56:28-05:00 — #19
I understand the point you are making but that's not a problem. Their information should be on their page, not on another page that they want to link to - there is no legitimate reason to want to add a link and I doubt if anyone will complain. In fact if I automatically remove links, they are less likely to be annoyed than if I manually removed them later. Rules is rules!