Making text hard to copy?

Dez · October 16, 2010, 11:07am

Ok, first of all, I’m fully aware, that if I put text into html code, that someone could then easily use that text in their website. Now, supposing, just supposing, that you’ve done many, many years researching something and then put those results on your website, and you wanted to make it as hard as possible for someone to copy that text, how would you do it, BUT, still have the text spiderable by bots please ?

Any help appreciated.

Dez.

felgall · October 17, 2010, 8:35pm

The most effective way to achieve what the OP is asking for is to use a PDF with copy protection turned on. Breaking that copy protection may be possible but it is still far more effective than anything that can be done in HTML (apart from suggestions like those already made which would make the page totally unusable to a large number of potential visitors while still being as easy to bypass as the protection in a PDF. (for example the alternate letter approach could be easily resolved by taking a screen print and feeding that through a decent OCR).

Mittineague · October 17, 2010, 8:28pm

I am unfamiliar with that thread. I like the idea - but only as an excercise in programming.

I imagine it would be fairly easy to do if the text was monospace, otherwise the layers would look a mess.

And you could throw Accessibilty out the window. Imagine a screen reader user getting something like

<div id="layer1">W l o e t   y w b i e   e l f e   o r a , b t d n t y u d r   o y a y h n !</div>
<div id="layer2"> e c m   o m   e s t . F e   r e t   e d   u   o '   o   a e c p   n t i g </div>

Stevie_D · October 17, 2010, 5:22pm

I remember that technique being discussed, the unanimous conclusion was thar if you thought it was a good idea to inflict that on people, you probably shouldn’t be let out on your own!

oddz · October 17, 2010, 12:40am

Taking the idea of using an image farther you could program a page to generate text as an image. Than use that based on the user agent. That would surely stop idiots.

If I had to do this I would generate a script that would take several arguments and build out a image of the text via external script. Similar to how you would generate images in the database or behind the site root. Some useful arguments would be perhaps, the font, width, etc so that I can control the layout of text and have it fit the design. Perhaps even taking it further integrate the concept of columns into the mix. I would than cache each unique image generated to prevent the intensive process of building out the image for all users, besides the first request.

Dez · October 16, 2010, 5:32pm

Thanks all, the answers were expected. So, yep, all conent can be transformed to text, but some ways make it harder than other ways, for other people to transform, which ways would be the hardest to transform, but still be spiderable please ?

One other thing, what does the bit below mean ?

“published content = shared content = IP is the only way to protect it”

bluedreamer · October 16, 2010, 3:02pm

I think this goes back to the old saying - “if you don’t want it copied - don’t publish it”. Not what you wanted to hear, but it’s well nigh impossible protecting any web content.

spVince · October 16, 2010, 5:45pm

Here’s another idea for you which may be a little ‘out-of-the-box’ but would still achieve what you need, and may in fact be a better option.

Create an eBook from your research content, and put it onto Amazon.

This way you can still have all the keyword terms for spiders within the description, and let Amazon cover the security / copy protection side of things.

Amazon predicts that it will sell more e-books than paperbacks by the end of next year, so you could gain a few pennies in the process.

Hope that helps,

Vincent

system · October 16, 2010, 5:38pm

Protecting it will generally break search engines/spiders, just as it usually breaks accessibility. HTML is just not designed for this, and attempting to do so is a total waste of time and effort.

“IP is the only way to protect it” means intellectual property rights - copyright it (which is instant/free for web publishing now) and sue people that copy it.

ANY other approach and you are just spinning your wheels over nothing.

Mittineague · October 16, 2010, 12:20pm

AFAIK, PDF files can be crawled but only the “text” portion. Equivalent to reading the text as rendered in HTML - i.e. no tag attribute values

spVince · October 16, 2010, 11:15am

Display all text as images, but with ‘Alt’ text??

Vince

Dez · October 16, 2010, 12:10pm

Thanks Vince and Mittineague, it’s appreciated. The image alt text would be tricky with so much text.

How about pdf’s ? Are they spiderable ?

system · October 16, 2010, 4:15pm

image = ocr to text = text
pdf text = edit-copy protection removed = text
pdf image = ocr to text = text

what’s left: flash, silverlight, video. all of which can be transcribed/transformed if needed.

published content = shared content = IP is the only way to protect it

system · October 16, 2010, 4:06pm

In fact, it defeats the POINT of publishing on the Internet.

Stevie_D · October 16, 2010, 10:04pm

To be honest, the harder you make it to copy, the more likely people are to copy and re-publish it, purely out of spite. There are many legitimate reasons for people wanting to highlight and copy text that don’t involve plagiarism, and if you try to interfere with that in any way then you will become very unpopular, and your site will become very unpopular. And the people who have copied your content and made it accessible on their websites will get all of your traffic.

If people want to copy it, they will find a way to copy it. If you’re going to publish it, you can’t stop them, and it isn’t worth trying - and that is particularly true if you want it spiderable. Any method you use will make it more difficult for people to read and interact with the site in they way that they want to.

Sure, some people may copy what you’ve written? Is that the end of the world?

SpacePhoenix · October 17, 2010, 8:08pm

Can you remember what the title of that thread was or which forum it was in? I tried to find it via google but haven’t had much luck trying to find it so far

Stevie_D · October 17, 2010, 11:32pm

Mittineague:

I am unfamiliar with that thread. I like the idea - but only as an excercise in programming.

I imagine it would be fairly easy to do if the text was monospace, otherwise the layers would look a mess.

And you could throw Accessibilty out the window. Imagine a screen reader user getting something like
<div id="layer1">W l o e t   y w b i e   e l f e   o r a , b t d n t y u d r   o y a y h n !</div>
<div id="layer2"> e c m   o m   e s t . F e   r e t   e d   u   o '   o   a e c p   n t i g </div>

Yes, that’s pretty much exactly how it went. And yes, you would have to use monospace fonts. I think you would also have to use a non-breaking space for every replaced letter, to make sure that lines break nicely between words.

SpacePhoenix · October 17, 2010, 2:35am

One technique I have read about a couple of years ago (I think it may have been a thread somewhere in the SitePoint forums) was the use of two layers or something like that, they had alternate letters on each and when combined the whole of the text was viewable, it could probably be bypassed though via screen-grabs and ocr.

Possibly felgall or AlexDawson might be able to remember what the technique was called.

Mittineague · October 16, 2010, 11:33am

IMHO any effort you expend towards this will be wasted. If spider bots can get the text so can scraper bots. Once anything is online its “out there” forever.

So what would I do?
Hire a Lawyer and sue anyone that used it illegally.

system · October 16, 2010, 6:14pm

research usually means discoveries. if it’s innovation than a patent will keep you safe.

if you’re selling a solution, take example from those promoting “10 ways to get rich”. they blab about it w/o saying anything tangible, but they manage to slip in all the key words. then, based on a subscription, you buy or read the methods.

same for you: blab about it slipping in all the key words for the bots to find in a normal web page, but keep the essential part out of it. don’t bother with pdf or images. build a subscription or a buying mechanism for those wanting the essential.

or just put it out in the open, using a licence.