Extracting all Data from Web text,video,image etc

selicon_valley · August 31, 2012, 3:31am

there is a website www.howstuffworks.com i want to extract all data includes text, audio, video, flash files and all and put it into my my so can u recomend any website that extract data

ralphm · August 31, 2012, 5:12am

It sounds like you are talking about stealing that site’s content. Is that so, or is this your own site, or …? :-/

selicon_valley · August 31, 2012, 10:47pm

no some sites allow to copy their material

r937 · September 1, 2012, 1:34am

however, howstuffworks is very clear that you may not re-publish any of it

you may ~not~ copy their content except for your very own personal use

The materials available through the Discovery Sites are the property of Discovery or its licensors, and are protected by copyright, trademark and other intellectual property laws. You are free to display and print for your personal, non-commercial use information you receive through the Discovery Sites. But you may not otherwise reproduce any of the materials without the prior written consent of the owner. You may not distribute copies of materials found on the Discovery Sites in any form (including by e-mail or other electronic means), without prior written permission from the owner.

Sogo7 · September 1, 2012, 1:54am

You cannot use content from howstuffworks.com to build your own website, without prior written permission from the sites admin staff.
It actually says so in the Terms & Conditions but you have to
read almost the entire page to find out.

However you could build a website that contains only links with short descriptions to howstuffworks.com and similar websites with tutorial videos like YouTube. That falls within the ‘Fair Use’ rules because what appears on your site would not be a duplicate copy and does not contain all the information the user would need. If the site was actually powered by something like the Bing or Google custom search API then the ‘Safe Harbor’ agreement they have gives you immunity to any claims of copyright infringement. This is because your website would be just a pocket size version of the larger search engine set up to provide specific search information.

Sadly websites built like this do not perform very well from my experience, having a forum as well for users to generate content helps but it will take a while to build visitor traffic.

Shyflower · September 1, 2012, 5:26pm

Sogo7:

You cannot use content from howstuffworks.com to build your own website, without prior written permission from the sites admin staff.
It actually says so in the Terms & Conditions but you have to
read almost the entire page to find out.

However you could build a website that contains only links with short descriptions to howstuffworks.com and similar websites with tutorial videos like YouTube. That falls within the ‘Fair Use’ rules because what appears on your site would not be a duplicate copy and does not contain all the information the user would need. If the site was actually powered by something like the Bing or Google custom search API then the ‘Safe Harbor’ agreement they have gives you immunity to any claims of copyright infringement. This is because your website would be just a pocket size version of the larger search engine set up to provide specific search information.

Sadly websites built like this do not perform very well from my experience, having a forum as well for users to generate content helps but it will take a while to build visitor traffic.

Unless you are an attorney, please stop referencing the Fair Use Rule because you have it all wrong.

Piridelli · September 2, 2012, 3:32am

Dont do it. How would you feel if you worked very hard on a site then somebody just takes your content? That is stealing.

Sogo7 · September 2, 2012, 8:09pm

M’lud now I am confused…lol!
So are you saying if I was an Attorney my statement would been correct or acceptable to you?

Shyflower · September 2, 2012, 9:43pm

No, I’m saying if you were an attorney you would have given the appropriate information to begin with. Fair use is a very touchy subject and is decided case by case by a judge in a court or law. However, one of the major aspects of fair use is that it is for non-profit and educational or journalistic use only. That, in itself, is a pretty basic summary.

However, spouting copyright law on a public forum with no references given to its credibility is just, IMO, bad judgement and we don’t recommend that here. Law is built on fact, not opinion.

This particular conversation has run its course. If you really want to discuss legal issues, please visit the business and legal forum to do so… and while you are there, you might do a search on fair use as it is a topic that has been covered several times in that forum. This forum and this discussion is about white-hat practices in acquiring and adding legitimate content to a website.

XcriptXource · September 8, 2012, 1:34pm

Could that actually be possible?
To copy a whole website’s videos, pictures, contents, and etc. for your own personal use?
If so, how so?

Shyflower · September 8, 2012, 5:19pm

Copy/paste… page by page, image by image, video by video. But why on earth would anyone want to do that when it is much easier just to bookmark a site in a browser?

XcriptXource · September 14, 2012, 10:53am

You know that answer to your question. Use it for personal reference, when you have no internet and only a computer. Just plug in your handy dandy flash drive then boom!
Walah… you have your self a professionally working website that doesn’t use internet. Lol, I was just kidding with all these nonsense. But isn’t there another way than just copying image and etc. one by one?

TechnoBear · September 15, 2012, 12:03pm

You can use the “Save page as” (or similar) option, under the “File” menu in your browser to save a complete page. (I’m using Firefox, but I presume other browsers are the same.) I don’t know if it will save scripts - I’ve never tried with a page that uses one. I don’t know any way to save an entire site.

Shyflower · September 15, 2012, 7:54pm

You certainly can do that, but it hotlinks everything from the site so if something changes the page will as well. For instance, removed images will show up as place holders instead of the images. Additionally, hot-linking is pretty poor web etiquette. And finally, trying to do that on a large site such as How Stuff Works, would be a nightmare.

There is absolutely no reason I can think of as to what you would accomplish by downloading someone else’s whole site for your own use when all you have to do is open your browser and click on a bookmark to revisit it.

ralphm · September 16, 2012, 12:11am

When I first started web design (which wasn’t so long ago), there were programs that purported to download a whole site for you. But I honestly don’t see the point—not for any legitimate reasons, anyhow. In the future, HTML5 sites may offer a download facility that allows you to view the content offline, but that’s a choice the site owner has to make.

colossal · September 16, 2012, 9:02am

What’s the point of scraping that site? Duplicate content penalties plus possible lawsuit = not a happy situation. One possible legal use is not to republish it or syndicate it but to review structure or make an archival copy for personal use. I’m not sure about the latter though.

TechnoBear · September 16, 2012, 9:53am

Firefox does save the images, but not background images - they don’t appear at all. I’ve only really used it where I’ve wanted to print something - generally a knitting pattern - and it’s not been set up in a way that will print well, e.g. the pattern is a narrow column that prints over umpteen pages. Then I sometimes save the page and edit my local version to print in a more useful manner. I don’t need to keep coming back to it.

ServerStorm · September 18, 2012, 8:03pm

Unless doing something like TechnoBear has mentioned web scrapping and web copying are just bad and in most cases illegal. As Shyflower indicated hot-linking is also bad because it steals the bandwidth that someone else pays to host the images and content not to mention that taking this information can in many cases also be illegal. Companies are increasingly prosecuting sites that have scrapped contents as scrap-checking bots are becoming very good at finding stolen content. You don’t want to get mixed up in this unless it is for personal use or that you are VERY clear on the site owner allowing this as well as that they haven’t stolen their content from somewhere; how do you know or guarantee this, I don’t think you can!

dxter · September 21, 2012, 8:23am

You want to use httrack but as other posters mentioned this might not be a best idea.

kurti · September 23, 2012, 10:38am

i use webcontent extractor it simply downloads all information in tables , so that you can reuse it anywhere easily