sergiozambrano — 2012-03-11T23:03:26-04:00 — #1
I'd like to make an online index of existing web pages.
The website is not mine, but it doesn't have a search tool nor have it anytime soon.
I can download them all to my local computer, and make them all wordpress pages (I'm good at it, but not at SQL) but I think my missing link is how to correlate the content with the real online page. If I had an existing tool / system to index pages that would probably fill in the gap, because I don't really need the content other than to create the index. After that, the content is useless.
rammurtee — 2012-03-12T17:30:49-04:00 — #2
you can check for sitemap of the website,if it exists.Generally,wordpress blogs or any other blogs have sitemap,which you can access by using url/siitemap.xml.If it's not a blog,then you are having a real problem
sergiozambrano — 2012-03-15T08:14:51-04:00 — #3
Ok, I've given a php script:
The problem is… I can't get it to browse as a browser's agent and it keeps relying on the robots.txt file, failing to index the pages marked as disallow… or at least so says the error message. I tried to change the if conditionals in a few places, to make it not to find the robots file, or ignore it, but it didn't work. sphider-plus worked the same.
If anyone knows how to do it, I'd appreciate the tip.
tongdonny — 2012-03-15T11:02:37-04:00 — #4
wordpress blogs or any other blogs have sitemap,which you can access by using url/siitemap.xml.
sitemapgenerator — 2012-03-24T13:10:02-04:00 — #5
If you switch off easy mode in A1 Sitemap Generator, you configure "webmaster filters" tab to ignore nofollow, noindex, robots.txt etc. (My guess is that other crawler/sitemapper solutions may have similar options if you look for them. Try ask the developer of the script/program you use!)
sergiozambrano — 2012-03-26T08:31:11-04:00 — #6
Just to stop the ball of comments not related to my case:
As I said before, the site is not mine.
It doesn't allow robots somehow, because I can browse the pages but sphider can't get them.
The site is OLD, custom made. It's not wordpress nor uses plugins.
If it does, I don't have access to it to edit a thing.
Let's start from the fact that the site is already as I said, no way to change it, and I'll access it from an external server.
It might have a "search" feature, but I don't know all the common query strings/variables commonly used by old forum platforms for me to test it. (again, I think it was completely custom made, but you know, there has always been trends in programming)
Can you tell?
The urls end with (e.g.) /org_board.show_msg?an_msg_id=1787112
and the main boards page is at /org_board.p_main
Can you guess the search query?
john_betong — 2012-03-26T08:42:26-04:00 — #7
Take a look at Xenu, the reporting feature of external sites is comprehensive - it may help.
sitemapgenerator — 2012-03-26T09:56:29-04:00 — #8
It sounds straightforward. (But might not be.) No matter what crawler software you end up using, you should just make sure to have it ignore robots.txt, nofollow/noindex instructions. Also from your description, your website URLs have non-standard file extensions which means you should probably remove the file extensions list(s) in whatever crawler software you use and depend on MIME types instead. (That would e.g. be relevant for my suggestion at least.) If you have trouble indexing your website with some crawler tool, I recommend you contact the developer of the tool and ask.
sergiozambrano — 2012-03-27T05:06:12-04:00 — #9
Stupidly I didn't check HOW the links appear, just where the links pointed to.
At least I know how the pages are called, and I can increment the query string while downloading.
Is there any php script or Mac Software (or Firefox/Chrome extension?) to download webpages from a url range?
That won't index the original pages but I'll be able to create a DB I can work with.
jake_arkinstall — 2012-03-27T06:39:18-04:00 — #10
I haven't come across one, but it'll do you some good to learn how to do it with PHP yourself. Dependence on available software to do a basic task like that is never a good thing.