How do I create an index from an existing website?

sergiozambrano · March 12, 2012, 3:03am

I’d like to make an online index of existing web pages.
The website is not mine, but it doesn’t have a search tool nor have it anytime soon.

I can download them all to my local computer, and make them all wordpress pages (I’m good at it, but not at SQL) but I think my missing link is how to correlate the content with the real online page. If I had an existing tool / system to index pages that would probably fill in the gap, because I don’t really need the content other than to create the index. After that, the content is useless.

Any idea?

rammurtee · March 12, 2012, 9:30pm

you can check for sitemap of the website,if it exists.Generally,wordpress blogs or any other blogs have sitemap,which you can access by using url/siitemap.xml.If it’s not a blog,then you are having a real problem

sergiozambrano · March 15, 2012, 12:14pm

Ok, I’ve given a php script:
The problem is… I can’t get it to browse as a browser’s agent and it keeps relying on the robots.txt file, failing to index the pages marked as disallow… or at least so says the error message. I tried to change the if conditionals in a few places, to make it not to find the robots file, or ignore it, but it didn’t work. sphider-plus worked the same.
If anyone knows how to do it, I’d appreciate the tip.
Thanks.

tongdonny · March 15, 2012, 3:02pm

wordpress blogs or any other blogs have sitemap,which you can access by using url/siitemap.xml.

SitemapGenerator · March 24, 2012, 5:10pm

If you switch off easy mode in A1 Sitemap Generator, you configure “webmaster filters” tab to ignore nofollow, noindex, robots.txt etc. (My guess is that other crawler/sitemapper solutions may have similar options if you look for them. Try ask the developer of the script/program you use!)

sergiozambrano · March 26, 2012, 12:31pm

Just to stop the ball of comments not related to my case:

As I said before, the site is not mine.
It doesn’t allow robots somehow, because I can browse the pages but sphider can’t get them.
The site is OLD, custom made. It’s not wordpress nor uses plugins.
If it does, I don’t have access to it to edit a thing.

Let’s start from the fact that the site is already as I said, no way to change it, and I’ll access it from an external server.

It might have a “search” feature, but I don’t know all the common query strings/variables commonly used by old forum platforms for me to test it. (again, I think it was completely custom made, but you know, there has always been trends in programming)

Can you tell?
The urls end with (e.g.) /org_board.show_msg?an_msg_id=1787112
and the main boards page is at /org_board.p_main
Can you guess the search query?

John_Betong · March 26, 2012, 12:42pm

Take a look at Xenu, the reporting feature of external sites is comprehensive - it may help.

SitemapGenerator · March 26, 2012, 1:56pm

It sounds straightforward. (But might not be.) No matter what crawler software you end up using, you should just make sure to have it ignore robots.txt, nofollow/noindex instructions. Also from your description, your website URLs have non-standard file extensions which means you should probably remove the file extensions list(s) in whatever crawler software you use and depend on MIME types instead. (That would e.g. be relevant for my suggestion at least.) If you have trouble indexing your website with some crawler tool, I recommend you contact the developer of the tool and ask.

sergiozambrano · March 27, 2012, 9:06am

Stupidly I didn’t check HOW the links appear, just where the links pointed to.
It seems the links open the pages I want with JavaScript, which Sphider can’t process.

At least I know how the pages are called, and I can increment the query string while downloading.

Is there any php script or Mac Software (or Firefox/Chrome extension?) to download webpages from a url range?
That won’t index the original pages but I’ll be able to create a DB I can work with.

Any idea?

Jake_Arkinstall · March 27, 2012, 10:39am

I haven’t come across one, but it’ll do you some good to learn how to do it with PHP yourself. Dependence on available software to do a basic task like that is never a good thing.