Hi, I'm having a problem with developing a system, that will collect data from other sites/DB's and put them into my DB. I'll try to explain it with example:
I have a table called BOOKS, with a user-defined structure, like this:
ID | Author | Year | Title
1 | J. Smith | 1992 | "Something"
Now I let the user type in an address of a web-site, let's say it contains 30 new books. What I want is to fetch all of them and reproduce them in the BOOKS table. Doing it by analyzing the content and putting it into according table fields could turn out to be a very, very tough job, since the table structure and content is user-defined. Perhaps letting user select author by author, year by year, title by title for all of the books could be a solution, however a very ineffective solution, if the amount of books reaches, for example, 100 (>5).
What would you recommend here - how should I handle this uneasy situation with the minimum involvement of a user, without designing a content-understanding-robot? Appreciate any help.
It sounds like there are multiple parts to your question:
How to handle parsing content into data bits? This is pretty tough, though not impossible. If all you have is HTML to look at, and the structure of the code isn't consistent (given that it's coming from different websites) how are you going to determine which bit of information is the title, which is the author, etc? Breaking the content into usable pieces of information is your first challenge.
Storing that information in the database will be the easiest part here. You would have to clean the data from special characters and then just run an INSERT query to your MySQL database. It sounds like you're also referring to sorting and organizing that data in the database for users to search on and view. Using a couple of carefully crafted queries you can deal with sorting and pagination, even text-based searches.
So which one are you in need of help with, #1 or #2?
Yes, thanks for the quick reply. You're right about the 1st part and that definitely is the part I'm interested in solving. Determining what is what before inserting it in DB would be the main step that needs to understood, unfortunately I'm not so sure how to do it without recognizing words. I was thinking of some sort of user interaction that would make this job easier, perhaps you could throw some ideas?
Well it certainly sounds like a fun project, but it may be beyond my expertise. So, take what I say with a grain of salt.
I would start by getting very familiar with regular expressions. I'm assuming you'll be using PHP for this project, given that you've posted in the PHP forum. If regular expressions are new or foreign to you, don't freak out. It's not the easiest thing in the world to grasp, at least not for me. Take the best try you can and the folks around these forums will be happy to help you out, though they're not going to do all the heavy lifting for you. http://www.lmgtfy.com/?q=php+regular+expressions
You have two options. You could either spend 6 years writing algorithms to decipher the content of any HTML page and recognize author's names, book titles, ISBN numbers, etc. - or you could figure out how to crowd source your efforts. Coming up with a way to allow people to help you identify the content on a page would probably be ideal. There have been some popular attempts at tasks such as this, some more successful than others. I remember when Google wanted help tagging images for searches. They created some sort of game or activity which allowed users to quickly and easily add tags or descriptions to images they found. Clever, but it wasn't immensely popular. In fact, they recently shut down the service entirely. Another example, which is wildly successful, is reCaptcha. They decided that instead of just making up curly and distorted words for users to attempt to read in order to prove they were human, they would present users with real words from a scanned book. Essentially they crowd sourced humans to form the most powerful OCR software available. It worked so well that Google bought them. My point is that you may want to come up with a reason for your users to help you out here.
As far as the technology behind something like that... My thought is that maybe you give the user a toolbar or browser extension by which they can draw boxes around or highlight text on a particular web page. They can then label that text as a book title, an author, etc. Maybe by logging this information they're also adding that book to their wish list or something like that. Again, give them a purpose to participate. This isn't going to be easy.
Maybe it's time to think backwards into the situation. Is the end goal to build a giant database of books? If so, somebody has probably already beat you to it, for example, every public library. Depending on how much that database is worth to you, you could probably purchase such a list.
To do this properly you would need to analyze every source and develop what likely would be a separate algorithm for scraping the page perhaps even more then one algorithm per source if there isn't a pattern in how things are marked-up. Sounds daunting and in no way an easy task because you are essentially asking to build a content understanding robot. The only other reliable solution would be to see if the source sits have some type of feed you can hook into. Even then you would probably need separate translators for every website but at least the data would defined in a XML form to some extent. Otherwise, there would have to be a mechanism for the user to assign context to the content to extract, I don't really see any other way considering HTML lacks context at the the level of granularity required to achieve your end goals. Even if HTML did provide that level of context not everyone would be following anyway. Sounds like something destined to be riddled with bugs and failure given the limitations of technology unless one is to hand of work to the user in some way. However, then you have to rely on user input and users are pretty stupid. That said, how many source sites do you have mind at this time – if it is a small amount 2 – 3 written separate algorithms to scrap the pages might be a feasible options. However, if you merely want someone to type in ANY url, well then what you require is going to be just about impossible without some type involvement for the user.
I'd take a look at some of the popular books apis, and see if they can help you.
Maybe make some browser specific add-ons using JS.
User highlights a book title, add-on captures it, somehow do a lookup vs an api - user then has to perhaps identify which actual title/isbn they mean - then sends you the detail.
Either way I think you will find you need to capture isbns.
yes, it's meant to be for any site. Even if it were 2-3 sites, I don't think it would be good to rely on static methods, cause I imagine designer completely changing the look of the site, which could turn out to be a problem for previous algorithms, at least I think about it like that.
no, the table of books was just an example. I'm thinking of collecting data like publications, conferences attended etc., so the content could fall from a to z, so I see no other choice as user-involvement.
Anyway, thanks for all the replies. You've given me a subject to think about - I'll probably try something with the highlighting of content, perhaps, combined with regular expressions.
as it mentioned before, the problem is that HTML layout of all these sites can vary and sites can change ,so the scripts should be updated. if you have some problems with it you can contact one of developers who are familiar with such tasks. (there are some of them). firefox or chrome addons also might help.
Just throwing this thought into the pot, but not every site wants you to do this. It's actually against the terms of larger sites to scrape information from their databases for keeping in your own. (Not that they can -really- prevent you from doing it, but it's still against their terms.)