Users browsing this thread: 1 Guest(s)
Rebuilding Process Information Thread
#15
(01-07-2014, 02:18 PM)Phaze Wrote: On the subject of retrieving data, wouldn't it be theoretically possible to make a DOM-parsing program that will auto-download the cached pages for you from Google and parse the data? I've never done this sort of thing before* but if it's really gonna take ages to do, it might be worth for me or one of the staff to look into.

*While I haven't done something to scrape pages automatically, I once made a DOM-parsing program in PHP that I never finished to clean up saved pages from a forum to archive them neatly and fix the broken CSS.


And a chat with Dazz tells me that Google rejects crawling-like activity. Oh well '_;

I wrote a scraper but the IP address I was using got banned pretty quickly, even if I randomized the IP/interval between requests Google looks for patterns in the search requests and will block too many similar requests with a CAPTCHA. The funny thing is, the biggest web crawler in the world doesn't let you crawl their servers. Huh.
Thanked by:


Messages In This Thread
Rebuilding Process Information Thread - by Dazz - 01-07-2014, 11:44 AM
RE: Rebuilding Process Information Thread - by Raz - 01-07-2014, 03:50 PM

Forum Jump: