The VG Resource

Full Version: Rebuilding Process Information Thread
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3 4
I'll be posting any information about what has gone down on my Twitter, but I'll keep people informed on anything that might need direct input from here. I'll be referring to this thread from Twitter if I feel I need to make a longer post.

Keep up with my Twitter feed for more information:
https://twitter.com/TheVGResource

We'll get there, don't worry... We've had worse before. Smile


We've added a donation button to the home page to hopefully get some assistance in covering the lost revenue from the lack of ads that we'll be displaying.
http://www.spriters-resource.com/

I hate to ask for money, because I've always been able to cover the costs with revenue... This is why I removed the donation buttons from the site to start with. But I'm seriously begging here.
Like I said, let me know if there's anything I can do to help. If this had happened before Christmas, I would have been able to toss some money your way, but my hours have been severely cut. If I can rustle something up I'll let you know.
What needs to happen? Is it basically a matter of labeling, organizing, and giving proper credit, to a bunch of sprite sheets saved with a number name like 609145.png? I could see how that would be a nightmare.
(01-07-2014, 01:12 PM)Kedric Wrote: [ -> ]What needs to happen? Is it basically a matter of labeling, organizing, and giving proper credit, to a bunch of sprite sheets saved with a number name like 609145.png? I could see how that would be a nightmare.

Actually, submitters are still in tact. We just need names, sections, and game info associated with each sheet.
Is there anyway any of us could help speed up the process further?

Dazz seems to already be doing what I was going to suggest and looking at the internet archives for tSR :p
Yea, Dazz said he is using the Google cached version of the site to reobtain missing information by running code against it. Will this retain all the names, sections, and game info for all the sprite sheets?

I know you guys don't know me because I don't participate in the community much but I am literally on this website hours every day. I want to help if there is anything I can do.
At the moment, our plan (which is working SO FAR) will reduce the work load from having to load 60,000+ cached pages, to only needing to load 5,000.
We're hoping this works, if not, again, I'll keep you all posted.
To put some context into what we're doing:


We've developed a bookmarklet (a bookmark that runs some javascript on a page) which when used on a cached tSR Game Page, will give a text box of information from the game. This includes the game's indentifier (which we're using the URL for, as it is always unique to the page in question), the sheet's ID number (from the url linking to the sheet's page), the sheet's name (from the alt tag used on the icon), and the category it is under. We will then take this information, and dump it into another page that processes the information and inserts all of it into the database where the sheets previously were.

Using this method, we have to manually view every game on the site through Google's cache. However, Pete created a script that automated the URLs for these, so we have a list of the URLs to access. We simply have to go to the URL, click the bookmarklet, copy the text, paste it into a processing page, and hit enter. This will give us the missing information.

However, we are still going to be missing the most vital piece of information on the site... The number of times somebody has viewed a sheet. I don't know how we'll be able to go on from losing this information... We'll just have to band together and cry at night, telling stories of how Pokémon Black/White 2 had over 200,000 hits on multiple sheets. Boohoo.

tl;dr - we've made a codey thing that grabs shit from Google.
Decided to donate some dosh, but I am honestly surprised that there wasn't any backup for, as you said, up to 10 years of information.

Lesson learned, eh?


Read your tweet about automation failing. I guess you need to keep an eye on it in future!
This is outstanding. Great job Pete on the script!

Is there a way to use Microsoft Excel's "Gather Data - From Web" aka "web query" to get each bookmarklet result on a different sheet? Actually, that might not be very helpful. I have used that technique before to get data from multiple websites organized in a single sheet. However, I imagine that would not help much in actually getting the data back into the site - I am probably just demonstrating my ignorance in html and javascript. :-)

Is the most time consuming process going to be getting the bookmarklet from each site or imputing the data retrieved back into the website?
On the subject of retrieving data, wouldn't it be theoretically possible to make a DOM-parsing program that will auto-download the cached pages for you from Google and parse the data? I've never done this sort of thing before* but if it's really gonna take ages to do, it might be worth for me or one of the staff to look into.

*While I haven't done something to scrape pages automatically, I once made a DOM-parsing program in PHP that I never finished to clean up saved pages from a forum to archive them neatly and fix the broken CSS.


And a chat with Dazz tells me that Google rejects crawling-like activity. Oh well '_;
Whatever can't be gotten from the cache, I'm always here to help with.
I'm definitely in to help. I'll do all my submissions obviously and then the games I know cause there's plenty of 'em that I don't.
MJ it... Doesn't work like that. It's just clicking buttons.
(01-07-2014, 02:18 PM)Phaze Wrote: [ -> ]On the subject of retrieving data, wouldn't it be theoretically possible to make a DOM-parsing program that will auto-download the cached pages for you from Google and parse the data? I've never done this sort of thing before* but if it's really gonna take ages to do, it might be worth for me or one of the staff to look into.

*While I haven't done something to scrape pages automatically, I once made a DOM-parsing program in PHP that I never finished to clean up saved pages from a forum to archive them neatly and fix the broken CSS.


And a chat with Dazz tells me that Google rejects crawling-like activity. Oh well '_;

I wrote a scraper but the IP address I was using got banned pretty quickly, even if I randomized the IP/interval between requests Google looks for patterns in the search requests and will block too many similar requests with a CAPTCHA. The funny thing is, the biggest web crawler in the world doesn't let you crawl their servers. Huh.
Pages: 1 2 3 4