Difference between revisions of "Ispygames"

From Archiveteam
Jump to navigation Jump to search
Line 76: Line 76:
* http://flashpoint.ign.com - siliconvalleypark - grabbed
* http://flashpoint.ign.com - siliconvalleypark - grabbed
* http://beaterator.ign.com - siliconvalleypark - grabbed
* http://beaterator.ign.com - siliconvalleypark - grabbed
* http://aivlev.dev.m.au.ign.com - password protected dev site
* http://aivlev.dev.m.ie.ign.com - password protected dev site
* http://aivlev.dev.m.ign.com - password protected dev site
* http://aivlev.dev.m.uk.ign.com - password protected dev site
* http://apassey.dev.m.ign.com - password protected dev site


=== Ready to grab ===
=== Ready to grab ===
* http://au.bestof.ign.com
* http://au.retro.ign.com
* http://au.sports.ign.com
* http://bestof.ign.com
* http://code.ign.com




=== untested ===
=== untested ===
* http://911.ign.com
* http://adtools.ign.com - blank
* http://adtools.ign.com - blank
* http://aivlev.dev.m.au.ign.com
* http://aivlev.dev.m.ie.ign.com
* http://aivlev.dev.m.ign.com
* http://aivlev.dev.m.uk.ign.com
* http://apassey.dev.m.ign.com
* http://au.bestof.ign.com
* http://au.microsites.ign.com
* http://au.microsites.ign.com
* http://au.retro.ign.com
* http://au.sports.ign.com
* http://au.top100.ign.com
* http://au.top100.ign.com
* http://au.video.ign.com
* http://au.video.ign.com
* http://beacon.ign.com
* http://beacon.ign.com
* http://bestofe3.ign.com
* http://bestof.ign.com
* http://blockbuster.ign.com
* http://blockbuster.ign.com
* http://broadband.ign.com
* http://broadband.ign.com
Line 103: Line 102:
* http://championshipgamingseries.ign.com
* http://championshipgamingseries.ign.com
* http://championsonline.ign.com
* http://championsonline.ign.com
* http://code.ign.com
* http://cohvault.ign.com
* http://cohvault.ign.com
* http://comiccon.ign.com
* http://comiccon.ign.com
Line 115: Line 113:
* http://downloads.ign.com
* http://downloads.ign.com
* http://dragonica.ign.com
* http://dragonica.ign.com
* http://dsi.ign.com
* http://dsvault.ign.com
* http://dsvault.ign.com
* http://emailpreferences.ign.com
* http://emailpreferences.ign.com
Line 345: Line 342:
* http://ddovault.ign.com -> http://dndvault.ign.com/
* http://ddovault.ign.com -> http://dndvault.ign.com/
* http://bigworldvault.ign.com -> http://vault.ign.com
* http://bigworldvault.ign.com -> http://vault.ign.com
* http://911.ign.com -> http://tickets.ign-inc.com/
* http://bestofe3.ign.com -> http://games.ign.com/bestofe3.html
* http://dsi.ign.com -> http://ds.ign.com/dsi/


== Gamespy Domains ==
== Gamespy Domains ==

Revision as of 02:09, 3 April 2013

The News

IGN hit with layoffs, 1UP, UGO and GameSpy shutting down
1UP, UGO and GameSpy to be shut down

The Problems

  • Once you start digging around these sites you find it to be a mess of inconsistent url schemes and content everywhere.
  • Some files are being hosted on MediaFire.
  • Based on tests the larger and older a site is the more that is missed by a wget crawl due to the url scheme.

What we know

  • We already have a list of almost all the domains involved
  • A clean list with dups and bad domains is already being process and will be posted here when complete.
  • Most of the sites are not that big, but a few are huge.

The plan

  • Save the sites and related content
  • Backup the twitter feeds for any associated accounts. All my tweets just takes a username and returns the max tweets possible.


wget test command

This if for the gamespy sites.

USER_AGENT="Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)"
SAVE_HOST="http://planetdoom.gamespy.com"
WARC_NAME="warc_name"

wget -e robots=off --mirror --page-requisites \ 
--waitretry 5 --timeout 60 --tries 5 --wait 2 \
--warc-header "operator: Archive Team" --warc-cdx --warc-file="$WARC_NAME" \
-U "$USER_AGENT" "$SAVE_HOST" \
--span-hosts --domains=$SAVE_HOST,pcmedia.gamespy.com,pnmedia.gamespy.com,pspmedia.gamespy.com,oystatic.ignimgs.com

Try this for the ign, ugo sites.

USER_AGENT="Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)"
SAVE_HOST="http://ve3d.ign.com"
WARC_NAME="warc_name"

wget -e robots=off --mirror --page-requisites \ 
--waitretry 5 --timeout 60 --tries 5 --wait 2 \
--warc-header "operator: Archive Team" --warc-cdx --warc-file="$WARC_NAME" \
-U "$USER_AGENT" "$SAVE_HOST"

IGN domains

Ready to grab


untested

These might be asset only hosting sites

Redirects

Gamespy Domains

Ready to grab

In Progress