Difference between revisions of "Audit2014"
Jump to navigation
Jump to search
(→Oddities, Mislocations, and To Do: this is now in the right collection) |
|||
Line 180: | Line 180: | ||
* https://archive.org/search.php?query=earbits Earbits gathering is in the wrong place and needs additional versions. | * https://archive.org/search.php?query=earbits Earbits gathering is in the wrong place and needs additional versions. | ||
* The wiki front page needs updating | * The wiki front page needs updating | ||
Revision as of 06:15, 14 December 2014
We've uploaded a bunch of stuff: https://archive.org/search.php?query=subject:Archiveteam
Let's go through the list and make sure it's categorized, has decent metadata, etc.
Many of our uploads are quite large, and have been broken into many items on Archive.org. We'll group them together here and verify each set all at once.
Things to check
- Collection
- Are all the related items grouped into a collection?
- Description
- Can a visitor figure out what each item represents? Items in a collection don't need to repeat the description of the collection, but it'd be nice if they had a sentence or two, and information about how the item differs from the other items in the collection ("MP3s from earbits.com, files starting with c." from the Earbits items is a good example.)
- Inclusion
- Are all the related items included in the same collection?
- Categorization
- Can a visitor find the item by browsing the collections?
- Cross-references
- Can a visitor find other items in a set, starting at any item in the set? Can a visitor find the index of a large set starting from any part of it?
- Indexing
- If the item is a collection of sub-items, is one of these sub-items an index of the others? (This is a complicated thing to check for and to create when it doesn't exist, so we can come back to this after we've checked the rest.)
- Your suggestion here
- this is just off the top of my head.
Current Sub-Collections at Archive Team
Collection | Status | Auditor | Item Count | Has an Index | Description of Audit |
---|---|---|---|---|---|
No Category (earbits) | Unaudited | 98 | Yes | The items are not in a collection. Most items are WARCs; the rest need additional work if anyone is going to be able to find the exact MP3 they want. | |
archiveteam_ptch | Audited | db48x | 50 | No | Collection has great description, but no categories. Items in collection are WARCS. One item not included in the collection: deathy-s3-test-ptch |
archiveteam_flowerpot | Audited | db48x | 406 | No | The description of the collection is anemic, but each item is well-identified. |
github_files | Audited | db48x | 1 | No | Pretty bad shape. Only one item in the collection, and that's only half the data. Was the rest never uploaded? Has no description, keywords or other metadata. Other Github items could be included, such as this repository index, and these other file downloads |
justintv | Audited | db48x | 189 | Decent description, but no other metadata. There are 51 other 'justintv' items, but none of them look to be from us. | |
archiveteam_mochimedia | Audited | db48x | 9 | No | Collection includes Mochi's notice about the shutdown, but no other context. The items are all WARCs, and all have CDXs and JSON indexes, but there's no overall index.
Index can be easily generated from this 26MB JSON file--chfoo |
archivebot | Unaudited | ||||
archiveteam_yahooblogs and archiveteam_yahooblog | Audited | db48x | 49 | No | Collection description is just the shutdown notice (and apparently quite a brief one at that) with no other context. Items are all WARCs, and all have CDXs and JSON indexes, but there's no overall index. One item is orphaned in a collection of its own; apparently caused by a typo in the collection name. |
archiveteam-splinder | Unaudited | ||||
archiveteam-picplz | Audited | db48x | 141 | Yes | The collection description is just the shutdown message, with no other context. Items are tarballs containing WARCs. There is an index, but it's not a part of the collection ([1]). There's also a search page for the index, which is great. |
archiveteam_puush | Audited | db48x | 1781 | The collection description is just the shutdown notice, but it's better than average; it includes some context. The items are all WARCs with CDXs, but there's no central index. | |
archiveteam_upcoming | Audited | dashcloud1 | 142 | no | The collection description only describes the site, not the items themselves. Individual items have no description of any kind. |
archiveteam_randomfandom | Audited | dashcloud1 | 42 | yes | Short collection description, but has an index, and every collection item is well described. Index is located right on collection page. |
archiveteam_antecedents | Audited | db48x | 46 | N/A | This collection represents multiple sites, rather than multiple parts of a single large site. The collection description is quite brief, but each item appears to have a paragraph describing what the site is/was, as well as some basic metadata such as keywords. All the items appear to be WARCs with CDXs |
archiveteam_jazzhands | Audited | db48x | 443 | No | This one is a collection of items from multiple sites, but those sites are also broken up into multiple items based on when they were scanned. The items have brief descriptions and some keywords, and are WARCs with CDXs. A good way to improve this would be to make collections for each site as subcollections. |
archiveteam-mobileme-hero | Unaudited | 4007 | Yes (source) | ||
archiveteam_myopera | Audited | dashcloud1 | 155 | No | Collection page has a nice description of the site, and the items. The items appear to be all have WARCs, and have no descriptions/keywords of any kind on them. |
archiveteam_bebo | Unaudited | ||||
archiveteam_dogster | Audited | jscott | 55 Items | ??? | Collection well described. Wayback Machine-Ready WARCs, all integrated. |
hyves | Unaudited | ||||
archiveteam_wretch | Unaudited | ||||
archiveteam_xanga | Unaudited | ||||
twitterstream | Unaudited | ||||
pastebinpastes | Unaudited | ||||
archiveteam-googlegroups-th | Unaudited | ||||
archiveteam_zapd | Unaudited | ||||
archiveteam_patch | Unaudited | ||||
archiveteam_posterous | Unaudited | ||||
archiveteam_greader | Unaudited | ||||
archiveteam_ignsites | Unaudited | ||||
archiveteam_g4tv_forums | Unaudited | ||||
archiveteam-yahoovideo | Unaudited | ||||
archive-team-friendster | Unaudited | ||||
archiveteam_formspring | Unaudited | ||||
archiveteam_yahoo_messages | Unaudited | ||||
archiveteam_punchfork | Unaudited | ||||
yahoo_korea_blogs | Unaudited | ||||
archiveteam-cinch | Unaudited | ||||
archiveteam_dailybooth | Unaudited | ||||
archiveteam_weblognl | Unaudited | ||||
stage6 | Unaudited | ||||
googlegroups-part2 | Unaudited | ||||
archiveteam-btinternet | Unaudited | ||||
archiveteam-qaudio-archive | Unaudited | ||||
webshots-freeze-frame | Unaudited | ||||
tabblo-archive | Unaudited | ||||
archiveteam-fortunecity | Unaudited | ||||
2012-04-30-wikimedia-images-snapshot | Unaudited | ||||
archiveteam-anyhub | Unaudited | ||||
archiveteam-fileplanet | Unaudited | ||||
archiveteam-umich-save | Unaudited | ||||
archiveteam-geocities | Unaudited | ||||
archiveteam-fire | Unaudited | ||||
archiveteam-mypodcast | Unaudited | ||||
archiveteam-googlegroups | Unaudited | ||||
isohunt dumps 1 2 3 | Unaudited | These are not yet in a dedicated collection, and have never been post-processed. Some of the .torrent files may actually be error pages. This needs work, and proper full auditing. | |||
No Category (streetfiles) | Unaudited | ||||
archiveteam_yahoovoices | Unaudited | ||||
archiveteam_twitchtv | Unaudited | Yes (source) | |||
archiveteam_fotopedia | Unaudited | ||||
archiveteam_canvas | Unaudited | ||||
archiveteam_ancestry | Unaudited |
In progress???
But what happened after? Where are the archives?
- BerliOS
- Deletionpedia
- Delicious
- ExtraTorrent
- Free ProHosting
- Google Video
- Ispygames
- Len Sassaman Project
- Lulu Poetry
- Prodigy.net
- Resedagboken
- ScreenshotsDatabase.com
- Spanish Revolution: Is this finished?
- University of Michigan personal webpages
- Wallbase
- Wallhaven
- Webmonkey
- Widgetbox
- Windows Live Spaces
Oddities, Mislocations, and To Do
- https://archive.org/search.php?query=earbits Earbits gathering is in the wrong place and needs additional versions.
- The wiki front page needs updating
To be moved to better collection
Orphaned Canv.as
- https://archive.org/details/archiveteam_canvas_20140812090142
- https://archive.org/details/archiveteam_canvas_20140812144024
- https://archive.org/details/archiveteam_canvas_20140815175099
- https://archive.org/details/archiveteam_canvas_20140812085210
- https://archive.org/details/archiveteam_canvas_20140812090945
Orphaned Twitch.tv
- https://archive.org/details/archiveteam_twitchtv_20140811223313
- https://archive.org/details/archiveteam_twitchtv_espesgrab
WARC
- https://archive.org/details/pouet.com_full_grab no WARC file visible for me
- https://archive.org/details/archiveteam_punchfork_archive-archive
- https://archive.org/details/sg1archive.com_forums_20140708
- https://archive.org/details/2013_misc_warcs_02
- https://archive.org/details/2013_misc_warcs_01
- https://archive.org/details/site-donkeyboytripodcom
- https://archive.org/details/site-homeswipnetseclubnintendo007
- https://archive.org/details/site-homeswipnetsecpg
- https://archive.org/details/site-homeswipnetsegamemaster
- https://archive.org/details/homeswipnetsenestabs
- https://archive.org/details/Site-homeswipnetsew-62848
- https://archive.org/details/site-homeswipnetsesofiasgbc
- https://archive.org/details/site-homeswipnetsexcheatsdk
- https://archive.org/details/site-home2swipnetsew26120
- https://archive.org/details/site-home3.swipnet.se-w38081
- https://archive.org/details/site-home4swipnetse-w42641
- https://archive.org/details/site-home4swipnetse-w46722
- https://archive.org/details/site-homeswipnetsefredde2000
- https://archive.org/details/ubuntuone-panicgrab-20140405
- https://archive.org/details/myopera-forums-1700001-1800000
- https://archive.org/details/myopera-forums-1800001-1823192
- https://archive.org/details/rawporter.s3.amazonaws.com_20140616_partial
- https://archive.org/details/technet.microsoft.com-panicgrab-20130706
- https://archive.org/details/isohunt_facebook_page_snapshot WARC and other formats
- https://archive.org/details/Misc.yero.orgMusic
- https://archive.org/details/telinco.co.uk_pages
- https://archive.org/details/tribes_forum_emergency_grab
- https://archive.org/details/isohunt-20131019-mithrandir-extra
- https://archive.org/details/cscope.us-google-pdfs-grab-20130312
- https://archive.org/details/cscope.us-google-pdfs-grab-20130520
- https://archive.org/details/PinkTentacle
- https://archive.org/details/journalstar.com_sports_local_20120730.warc
- https://archive.org/details/www.battleforthenet.com-panicgrab-20140718
- https://archive.org/details/theopeninter.net-panicgrab-20140718
- https://archive.org/details/startupsfornetneutrality.org-panicgrab-20140718
- https://archive.org/details/net.net-panicgrab-20140718
- https://archive.org/details/wwdctimer.com-panicgrab-20140731
- https://archive.org/details/xn--19g.com-panicgrab-20140731
- https://archive.org/details/chromercise.com-panicgrab-20140731
- https://archive.org/details/hiddenfromgoogle.com-panicgrab-20140731
- https://archive.org/details/orteil.dashnet.org-panicgrab-20140731
- https://archive.org/details/pingus.seul.org-panicgrab-20140731
- https://archive.org/details/tux4kids.alioth.debian.org-panicgrab-20140731
- https://archive.org/details/tuxkart.sourceforge.net-panicgrab-20140731
- https://archive.org/details/assets.minecraft.net-panicgrab-20140807
- https://archive.org/details/bmf.*rustedmagick.com-cr-panicgrab-20140808 (remove asterisk, spam filter doesn't like this link)
- https://archive.org/details/tppx.herokuapp.com-panicgrab-20140808
- https://archive.org/details/nintendo-warcs
- https://archive.org/details/www.battleforthenet.com-panicgrab-20140912
- https://archive.org/details/mojang.com-notch-panicgrab-20140912
- https://archive.org/details/http.lists.xiph.org.ad78c6615d420894
- https://archive.org/details/legowracers.4t2portfolio.co.uk-panicgrab-20141007
- https://archive.org/details/2014.oct.29G3.warc (Geometer's Sketchpad installers)
- https://archive.org/details/inw-begun-2014.oct.26-p6-00001.warc (ef.inwards.com snapshot)
- https://archive.org/details/Dsoi4Jan2014.megawarc.json (WARCs turned out to seem to be corrupt)
- https://archive.org/details/bds-9oct2013
- https://archive.org/details/Hogislandeducators2011.wikispaces.comWARCSnapshot9October2013
- https://archive.org/details/warcs-as-of-26jany2014
- https://archive.org/details/00001DlUMkUFTWc.info (WARCs and a bunch of other stuff)
- https://archive.org/details/D3jan2014.megawarc.json
- https://archive.org/details/MicrosoftDemandsTakedownOfMicrosoftSpyGuide.html (WARCs and other stuff)
- https://archive.org/details/warc-9aug2014
- https://archive.org/details/27may2014warcset
- https://archive.org/details/13jany2014warcs
- https://archive.org/details/mcspotlight.org-20141030
- https://archive.org/details/dr_static.s3.amazonaws.com-panicgrab-20140929
- https://archive.org/details/cc2014.oct.31-00000.warc (Songs from DJ Contacreast's website)
- https://archive.org/search.php?query=collection%3Aamjbarreldata (This is a collection of WARCs (not in Wayback at present, as far as I know) from my attempt at writing a distributed-computing website-specific archival tool, sort of a cross between Majestic-12 and Archivebot. Not sure if it's appropriate to list here, but it's a thing. Feel free to remove it if it's not....)
FTP
- https://archive.org/details/ftp.idsoftware.com
- https://archive.org/details/ftp.lucasarts.com-20130427
- https://archive.org/details/ftp.santronics.com
- https://archive.org/details/2014.02.ftp.inf.tuDresden.deAtari
- https://archive.org/details/2014.0102.ftp.festo.com
- https://archive.org/details/wa-begun-ul-27jany2014amn (This should probably be darked, it looks to me like it's someone's misconfigured home NAS)
- https://archive.org/details/2014.0102.mail.digipro.rs
Misc
- https://archive.org/details/archiveteam-picplz-index
- https://archive.org/details/Posterous.comHostnames
- https://archive.org/details/YahooBlogSitemaps20131216071927
- https://archive.org/details/archiveteam-mobileme-index
- https://archive.org/details/archiveteam-twitter-stream-2014-05
- https://archive.org/details/ESPNForumsPanicgrab
- https://archive.org/details/rawporter-grab
- https://archive.org/details/bitsnoop-dump
- https://archive.org/details/CaliforniaFinanceLobbyData
- https://archive.org/details/ArchiveteamWarriorV220121008Hyperv
- https://archive.org/details/HowFlickr.comLookedLikeIn2010-APlaceOfWorshipOnFlickr-Photo
- https://archive.org/details/myopera_shutdown_notice
- https://archive.org/details/UsenetSci.space.news2003-2012
- https://archive.org/details/Usenet_rec.food.recipesArchive2003-2012
- https://archive.org/details/MirrorOfSiteOrtodoxiesiviata.blogspot.com
- https://archive.org/details/CaliforniaFinanceLobbyData
- https://archive.org/details/carti.itarea.org
- https://archive.org/details/ovmk_story
- https://archive.org/details/ti_guidebook_en
- https://archive.org/details/ti_guidebook_fr
- https://archive.org/details/ti_guidebook_de
- https://archive.org/details/myopera_usernames_FIXED.7z
- https://archive.org/details/DubaiWikipediaPageOn2012-09-06
- https://archive.org/details/digpicz-2008-07-30-website
- https://archive.org/details/site-wwwangelfirecomazdixieden
- https://archive.org/details/ArkiverCrawlsPack0004
- https://archive.org/details/ArkiverCrawlsPack0005
- https://archive.org/details/ArkiverCrawlsPack0007
- https://archive.org/details/ArkiverCrawlsPack0008
- https://archive.org/details/laptops-manuals-dump-from-tim.id.au-20121111
- https://archive.org/details/paste_lisp_org
- https://archive.org/details/MtGoxSituationCrisisStrategyDraft
- https://archive.org/details/MtGoxBusinessPlan20142017
- https://archive.org/details/nyt_innovation_2014
- https://archive.org/details/slackware-irc-logs
- https://archive.org/details/thekeep_bbs
- https://archive.org/details/mail.google.com-saved-1Oct2014
- https://archive.org/details/madden_giferator_scrape_1-100000
- https://archive.org/details/madden_giferator_scrape_100001-200000
- https://archive.org/details/madden_giferator_scrape_200001-300000
- https://archive.org/details/Data2September2013.tar (Gunnerkrigg Court homepage comments snapshots)
- https://archive.org/details/shipwretched-items
- https://archive.org/details/fotodisco-raw-items
- https://archive.org/details/quizilladisco-raw-items
- https://archive.org/details/qwikidisco-raw-items
- https://archive.org/details/twitpicdisco-raw-items
- https://archive.org/details/maemo-fremantle-ovi
URLTeam
Upload the latest offical torrent release. Done! URLTeamTorrentRelease2013JulyUpload the Dropbox files in the URLTeam wiki page table that are *not* in the latest release. Done!- user:chfoo needs access URLTeam collection OR move the items as needed.
Missing
- Yahoo!_Blog: What happened to the Vietnam archives? Does anyone have a copy or at least a blurry screenshot of the Korean shutdown notice?