Dev/Seesaw

From Archiveteam
Jump to: navigation, search

Writing a Seesaw content grab script is the most challenging and fun aspect of the infrastructure.

What a Archive Team Project Contains

Once the Git repository has been created, be sure to include the following files:

pipeline.py

This file contains the Seesaw client code for the project.

README.{md,rst,txt}

This file contains
* brief information about the project
* instructions on how to manually run the scripts
* A template is available here: standalone-readme-template

[Project Name Here].lua (optional)

This is the Lua script used by Wget-Lua.

warrior-install.sh (optional)

This file is executed by the Warrior to install extra libraries needed by the project. Example: punchfork-grab warrior-install.sh.

wget-lua-warrior (optional)

This executable is a build of Wget-Lua for the warrior environment.

get-wget-lua.sh (optional)

Build scripts for Wget-Lua for those running scripts manually.

The repository is pulled in by the Warrior or manually be those who want to run the scripts manually.

Writing a pipeline.py (Seesaw Client)

The Seesaw client is a specific set of tasks that must be done within an item. Think of it as a template of instructions. Typically, the file is called pipeline.py. The pipeline file uses the Seesaw Library.

The pipeline file will typically use Wget with Lua scripting. Wget+Lua is a web crawler.

The Lua script is provided as an argument to Wget within the pipeline file. It controls fine grain operations within Wget such as rejecting unneeded URLs or adding more URLs as they are discovered.

The goal of the pipeline is to download, make WARC files, and upload them.

Quick Definitions

item

a work unit

pipeline

a series of tasks in an item

task

a step in getting the item done

Recommend reading.

Installation

You will need:

  • Python 2.6/2.7
  • Lua
  • Wget with Lua hooks

Typically, you can install these on Ubuntu by running:

sudo apt-get install build-essential lua5.1 liblua5.1-0-dev python python-setuptools python-dev openssl libssl-dev python-pip make libgnutls-dev zlib1g-dev
sudo pip install seesaw

You will also need Wget with Lua. There is an Ubuntu PPA or you can build it yourself:

./get-wget-lua.sh

Grab a recent build script from here.

The pipeline file

The pipeline file typically includes:

  • A line that checks the minimum seesaw version required
  • Copy-and-pasted monkey patches if needed
  • A routine to find Wget Lua
  • A version number in the form of YYYYMMDD.NN
  • Misc constants
  • Custom Tasks:
    • PrepareDirectories
    • MoveFiles
  • Project information saved into the project variable
  • Instructions on how to deal with the item saved into the pipeline variable
  • An undeclared downloader variable which will be filled in by the Seesaw library

It is important to remember that each Task is a template on how to deal with each Item. Specific item variables should not be stored on a Task, but rather, it should be saved onto the item: item["my_data"] = "hello".

Minimum Seesaw Version Check

if StrictVersion(seesaw.__version__) < StrictVersion("0.0.15"):
    raise Exception("This pipeline needs seesaw version 0.0.15 or higher.")

This check is used to prevent manual script users from using an obsolete version of Seesaw. The Warrior will always upgrade to the latest Seesaw if dictated in the Tracker's projects.json file.

Version 0.0.15 is the supported legacy version, but it is suggested to rely on the latest version of Seesaw as specified in the Seesaw Python Package Index.

Monkey Patches

Monkey patches such as AsyncPopenFixed are only provided for legacy versions of Seesaw.

Routine to find Wget-Lua

WGET_LUA = find_executable(
    "Wget+Lua",
    ["GNU Wget 1.14.lua.20130523-9a5c"],
    [
        "./wget-lua",
        "./wget-lua-warrior",
        "./wget-lua-local",
        "../wget-lua",
        "../../wget-lua",
        "/home/warrior/wget-lua",
        "/usr/bin/wget-lua"
    ]
)

if not WGET_LUA:
    raise Exception("No usable Wget+Lua found.")

This routine is a sanity check that aborts the script early if Wget+Lua has not been found. Omit this if needed.

Script Version

VERSION = "20131129.00"

This constant, to be used within pipeline, is sent to the Tracker and should be embedded within the WARC files. It is used for accounting purposes:

  • Tracker admins can check the logs for faulty grab scripts and requeue the faulty items.
  • Tracker admins can require the user to upgrade the scripts.

Always change the version whenever you make a non-cosmetic change. Note, this constant is only a variable. Be sure that it is used within pipeline.

Misc constants

USER_AGENT = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27 ArchiveTeam"
TRACKER_ID = "posterous"
TRACKER_HOST = "tracker.archiveteam.org"

Constants like USER_AGENT and TRACKER_HOST are good programming practices for clean coding.

Check IP address

This task checks the IP address to ensure the user is not behind a proxy or firewall. Sometimes websites are censored or the user is behind a captive portal (like a coffeeshop wifi) which will ruin results.

class CheckIP(SimpleTask):
    def __init__(self):
        SimpleTask.__init__(self, "CheckIP")
        self._counter = 0

    def process(self, item):
        ip_str = socket.gethostbyname('example.com')
        if ip_str not in ['1.2.3.4', '1.2.3.6']:
            item.log_output('Got IP address: %s' % ip_str)
            item.log_output(
                'Are you behind a firewall/proxy? That is a big no-no!')
            raise Exception(
                'Are you behind a firewall/proxy? That is a big no-no!')

        # Check only occasionally
        if self._counter <= 0:
            self._counter = 10
        else:
            self._counter -= 1

PrepareDirectories & MoveFiles

class PrepareDirectories(SimpleTask):
  """
  A task that creates temporary directories and initializes filenames.

  It initializes these directories, based on the previously set item_name:
    item["item_dir"] = "%{data_dir}/%{item_name}"
    item["warc_file_base"] = "%{warc_prefix}-%{item_name}-%{timestamp}"

  These attributes are used in the following tasks, e.g., the Wget call.

  * set warc_prefix to the project name.
  * item["data_dir"] is set by the environment: it points to a working
    directory reserved for this item.
  * use item["item_dir"] for temporary files
  """
  def __init__(self, warc_prefix):
    SimpleTask.__init__(self, "PrepareDirectories")
    self.warc_prefix = warc_prefix

  def process(self, item):
    item_name = item["item_name"]
    dirname = "/".join(( item["data_dir"], item_name ))

    if os.path.isdir(dirname):
      shutil.rmtree(dirname)
    os.makedirs(dirname)

    item["item_dir"] = dirname
    item["warc_file_base"] = "%s-%s-%s" % (self.warc_prefix, item_name, time.strftime("%Y%m%d-%H%M%S"))

    open("%(item_dir)s/%(warc_file_base)s.warc.gz" % item, "w").close()


class MoveFiles(SimpleTask):
  """
  After downloading, this task moves the warc file from the
  item["item_dir"] directory to the item["data_dir"], and removes
  the files in the item["item_dir"] directory.
  """
  def __init__(self):
    SimpleTask.__init__(self, "MoveFiles")

  def process(self, item):
    os.rename("%(item_dir)s/%(warc_file_base)s.warc.gz" % item,
              "%(data_dir)s/%(warc_file_base)s.warc.gz" % item)

    shutil.rmtree("%(item_dir)s" % item)

These tasks are "tradition" (meaning, they are copied-and-pasted and modified to fit) for managing temporary files.

Note, PrepareDirectories makes an empty warc.gz file since later tasks expect a warc.gz file.

project variable

project = Project(
  title = "Posterous",
  project_html = """
    <img class="project-logo"
      alt="Posterous Logo"
      src="http://archiveteam.org/images/6/6c/Posterous_logo.png"
      height="50"/>
    <h2>Posterous.com
      <span class="links">
        <a href="http://www.posterous.com/">Website</a> · 
        <a href="http://tracker.archiveteam.org/posterous/">Leaderboard</a>
      </span>
    </h2>
    <p><i>Posterous</i> is closing April, 30th, 2013</p>
  """
   , utc_deadline = datetime.datetime(2013, 04, 30, 23, 59, 0)
)

This variable is used within the Warrior to show the HTML at the top of the page.

Note, this could be potentially be used to show important messages using <p class="projectBroadcastMessage"></p>. However, manual script users will not see anything related to this variable so you may want to print out any important messages instead.

pipeline variable

Here's a real chunk of code.

pipeline = Pipeline(
  # request an item from the tracker (using the universal-tracker protocol)
  # the downloader variable will be set by the warrior environment
  #
  # this task will wait for an item and sets item["item_name"] to the item name
  # before finishing
  GetItemFromTracker("http://%s/%s" % (TRACKER_HOST, TRACKER_ID), downloader, VERSION),

  # create the directories and initialize the filenames (see above)
  # warc_prefix is the first part of the warc filename
  #
  # this task will set item["item_dir"] and item["warc_file_base"]
  PrepareDirectories(warc_prefix="posterous.com"),

  # execute Wget+Lua
  #
  # the ItemInterpolation() objects are resolved during runtime
  # (when there is an Item with values that can be added to the strings)
  WgetDownload([ WGET_LUA,
      "-U", USER_AGENT,
      "-nv",
      "-o", ItemInterpolation("%(item_dir)s/wget.log"),
      "--no-check-certificate",
      "--output-document", ItemInterpolation("%(item_dir)s/wget.tmp"),
      "--truncate-output",
      "-e", "robots=off",
      "--rotate-dns",
      "--recursive", "--level=inf",
      "--page-requisites",
      "--span-hosts", 
      "--domains", ItemInterpolation("%(item_name)s,s3.amazonaws.com,files.posterous.com,"
        "getfile.posterous.com,getfile0.posterous.com,getfile1.posterous.com,"
        "getfile2.posterous.com,getfile3.posterous.com,getfile4.posterous.com,"
        "getfile5.posterous.com,getfile6.posterous.com,getfile7.posterous.com,"
        "getfile8.posterous.com,getfile9.posterous.com,getfile10.posterous.com"),
      "--reject-regex", r"\.com/login",
      "--timeout", "60",
      "--tries", "20",
      "--waitretry", "5",
      "--lua-script", "posterous.lua",
      "--warc-file", ItemInterpolation("%(item_dir)s/%(warc_file_base)s"),
      "--warc-header", "operator: Archive Team",
      "--warc-header", "posterous-dld-script-version: " + VERSION,
      "--warc-header", ItemInterpolation("posterous-user: %(item_name)s"),
      ItemInterpolation("http://%(item_name)s/")
    ],
    max_tries = 2,
    # check this: which Wget exit codes count as a success?
    accept_on_exit_code = [ 0, 8 ],
  ),

  # this will set the item["stats"] string that is sent to the tracker (see below)
  PrepareStatsForTracker(
    # there are a few normal values that need to be sent
    defaults = { "downloader": downloader, "version": VERSION },
    # this is used for the size counter on the tracker:
    # the groups should correspond with the groups set configured on the tracker
    file_groups = {
      # there can be multiple groups with multiple files
      # file sizes are measured per group
      "data": [ ItemInterpolation("%(item_dir)s/%(warc_file_base)s.warc.gz") ]
    },
  ),

  # remove the temporary files, move the warc file from
  # item["item_dir"] to item["data_dir"]
  MoveFiles(),
  
  # there can be multiple items in the pipeline, but this wrapper ensures
  # that there is only one item uploading at a time
  #
  # the NumberConfigValue can be changed in the configuration panel
  LimitConcurrent(NumberConfigValue(min=1, max=4, default="1",
    name="shared:rsync_threads", title="Rsync threads", 
    description="The maximum number of concurrent uploads."),
    # this upload task asks the tracker for an upload target
    # this can be HTTP or rsync and can be changed in the tracker admin panel
    UploadWithTracker(
      "http://%s/%s" % (TRACKER_HOST, TRACKER_ID),
      downloader = downloader,
      version = VERSION,
      # list the files that should be uploaded.
      # this may include directory names.
      # note: HTTP uploads will only upload the first file on this list
      files = [
        ItemInterpolation("%(data_dir)s/%(warc_file_base)s.warc.gz")
      ],
      # the relative path for the rsync command
      # (this defines if the files are uploaded to a subdirectory on the server)
      rsync_target_source_path = ItemInterpolation("%(data_dir)s/"),
      # extra rsync parameters (probably standard)
      rsync_extra_args = [
        "--recursive",
        "--partial",
        "--partial-dir", ".rsync-tmp"
      ]
    ),
  ),

  # if the item passed every task, notify the tracker and report the statistics
  SendDoneToTracker(
    tracker_url = "http://%s/%s" % (TRACKER_HOST, TRACKER_ID),
    stats = ItemValue("stats")
  )
)

It's pretty big.

Notice:

  • the downloader variable should be left undefined
  • ItemInterpolation holds some magic. ItemInterpolation("%(item_dir)s/wget.log").realize(item) executes item % "%(item_dir)s/wget.log" which gives us item["item_dir"]+"/wget.log"
  • --output-document concatenates everything into a single temporary file.
  • --truncate-output is a Wget+Lua option. It makes --output-document into a temporary file option by downloading to the file, extract the URLs, and then set the temporary file to 0 bytes.
  • the use of -e robots=off because robots.txt is bad
  • --lua-script posterous.lua specifies the Lua script that controls Wget
  • NumberConfigValue adds another setting to the Warrior's advanced settings page

Lua Script

The Lua script is like a parasite controlling and modifying Wget's behavior from within.

Generally, scripts will want to use:

  1. download_child_p
  2. httploop_result
  3. get_urls

download_child_p

This hook is useful for advanced URL accepting and rejecting. Although Wget supports regular expression on its command line options, it can be messy. Lua only supports a small subset of regular expressions called Patterns.

httploop_result

This hook is useful for checking if we have been banned or implementing our own --wait.

Here is a practical example that delays Wget for a minute on a ban or server overload, approximate 1 second between normal requests, and no delay on a content delivery network:

wget.callbacks.httploop_result = function(url, err, http_stat)
  local sleep_time = 60
  local status_code = http_stat["statcode"]

  if status_code == 420 or status_code >= 500 then
    if status_code == 420 then
      io.stdout:write("\nBanned (code "..http_stat.statcode.."). Sleeping for ".. sleep_time .." seconds.\n")
    else
      io.stdout:write("\nServer angered! (code "..http_stat.statcode.."). Sleeping for ".. sleep_time .." seconds.\n")
    end

    io.stdout:flush()

    -- Execute the UNIX sleep command (since Lua does not have its own delay function)
    -- Note that wget has its own linear backoff to this time as well
    os.execute("sleep " .. sleep_time)

    -- Tells wget to try again
    return wget.actions.CONTINUE

  else
    -- We're okay; sleep a bit (if we have to) and continue
    local sleep_time = 1.0 * (math.random(75, 125) / 100.0)

    if string.match(url["url"], "website-cdn%.net") then
      -- We should be able to go fast on images since that's what a web browser does
      sleep_time = 0
    end

    if sleep_time > 0.001 then
      os.execute("sleep " .. sleep_time)
    end

    -- Tells wget to resume normal behavior
    return wget.actions.NOTHING
  end
end
  • You will likely want to be cautious and include the wget.actions.CONTINUE action to cover a wide case. Wget may consider a temporary server overload as a permanent error.
  • Yahoo! likes to use status 999 to indicate a temporary ban.

get_urls

This hook is used to add additional URLs.

This example injects URLs to simulate JavaScript requests:

wget.callbacks.get_urls = function(file, url, is_css, iri)
  local urls = {}

  for image_id in string.gmatch(html, "([a-zA-Z0-9]-)/image_thumb.png") do
    table.insert(urls, {
      url="http://example.com/photo_viewer.php?imageid="..image_id,
      post_data="crf_token=deadbeef"
    })
  end

  return urls
end

It can also be used to display a progress message:

url_count = 0

wget.callbacks.get_urls = function(file, url, is_css, iri)
  url_count = url_count + 1
  if url_count % 5 == 0 then
    io.stdout:write("\r - Downloaded "..url_count.." URLs.")
    io.stdout:flush()
  end
end

Useful Snippets

Read first 1 kilobyte of a file:

read_file_short = function(file)
  if file then
    local f = io.open(file)
    local data = f:read(4096)
    f:close()
    return data or ""
  else
    return ""
  end
end

Run a pipeline.py

To run a pipeline file, run the command:

run-pipeline pipeline.py YOUR_NICKNAME

For more options, run:

run-pipeline --help

External Links


Developer Documentation



[view]  [edit]                   Archive Team                  
Current events Alive... OR ARE THEY  · Deathwatch  · Projects
Archiveteam.jpg
Archiving projects APKMirror  · Archive.is  · BetaArchive  · Gmane  · Internet Archive  · It Died  · Megalodon.jp  · OldApps.com  · OldVersion.com  · OSBetaArchive  · TEXTFILES.COM  · The Dead, the Dying & The Damned  · The Mail Archive  · UK Web Archive  · WebCite  · Vaporwave.me
Blogging Blog.pl  · Blogger  · Blogster  · Blogter.hu  · Freeblog.hu  · Fuelmyblog  · Jux  · LiveJournal  · My Opera  · Nolblog.hu  · Open Diary  · ownlog.com  · Posterous  · Powerblogs  · Proust  · Roon  · Splinder  · Tumblr  · Vox  · Weblog.nl  · Windows Live Spaces  · Wordpress.com  · Xanga  · Yahoo! Blog  · Zapd
Cloud hosting/file sharing aDrive  · AnyHub  · Box  · Dropbox  · Docstoc  · Google Drive  · Google Groups Files  · iCloud  · Fileplanet  · LayerVault  · MediaCrush  · MediaFire  · Mega  · MegaUpload  · MobileMe  · OneDrive  · Pomf.se  · RapidShare  · Ubuntu One  · Yahoo! Briefcase
Corporations Apple  · IBM  · Google  · Lycos Europe  · Microsoft  · Yahoo!
Events Arab Spring  · Occupy movement  · Spanish Revolution
Font Repos Google Web Fonts  · GNU FreeFont  · Fontspace
Forums/Message boards 4chan  · Captain Luffy Forums  · College Confidential  · DSLReports  · ESPN Forums  · forums.starwars.com  · HeavenGames  · Invisionfree  · The Classic Horror Film Board  · Yahoo! Messages  · Yahoo! Neighbors  · Yuku.com
Gaming Atomicgamer  · City of Heroes  · Club Nintendo  · CS:GO Lounge  · Desura  · Dota 2 Lounge  · Emulation Zone  · GameMaker Sandbox  · GameTrailers  · Halo  · HLTV.org  · Infinite Crisis  · Minecraft.net  · Player.me  · Playfire  · Steam  · SteamDB  · Warhammer  · Xfire
Image hosting 500px  · AOL Pictures  · Blipfoto  · Blingee  · Canv.as  · Camera+  · Cameroid  · DailyBooth  · Degree Confluence Project  · deviantART  · Demotivalo.net  · Flickr  · Fotoalbum.hu  · Fotolog.com  · Fotopedia  · Frontback  · Geograph Britain and Ireland  · GTF Képhost  · ImageShack  · Imgur  · Inkblazers  · Instagr.am  · Kepfeltoltes.hu  · Kephost.com  · Kephost.hu  · Kepkezelo.com  · Keptarad.hu  · Madden GIFERATOR  · MLKSHK  · Microsoft Clip Art  · Nokia Memories  · noob.hu  · Odysee  · Panoramio  · Photobucket  · Picasa  · Picplz  · PSharing  · Ptch  · puu.sh  · Rawporter  · Relay.im  · ScreenshotsDatabase.com  · Snapjoy  · Streetfiles  · Tabblo  · Trovebox  · TwitPic  · Wallbase  · Wallhaven  · Webshots  · Wikimedia Commons
Knowledge/Wikis arXiv  · Citizendium  · Clipboard.com  · Deletionpedia  · EditThis  · Encyclopedia Dramatica  · Etherpad  · Everything2  · infoAnarchy  · GeoNames  · GNUPedia  · Google Books (Google Books Ngram)  · Horror Movie Database  · Insurgency Wiki  · Knol  · Library Genesis  · Lost Media Wiki  · Neoseeker.com  · Notepad.cc  · Nupedia  · OpenCourseWare  · OpenStreetMap  · Orain  · Pastebin  · Patch.com  · Project Gutenberg  · Puella Magi  · Referata  · Resedagboken  · SongMeanings  · ShoutWiki  · The Internet Movie Database  · TropicalWikis  · Uncyclopedia  · Urban Dictionary  · Webmonkey  · Wikia  · Wikidot  · WikiHow  · Wikkii  · WikiLeaks  · Wikipedia (Simple English Wikipedia)  · Wikispaces  · Wikispot  · Wik.is  · Wiki-Site  · WikiTravel  · Word Count Journal
Magazines/Blogs/News Cyberpunkreview.com  · Game Developer Magazine  · Gigaom  · Helium  · JPG Magazine  · Polygamia.pl  · San Fransisco Bay Guardian  · Scoop  · Regretsy  · Yahoo! Voices
Microblogging Heello  · Identi.ca  · Jaiku  · Mommo.hu  · Plurk  · Sina Weibo  · Twitter  · TwitLonger
Music/Audio AOL Music  · Audimated.com  · Cinch  · digCCmixter  · Dogmazic.net  · Earbits  · exfm  · Free Music Archive  · Gogoyoko  · Indaba Music  · Instacast  · Jamendo  · Last.fm  · Music Unlimited  · MOG  · PureVolume  · Reverbnation  · ShareTheMusic  · SoundCloud  · Soundpedia  · This Is My Jam  · TuneWiki  · Twaud.io  · WinAmp
People Aaron Swartz  · Michael S. Hart  · Steve Jobs  · Mark Pilgrim  · Dennis Ritchie  · Len Sassaman Project
Protocols/Infrastructure FTP  · Gopher  · IRC  · Usenet  · World Wide Web
Q&A Askville  · Answerbag  · Answers.com  · Ask.com  · Askalo  · Baidu Knows  · Blurtit  · ChaCha  · Experts Exchange  · Formspring  · GirlsAskGuys  · Google Answers  · Google Baraza  · JustAnswer  · MetaFilter  · Quora  · Retrospring  · StackExchange  · The AnswerBank  · The Internet Oracle  · Uclue  · WikiAnswers  · Yahoo! Answers
Recipes/Food Allrecipes  · Epicurious  · Food.com  · Foodily  · Food Network  · Punchfork  · ZipList
Social bookmarking Addinto  · Backflip  · Balatarin  · BibSonomy  · Bkmrx  · Blinklist  · BlogMarks  · BookmarkSync  · CiteULike  · Connotea  · Delicious  · Designer News  · Digg  · Diigo  · Dir.eccion.es  · Evernote  · Excite Bookmark  · Faves  · Favilous  · folkd  · Freelish  · Getboo  · GiveALink.org  · Gnolia  · Google Bookmarks  · Hacker News  · HeyStaks  · IndianPad  · Kippt  · Knowledge Plaza  · Licorize  · Linkwad  · Menéame  · Microsoft Developer Network  · myVIP  · Mister Wong  · My Web  · Mylink Vault  · Newsvine  · Oneview  · Pearltrees  · Pinboard  · Pocket  · Propeller.com  · Reddit  · sabros.us  · Scloog  · Scuttle  · Simpy  · SiteBar  · Slashdot  · Squidoo  · StumbleUpon  · Twine  · Vizited  · Yummymarks  · Xmarks  · Yahoo! Buzz  · Zootool  · Zotero
Social networks Bebo  · BlackPlanet  · Classmates.com  · Cyworld  · Dogster  · Dopplr  · douban  · Ello  · Facebook  · Flixster  · FriendFeed  · Friendster  · Friends Reunited  · Gaia Online  · Google+  · Habbo  · hi5  · Hyves  · iWiW  · LinkedIn  · Miiverse  · mixi  · MyHeritage  · MyLife  · Myspace  · myVIP  · Netlog  · Odnoklassniki  · Orkut  · Plaxo  · Qzone  · Renren  · Skyrock  · Sonico.com  · Storylane  · Tagged  · tvtag  · Upcoming  · Viadeo  · Vine  · Vkontakte  · WeeWorld  · Weibo  · Wretch  · Yahoo! Groups  · Yahoo! Stars India  · Yahoo! Upcoming  · more sites...
Shopping/Retail Alibaba  · AliExpress  · Amazon  · Apple Store  · eBay  · Printfection  · RadioShack  · Sears  · Target  · The Book Depository  · ThinkGeek  · Walmart
Software/code hosting Android Development  · Alioth  · Assembla  · BerliOS  · Betavine  · Bitbucket  · BountySource  · Codecademy  · CodePlex  · Freepository  · Free Software Foundation  · GNU Savannah  · GitHost  · GitHub  · GitHub Downloads  · Gitorious  · Gna!  · Google Code  · ibiblio  · java.net  · JavaForge  · KnowledgeForge  · Launchpad  · LuaForge  · Maemo  · mozdev  · OSOR.eu  · OW2 Consortium  · Openmoko  · OpenSolaris  · Ourproject.org  · Ovi Store  · Project Kenai  · RubyForge  · SEUL.org  · SourceForge  · Stypi  · TestFlight  · tigris.org  · Transifex  · TuxFamily  · Yahoo! Downloads
Torrenting/Piracy ExtraTorrent  · EZTV  · isoHunt  · KickassTorrents  · The Pirate Bay  · Torrentz
Video hosting Academic Earth  · Blip.tv  · Epic  · Google Video  · Justin.tv  · Niconico  · Nokia Trailers  · Qwiki  · Skillfeed  · Stickam  · TED Talks  · Ticker.tv  · Twitch.tv  · Ustream  · Videoplayer.hu  · Viddler  · Viddy  · Vimeo  · Vine  · Vstreamers  · Yahoo! Video  · YouTube  · Famous Internet videos (Me at the zoo)
Web hosting Angelfire  · Brace.io  · BT Internet  · CableAmerica Personal Web Space  · Claranet Netherlands Personal Web Pages  · Comcast Personal Web Pages  · Extra.hu  · FortuneCity  · Free ProHosting  · GeoCities (patch)  · Google Business Sitebuilder  · Google Sites  · Internet Centrum  · MBinternet  · MSN TV  · Nwnyet  · Parodius Networking  · Prodigy.net  · Saunalahti Iso G  · Swipnet  · Telenor  · Tripod  · University of Michigan personal webpages  · Verizon Mysite  · Verizon Personal Web Space  · Webzdarma  · Virgin Media
Web applications Mailman  · MediaWiki  · phpBB  · Simple Machines Forum  · vBulletin
Other 800notes  · AOL  · Akoha  · Ancestry.com  · April Fools' Day  · Amplicate  · AutoAdmit  · Bre.ad  · Circavie  · Cobook  · Co.mments  · Countdown  · Distill  · Dmoz  · Easel  · Eircode  · Electronic Frontier Foundation  · FanFiction.Net  · Feedly  · Ficlets  · Forrst  · FunnyExam.com  · FurAffinity  · Google Helpouts  · Google Moderator  · Google Reader  · ICQmail  · IFTTT  · Jajah  · JuniorNet  · Lulu Poetry  · Mobile Phone Applications  · Mochi Media  · Mozilla Firefox  · MyBlogLog  · NBII  · Neopets  · Quantcast  · Quizilla  · Salon Table Talk  · Shutdownify  · Slidecast  · SOPA blackout pages  · starwars.yahoo.com  · TechNet  · Toshiba Support  · Volán  · Widgetbox  · Windows Technical Preview  · Wunderlist  · Zoocasa
Information A Million Ways to Die on the Web  · Backup Tips  · Cheap storage  · Collecting items randomly  · Data compression algorithms and tools  · Dev  · Discovery Data  · DOS Floppies  · Fortress of Solitude  · Keywords  · Naughty List  · Nightmare Projects  · Rescuing floppy disks  · Rescuing optical media  · Site exploration  · The WARC Ecosystem  · Working with ARCHIVE.ORG
Projects ArchiveCorps  · Audit2014  · Emularity  · Faceoff  · FlickrFckr  · Froogle  · INTERNETARCHIVE.BAK (Internet Archive Census)  · IRC Quotes  · JSMESS  · JSVLC  · Just Solve the Problem  · NewsGrabber  · Project Newsletter  · Valhalla  · Web Roasting (ISP Hosting  · University Web Hosting)  · Woohoo
Tools ArchiveBot  · ArchiveTeam Warrior (Tracker)  · Google Takeout  · HTTrack  · Video downloaders  · Wget (Lua  · WARC)
Teams Bibliotheca Anonoma  · LibreTeam  · URLTeam  · Yahoo Video Warroom  · WikiTeam
About Archive Team Introduction  · Philosophy  · Who We Are  · Our stance on robots.txt  · Why Back Up?  · Software  · Formats  · Storage Media  · Recommended Reading  · Films and documentaries about archiving  · Talks  · In The Media  · FAQ