Difference between revisions of "Dev/Tracker"

From Archiveteam
< Dev
Jump to: navigation, search
(The Tracker: add info about tracker logs)
m (Reverted edits by Megalanya1 (talk) to last revision by Start)
 
(24 intermediate revisions by 5 users not shown)
Line 1: Line 1:
This article describes how to set up your own [[tracker]] and projects.
+
This article describes how to set up your own '''[[tracker]]''' just like the official Archive Team tracker. Use this guide only if you want to do a full test of the infrastructure.
  
You don't need to be a computer scientist to set up your own tracker, but you will need to be comfortable with:
+
'''Note:''' A virtual machine appliance is available at [https://github.com/ArchiveTeam/archiveteam-dev-env ArchiveTeam/archiveteam-dev-env] which contains a ready-to-use tracker.
  
* Environment:
+
Installation will cover:
** Ubuntu/Debian
+
 
** running scripts
+
* Environment: Ubuntu/Debian
** compiling programs
 
 
* Languages:
 
* Languages:
 
** Python
 
** Python
 
** Ruby
 
** Ruby
 
** JavaScript
 
** JavaScript
** Lua
 
** Shell scripting like Bash
 
 
* Web:  
 
* Web:  
 
** Nginx
 
** Nginx
 
** Phusion Passenger
 
** Phusion Passenger
** Rails
 
 
** Redis
 
** Redis
 
** Node.js
 
** Node.js
Line 25: Line 21:
 
** Wget
 
** Wget
 
** regular expressions
 
** regular expressions
 
There are 3 components to the tracker and project system:
 
 
# The Seesaw client
 
# The Rsync server
 
# The Tracker server
 
 
== Writing a Seesaw Client ==
 
 
The Seesaw client is a specific set of tasks that must be done within an item. Think of it as a template of instructions. Typically, the file is called pipeline.py. The pipeline file uses the [https://github.com/ArchiveTeam/seesaw-kit Seesaw Library].
 
 
The pipeline file will typically use [[Wget_with_Lua_hooks|Wget with Lua]] scripting. The Lua script provided as an argument to Wget within the pipeline file. It controls fine grain operations within Wget such as rejecting unneeded URLs or adding more URLs as they are discovered.
 
 
Take a look at the grab scripts in recent [https://github.com/ArchiveTeam/ Archive Team repositories] for examples of clients.
 
 
=== Installation ===
 
 
You will need:
 
 
* Python 2.7
 
* Lua
 
* Wget with Lua hooks
 
 
Typically, you can install these by running:
 
 
sudo apt-get install build-essential lua5.1 liblua5.1-0-dev python python-setuptools python-dev openssl libssl-dev python-pip make
 
sudo pip install seesaw
 
 
You will also need Wget with Lua. Look into recent repositories for the following script and run it:
 
 
./get-wget-lua.sh
 
 
=== The pipeline file ===
 
 
The pipeline file typically includes:
 
 
* Copy and pasted monkey patches
 
* A routine to find Wget Lua
 
* A version number in the form of <code>YYYYMMDD.NN</code>
 
* Tracker hostname
 
* Custom Tasks:
 
** PrepareDirectories
 
** MoveFiles
 
* Project information saved into the <code>project</code> variable
 
* Instructions on how to deal with the item saved into the <code>pipeline</code> variable
 
* An undeclared <code>downloader</code> variable which will be filled in by the Seesaw library
 
 
It is important to remember that each Task is a template on how to deal with each Item. Specific item variables should not be stored on a Task, but rather, it should be saved onto the item.
 
 
To run a pipeline file, run the command:
 
 
run-pipeline pipeline.py YOUR_NICKNAME
 
 
== Setup the Rsync target ==
 
 
The Rsync target consists of disk space, Rsync, and WARC packing scripts in a dedicated user account.
 
 
Create the system user account dedicated for the Rsync target:
 
 
sudo adduser --system --group --shell /bin/bash archiveteam
 
 
Log in as archiveteam:
 
 
sudo -u archiveteam -i
 
 
Create a place to store the uploads:
 
 
mkdir -p PROJECT_NAME/incoming-uploads/
 
 
You may log out of archiveteam at this point.
 
 
=== Rsync ===
 
 
You will need to install Rsync:
 
 
sudo apt-get install rsync
 
 
Once rsync is installed, you will need to edit the rsync configuration file. If no <code>rsyncd.conf</code> exists in <code>/etc</code>, copy it from <code>/usr/share/doc/rsync/examples/rsyncd.conf</code>
 
 
Rsync uses a concept of "modules" which can be considered as namespaces. If you have copied the example file, you can modify the example ftp module to fit your new project. Perhaps you may call the module after the project name.
 
 
You will also need to include:
 
 
* path = /home/archiveteam/PROJECT_NAME/incoming-uploads/
 
* read only = no
 
* uid = archiveteam
 
* gid = archiveteam
 
 
Make Rsync start up as daemon on boot up by editing <code>/etc/default/rsync</code>. Ensure it reads
 
 
RSYNC_ENABLE=true
 
 
Start up Rsync deamon:
 
 
sudo invoke-rc.d rsync start
 
 
=== The Megawarc Factory ===
 
 
The [https://github.com/ArchiveTeam/archiveteam-megawarc-factory Megawarc Factory] are scripts that package and bundle up all the uploaded WARC files that is received.
 
 
If Git, Curl, or Screen is not yet installed, install it now:
 
 
sudo apt-get install git curl screen
 
 
Log in as archiveteam and download the scripts needed:
 
 
git clone https://github.com/ArchiveTeam/archiveteam-megawarc-factory.git
 
cd archiveteam-megawarc-factory/
 
git clone https://github.com/alard/megawarc.git
 
cd
 
 
Let's begin to populate the configuration file:
 
 
cp archiveteam-megawarc-factory/config.example.sh PROJECT_NAME/config.sh
 
nano PROJECT_NAME/config.sh
 
 
Going through the config.sh:
 
 
* MEGABYTES_PER_CHUNK denotes how big the mega WARC files. Typically it should be set at 50GB, but if you really don't have the space, you can use smaller files like 10GB.
 
* IA_AUTH is your [http://archive.org/help/abouts3.txt Internet Archive S3-like] API [http://archive.org/account/s3.php authentication keys].
 
* IA_COLLECTION, IA_ITEM_TITLE, IA_ITEM_PREFIX, FILE_PREFIX all should have the todos replaced with the project name.
 
* FS1_BASE_DIR should be set to /home/archiveteam/PROJECT_NAME/
 
* FS2_BASE_DIR should be set to same as above or another location.
 
* COMPLETED_DIR should be left empty (i.e., "") if the uploaded file is to be deleted.
 
 
Bother or ask politely someone about getting permission to upload your files to the collection archiveteam_PROJECT_NAME. You can ask on #archiveteam on EFNet.
 
 
Let's run the Megawarc Factory. First, create a sentinel file:
 
 
cd PROJECT_NAME
 
touch RUN
 
 
You can run the Megawarc Factory in Screen. The 3 scripts will on separate command shells within one Screen session:
 
 
screen
 
./chunk-multiple
 
CTRL+A c
 
ionice -c 2 -n 6 nice -n 19 ./pack-multiple
 
CTRL+A c
 
./upload-multiple
 
CTRL+A d
 
 
Here's a few Screen pointers:
 
 
* screen -r will resume an existing screen session
 
* CTRL+A c creates a new command window
 
* CTRL+A SPACE switches to the next window
 
* CTRL+A " shows you a list of windows
 
* CTRL+A d leaves, or detaches, the screen session
 
 
To stop the Megawarc Factory, remove the sentinel file:
 
 
rm RUN
 
 
You can log out of the archiveteam account now.
 
  
 
== The Tracker ==
 
== The Tracker ==
Line 191: Line 32:
 
=== Redis ===
 
=== Redis ===
  
Redis is database stored in memory. So, engineer your item names so you do not have to load many items at once. Redis saves its database periodicly into a file located at /var/lib/redis/6379/dump.rdb. It is safe to copy the file, e.g., for backups.
+
Redis is database stored in memory. So, item names should be engineered to be memory efficient. Redis saves its database periodically into a file located at /var/lib/redis/6379/dump.rdb. It is safe to copy the file, e.g., for backups.
  
 
To install Redis, you may follow these [http://redis.io/topics/quickstart quickstart instructions], but we'll show you how.
 
To install Redis, you may follow these [http://redis.io/topics/quickstart quickstart instructions], but we'll show you how.
Line 204: Line 45:
 
Now install the server:
 
Now install the server:
  
  sudo ./utils/install_server.sh
+
sudo make install
 +
cd utils
 +
  sudo ./install_server.sh
  
 
Note, by default, it runs as root. Let's stop it and make it run under www-data:
 
Note, by default, it runs as root. Let's stop it and make it run under www-data:
Line 271: Line 114:
 
A list of things needed to be installed will be shown. Log out of the tracker account, install them, and log back into the tracker account.
 
A list of things needed to be installed will be shown. Log out of the tracker account, install them, and log back into the tracker account.
  
Install Ruby and Rails:
+
Install Ruby and Bundler:
  
  rvm install 2.0
+
  rvm install 2.2.2
 
  rvm rubygems current
 
  rvm rubygems current
  gem install rails
+
  gem install bundler
gem install bundle
 
  
 
Install Passenger:
 
Install Passenger:
Line 295: Line 137:
 
  passenger_enabled on;
 
  passenger_enabled on;
 
  client_max_body_size 15M;
 
  client_max_body_size 15M;
 +
 +
The logs will get big so we'll use logrotate. Save this into <code>/home/tracker/logrotate.conf</code>:
 +
 +
/home/tracker/nginx/logs/error.log
 +
/home/tracker/nginx/logs/access.log {
 +
      daily
 +
      rotate 10
 +
      copytruncate
 +
      delaycompress
 +
      compress
 +
      notifempty
 +
      missingok
 +
      size 10M
 +
}
 +
 +
To call logrotate, we'll add an entry using crontab:
 +
 +
crontab -e
 +
 +
Now add the following line:
 +
 +
@daily /usr/sbin/logrotate --state /home/tracker/.logrotate.state /home/tracker/logrotate.conf
  
 
Log out of the tracker account at this point.
 
Log out of the tracker account at this point.
  
Let's create an Upstart configuration file to start up Nginx. Save this into <code>/etc/init/ngixn-tracker.conf</code>:
+
Let's create an Upstart configuration file to start up Nginx. Save this into <code>/etc/init/nginx-tracker.conf</code>:
  
 
  description "nginx http daemon"
 
  description "nginx http daemon"
Line 311: Line 175:
 
   
 
   
 
  exec /home/tracker/nginx/sbin/nginx -c /home/tracker/nginx/conf/nginx.conf -g "daemon off;"
 
  exec /home/tracker/nginx/sbin/nginx -c /home/tracker/nginx/conf/nginx.conf -g "daemon off;"
 +
 +
Or, if you use Systemd, put this into <code>/lib/systemd/system/nginx-tracker.service</code>:
 +
 +
[Unit]
 +
Description="nginx http daemon"
 +
 +
[Service]
 +
Type=simple
 +
ExecStart=/home/tracker/nginx/sbin/nginx -c /home/tracker/nginx/conf/nginx.conf -g "daemon off;"
  
 
=== Tracker ===
 
=== Tracker ===
Line 344: Line 217:
 
  }
 
  }
  
Now we may need to fix an issue with Passenger forking after the Redis connection has been made. Please see https://github.com/ArchiveTeam/universal-tracker/issues/5 for more information.
+
* Now we may need to fix an issue with Passenger forking after the Redis connection has been made. Please see https://github.com/ArchiveTeam/universal-tracker/issues/5 for more information.
 +
* There is also an issue with non-ASCII names. See https://github.com/ArchiveTeam/universal-tracker/issues/7.
  
 
Now install the necessary gems:
 
Now install the necessary gems:
Line 382: Line 256:
 
Install the Node.js libraries needed:
 
Install the Node.js libraries needed:
  
  npm install socket.io
+
  npm install
  npm install redis
+
 
 +
If you get an error while installing hiredis, you may need to provide Debian's "nodejs" as "node"Symlink "node" to the nodejs executable and try again.
  
 
Log out of the tracker account at this point.
 
Log out of the tracker account at this point.
Line 398: Line 273:
 
   
 
   
 
  exec node /home/tracker/broadcaster/server.js
 
  exec node /home/tracker/broadcaster/server.js
 +
 +
Or, for Systemd, put this into <code>/lib/systemd/system/nodejs-tracker.service</code>:
 +
 +
[Unit]
 +
Description="tracker nodejs daemon"
 +
 +
[Service]
 +
Type=forking
 +
Group=tracker
 +
User=tracker
 +
ExecStart=/usr/bin/js /home/tracker/broadcaster/server.js
  
 
=== Tracker Setup ===
 
=== Tracker Setup ===
Line 403: Line 289:
 
Start up the Tracker and Broadcaster:
 
Start up the Tracker and Broadcaster:
  
 +
Upstart:
 
  sudo start nginx-tracker
 
  sudo start nginx-tracker
 
  sudo start nodejs-tracker
 
  sudo start nodejs-tracker
  
You now need to configure the tracker. Open up your web browser and visit http://localhost/.
+
Systemd:
 +
sudo systemctl start nginx-tracker
 +
sudo systemctl start nodejs-tracker
 +
 
 +
You now need to configure the tracker. Open up your web browser and visit http://localhost/global-admin/.
  
 
* In Global-Admin→Configuration→Live logging host, specify the public location of the Node.js app. By default, it uses port 8080.
 
* In Global-Admin→Configuration→Live logging host, specify the public location of the Node.js app. By default, it uses port 8080.
Line 415: Line 306:
  
 
* If you followed this guide, the rsync location is defined as <code>rsync://HOSTNAME/PROJECT_NAME/:downloader/</code>
 
* If you followed this guide, the rsync location is defined as <code>rsync://HOSTNAME/PROJECT_NAME/:downloader/</code>
 +
* The '''''trailing slash''''' within the rsync URL is very important. Without it, files will not be uploaded within the directory.
  
=== Claims ===
+
==== Claims ====
  
 
You probably want to have Cron clearing out old claims. The Tracker includes a Ruby script that will do that for you. By default, it removes claims older than 6 hours. You may want to change that for big items by creating a copy of the script for each project.
 
You probably want to have Cron clearing out old claims. The Tracker includes a Ruby script that will do that for you. By default, it removes claims older than 6 hours. You may want to change that for big items by creating a copy of the script for each project.
Line 434: Line 326:
 
  0 */6 * * * cd /home/tracker/universal-tracker && WHICH_RUBY scripts/release-stale.rb PROJECT_NAME
 
  0 */6 * * * cd /home/tracker/universal-tracker && WHICH_RUBY scripts/release-stale.rb PROJECT_NAME
  
=== Logs ===
+
==== Logs ====
  
 
Since the Tracker stores logs into Redis, it will use up memory quickly. <code>log-drainer.rb</code> continuously writes the logs into a text file:
 
Since the Tracker stores logs into Redis, it will use up memory quickly. <code>log-drainer.rb</code> continuously writes the logs into a text file:
Line 446: Line 338:
  
 
  @daily find /home/tracker/universal-tracker/logs/ -iname "*.log" -mtime +2 -exec xz {} \;
 
  @daily find /home/tracker/universal-tracker/logs/ -iname "*.log" -mtime +2 -exec xz {} \;
 +
 +
==== Reducing memory usage ====
 +
 +
The Passenger Ruby module may use up too much memory. You can add the following lines to your nginx config. Add these inside the <code>http</code> block:
 +
 +
passenger_max_pool_size 2;
 +
passenger_max_requests 10000;
 +
 +
The first line allows spawning maximum of 2 processes. The second line restarts Passenger after 10,000 requests to free memory caused by memory leaks.
 +
 +
{{devnav}}
 +
 +
{{Navigation box}}

Latest revision as of 16:26, 17 January 2017

This article describes how to set up your own tracker just like the official Archive Team tracker. Use this guide only if you want to do a full test of the infrastructure.

Note: A virtual machine appliance is available at ArchiveTeam/archiveteam-dev-env which contains a ready-to-use tracker.

Installation will cover:

  • Environment: Ubuntu/Debian
  • Languages:
    • Python
    • Ruby
    • JavaScript
  • Web:
    • Nginx
    • Phusion Passenger
    • Redis
    • Node.js
  • Tools:
    • Screen
    • Rsync
    • Git
    • Wget
    • regular expressions

The Tracker

The Tracker manages what items are claimed by users that run the Seesaw client. It also shows a pretty leaderboard.

Let's create a dedicated account to run the web server and tracker:

sudo adduser --system --group --shell /bin/bash tracker

Redis

Redis is database stored in memory. So, item names should be engineered to be memory efficient. Redis saves its database periodically into a file located at /var/lib/redis/6379/dump.rdb. It is safe to copy the file, e.g., for backups.

To install Redis, you may follow these quickstart instructions, but we'll show you how.

These steps are from the quickstart guide:

wget http://download.redis.io/redis-stable.tar.gz
tar xvzf redis-stable.tar.gz
cd redis-stable
make

Now install the server:

sudo make install
cd utils
sudo ./install_server.sh

Note, by default, it runs as root. Let's stop it and make it run under www-data:

sudo invoke-rc.d redis_6379 stop
sudo adduser --system --group www-data
sudo chown -R www-data:www-data /var/lib/redis/6379/
sudo chown -R www-data:www-data /var/log/redis_6379.log

Edit the config file /etc/redis/6379.conf with the options like:

bind 127.0.0.1
pidfile /var/run/shm/redis_6379.pid

Now tell the start up script to run it as www-data:

sudo nano /etc/init.d/redis_6379

Change the EXEC and CLIEXEC variables to use sudo -u www-data -g www-data:

EXEC="sudo -u www-data -g www-data /usr/local/bin/redis-server"
CLIEXEC="sudo -u www-data -g www-data /usr/local/bin/redis-cli"
PIDFILE=/var/run/shm/redis_6379.pid

To avoid catastrophe with background saves failing on fork() (Redis needs lots of memory), run:

sudo sysctl vm.overcommit_memory=1

The above setting will be lost after reboot. Add this line to /etc/sysctl.conf:

vm.overcommit_memory=1

The log file will get big so we need a logrotate config. Create one at /etc/logrotate.d/redis with the config:

/var/log/redis_*.log {
      daily
      rotate 10
      copytruncate
      delaycompress
      compress
      notifempty
      missingok
      size 10M
}

Start up Redis again using:

sudo invoke-rc.d redis_6379 start

Nginx with Passenger

Nginx is a web server. Phusion Passenger is a module within Nginx that runs Rails applications.

There is a guide on how to install Nginx with Passenger, the following instructions are similar.

Log in as tracker:

sudo -u tracker -i

We'll use RVM to install Ruby libraries:

curl -L get.rvm.io | bash -s stable
source ~/.rvm/scripts/rvm
rvm requirements

A list of things needed to be installed will be shown. Log out of the tracker account, install them, and log back into the tracker account.

Install Ruby and Bundler:

rvm install 2.2.2
rvm rubygems current
gem install bundler

Install Passenger:

gem install passenger

Install Nginx. This command will download, compile, and install a basic Nginx server.:

passenger-install-nginx-module

Use the following prefix for Nginx installation:

/home/tracker/nginx/

Change the location of the tracker software (to be installed later). Edit nginx/conf/nginx.conf. Use the lines under the "location /" option:

root /home/tracker/universal-tracker/public;
passenger_enabled on;
client_max_body_size 15M;

The logs will get big so we'll use logrotate. Save this into /home/tracker/logrotate.conf:

/home/tracker/nginx/logs/error.log
/home/tracker/nginx/logs/access.log {
     daily
     rotate 10
     copytruncate
     delaycompress
     compress
     notifempty
     missingok
     size 10M
}

To call logrotate, we'll add an entry using crontab:

crontab -e

Now add the following line:

@daily /usr/sbin/logrotate --state /home/tracker/.logrotate.state /home/tracker/logrotate.conf

Log out of the tracker account at this point.

Let's create an Upstart configuration file to start up Nginx. Save this into /etc/init/nginx-tracker.conf:

description "nginx http daemon"

start on runlevel [2]
stop on runlevel [016]

setuid tracker
setgid tracker

console output

exec /home/tracker/nginx/sbin/nginx -c /home/tracker/nginx/conf/nginx.conf -g "daemon off;"

Or, if you use Systemd, put this into /lib/systemd/system/nginx-tracker.service:

[Unit]
Description="nginx http daemon"

[Service]
Type=simple
ExecStart=/home/tracker/nginx/sbin/nginx -c /home/tracker/nginx/conf/nginx.conf -g "daemon off;"

Tracker

Log in into the tracker account.

Download the Tracker software:

git clone https://github.com/ArchiveTeam/universal-tracker.git

We'll need to configure the location of Redis. Copy the config file:

cp universal-tracker/config/redis.json.example universal-tracker/config/redis.json

Add a "production" object into the JSON file. Here is an example:

{
  "development": {
    "host": "127.0.0.1",
    "port": 6379,
    "db":   13
  },
  "test": {
    "host": "127.0.0.1",
    "port": 6379,
    "db":   14
  },
  "production": {
    "host":"127.0.0.1",
    "port":6379,
    "db": 1
  }
}

Now install the necessary gems:

cd universal-tracker
bundle install

Log out of the tracker account at this point.

Node.js

Node.js is required to run the fancy leaderboard using WebSockets. We'll use NPM to manage the Node.js libraries:

sudo apt-get install npm

Log into the tracker account.

Now, we manually edit the Node.js program because it has problems:

cp -R universal-tracker/broadcaster .
nano broadcaster/server.js

Modify env and trackerConfig variables to something like this:

var env = {
    tracker_config: {
        redis_pubsub_channel: "tracker-log"
    },
    redis_db: 1
};
var trackerConfig = env['tracker_config'];

You also need to modify the "transports" configuration by adding websocket. The new line should look like this:

  io.set("transports", ["websocket", "xhr-polling"]);

Install the Node.js libraries needed:

npm install

If you get an error while installing hiredis, you may need to provide Debian's "nodejs" as "node". Symlink "node" to the nodejs executable and try again.

Log out of the tracker account at this point.

Create an Upstart file at /etc/init/nodejs-tracker.conf:

description "tracker nodejs daemon"

start on runlevel [2]
stop on runlevel [016]

setuid tracker
setgid tracker

exec node /home/tracker/broadcaster/server.js

Or, for Systemd, put this into /lib/systemd/system/nodejs-tracker.service:

[Unit]
Description="tracker nodejs daemon"

[Service]
Type=forking
Group=tracker
User=tracker
ExecStart=/usr/bin/js /home/tracker/broadcaster/server.js

Tracker Setup

Start up the Tracker and Broadcaster:

Upstart:

sudo start nginx-tracker
sudo start nodejs-tracker

Systemd:

sudo systemctl start nginx-tracker
sudo systemctl start nodejs-tracker

You now need to configure the tracker. Open up your web browser and visit http://localhost/global-admin/.

  • In Global-Admin→Configuration→Live logging host, specify the public location of the Node.js app. By default, it uses port 8080.

You are now free to manage the tracker.

Notes:

  • If you followed this guide, the rsync location is defined as rsync://HOSTNAME/PROJECT_NAME/:downloader/
  • The trailing slash within the rsync URL is very important. Without it, files will not be uploaded within the directory.

Claims

You probably want to have Cron clearing out old claims. The Tracker includes a Ruby script that will do that for you. By default, it removes claims older than 6 hours. You may want to change that for big items by creating a copy of the script for each project.

To set up Cron, login as the tracker account, and run:

which ruby

Take note of which Ruby executable is used.

Now edit the Cron table:

crontab -e

Add the following line which runs release-stale.rb every 6 hours:

0 */6 * * * cd /home/tracker/universal-tracker && WHICH_RUBY scripts/release-stale.rb PROJECT_NAME

Logs

Since the Tracker stores logs into Redis, it will use up memory quickly. log-drainer.rb continuously writes the logs into a text file:

mkdir -p /home/tracker/universal-tracker/logs/
cd /home/tracker/universal-tracker && ruby scripts/log-drainer.rb

Pressing CTRL+C will stop it. Run this within a Screen session.

This crontab entry will compress the log files that haven't been modified in two days:

@daily find /home/tracker/universal-tracker/logs/ -iname "*.log" -mtime +2 -exec xz {} \;

Reducing memory usage

The Passenger Ruby module may use up too much memory. You can add the following lines to your nginx config. Add these inside the http block:

passenger_max_pool_size 2;
passenger_max_requests 10000;

The first line allows spawning maximum of 2 processes. The second line restarts Passenger after 10,000 requests to free memory caused by memory leaks.


Developer Documentation



v · t · e         Archive Team
Current events

Alive... OR ARE THEY · Deathwatch · Projects

Archiveteam.jpg
Archiving projects

APKMirror · Archive.is · BetaArchive · Government Backup (#datarefuge · ftp-gov· Gmane · Internet Archive · It Died · Megalodon.jp · OldApps.com · OldVersion.com · OSBetaArchive · TEXTFILES.COM · The Dead, the Dying & The Damned · The Mail Archive · UK Web Archive · WebCite · Vaporwave.me

Blogging

Blog.pl · Blogger · Blogster · Blogter.hu · Freeblog.hu · Fuelmyblog · Jux · LiveJournal · My Opera · Nolblog.hu · Open Diary · ownlog.com · Posterous · Powerblogs · Proust · Roon · Splinder · Tumblr · Vox · Weblog.nl · Windows Live Spaces · Wordpress.com · Xanga · Yahoo! Blog · Zapd

Cloud hosting/file sharing

aDrive · AnyHub · Box · Dropbox · Docstoc · Google Drive · Google Groups Files · iCloud · Fileplanet · LayerVault · MediaCrush · MediaFire · Mega · MegaUpload · MobileMe · OneDrive · Pomf.se · RapidShare · Ubuntu One · Yahoo! Briefcase

Corporations

Apple · IBM · Google · Loblaw · Lycos Europe · Microsoft · Yahoo!

Events

Arab Spring · Great Ape-Snake War · Spanish Revolution

Font Repos

DaFont · Google Web Fonts · GNU FreeFont · Fontspace

Forums/Message boards

4chan · Captain Luffy Forums · College Confidential · DSLReports · ESPN Forums · forums.starwars.com · HeavenGames · Invisionfree · NeoGAF · The Classic Horror Film Board · Yahoo! Messages · Yahoo! Neighbors · Yuku.com

Gaming

Atomicgamer · Bazaar.tf · City of Heroes · Club Nintendo · Counter-Strike: Global Offensive · CS:GO Lounge · Desura · Dota 2 · Dota 2 Lounge · Emulation Zone · ESEA · GameBanana · GameMaker Sandbox · GameTrailers · Halo · HLTV.org · HQ Trivia · Infinite Crisis · joinDOTA · League of Legends · Liquipedia · Minecraft.net · Player.me · Playfire · Raptr · Steam · SteamDB · SteamGridDB · Team Fortress 2 · TF2 Outpost · Warhammer · Xfire

Image hosting

500px · AOL Pictures · Blipfoto · Blingee · Canv.as · Camera+ · Cameroid · DailyBooth · Degree Confluence Project · deviantART · Demotivalo.net · Flickr · Fotoalbum.hu · Fotolog.com · Fotopedia · Frontback · Geograph Britain and Ireland · Giphy · GTF Képhost · ImageShack · Imgh.us · Imgur · Inkblazers · Instagram · Kepfeltoltes.hu · Kephost.com · Kephost.hu · Kepkezelo.com · Keptarad.hu · Madden GIFERATOR · MLKSHK · Microsoft Clip Art · Microsoft Photosynth · Nokia Memories · noob.hu · Odysee · Panoramio · Photobucket · Picasa · Picplz · Pixiv · Portalgraphics.net · PSharing · Ptch · puu.sh · Rawporter · Relay.im · ScreenshotsDatabase.com · Snapjoy · Streetfiles · Tabblo · Tinypic · Trovebox · TwitPic · Wallbase · Wallhaven · Webshots · Wikimedia Commons

Knowledge/Wikis

arXiv · Citizendium · Clipboard.com · Deletionpedia · EditThis · Encyclopedia Dramatica · Etherpad · Everything2 · infoAnarchy · GeoNames · GNUPedia · Google Books (Google Books Ngram· Horror Movie Database · Insurgency Wiki · Knol · Lost Media Wiki · Neoseeker.com · Notepad.cc · Nupedia · OpenCourseWare · OpenStreetMap · Orain · Pastebin · Patch.com · Project Gutenberg · Puella Magi · Referata · Resedagboken · SongMeanings · ShoutWiki · The Internet Movie Database · TropicalWikis · Uncyclopedia · Urban Dictionary · Urban Exploration Resource · Webmonkey · Wikia · Wikidot · WikiHow · Wikkii · WikiLeaks · Wikipedia (Simple English Wikipedia· Wikispaces · Wikispot · Wik.is · Wiki-Site · WikiTravel · Word Count Journal

Magazines/Blogs/News

Cyberpunkreview.com · Game Developer Magazine · Gigaom · Hardware Canucks · Helium · JPG Magazine · Make Magazine · Polygamia.pl · San Fransisco Bay Guardian · Scoop · Regretsy · Yahoo! Voices

Microblogging

Heello · Identi.ca · Jaiku · Mommo.hu · Plurk · Sina Weibo · Twitter · TwitLonger

Music/Audio

AOL Music · Audimated.com · Cinch · digCCmixter · Dogmazic.net · Earbits · exfm · Free Music Archive · Gogoyoko · Indaba Music · Instacast · Jamendo · Last.fm · Music Unlimited · MOG · PureVolume · Reverbnation · ShareTheMusic · SoundCloud · Soundpedia · This Is My Jam · TuneWiki · Twaud.io · WinAmp

People

Aaron Swartz · Michael S. Hart · Steve Jobs · Mark Pilgrim · Dennis Ritchie · Len Sassaman Project

Protocols/Infrastructure

FTP · Gopher · IRC · Usenet · World Wide Web
BitTorrent DHT

Q&A

Askville · Answerbag · Answers.com · Ask.com · Askalo · Baidu Knows · Blurtit · ChaCha · Experts Exchange · Formspring · GirlsAskGuys · Google Answers · Google Baraza · JustAnswer · MetaFilter · Quora · Retrospring · StackExchange · The AnswerBank · The Internet Oracle · Uclue · WikiAnswers · Yahoo! Answers

Recipes/Food

Allrecipes · Epicurious · Food.com · Foodily · Food Network · Punchfork · ZipList

Social bookmarking

Addinto · Backflip · Balatarin · BibSonomy · Bkmrx · Blinklist · BlogMarks · BookmarkSync · CiteULike · Connotea · Delicious · Designer News · Digg · Diigo · Dir.eccion.es · Evernote · Excite Bookmark · Faves · Favilous · folkd · Freelish · Getboo · GiveALink.org · Gnolia · Google Bookmarks · Hacker News · HeyStaks · IndianPad · Kippt · Knowledge Plaza · Licorize · Linkwad · Menéame · Microsoft Developer Network · myVIP · Mister Wong · My Web · Mylink Vault · Newsvine · Oneview · Pearltrees · Pinboard · Pocket · Propeller.com · Reddit · sabros.us · Scloog · Scuttle · Simpy · SiteBar · Slashdot · Squidoo · StumbleUpon · Twine · Vizited · Yummymarks · Xmarks · Yahoo! Buzz · Zootool · Zotero

Social networks

Bebo · BlackPlanet · Classmates.com · Cyworld · Dogster · Dopplr · douban · Ello · Facebook · Flixster · FriendFeed · Friendster · Friends Reunited · Gaia Online · Google+ · Habbo · hi5 · Hyves · iWiW · LinkedIn · Miiverse · mixi · MyHeritage · MyLife · Myspace · myVIP · Netlog · Odnoklassniki · Orkut · Plaxo · Qzone · Renren · Skyrock · Sonico.com · Storylane · Tagged · tvtag · Upcoming · Viadeo · Vine · Vkontakte · WeeWorld · Weibo · Wretch · Yahoo! Groups · Yahoo! Stars India · Yahoo! Upcoming · more sites...

Shopping/Retail

Alibaba · AliExpress · Amazon · Apple Store · Barnes & Noble · DirectCanada · eBay · Kmart · NCIX · Printfection · RadioShack · Sears · Sears Canada · Target · The Book Depository · ThinkGeek · Toys "R" Us · Walmart

Software/code hosting

Android Development · Alioth · Assembla · BerliOS · Betavine · Bitbucket · BountySource · Codecademy · CodePlex · Freepository · Free Software Foundation · GNU Savannah · GitHost  · GitHub · GitHub Downloads · Gitorious · Gna! · Google Code · ibiblio · java.net · JavaForge · KnowledgeForge · Launchpad · LuaForge · Maemo · mozdev · OSOR.eu · OW2 Consortium · Openmoko · OpenSolaris · Ourproject.org · Ovi Store · Project Kenai · RubyForge · SEUL.org · SourceForge · Stypi · TestFlight · tigris.org · Transifex · TuxFamily · Yahoo! Downloads

Television/Radio

ABC · Austin City Limits · BBC · CBC · CBS · Computer Chronicles · CTV · Fox · G4 · Global TV · Jeopardy! · NBC · NHK · PBS · Penn & Teller: Bullshit! · The Howard Stern Show · TV News Archive (Understanding 9/11)

Torrenting/Piracy

ExtraTorrent · EZTV · isoHunt · KickassTorrents · The Pirate Bay · Torrentz · Library Genesis

Video hosting

Academic Earth · Bambuser · Blip.tv · Epic · Google Video · Justin.tv · Niconico · Nokia Trailers · Oddshot.tv · Plays.tv · Qwiki · Skillfeed · Stickam · TED Talks · Ticker.tv · Twitch.tv · Ustream · Videoplayer.hu · Viddler · Viddy · Vidme · Vimeo · Vine · Vstreamers · Yahoo! Video · YouTube · Famous Internet videos (Me at the zoo)

Web hosting

Angelfire · Brace.io · BT Internet · CableAmerica Personal Web Space · Claranet Netherlands Personal Web Pages · Comcast Personal Web Pages · Extra.hu · FortuneCity · Free ProHosting · GeoCities (patch· Google Business Sitebuilder · Google Sites · Internet Centrum · MBinternet · MSN TV · Nifty · Nwnyet · Parodius Networking · Prodigy.net · Saunalahti Iso G · Swipnet · Telenor · Tripod · University of Michigan personal webpages · Verizon Mysite · Verizon Personal Web Space · Webzdarma · Virgin Media

Web applications

Mailman · MediaWiki · phpBB · Simple Machines Forum · vBulletin

Information

A Million Ways to Die on the Web · Backup Tips · Cheap storage · Collecting items randomly · Data compression algorithms and tools · Dev · Discovery Data · DOS Floppies · Fortress of Solitude · Keywords · Naughty List · Nightmare Projects · Rescuing floppy disks · Rescuing optical media · Site exploration · The WARC Ecosystem · Working with ARCHIVE.ORG

Projects

ArchiveCorps · Audit2014 · Emularity · Faceoff · FlickrFckr · Froogle · INTERNETARCHIVE.BAK (Internet Archive Census· IRC Quotes · JSMESS · JSVLC · Just Solve the Problem · NewsGrabber · Project Newsletter · Valhalla · Web Roasting (ISP Hosting · University Web Hosting· Woohoo

Tools

ArchiveBot · ArchiveTeam Warrior (Tracker· Google Takeout · HTTrack · Video downloaders · Wget (Lua · WARC)

Teams

Bibliotheca Anonoma · LibreTeam · URLTeam · Yahoo Video Warroom · WikiTeam

Other

800notes · AOL · Akoha · Ancestry.com · April Fools' Day · Amplicate · AutoAdmit · Bre.ad · Circavie · Cobook · Co.mments · Countdown · Discourse · Distill · Dmoz · Easel · Eircode · Electronic Frontier Foundation · FanFiction.Net · Feedly · Ficlets · Forrst · FunnyExam.com · FurAffinity · Google Helpouts · Google Moderator · Google Reader · ICQmail · IFTTT · Jajah · JuniorNet · Lulu Poetry · Mobile Phone Applications · Mochi Media · Mozilla Firefox · MyBlogLog · NBII · Neopets · Quantcast · Quizilla · Salon Table Talk · Shutdownify · Slidecast · Stack Overflow · SOPA blackout pages · starwars.yahoo.com · TechNet · Toshiba Support · USA-Gov · Volán · Widgetbox · Windows Technical Preview · Wunderlist · YTMND · Zoocasa

About Archive Team

Introduction · Philosophy · Who We Are · Our stance on robots.txt · Why Back Up? · Software · Formats · Storage Media · Recommended Reading · Films and documentaries about archiving · Talks · In The Media · FAQ