Difference between revisions of "ArchiveTeam Warrior"

From Archiveteam
Jump to navigation Jump to search
m (you put active twice :) Undo revision 19554 by Chazchaz101 (talk))
(→‎Projects: add canv.as)
Line 256: Line 256:
|-
|-
| [[Twitch.tv]] || Active || August 9, 2014 || || ||
| [[Twitch.tv]] || Active || August 9, 2014 || || ||
|-
| [[Canv.as]] || Active || August 11, 2014 || || ||
|}
|}



Revision as of 01:51, 11 August 2014

What is the Archive Team Warrior?

Archive team.png
Warrior-vm-screenshot.png
Warrior-web-screenshot.png

The Archive Team Warrior is a virtual archiving appliance. You can run it to help with the ArchiveTeam archiving efforts. It will download sites and upload them to our archive — and it’s really easy to do!

The warrior is a virtual machine, so there is no risk to your computer. The warrior will only use your bandwidth and some of your disk space. It will get tasks from and report progress to the Tracker.

Basic usage

The warrior runs on Windows, OS X and Linux using a virtual machine. You'll need one of:

Quick start instructions for VirtualBox

  1. Download the appliance (174MB).
  2. Launch VirtualBox
  3. In VirtualBox, click File > Import Appliance and open the file.
  4. Start the virtual machine.
    • It will fetch the latest updates and will eventually tell you to start your web browser.
  5. Using your regular web browser, visit http://localhost:8001/
  6. On the left, click "Your settings".
  7. Choose a username - we'll show your progress on the leaderboard.
  8. On the left, click "Available projects" tab and pick a project to work on.
    • Even better: select "ArchiveTeam's Choice" to let your warrior work on the most urgent project.


Alternative virtual machines

Thanks to user-effort, there are alternatives:

Please note that these alternatives are not in widespread use by our warriors, so we may not be able to help with either issues or advanced usage.

Warrior FAQ

Why am I seeing a message that no item was received?

It means that there is no work available. This happens for several reasons:

  • There project has just finished and someone is inspecting the work done. If a problem is discovered, items may be re-queued and more work is available.
  • In a rare case, you have been banned by a tracker administrator because you were requesting too much work or your internet connection is "unclean". We prefer connections from many public IP addresses, use of non-captive DNS servers, and no proxies/firewalls.

Why am I seeing a message about rate limiting?

Keep in mind that although downloading the internet for digital preservation and fun are the primary goals of all Archive Team activities, serious stress on the target's server may occur. The rate limit is imposed by a tracker administrator and should not be subverted.

Help! The warrior is eating all my bandwidth!

You can limit the warrior's bandwidth quite easily for VirtualBox as long as you are running a relatively recent version. The option is not offered with a GUI however.

The command

VBoxManage bandwidthctl archiveteam-warrior-2 --name Limit --add network --limit 3

will limit the warrior instance called archiveteam-warrior-2 (the default name of the warrior vm currently) to 3Mb/s. Adjust as needed.

In the latest version of VirtualBox on Windows, the syntax appears to have changed. The correct command now seems to be:

VBoxManage bandwidthctl archiveteam-warrior-2 add netlimit --type network --limit 3

For more info, consult the VirtualBox manual (Chapter 6, Section 9).

NAT sucks! I want directly-bridged networking!

Simples! (If you're running linux, that is.)

VBoxManage modifyvm "archiveteam-warrior-2" --nic1 bridged
VBoxManage modifyvm "archiveteam-warrior-2" --bridgeadapter1 eth0

(We presume you want to bind to eth0. Adjust as required. :))

I turned my warrior off. Will those tasks be lost?

If you've killed your warrior instances, then the work your warrior did has been lost, however the tasks will be returned to the pool after a period of time. If you want, you can alert the admins via IRC of what's happened, and they can clear the claims your username may have made. However, this isn't very important on most projects.

I need to disconnect my internet / reboot my PC, but I don't want to lose work.

If you pause/suspend the warrior instance, most projects will allow resuming of work in progress when you unsuspend the warrior instance.

I told the warrior to shutdown from the interface but nothing has changed! What gives?

The warrior will attempt to finish the current running tasks before shutting down. If you need to shut down right away, go ahead. Your progress will be lost, however the jobs will eventually cycle out to another user.

How much disk space will the warrior use?

Short answer: it depends on the project.

Long answer: because the way each project defines an item differently, the warrior may be downloading a small file or downloading a whole subsection of a website. The virtual machine is configured by default to use 60GB as an absolute maximum. Any unused virtual machine disk space is not used on the host computer. You may, however, run the virtual machine on less than 60GB if you like to live dangerously. We're downloading the internet after all!

The secondary disk is using up space even though it's not running a project.

Virtual machine disk images do not behave like a regular file. There are several ways to reclaim space:

  • Delete the second disk and put back an empty disk. The warrior should reformat the second disk.
  • Delete the entire warrior application and re-import it.
  • Use the zerofree program and then clone the disk image. Reattach the cloned disk image.

I can't connect to localhost.

The application includes a configuration to set up port forwarding to the guest machine on port 8001 so you can access the interface through your web browser. If this does not happen, you may need to double check your machine's network settings.

The warrior can't connect to the internet.

It may be possible that the virtual machine has picked up the address of the local DNS cache on your computer which the virtual machine does not have access to.

If you experience this on VirtualBox, see this question and answer.

I'm looking at the text scrolling by and I notice some errors. rsync is not working.

Uh-oh! Something is not right. Notify us immediately in the appropriate IRC channel.

I'm looking at the leaderboard. What's that icon beside the username?

That's just the warrior logo: Archive team.png (click on the image for a larger version). It means that that person is using the warrior. Those without the icon are running the scripts manually.

What's that guy doing in the logo?

The place is on fire! But don't worry, he safely escaped with the rescued data in his arms.

I want to log in to the virtual machine. How do I do this?

Unless you know what you are doing, you should not need to do this. But if you want to, the username is root and the password is archiveteam. Then, you can execute sudo -u warrior -i to log in as the warrior user.

Press ALT+F3 to switch to virtual console number 3. Use ALT+Left or ALT+Right to switch between virtual consoles. There are 6 virtual consoles in total. Consoles 1 and 2 are reserved for the warrior.

The warrior seems to have too much overhead. I can't run a VM in a VPS!

You don't need to run a virtual machine. If you are managing a VPS, it's likely you are comfortable with some Linux stuff. Projects can be run manually. Consult the project wiki page or the source code repository readme file.

Why a virtual machine in the first place?

The virtual machine is a quick, safe, and easy way for newcomers to help us out. It offers many features:

  • Graphical interface
  • Automatically selects which project is important to run
  • Self-updating software infrastructure
  • Allows for unattended use
  • In case of software faults, your machine is not ruined
  • Restarts itself in case of runaway programs
  • Runs on Windows, Mac, and Linux painlessly
  • Ensures consistency in the archived data regardless of your machine's quirks

If you have suggestions for improving this system, please talk to us as described below.

I'm running the scripts manually in a VPS but it says the code is out of date a while later

It happens when a bug in the scripts is discovered. Bugs are unavoidable especially when the server is out of our control.

If you are good with scripting, try scripting run-pipeline with --max-items N and git pull in a loop. Or better yet, help us code an auto-updating-outside-the-VM feature.

I just imported the ova image and the warrior is stuck on "Preparing the data partition"

This issue has cropped up before and we do not know what causes it. It is recommended to just delete the warrior image and import the ova again. Testing shows that such a reimport works in the majority of cases.

Why is the default project not working? / Why is a manual project not in the Warrior yet?

Sorry. Sometimes the administrators are too busy...

Why are there no projects?

If there are no projects showing, you can help us write one. No projects does not mean there is nothing left to archive!

Where can I file a bug or a feature request?

If the issue is related to the warrior's web interface or the library that grab scripts are using, see seesaw-kit issues. Other issues should be filed into their own repositories.

I still have a question!

Talk to us on IRC. Use #warrior for specific warrior questions or #archiveteam for general questions.

Projects

Previous and current warrior projects:

Project Status Began Finished Result Archive Location
MobileMe Archive Posted April 3, 2012 Aug 8, 2012 Success

archive index user lookup

FortuneCity Archive Posted April 4, 2012 April 11, 2012 Partial Success archive user lookup
Tabblo Archive Posted May 23, 2012 May 26, 2012 Success archive user lookup
Picplz Archive Posted June 3, 2012 June 15, 2012 archive index user lookup
Tumblr (test project) Archive Posted August 9, 2012 August 19, 2012 archive (tar) archive (warc)
Cinch.FM Archive Posted August 20, 2012 August 22, 2012 Success archive
City of Heroes Archive Posted September 3, 2012 December 1, 2012 Success www forums 1 2 3 4 5
Webshots Archive Posted October 4, 2012 November 18, 2012 index
BT Internet Archive Posted October 10, 2012 November 2, 2012 Success archive
Daily Booth Archive Posted November 19, 2012 December 29, 2012 archive lookup
GitHub Downloads Archive Posted December 13, 2012 December 17, 2012 Success archive index
Yahoo! Blog Archive Posted January 8, 2013 January 19, 2013 archive
weblog.nl Archive Posted January 19, 2013 February 2, 2013 archive lookup
URLTeam Active all releases
Punchfork Archive Posted January 11, 2013 March 6, 2013 archive user lookup
Xanga Downloads Paused January 22, 2013 February 16, 2013 archive user lookup user list
Posterous Archive Posted February 23, 2013 June 29, 2013 archive
Storylane Downloads Finished March 8, 2013 March 15, 2013
Yahoo! Messages Archive Posted March 20, 2013 March 31, 2013 archive
Formspring Archive Posted March 24, 2013 September 19, 2013 Success archive
Yahoo Upcoming Archive Posted April 20, 2013 April 25, 2013 archive
Streetfiles.org Archive Posted April 28, 2013 April 30, 2013 Partial archive
Xanga Downloads Paused June 21, 2013 August 31, 2013 archive
Zapd Archive Posted October 1, 2013 October 8, 2013 Success archive
Blip.tv Hiatus October 11, 2013
Hyves Archives Posted November 10, 2013 December 2, 2013 Success archive
Wretch & Yahoo! Blog Archives Posted December 17, 2013 January 9, 2014 Partial wretch Yahoo Blog
Dogster Archives Posted February 7, 2014 February 16, 2014 Success archive
My Opera Archives Posted February 16, 2014 March 3, 2014 Success archive
Bebo Hiatus February 18, 2014 archive
Viddler Cancelled February 21, 2014 February 27, 2014 Qualified Success
Justin.tv Archives Posted June 5, 2014 June 15, 2014 Success archive
Yahoo! Voices Archives Posted July 28, 2014 July 31, 2014 Success archive
Fotopedia Downloads Finished August 5, 2014 August 7, 2014 Success
Twitch.tv Active August 9, 2014
Canv.as Active August 11, 2014

Status

In Development
a future project
Active
start up a Warrior and join the fun; this one is in progress right now
Downloads Finished
we've finished downloading the data
Archived
the collected data has been properly archived
Archive Posted
the archive is available for download

Result

Success
downloaded all of the data and posted the archive publicly
Qualified Success
either we couldn't get all of the data, or the archive can't be made public
Failure
the site closed before we could download anything

Are you a coder?

Like the warrior? Interested in how it works under the hood? Got software skills? Help us improve it!