Dev/New Project

From Archiveteam
< Dev
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Starting a new project is a giant leap into getting things done.

Remember that usability is very important in archives. Using API endpoints or webpages that are used internally by the website itself will make it easier to browse the archives. Check how playback in the Wayback Machine will look (most notably, support for anything except GET is flaky at best).

Website Structure

Take a good look at how the website is structured:

  • Is everything hosted under one domain name?
  • Is there a throttling system?
  • How can I discover usernames or page IDs?
  • Is there an API?
  • Is there a sitemap.xml?
  • Can I guess URLs by incrementing a value?
  • Does disabling cookies or using specific cookies affect anything?
  • Does the website break if you make special requests?
  • Can you Google site:example.com for some URLs?
    • Hint: site:example.com inurl:show_thread
  • Is it a video? Try get-flash-videos

JavaScript

JavaScript is a pain.

  • Check to see if there's a noscript or mobile version.
  • Use a web inspector to observe its behavior and simulate POST requests made by the scripts. (This is often much easier than trying to read the JavaScript yourself.)
  • Scrape URLs from JavaScript templates with regular expressions.

Static Assets

Websites sometimes do not host static media such as images and stylesheets under their primary domain name. Be sure to take those under consideration.

If there are a lot of assets, or these assets need to be deduplicated, you may want to queue them as separate items on the tracker side. To do this you can use the 'backfeed', which is just an API endpoint that allows you to queue items that are then deduplicated with the bloom filter.

IP Address Bans & Throttling

Find out if there is IP address banning. Use a sacrificial IP address if you need to. Many providers charge by the hour, allowing you to make such tests for a tiny cost.

Items

See also: Dev/Seesaw#Quick Definitions

Once you determine the website structure, you need to determine how to split up work units up efficiently by an item name. An item name is a short string describing the work unit, for example, a username. Some projects have multiple potential things that an item could represent; these are most commonly done as type:value. For example, user:foo, or asset:bar.jpg.

Because the Tracker uses Redis as its database, memory usage is a concern. The maximum number of items supported ranges from 5,000,000 to 10,000,000 depending on the item name length. Keep in mind that the todo and done queues are offloaded to disk, so memory usage for those is not as much of a problem.

  • If a user site is USERNAME.example.com, a good candidate is USERNAME.
    • Be careful of large subdomain sites.
  • If the content is by some numerical ID, consider whether ranges of IDs are appropriate.

Call for Action

  • ProTip™: Get things done.

Wiki Page

Ensure there is documentation on this wiki about the project.

Include:

  • an overview of the website
  • the shutdown notice
  • "how to help" instructions
  • a (future) link to the archives

Writing Grab Scripts

If you do not have permissions to create Archive Team's repository, please ask on IRC. You can always create one on GitHub under your own user and it can be transferred later (though make sure that the project is okayed before you put in the effort!)

For detailed information about what goes inside grab scripts, take a look at writing Seesaw scripts.

Tracker Access

If you do not have permission to access the Tracker, please see Tracker#People.

IRC Channel

Archive Team uses per-project IRC channels to reduce noise in the main channel. It also serves as a technical support channel.

IRC channel names must be humorous.

  • If an employee of the website in danger appears on the channel, please do cooperate.

Project Management

Successful projects are a result of successful management. See Project Management for details.

Getting Attention

Many Twitter followers? Got connections? Become a loudmouth!

Otherwise, take initiative yourself and encourage other team members to take initiative.


Developer Documentation