PANDA

From Archiveteam
Jump to navigation Jump to search

PANDA is a tool for dumping a website from a sitemap, an URL Shortener, or any website where content is ordered (alpha)numerically.

PANDA itself is divided into three executables:

  1. PANDA-DL, which dumps all URLs in a text file, separated by Unix newlines ("\n").
  2. PANDA-SH (incomplete), which dumps URL shortening websites and other websites where URLs are numerically (or alphabetically) ordered (i.e. "00000" to "zzzzz").
  3. PANDA-SP, which parses a sitemap and outputs a list of URLs usable by PANDA-DL.

The first two are only compatible with Linux and (maybe) macOS; if you want to run them on Windows, you'll have to use Cygwin or WSL.

Usage

PANDA-DL

To use PANDA-DL you must specify a file containing full URLs and (optionally) a number representing how many URLs to process at the same time.

After you start the program, it will start dumping the URLs, 16 at a time.

When dumping an URL, PANDA-DL will use the "wget" command and download both the WARC and regular file.

Bugs

As of Version A, PANDA-DL supports up to 1000000 lines per file.

PANDA-SP

PANDA-SP if written in Python; to use it, specify one or more XML Sitemap files, and PANDA-SP will automatically print out all URLs in it.