Template:CTA URL lists

From Archiveteam
Jump to navigation Jump to search

Options:

  • regex, required, the PCRE regular expression to use for filtering, will get wrapped in single quotes for the grep command
    • Technically, this isn't actually required, but only for use on URLs.
  • broad, optional, adding an extra bit about the regex being intentionally broad if non-empty

Example:

{{CTA URL lists|regex = <nowiki>\S*(foo|bar)\S*</nowiki>|broad = yes}}

renders as:

How to help if you have lists of URLs

For other ArchiveTeam projects that can use this kind of help, see Projects requiring URL lists.

This project requires lists of URLs for content on the target website. If you have a source of URLs, please:

  1. Use the PCRE regular expression \S*(foo|bar)\S* for filtering.
    • Note that this regex is intentionally broad to cover many different URL formats. Please do not try to use a more narrow pattern, as it may miss valid URLs. We can always filter or transform the results as needed later.
    • Enable case-insensitive matching (e.g. grep's -i) to catch URLs with capitalization.
    • If using grep or similar, enable text matching (-a or --text) to catch URLs in files with apparent binary data.
    • Example command (GNU grep): grep -Pahoi '\S*(foo|bar)\S*' FILENAME FILENAME...
  2. If the output exceeds a few megabytes, compress it, preferably using zstd -10.
  3. Give the file a descriptive name and upload it to https://transfer.archivete.am/.
  4. Share the resulting URL in the project IRC channel.
    • If you wish your list to remain private, please get in touch with a channel op (e.g. arkiver or JustAnotherArchivist). Items generated from your list will still be processed publicly, but they will be mixed in with all other items and channel logs will not associate them with you.