Difference between revisions of "Template:CTA URL lists"

From Archiveteam
Jump to navigation Jump to search
(Clarify regex type and add comment on case-insensitive matching)
(move category link to hat position; tighten up prose)
Line 1: Line 1:
<includeonly>== How to help if you have lists of URLs ==
<includeonly>== How to help if you have lists of URLs ==
: ''For other ArchiveTeam projects that can use this kind of help, see [[:Category:Projects requiring URL lists|Projects requiring URL lists]].''
This project requires lists of URLs for content on the target website. If you have a source of URLs, please:
This project requires lists of URLs for content on the target website. If you have a source of URLs, please:
{{ #if: {{{regex|}}} |
{{ #if: {{{regex|}}} |
# Use the PCRE regular expression <code>{{{regex}}}</code> for filtering.{{ #if: {{{broad|}}} |
# Use the PCRE regular expression <code>{{{regex}}}</code> for filtering.{{ #if: {{{broad|}}} |
#* Note that this regex is intentionally broad to cover many different URL formats. Please do not try to use a more narrow pattern as it may miss valid URLs. We can always filter or transform the results as needed later.}}
#* Note that this regex is intentionally broad to cover many different URL formats. Please do not try to use a more narrow pattern, as it may miss valid URLs. We can always filter or transform the results as needed later.}}
#* Enable case-insensitive matching (e.g. <code>-i</code> option on <code>grep</code>) to not miss URLs with capitalised domains or similar.
#* Enable case-insensitive matching (e.g. grep's <code>-i</code>) to catch URLs with capitalization.
#* If you use <code>grep</code>, remember to include the <code>-a</code> (aka <code>--text</code> on GNU grep) option to ensure it will continue searching for matches when encountering binary data.
#* If using grep or similar, enable text matching (<code>-a</code> or <code>--text</code>) to catch URLs in files with apparent binary data.
#* Example command (GNU grep): <code>grep -Pahoi '{{{regex}}}' FILENAME FILENAME...</code>}}
#* Example command (GNU grep): <code>grep -Pahoi '{{{regex}}}' FILENAME FILENAME...</code>}}
# If the {{ #if: {{{regex|}}} | output | list }} exceeds a few megabytes, please compress it, preferably using <code>zstd -10</code>.
# If the {{ #if: {{{regex|}}} | output | list }} exceeds a few megabytes, please compress it, preferably using <code>zstd -10</code>.
# Upload the file to https://transfer.archivete.am/.
# Upload the file to https://transfer.archivete.am/.
# Share the resulting URL in the project IRC channel.
# Share the resulting URL in the project IRC channel.
#* If you would like to keep the list non-public instead, e.g. for privacy reasons or for not wanting to be publicly associated with it, please get in touch with a channel op (e.g. [[User:Arkiver|arkiver]] or [[User:JustAnotherArchivist|JustAnotherArchivist]]). Note that the items generated from your list would still be processed publicly, of course, but they would be mixed with everything else.
#* If you wish your list to remain private, please get in touch with a channel op (e.g. [[User:Arkiver|arkiver]] or [[User:JustAnotherArchivist|JustAnotherArchivist]]). Items generated from your list will still be processed publicly, but they will be mixed in with all other items and channel logs will not associate them with you.{{ #if: {{{suppresscategory|}}} ||[[Category:Projects requiring URL lists]]}}</includeonly><noinclude>
 
See also [[:Category:Projects requiring URL lists]] for other ArchiveTeam projects that necessitate URL lists.{{ #if: {{{suppresscategory|}}} ||[[Category:Projects requiring URL lists]]}}</includeonly><noinclude>
Options:
Options:



Revision as of 00:06, 4 January 2024

Options:

  • regex, required, the PCRE regular expression to use for filtering, will get wrapped in single quotes for the grep command
    • Technically, this isn't actually required, but only for use on URLs.
  • broad, optional, adding an extra bit about the regex being intentionally broad if non-empty

Example:

{{CTA URL lists|regex = <nowiki>\S*(foo|bar)\S*</nowiki>|broad = yes}}

renders as:

How to help if you have lists of URLs

For other ArchiveTeam projects that can use this kind of help, see Projects requiring URL lists.

This project requires lists of URLs for content on the target website. If you have a source of URLs, please:

  1. Use the PCRE regular expression \S*(foo|bar)\S* for filtering.
    • Note that this regex is intentionally broad to cover many different URL formats. Please do not try to use a more narrow pattern, as it may miss valid URLs. We can always filter or transform the results as needed later.
    • Enable case-insensitive matching (e.g. grep's -i) to catch URLs with capitalization.
    • If using grep or similar, enable text matching (-a or --text) to catch URLs in files with apparent binary data.
    • Example command (GNU grep): grep -Pahoi '\S*(foo|bar)\S*' FILENAME FILENAME...
  2. If the output exceeds a few megabytes, please compress it, preferably using zstd -10.
  3. Upload the file to https://transfer.archivete.am/.
  4. Share the resulting URL in the project IRC channel.
    • If you wish your list to remain private, please get in touch with a channel op (e.g. arkiver or JustAnotherArchivist). Items generated from your list will still be processed publicly, but they will be mixed in with all other items and channel logs will not associate them with you.