Difference between revisions of "Mozilla Addons"

From Archiveteam
Jump to navigation Jump to search
(A first attempt to document the site structure a bit)
(9 intermediate revisions by 2 users not shown)
Line 3: Line 3:
| URL = https://addons.mozilla.org/
| URL = https://addons.mozilla.org/
| project_status = {{specialcase}}
| project_status = {{specialcase}}
| archiving_status = {{saved}} <small>(addon files)</small><br />{{upcoming}} <small>(website, warrior)</small><br />{{inprogress}} <small>(website, JAA)</small>
| archiving_status = {{saved}} <small>(addon files)</small><br />{{saved}} <small>(website)</small>
| irc = outofammo
| irc = outofammo
| image = Amo_screenshot_2018-08-22.png
| image = Amo_screenshot_2018-08-22.png
| lead = [[User:Arkiver]], [[User:JustAnotherArchivist]]
| lead = [[User:JustAnotherArchivist]]
}}
}}


'''Mozilla Addons''', also known as '''AMO''' (from its domain, addons.mozilla.org), is a website run by the Mozilla Foundation which hosts extensions and themes for Firefox, Thunderbird, and other Mozilla software.
'''Mozilla Addons''', also known as '''AMO''' (from its domain, addons.mozilla.org), is a website run by the Mozilla Foundation which hosts extensions and themes for Firefox, Thunderbird, and other Mozilla software.


Extensions used to be based on XPI until the introduction of WebExtensions around 2016. Since Firefox 57 and Thunderbird 58, only WebExtensions are supported. XPI-based addons (called "legacy") are deprecated but still supported until the end-of-life of Firefox 52 ESR in September 2018. The legacy addons will be removed from AMO in early October 2018<ref>https://blog.mozilla.org/addons/2017/10/03/legacy-add-on-support-on-firefox-esr/#comment-224382</ref><ref>https://blog.mozilla.org/addons/2018/08/21/timeline-for-disabling-legacy-firefox-add-ons/</ref>.
Extensions used to be based on XPI until the introduction of WebExtensions around 2016. Since Firefox 57, only WebExtensions are supported. XPI-based addons (called "legacy") were deprecated but still supported until the end-of-life of Firefox 52 ESR in September 2018. The legacy addons were planned to be removed from AMO in early October 2018<ref>https://blog.mozilla.org/addons/2017/10/03/legacy-add-on-support-on-firefox-esr/#comment-224382</ref><ref>https://blog.mozilla.org/addons/2018/08/21/timeline-for-disabling-legacy-firefox-add-ons/</ref>.


== Website structure ==
== Website structure ==
Line 20: Line 20:
To track addon installations, AMO uses a <code>src</code> parameter everywhere on the site. There are ''at least'' 59 possible values for this parameter<ref>https://addons-server.readthedocs.io/en/latest/topics/api/download_sources.html</ref>.
To track addon installations, AMO uses a <code>src</code> parameter everywhere on the site. There are ''at least'' 59 possible values for this parameter<ref>https://addons-server.readthedocs.io/en/latest/topics/api/download_sources.html</ref>.


Addon download links have the general format <code>https://addons.mozilla.org/firefox/downloads/file/$FILEID/$FILENAME?src=$SRC</code>. Note that file IDs are separate from addon and version IDs. The filename typically contains the slug and a version identifier. When AMO detects that you're using a version of Firefox that is incompatible with an addon, it displays a "download anyway" link, which in additiona contains a <code>type:attachment</code> path segment between the file ID and the filename (i.e. <code>.../file/$FILEID/type:attachment/$FILENAME...</code>). All download URLs redirect to a CDN at addons.cdn.mozilla.net; the <code>type:attachment</code> is also reflected in that CDN URL as <code>_attachments</code> (which then inserts a <code>Content-Disposition</code> header); the <code>src</code> parameter is not included in the redirect target.
Addon download links have the general format <code>https://addons.mozilla.org/firefox/downloads/file/$FILEID/$FILENAME?src=$SRC</code>. Note that file IDs are separate from addon and version IDs. The filename typically contains the slug and a version identifier. When AMO detects that you're using a version of Firefox that is incompatible with an addon, it displays a "download anyway" link, which in addition contains a <code>type:attachment</code> path segment between the file ID and the filename (i.e. <code>.../file/$FILEID/type:attachment/$FILENAME...</code>). All download URLs redirect to a CDN at addons.cdn.mozilla.net; the <code>type:attachment</code> is also reflected in that CDN URL as <code>_attachments</code> (which then inserts a <code>Content-Disposition</code> header); the <code>src</code> parameter is not included in the redirect target.


Besides the actual addon files, AMO also hosts preview screenshots, reviews, version history (including changelogs), statistics, and in some cases additional pages (e.g. privacy policy) for each addon. The review page only displays the most recent review of any particular user, and one needs to follow an extra link to discover a user's earlier reviews for the same addon.
Besides the actual addon files, AMO also hosts preview screenshots, reviews, version history (including changelogs), statistics, and in some cases additional pages (e.g. privacy policy) for each addon. The review page only displays the most recent review of any particular user, and one needs to follow an extra link to discover a user's earlier reviews for the same addon.
Line 33: Line 33:


== Archival ==
== Archival ==
* There were two attempts to archive AMO through [[ArchiveBot]]. One ran from 2017-08-29 until early December 2017, the other was started on 2018-07-29 and vanished sometime in August 2018.
* There were two (proper) attempts to archive AMO through [[ArchiveBot]]. {{Job|4aa66jgox1pg1gp6gxzkgthiq}} ran from 2017-08-29 until early December 2017, and {{Job|xew9sjj59osltx5oyjr6n9rg}} was started on 2018-07-29 and vanished sometime in August 2018.
* All addon files (both from AMO for Firefox/Firefox Android and from addons.thunderbird.net for Thunderbird/Seamonkey) were downloaded by [[User:JustAnotherArchivist]] between 2018-09-14 and 2018-09-16.
* All addon files (both from AMO for Firefox/Firefox Android and from addons.thunderbird.net for Thunderbird/Seamonkey) were downloaded by [[User:JustAnotherArchivist]] between 2018-09-14 and 2018-09-16.
* The amo-links-getter list linked above is being downloaded through [[ArchiveBot]] as {{Job|akifc65k7kfhpdhfbveh79v1c}} (started on 2018-09-30).
* The amo-links-getter list linked above was downloaded through [[ArchiveBot]] as {{Job|akifc65k7kfhpdhfbveh79v1c}} (started on 2018-09-30, finished on 2018-10-07).
* The old, "classic desktop" AMO website – minus downloads and <code>src</code> parameter variations, but including version history, reviews, and API data – is being grabbed by [[User:JustAnotherArchivist]] since 2018-09-30.
* The old, "classic desktop" AMO website was grabbed by [[User:JustAnotherArchivist]] in October/November 2018.
* A warrior project for the website is in preparation.
** The website – minus downloads and <code>src</code> parameter variations, but including version history, reviews, and API data – was grabbed between 2018-09-30 and 2018-10-13 (see [[#JustAnotherArchivist.27s_website_grab.2C_part_1|below]] for details).
** The <code>src</code> parameter variations and downloads as well as addon collections were grabbed between 2018-10-15 and 2018-10-20 (see [[#JustAnotherArchivist.27s_website_grab.2C_part_2|below]] for details).
** A wpull grab of the skeleton of the old website (with some special handling of the locale variations in the URLs) was done between 2018-10-15 and 2018-10-19. "Skeleton" here means the categories, tags, etc.; the addons themselves as well as user profiles are excluded.
*** Specifically, case variations of <code>/en-US/</code> are normalised to this capitalisation. There is some bug in AMO which leads to links using <code>/en-us/</code>, <code>/eN-uS/</code>, etc. Unfortunately, this means that some links will be broken, but that's unavoidable without retrieving the entire site 16 times...
*** Any URLs with a path starting with <code>/en-US/(firefox|android)/(addon|user)/</code> or <code>/(firefox|android)/downloads/</code> as well as all locales other than en-US (af, ar, bg, ...) are ignored.
*** In the [https://addons.mozilla.org/en-US/firefox/search/ search], combinations between the filters on the left or with the sorting are ignored.
** All of this data can be found on the Internet Archive at {{IA item|addons.mozilla.org_legacy_201810}}.
* A warrior project for the website was in the works ([https://github.com/ArchiveTeam/firefox-addons-grab repository]) but never active.
 
=== JustAnotherArchivist's website grab, part 1 ===
General notes:
* Any URL starting with <code>https://addons.mozilla.org/en-US/firefox/addon/ADDONID/</code> redirects to a URL using the slug instead. Only the <code>ADDONID</code> URLs are listed below for brevity, but of course the redirect target with the slug was also grabbed in all cases.
* For all API resources, both the v3 and the v4 version was retrieved, but only the v3 URL is given below for brevity. Unless otherwise noted, you can simply replace <code>v3</code> with <code>v4</code> in those URLs to get the v4 URL.
 
For all addon IDs between 0 and 1009999 (largest existing ID as of 2018-10-13 is 1003947), these URLs are covered:
* addon detail API endpoint (<code>https://services.addons.mozilla.org/api/v3/addons/addon/ADDONID/</code>)
* addon page (<code>https://addons.mozilla.org/en-US/firefox/addon/ADDONID/</code>)
** This URL may redirect to addons.thunderbird.net for Thunderbird addons. In that case, all redirects on addons.mozilla.org are kept, but the addons.thunderbird.net page itself is not grabbed, and the addon is ignored.
** If this URL returns a 404 or another error (e.g. disabled addon), the addon is ignored.
* the "more" subpage which is loaded through JavaScript (<code>https://addons.mozilla.org/en-US/firefox/addon/ADDONID/more</code>, must be requested with the header <code>X-Requested-With: XMLHttpRequest</code>)
* the addon-specific images, i.e. icons (in both resolutions, 32x32 px and 64x64 px) and preview images (thumbnail and full resolution), extracted from both the page and the API response (just to be sure)
* addon detail API endpoint with the slug and/or the GUID instead of the addon ID if possible (i.e. if the slug and/or GUID could be determined)
* version history
** initial page (<code>https://addons.mozilla.org/en-US/firefox/addon/ADDONID/versions/</code>)
** pagination (<code>https://addons.mozilla.org/en-US/firefox/addon/ADDONID/versions/?page=N</code>; page=1 always retrieved even if there is no pagination)
** API endpoint (<code>https://services.addons.mozilla.org/api/v3/addons/addon/ADDONID/versions/</code> and <code>https://services.addons.mozilla.org/api/v3/addons/addon/ADDONID/versions/?page=1</code> + all following pages until the <code>next</code> field is empty/null)
* versions
** API endpoint for each version (<code>https://services.addons.mozilla.org/api/v3/addons/addon/ADDONID/versions/VERSIONID/</code>, where the version IDs were collected from the API history pagination)
** page redirect for each version (<code>https://addons.mozilla.org/en-US/firefox/addon/SLUG/versions/VERSIONSTRING</code>, collected during the pagination traversal on the website)
* reviews/ratings
** initial page + pagination as described above for the version history (<code>https://addons.mozilla.org/en-US/firefox/addon/ADDONID/reviews/[?page=N]</code>)
** API endpoint including further pages according to the <code>next</code> field (<code>https://services.addons.mozilla.org/api/v3/reviews/review/?addon=ADDONID</code> and <code>https://services.addons.mozilla.org/api/v4/ratings/rating/?addon=ADDONID</code>)
** API endpoint for each version of the addon + further pages according to <code>next</code> (<code>https://services.addons.mozilla.org/api/v3/reviews/review/?addon=ADDONID&version=VERSIONID</code>)
** individual review page (<code>https://addons.mozilla.org/en-US/firefox/addon/ADDONID/reviews/REVIEWID/</code>)
** individual review API endpoint (<code>https://services.addons.mozilla.org/api/v3/reviews/review/REVIEWID/</code> and <code>https://services.addons.mozilla.org/api/v4/ratings/rating/REVIEWID/</code>)
** page(s) for users who wrote multiple reviews for an addon (<code>https://addons.mozilla.org/en-US/firefox/addon/ADDONID/reviews/user:USERID</code>; also pagination with <code>?page=N</code> if available, though that doesn't seem to be the case anywhere)
* statistics
** page (<code>https://addons.mozilla.org/en-US/firefox/addon/ADDONID/statistics/</code>)
** data (<code>https://addons.mozilla.org/en-US/firefox/addon/SLUG/statistics/DATASET-day-YEAR0101-YEAR1231.json</code>)
*** Here, <code>DATASET</code> was each of <code>('overview', 'apps', 'locales', 'os', 'versions', 'statuses', 'sources', 'downloads')</code>, and <code>YEAR</code> started from 2018 and went back until the returned data was empty.
* any other subpage of the addon which is linked on the addon page and starts with <code>https://addons.mozilla.org/en-US/firefox/addon/ADDONID/</code> or <code>https://addons.mozilla.org/en-US/firefox/addon/SLUG/</code>, e.g. privacy policy
* feature compatibility API endpoint (<code>https://services.addons.mozilla.org/api/v3/addons/addon/ADDONID/feature_compatibility/</code>)
* EULA and privacy policy API endpoint (<code>https://services.addons.mozilla.org/api/v3/addons/addon/ADDONID/eula_policy/</code>)
 
Furthermore, during the relevant stages above (addon page, "more", addon detail API endpoint, and reviews pages and API endpoints), usernames were extracted, and the user profiles were afterwards retrieved as well:
* user profile page using the username (<code>https://addons.mozilla.org/en-US/firefox/user/USERNAME/</code>)
* if it can be found on that page, the same thing with the user ID (<code>https://addons.mozilla.org/en-US/firefox/user/USERID/</code>; the abuse report button is used for extracting the user ID)
* avatar if provided (somewhere under <code>https://addons.cdn.mozilla.net/user-media/userpics/</code>)
* pagination for reviews, if necessary (<code>https://addons.mozilla.org/en-US/firefox/user/USERNAME/?page=N</code> and <code>https://addons.mozilla.org/en-US/firefox/user/USERID/?page=N</code>)
 
=== JustAnotherArchivist's website grab, part 2 ===
This grab covers the variations of the <code>src</code> URL parameter on the addon page and the downloads themselves with that parameter. It again operates on addon IDs. It also covers collections.
 
==== src variations and downloads ====
* For each addon ID, it's checked whether the addon needs to be processed in this way. This could've been integrated into part 1, but it's tricky and time-consuming to do these checks after the fact, so we simply reretrieve the API addon detail endpoint. Inexistent and theme addons are skipped; note that themes do not use the <code>src</code> tracking parameter since their installation works very differently and there are no downloadable files either, so everything below is unnecessary for them.
* For each variation of <code>src</code>, <code>https://addons.mozilla.org/en-US/firefox/addon/ADDONID/?src=SRC</code> is retrieved. <code>SRC</code> is empty or one of the 58 values listed [https://addons-server.readthedocs.io/en/latest/topics/api/download_sources.html in the documentation] with the exception of <code>collection</code> and <code>version-history</code>; the former is handled below, and the latter is only used on the version history page but not on links to the addon page. (<code>version-history</code> is implicitly handled below.)
* The version history page(s) are retrieved as described in part 1: <code>https://addons.mozilla.org/en-US/firefox/addon/ADDONID/versions/[?page=N]</code>
* From all of the above pages, download links are collected. There are a few different formats:
** <code>https://addons.mozilla.org/firefox/downloads/latest/SLUG/addon-ADDONID-latest.EXT?src=SRC</code> – this is used by the install button at the top of the addon page and also on other pages (e.g. category listings).
** <code>https://addons.mozilla.org/firefox/downloads/file/FILEID/FILE.EXT?src=SRC</code> – this appears in the version information at the bottom of the addon page and in the version history.
** For both of these formats, there exist also URLs containing a <code>type:attachment</code> path segment. These are "download anyway" links for when a browser is incompatible with an addon version.
** All four URLs are actually redirects to the CDN; the <code>src</code> parameter is fortunately not passed on to the CDN, so only two requests to the CDN (for the presence and absence of <code>type:attachment</code>) are necessary. The file is identical in both cases; the only difference is a <code>Content-Disposition</code> header to force a download.
 
==== Collections ====
Collection retrieval operates on users and is based on the users discovered in part 1 (i.e. covers all addon developers and reviewers).
 
* The list of collections by a user is retrieved: <code>https://addons.mozilla.org/en-US/firefox/collections/USERNAME/[?page=N]</code>
* Each collection: <code>https://addons.mozilla.org/en-US/firefox/collections/USERNAME/COLLSLUG/[?page=N]</code>
* Each addon page linked from the collection and containing a <code>src</code> parameter is retrieved; this covers URLs such as <code>https://addons.mozilla.org/en-US/firefox/addon/decentraleyes/?src=collection&collection_id=4a02c848-8be7-44ff-bc1c-f1c2d8dddf86</code> from [https://addons.mozilla.org/en-US/firefox/collections/mozilla/privacy-matters/ this collection].
* For each download link appearing in either the collection or on the addon page, the redirect to the CDN is retrieved (but not followed).


== References ==
== References ==
<references/>
<references/>

Revision as of 16:42, 12 November 2018

Mozilla Addons
Amo screenshot 2018-08-22.png
URL https://addons.mozilla.org/
Status Special case
Archiving status Saved! (addon files)
Saved! (website)
Archiving type Unknown
IRC channel #outofammo (on hackint)
Project lead User:JustAnotherArchivist

Mozilla Addons, also known as AMO (from its domain, addons.mozilla.org), is a website run by the Mozilla Foundation which hosts extensions and themes for Firefox, Thunderbird, and other Mozilla software.

Extensions used to be based on XPI until the introduction of WebExtensions around 2016. Since Firefox 57, only WebExtensions are supported. XPI-based addons (called "legacy") were deprecated but still supported until the end-of-life of Firefox 52 ESR in September 2018. The legacy addons were planned to be removed from AMO in early October 2018[1][2].

Website structure

As of September 2018, there are two different versions of AMO: the old version, called "classic desktop" on the website, and a redesigned new site. The two mostly serve the same content; the most important difference is that the new site does not serve user profile pages for non-developers while the old site does. The switching between the two sites happens through a cookie called mamo (modern AMO?); when it is set to off, the old site is served; when it's on or unset, the new site is served.

AMO uses numeric IDs and slugs for addon identification. (GUIDs are also used, but only in the API and internally in Firefox.) These IDs are shared with Thunderbird and Seamonkey addons, which used to be hosted on AMO but have since been moved to addons.thunderbird.net (which only exists in the "old" form; there is a "view the new site" link in the footer, but it doesn't have any effect as of 2018-09-30).

To track addon installations, AMO uses a src parameter everywhere on the site. There are at least 59 possible values for this parameter[3].

Addon download links have the general format https://addons.mozilla.org/firefox/downloads/file/$FILEID/$FILENAME?src=$SRC. Note that file IDs are separate from addon and version IDs. The filename typically contains the slug and a version identifier. When AMO detects that you're using a version of Firefox that is incompatible with an addon, it displays a "download anyway" link, which in addition contains a type:attachment path segment between the file ID and the filename (i.e. .../file/$FILEID/type:attachment/$FILENAME...). All download URLs redirect to a CDN at addons.cdn.mozilla.net; the type:attachment is also reflected in that CDN URL as _attachments (which then inserts a Content-Disposition header); the src parameter is not included in the redirect target.

Besides the actual addon files, AMO also hosts preview screenshots, reviews, version history (including changelogs), statistics, and in some cases additional pages (e.g. privacy policy) for each addon. The review page only displays the most recent review of any particular user, and one needs to follow an extra link to discover a user's earlier reviews for the same addon.

Note that AMO does not only host extensions but also themes. These consist simply of a JSON object which provides the URLs for the relevant images and some additional settings (e.g. text colour), i.e. there is no real download for them.

The AMO API versions 3 and 4 are documented here and here, respectively.

Utilities

  • amo-links-getter: Both Wget and the Warrior are ineffective in downloading the site completely (besides there are many redundant links that are not taken into account as redirects causing the same content to be downloaded several times). This is a set of scripts that store all the links in a SQLite database to be downloaded later.

Archival

  • There were two (proper) attempts to archive AMO through ArchiveBot. job:4aa66jgox1pg1gp6gxzkgthiq ran from 2017-08-29 until early December 2017, and job:xew9sjj59osltx5oyjr6n9rg was started on 2018-07-29 and vanished sometime in August 2018.
  • All addon files (both from AMO for Firefox/Firefox Android and from addons.thunderbird.net for Thunderbird/Seamonkey) were downloaded by User:JustAnotherArchivist between 2018-09-14 and 2018-09-16.
  • The amo-links-getter list linked above was downloaded through ArchiveBot as job:akifc65k7kfhpdhfbveh79v1c (started on 2018-09-30, finished on 2018-10-07).
  • The old, "classic desktop" AMO website was grabbed by User:JustAnotherArchivist in October/November 2018.
    • The website – minus downloads and src parameter variations, but including version history, reviews, and API data – was grabbed between 2018-09-30 and 2018-10-13 (see below for details).
    • The src parameter variations and downloads as well as addon collections were grabbed between 2018-10-15 and 2018-10-20 (see below for details).
    • A wpull grab of the skeleton of the old website (with some special handling of the locale variations in the URLs) was done between 2018-10-15 and 2018-10-19. "Skeleton" here means the categories, tags, etc.; the addons themselves as well as user profiles are excluded.
      • Specifically, case variations of /en-US/ are normalised to this capitalisation. There is some bug in AMO which leads to links using /en-us/, /eN-uS/, etc. Unfortunately, this means that some links will be broken, but that's unavoidable without retrieving the entire site 16 times...
      • Any URLs with a path starting with /en-US/(firefox|android)/(addon|user)/ or /(firefox|android)/downloads/ as well as all locales other than en-US (af, ar, bg, ...) are ignored.
      • In the search, combinations between the filters on the left or with the sorting are ignored.
    • All of this data can be found on the Internet Archive at addons.mozilla.org_legacy_201810.
  • A warrior project for the website was in the works (repository) but never active.

JustAnotherArchivist's website grab, part 1

General notes:

  • Any URL starting with https://addons.mozilla.org/en-US/firefox/addon/ADDONID/ redirects to a URL using the slug instead. Only the ADDONID URLs are listed below for brevity, but of course the redirect target with the slug was also grabbed in all cases.
  • For all API resources, both the v3 and the v4 version was retrieved, but only the v3 URL is given below for brevity. Unless otherwise noted, you can simply replace v3 with v4 in those URLs to get the v4 URL.

For all addon IDs between 0 and 1009999 (largest existing ID as of 2018-10-13 is 1003947), these URLs are covered:

Furthermore, during the relevant stages above (addon page, "more", addon detail API endpoint, and reviews pages and API endpoints), usernames were extracted, and the user profiles were afterwards retrieved as well:

JustAnotherArchivist's website grab, part 2

This grab covers the variations of the src URL parameter on the addon page and the downloads themselves with that parameter. It again operates on addon IDs. It also covers collections.

src variations and downloads

  • For each addon ID, it's checked whether the addon needs to be processed in this way. This could've been integrated into part 1, but it's tricky and time-consuming to do these checks after the fact, so we simply reretrieve the API addon detail endpoint. Inexistent and theme addons are skipped; note that themes do not use the src tracking parameter since their installation works very differently and there are no downloadable files either, so everything below is unnecessary for them.
  • For each variation of src, https://addons.mozilla.org/en-US/firefox/addon/ADDONID/?src=SRC is retrieved. SRC is empty or one of the 58 values listed in the documentation with the exception of collection and version-history; the former is handled below, and the latter is only used on the version history page but not on links to the addon page. (version-history is implicitly handled below.)
  • The version history page(s) are retrieved as described in part 1: https://addons.mozilla.org/en-US/firefox/addon/ADDONID/versions/[?page=N]
  • From all of the above pages, download links are collected. There are a few different formats:
    • https://addons.mozilla.org/firefox/downloads/latest/SLUG/addon-ADDONID-latest.EXT?src=SRC – this is used by the install button at the top of the addon page and also on other pages (e.g. category listings).
    • https://addons.mozilla.org/firefox/downloads/file/FILEID/FILE.EXT?src=SRC – this appears in the version information at the bottom of the addon page and in the version history.
    • For both of these formats, there exist also URLs containing a type:attachment path segment. These are "download anyway" links for when a browser is incompatible with an addon version.
    • All four URLs are actually redirects to the CDN; the src parameter is fortunately not passed on to the CDN, so only two requests to the CDN (for the presence and absence of type:attachment) are necessary. The file is identical in both cases; the only difference is a Content-Disposition header to force a download.

Collections

Collection retrieval operates on users and is based on the users discovered in part 1 (i.e. covers all addon developers and reviewers).

References