Difference between revisions of "Yahoo! Groups"

From Archiveteam
Jump to navigation Jump to search
Line 13: Line 13:
It's been stable for a long time (since the late 90s), long enough for some specialised software to be developed to do backups of it. (Not many other websites can say ''that''.)
It's been stable for a long time (since the late 90s), long enough for some specialised software to be developed to do backups of it. (Not many other websites can say ''that''.)


== Python Yahoo! Group Archiver ==  
== Statistics ==


The [https://github.com/csaftoiu/yahoo-groups-backup yahoo-groups-backup] is a Python script which allows a scraping of the group. So far only messages are scraped. It puts all the info and metadata (both rendered message body and raw email) into a Mongo database, and provides a script to dump a static version of the site that can be read off of the filesystem. It works with Neo and with private groups by clunkily using Selenium to do the scraping.
As of 2019-10-16 the [https://groups.yahoo.com/neo/dir directory] lists 5619351 groups. 2752112 of them have been discovered. 1483853 (54%) have public message archives with an estimated number of 2.1 billion messages (1389 messages per group on average so far). 1.8 billion messages (86%) have been archived as of 2018-10-28.


Another Python-based Archiver is [https://github.com/andrewferguson/YahooGroups-Archiver YahooGroups-Archiver], which is a simple Python script to dump the messages into individual JSON files. No further processing of the messages is done to preserve them in the format Yahoo uses for displaying them. Private groups can be archived by providing the contents of two cookies that Yahoo uses to verify a logged-in user.
The following graphs are slightly outdated:


Yet another Python-based Archiver is https://github.com/philpem/yahoo-group-archiver.
[[File:Yahoo_groups_date_created.png‎]]
[[File:Yahoo_groups_messages_per_group.png‎]]
[[File:Yahoo_groups_post_date.png‎]]


== Perl Yahoo! Group Archiver ==
== Private groups of interest ==


Update: Apparently since Yahoo! Groups changed to the neo interface the script no longer functions and is no longer actively maintained.
{| class="wikitable"
! Group
! Notes
! Admin contact attempted?
|-
| [https://groups.yahoo.com/neo/groups/numberactivation/info numberactivation]
| see [https://trendingpress.com/some-of-the-uks-phone-number-infrastructure-relies-on-yahoo-groups-which-is-shutting-down/ all] [https://reclaimthenet.org/ofcom-oftel-uk-phone-numbers-yahoo-groups/ the] [https://www.axios.com/yahoo-groups-ofcom-cell-phone-number-porting-51949f81-446e-4b4b-82eb-26790146e9a0.html press] [https://techupdatess.com/some-of-the-uks-phone-number-infrastructure-relies-on-yahoo-groups-the-verge/ coverage]
| Not yet; [https://www.whatdotheyknow.com/request/all_data_held_in_yahoo_groups_us FOI request] made
|-
| [https://groups.yahoo.com/neo/groups/hpslash/info hpslash]
| see [https://fanlore.org/wiki/Hpslash_%28mailing_list%29 Fanlore page]
| Not yet
|}


<s>The [http://sourceforge.net/projects/grabyahoogroup/ Yahoo Group Archiver] is a Perl script which allows an export of "the messages (without the attachments), everything from the files section and all the images from the photo section along with their hierarchy on Yahoo".
Potentially relevant: [https://fanlore.org/wiki/Category:Yahoo!_Groups List of groups with Fanlore pages] (contains both private and public groups)
 
It appears that, if you get the "Couldn't get message count" error when trying to use it, the solution is to edit the yahoo2maildir.pl file and replace the bottom line <code>my $url = $HTTP::URI_CLASS->new($redirect, $base)->abs($base);</code> (under the heading <code>sub GetJSRedirect</code>) with <code><nowiki>my $url = "http://groups.yahoo.com/group/$group/messages/$begin_msgid"; </nowiki></code>
 
More frustratingly, it appears that Yahoo blocks your IP temporarily after hitting some invisible limit of data downloaded (the Archiver will continue to "download" messages for a bit, ending up with a bunch of 0-byte files, then stop completely). It's unknown if there is a solution.


Also: sometimes, some of the downloaded messages, in the middle of an otherwise normal batch, are 0 in size - almost as if Yahoo blocked your IP for a few seconds, then stopped. Watch out for these so that you can re-download them later.</s>
== Site structure ==
 
== Site Structure ==


There’s a convenient JSON API. May require logging in and joining a group to use all endpoints:
There’s a convenient JSON API. May require logging in and joining a group to use all endpoints:
Line 61: Line 69:
Note that all paginated responses are limited to the first 500 results and do not return anything new beyond that.
Note that all paginated responses are limited to the first 500 results and do not return anything new beyond that.


== Statistics ==
== Python Yahoo! Group archivers ==  


As of 2019-10-16 the [https://groups.yahoo.com/neo/dir directory] lists 5619351 groups. 2752112 of them have been discovered. 1483853 (54%) have public message archives with an estimated number of 2.1 billion messages (1389 messages per group on average so far). 1.8 billion messages (86%) have been archived as of 2018-10-28.
* [https://github.com/IgnoredAmbience/yahoo-group-archiver/network/members yahoo-group-archiver] scrapes a group using the JSON API and (for private endpoints) the two cookies Yahoo uses to verify a logged-in user. Relevant forks include [https://github.com/Frankkkkk/yahoo-group-archiver Frankkkkk] and [https://github.com/nsapa/yahoo-group-archiver nsapa]. Needs merging. Various branches have support (largely untested) for file attachments, photos, links, folders, and events.
 
The following graphs are slightly outdated:


[[File:Yahoo_groups_date_created.png‎]]
* [https://github.com/andrewferguson/YahooGroups-Archiver YahooGroups-Archiver] is similar, but scrapes only messages (not files or any other data). It is not currently under active development.
[[File:Yahoo_groups_messages_per_group.png‎]]
[[File:Yahoo_groups_post_date.png‎]]


== Private groups of interest ==
* [https://github.com/csaftoiu/yahoo-groups-backup yahoo-groups-backup] scrapes a group using Selenium, storing message info and metadata (both rendered message body and raw email) into a Mongo database. It also provides a script to dump its data to static HTML pages that can be viewed in the browser.
 
* [https://groups.yahoo.com/neo/groups/numberactivation/info numberactivation]: see [https://trendingpress.com/some-of-the-uks-phone-number-infrastructure-relies-on-yahoo-groups-which-is-shutting-down/ all] [https://reclaimthenet.org/ofcom-oftel-uk-phone-numbers-yahoo-groups/ the] [https://www.axios.com/yahoo-groups-ofcom-cell-phone-number-porting-51949f81-446e-4b4b-82eb-26790146e9a0.html press] [https://techupdatess.com/some-of-the-uks-phone-number-infrastructure-relies-on-yahoo-groups-the-verge/ coverage]. A [https://www.whatdotheyknow.com/request/all_data_held_in_yahoo_groups_us FOI request] has been made to try and get the data.
 
* [https://groups.yahoo.com/neo/groups/hpslash/info hpslash]: see [https://fanlore.org/wiki/Hpslash_%28mailing_list%29 Fanlore page]
 
Potentially relevant: [https://fanlore.org/wiki/Category:Yahoo!_Groups List of groups with Fanlore pages] (contains both private and public groups)


== Software for backups ==
== Other archivers ==


* [http://sourceforge.net/projects/grabyahoogroup/ Yahoo Group Archiver], Sourceforge
* [http://sourceforge.net/projects/grabyahoogroup/ Yahoo Group Archiver]: Perl, defunct.
* Is there a Windows thing out there?


== External Links ==
== External Links ==

Revision as of 10:45, 19 October 2019

Yahoo! Groups
Yahoo! Groups logo
Groups-yahoo-com.png
URL http://groups.yahoo.com/
Status Online!
Archiving status In progress...
Archiving type Unknown
IRC channel #yahoosucks (on hackint)

Yahoo! Groups is Yahoo's email service; it's the result of the acquisition of eGroups and some other Yahoo! stuff.

It's been stable for a long time (since the late 90s), long enough for some specialised software to be developed to do backups of it. (Not many other websites can say that.)

Statistics

As of 2019-10-16 the directory lists 5619351 groups. 2752112 of them have been discovered. 1483853 (54%) have public message archives with an estimated number of 2.1 billion messages (1389 messages per group on average so far). 1.8 billion messages (86%) have been archived as of 2018-10-28.

The following graphs are slightly outdated:

Yahoo groups date created.png Yahoo groups messages per group.png Yahoo groups post date.png

Private groups of interest

Group Notes Admin contact attempted?
numberactivation see all the press coverage Not yet; FOI request made
hpslash see Fanlore page Not yet

Potentially relevant: List of groups with Fanlore pages (contains both private and public groups)

Site structure

There’s a convenient JSON API. May require logging in and joining a group to use all endpoints:

Note that all paginated responses are limited to the first 500 results and do not return anything new beyond that.

Python Yahoo! Group archivers

  • yahoo-group-archiver scrapes a group using the JSON API and (for private endpoints) the two cookies Yahoo uses to verify a logged-in user. Relevant forks include Frankkkkk and nsapa. Needs merging. Various branches have support (largely untested) for file attachments, photos, links, folders, and events.
  • YahooGroups-Archiver is similar, but scrapes only messages (not files or any other data). It is not currently under active development.
  • yahoo-groups-backup scrapes a group using Selenium, storing message info and metadata (both rendered message body and raw email) into a Mongo database. It also provides a script to dump its data to static HTML pages that can be viewed in the browser.

Other archivers

External Links

References