Difference between revisions of "Talk:Ex.ua"

From Archiveteam
Jump to navigation Jump to search
(Created page with "I'm adding here so I can provide updated information in a stream-of-consciousness format and not need to think too much about the way the text is presentated. (This saves me t...")
 
Line 1: Line 1:
I'm adding here so I can provide updated information in a stream-of-consciousness format and not need to think too much about the way the text is presentated. (This saves me the extra I'd otherwise need to use to format this w/ a refined writing style.)
I (i336_) am adding this here so I can provide updated information in a stream-of-consciousness format and not need to think too much about the way the text is presentated. (This saves me the extra I'd otherwise need to use to format this w/ a refined writing style.)


Anybody is welcome to transcribe this data onto the page itself. Once there's nothing new here this can be emptied.
Anybody is welcome to transcribe this data onto the page itself. Once there's nothing new here this can be emptied.
Line 7: Line 7:




== `r_view` ==
== r_view ==


My first archive target was the `http://ex.ua/filelist/{i}.xspf` URL pattern because it's what ex.ua advertised when the site was functional, it's what I learned, and it's what I added to my little downloader shell script.
My first archive target was the `http://ex.ua/filelist/{i}.xspf` URL pattern because it's what ex.ua advertised when the site was functional, it's what I learned, and it's what I added to my little downloader shell script.
Line 42: Line 42:




== RSS vs `r_view` ==
== RSS vs r_view ==


As I said before, the r_view/XSPF method doesn't seem to handle certain folders, while RSS does.
As I said before, the r_view/XSPF method doesn't seem to handle certain folders, while RSS does.
Line 59: Line 59:




== `r_search_hint` ==
== r_search_hint ==


Found this first (also hiding in the JS). It's cute but not useful. `http://ex.ua/r_search_hint?s=flac`
Found this first (also hiding in the JS). It's cute but not useful. `http://ex.ua/r_search_hint?s=flac`




== `r_search` ==
== r_search ==


I lobbed _hint off the above URL experimentally, and raised the roof in IRC when it worked. '''Using this we can search for anything we want!''' It returns a list of matching IDs.
I lobbed _hint off the above URL experimentally, and raised the roof in IRC when it worked. '''Using this we can search for anything we want!''' It returns a list of matching IDs.
Line 74: Line 74:




== `user_say` ==
== user_say ==


One of the few HTML links that still work. Examples are `http://www.ex.ua/user_say?login=888_888 and http://www.ex.ua/user_say?login=big_bud.
One of the few HTML links that still work. Examples are `http://www.ex.ua/user_say?login=888_888 and http://www.ex.ua/user_say?login=big_bud.
Line 83: Line 83:




== `user_recommend` ==
== user_recommend ==


This is similar to user_say except it seems to be for things a user likes. It appears to be possible to switch this off, and some users have turned it off. I would consider this lower priority to crawl, if at all.
This is similar to user_say except it seems to be for things a user likes. It appears to be possible to switch this off, and some users have turned it off. I would consider this lower priority to crawl, if at all.

Revision as of 13:03, 10 December 2016

I (i336_) am adding this here so I can provide updated information in a stream-of-consciousness format and not need to think too much about the way the text is presentated. (This saves me the extra I'd otherwise need to use to format this w/ a refined writing style.)

Anybody is welcome to transcribe this data onto the page itself. Once there's nothing new here this can be emptied.

If I have anything to share it will be here - I will endeavor to use this as a scratchpad to share in. I also tend to treat IRC as micro-twitter when I'm nervous, so you _probably_ won't generally need to ask me what's new. :P


r_view

My first archive target was the `http://ex.ua/filelist/{i}.xspf` URL pattern because it's what ex.ua advertised when the site was functional, it's what I learned, and it's what I added to my little downloader shell script.

After paging through the full set of results crawled by the Wayback Machine with the intention of seeing what I could find, I discovered and started poking through JavaScript code and discovered `r_view`.

The XSPF approach should be discarded as r_view is superior in every way. Unlike the XSPF, it includes the author of the post (critical!!), md5sum and size of every file (verifiable downloads! yay!), and the text description associated with the post, which was going to be lost.

Use this from now on. `http://ex.ua/r_view/{i}`

Example: http://www.ex.ua/r_view/94350260


RSS

I remembered RSS was a thing while browsing some HTML. Example: http://www.ex.ua/rss/17247363

Unfortunately it only returns a handful of results.

But I discovered this! http://www.ex.ua/rss/17247363?count=100

100 is the max.

This is important, as it is the only way I have found to archive lists of *collections*. (Using r_view or .xspf shows you an empty object with a picture in it. They clearly weren't built to handle this datatype.)


Files, folders and "collections"

Files are /get/... URLs.

Folders are lists of files.

Collections are groups of folders. I'm using the term "collections" because I don't know what else to call them.


RSS vs r_view

As I said before, the r_view/XSPF method doesn't seem to handle certain folders, while RSS does.

An example:

r_view: http://www.ex.ua/r_view/17247363

XSPF: http://www.ex.ua/filelist/17247363.xspf

RSS: http://www.ex.ua/rss/17247363?count=100

Only RSS can list collections. However it only seems to be able to return the first 100; I have not found any pagination options.

NOTE: Notice how in the r_view URL the `file_upload_id` tag's `preview` parameter had the same filename as the `picture` tag's `URL` parameter? Please don't quote me on this, but this might be a consistent way to detect collections.


r_search_hint

Found this first (also hiding in the JS). It's cute but not useful. `http://ex.ua/r_search_hint?s=flac`


r_search

I lobbed _hint off the above URL experimentally, and raised the roof in IRC when it worked. Using this we can search for anything we want! It returns a list of matching IDs.

This is not useful for automated archiving, but critical for manual prioritization.

After trying a bunch of URLs I found `count` worked with RSS, so I tried it with this URL as well. `count` can go up to 500 here :D


user_say

One of the few HTML links that still work. Examples are `http://www.ex.ua/user_say?login=888_888 and http://www.ex.ua/user_say?login=big_bud.

You can specify `p` to select the page number, zero-indexed. (So `p=1` is page two.)

You can specify `per` to set the number of items per page. 200 is the max. (Use a consistent `per` with all searches as the site doesn't use an "after=..."-type system.) The website (ie in a browser) seems to remember the `per` setting you use, FWIW.


user_recommend

This is similar to user_say except it seems to be for things a user likes. It appears to be possible to switch this off, and some users have turned it off. I would consider this lower priority to crawl, if at all.


Recommended strategy

  1. Use `r_view` with bruteforced IDs to fetch contents of folders. Consider also fetching the image that gets returned (nice-to-have, but not important).
  1. Use userlist(s) with `user_say` to fetch archives of what users have written about. I understand this will include their comments and the files they have publicly released.
  1. Check for an RSS feed on every folder ID that seems to be a collection (going off the suggested heuristics above, along with your own ideas) - or alternatively do RSS scans on every ID. I consider the RSS feeds a nice-to-have "extra", and not part of the crawling operation, as collections simply provide a way to group folders together, you're getting the folder list anyway, and we don't yet know how to return more than the 100 most recent RSS items, so it's not particularly introspective.