Talk:Ex.ua

From Archiveteam
Revision as of 13:14, 10 December 2016 by I336 (talk | contribs)
Jump to navigation Jump to search

I (i336_) am adding this here so I can provide updated information in a stream-of-consciousness format and not need to think too much about the way the text is presentated. (This saves me the extra I'd otherwise need to use to format this w/ a refined writing style.)

Anybody is welcome to transcribe this data onto the page itself. Once there's nothing new here this can be emptied.

If I have anything to share it will be put here. I will endeavor to use this as a scratchpad to share in. I also tend to treat IRC as micro-twitter when I'm nervous, so you _probably_ won't generally need to ask me what's new. :P


r_view

My first archive target was the http://ex.ua/filelist/{i}.xspf URL pattern because it's what ex.ua advertised when the site was functional, it's what I learned, and it's what I added to my little downloader shell script.

After paging through the full set of results crawled by the Wayback Machine with the intention of seeing what I could find, I discovered and started poking through JavaScript code and discovered r_view.

The XSPF approach should be discarded as r_view is superior in every way. Unlike XSPF, it includes the author of the post (critical!!), md5sum and size of every file (verifiable downloads! yay!), and the text description associated with the post, which was going to be lost.

Use this from now on. http://ex.ua/r_view/{i}

Example: http://www.ex.ua/r_view/94350260


RSS

I remembered RSS was a thing while browsing some HTML. Example: http://www.ex.ua/rss/17247363

Unfortunately it only returns a handful of results.

But I discovered this! http://www.ex.ua/rss/17247363?count=100

100 is the max.

This is important, as it is the only way I have found to archive lists of *collections*. (Using r_view or .xspf shows you an empty object with a picture in it. They clearly weren't built to handle this datatype.)


Files, folders and "collections"

Files are /get/... URLs.

Folders are lists of files.

Collections are groups of folders. I'm using the term "collections" because I don't know what else to call them.


RSS vs r_view

As I said before, the r_view/XSPF method doesn't seem to handle certain folders, while RSS does.

An example:

r_view: http://www.ex.ua/r_view/17247363

XSPF: http://www.ex.ua/filelist/17247363.xspf (sorry, this link downloads itself)

RSS: http://www.ex.ua/rss/17247363?count=100

Only RSS can list collections. However it only seems to be able to return the first 100; I have not found any pagination options.

NOTE: Notice how in the r_view URL the file_upload_id tag's preview parameter had the same filename as the picture tag's URL parameter? Please don't quote me on this, but this might be a consistent way to detect collections.


r_search_hint

Found this first (also hiding in the JS). It's cute but not useful. http://ex.ua/r_search_hint?s=flac


r_search

I lobbed _hint off the above URL experimentally, and raised the roof in IRC when it worked.

Using this we can search for anything we want! It returns a text list of matching IDs.

This is not especially useful for automated archiving, but critical for manual prioritization.

After trying a bunch of URLs I found count worked with RSS, so I tried it with this URL as well. count can go up to 500 here :D


user_say

One of the few HTML links that still work. Examples are http://www.ex.ua/user_say?login=888_888 and http://www.ex.ua/user_say?login=big-bud.

You can specify p to select the page number, zero-indexed. (So p=1 is page two.)

You can specify per to set the number of items per page. 200 is the max. (Use a consistent `per` with all searches as the site doesn't use an after=...-type system.) The website (ie in a browser) seems to remember the `per` setting you use, FWIW.


user_recommend

This is similar to user_say except it seems to be for things a user likes. It appears to be possible to switch this off, and some users have turned it off. I would consider this lower priority to crawl, if at all.


Recommended strategy

  • Use r_view with bruteforced IDs to fetch contents of folders. Consider also fetching the image that gets returned (nice-to-have, but not important).
  • Use userlist(s) with user_say to fetch archives of what users have written about. I understand this will include their comments and the files they have publicly released.
  • Check for an RSS feed on every folder ID that seems to be a collection (going off the suggested heuristics above, along with your own ideas) - or alternatively do RSS scans on every ID. I consider the RSS feeds a nice-to-have "extra", and not part of the crawling operation, as collections simply provide a way to group folders together, you're getting the folder list anyway, and we don't yet know how to return more than the 100 most recent RSS items, so it's not particularly introspective.