Difference between revisions of "Talk:Ex.ua"

From Archiveteam
Jump to navigation Jump to search
Line 79: Line 79:
* Scrape avatars off either just the user pages or everywhere (user avatars are linked to in almost all pages)
* Scrape avatars off either just the user pages or everywhere (user avatars are linked to in almost all pages)


* For r_view, you could distribute the load between ex.ua and rover.info :) - in theory. See note about access below.
 
== Object access with ex.ua vs rover.info ==
 
Compare http://www.ex.ua/get/967129 with http://rover.info/get/967129 (warning, this will download an 8GB MKV, be ready to hit cancel)
 
rover.info allows access to things ex.ua has blocked for some reason. (I'm logged into ex.ua, so having a login isn't an issue.)
 
 
== Request minimization ==
 
I say just request everything off ex.ua *and* rover.info, but that means we're requesting something like 7ish requests per ID... that means '''at least''' 700 million requests. I fear the database's load alarms will go off...
 
I think the XSPFs can ''probably'' go, although I'll definitely hear arguments for while they should stay.
 
I think the resized (<code>?[0-9]+</code>) avatar requests can go.
 
The hard decision is whether to replace ex.ua with rover.info entirely. It looks like it'd work 100%, but I'm hesitant. I don't know what to do here.
 




Line 90: Line 107:
So we need to fetch the view_comments links, and then chase all the pages, in order to archive the threads in order and in context. We can recover the ordering with the timestamps, but that doesn't preserve the site heirachy.
So we need to fetch the view_comments links, and then chase all the pages, in order to archive the threads in order and in context. We can recover the ordering with the timestamps, but that doesn't preserve the site heirachy.


----
Compare http://www.ex.ua/get/967129 with http://rover.info/get/967129 (warning, this will download an 8GB MKV, be ready to hit cancel)
rover.info allows access to things ex.ua has blocked for some reason. (I'm logged into ex.ua, so having a login isn't an issue.)


----
----
Line 116: Line 128:
http://rover.info/78522682?r=1&p=23&per=5
http://rover.info/78522682?r=1&p=23&per=5


A few things change:
Here are a few things that change when you're on the last page:


* It ''looks'' like the tooltip for the go-to-the-end right-facing arrow contains a Russian string with ending in the number of the last ''item'' in the page. (As in, if there are 152 items on a page, the number will be 152.) This will match the right-side of the bold text in the middle of the left and right arrows when you're on the last page.
* It ''looks'' like the tooltip for the go-to-the-end right-facing arrow contains a Russian string that ends with the number of the last ''item'' in the page. (As in, if there are 152 items on a page, the number will be 152.) This will match the ''yyy'' in the <code>xxx..yyy</code> number in bold text in the middle of the left and right arrows.


* The go-right and go-right-to-end arrows have no text in the middle of them
* The go-right and go-right-to-end arrows have no text in the middle of them


* Both the go-right and -to-end arrows are disabled
* Both the go-right and -to-end arrows are no longer wrapped in an <code><a href...</code>.


"The more techniques you use, the less chance for failure when working at scale"...?
"The more techniques you use, the less chance for failure when working at scale"...?

Revision as of 12:21, 11 December 2016

I (i336_) am adding this here so I can provide updated information in a stream-of-consciousness format and not need to think too much about the way the text is presented. (This saves me the extra time I'd otherwise need to use to format this w/ a refined writing style.)

Anybody is welcome to transcribe this data onto the page itself. Once there's nothing new here this can be emptied.

If I have anything to share it will be put here. I will endeavor to use this as a scratchpad to share in. I also tend to treat IRC as micro-twitter when I'm nervous, so you _probably_ won't generally need to ask me what's new. :P


NOTE - I have a 110k-line-long userlist. Ask me for the URL.


rover.info

While sniffing around some Ubuntu packages I found references to an "ex.ua-uploader.pl", when I googled this I got a search result from rover.info that had an oddly similar URL structure to ex.ua.

Visiting http://rover.info/ showed me an all-but-blank page, but I could see I was using ex.ua's backend - the UI is identical.

So, I tried logging in with my ex.ua account, and it... worked. It accepted my ex.ua credentials, and sent me to rover.info/user/i336_.

Next, for the fun of it I tried accessing an object like http://rover.info/94350260 AND IT WORKED! rover.info DOES NOT REDIRECT!!!

The other big difference is that rover.info doesn't have user uploads hidden on user pages! So http://rover.info/user/Coco is much, much bigger than http://ex.ua/user/Coco :) and this will be very useful to save.

The other big difference is that with rover.info we can archive all of the conversations on the site. I thought the what.cd-like aspects of ex.ua were going to be lost. If we can crawl the site successfully this will not be the case.

Note: something to consider - rover.info/search also says "Search is not possible", just like ex.ua does. I think a reasonable interpretation of this is that EX is not going to keep rover.info going past the 31st, and that the whole thing is going to go away. So the show isn't over yet. :P

An aside:

I found it amusing that unlike ex.ua (190.115.31.5 -> ddos-guard.net), rover.info (77.120.115.179 -> 179.115.120.77.colo.static.dcvolia.com) isn't using DDoS protection. Something to keep in mind.

Also, now we know what ISP EX are/were using. Volia must've gotten enough complaints to fill a book... something very very interesting to keep in mind, especially considering their cheapest offering ("Intel Atom D510/1GB/250GB + UNLIM 100 Mbit/s") is $15.31/mo USD...!


Logging in

With a username of abc and a password of def,

 $ curl 'http://rover.info/login' --data 'login=abc&password=def&flag_not_ip_assign=1' -vvvv

The other parameter corresponds to the without recognition of IP on the homepage. This probably doesn't apply to the DDoS protection systems, but hey... why not ;) although if you think doing this would be suspicious then maybe not.

You get a ukey cookie back. If you implement a standard keep-everything cookie jar then that's fine, but ukey is the only parameter you need to supply to authenticate. I don't think it changes.

Once you've made an account in a browser:

  • change your language at the top-right, for convenience
  • the second-last option under settings (the link immediately to the left of the language dropdown) says "disallow requests to files with limited access facilities" - maybe disabling this will mean we can scrape at least references to things even if we can't download them. I have this turned on and in http://rover.info/view_includes/14091675 I see "no access to object ... !" messages amongst the replies. Maybe seeing this is not useful (we can't derive anything from the ID). Or maybe it is? I don't know.


User accounts

This is going to be the biggest issue.

"...Comrade, exua_archiver is logged in from 3,287 different IP addresses, and is currently using 648Mbps of bandwidth."

Hopefully this does not happen.

Hopefully we do not need to ask the question of "okay, what happens if we try to batch create accounts?"

Hopefully.

I'll leave it to you to make an account to use for the warrior, since it only takes 20 seconds. You can borrow one of my accounts if you want though.


Strategy for rover.info

  • Add rover.info/user/<userid> to the user-discovery system
  • Add rover.info/{i} to the bruteforcing system
  • Add rover.info/view_comments/{i} to the bruteforcing system - this uses the p and per semantics I described below
  • Consider adding rover.info/view_includes/{i} - I think this shows the references to an object, and may turn up otherwise hidden content.
  • Write a scraper that looks for /user/[A-Za-z0-9_-] (<-- I'm 99.9% confident that covers all usernames) in all returned data, and add the extra usernames to the base list.
  • Scrape avatars off either just the user pages or everywhere (user avatars are linked to in almost all pages)


Object access with ex.ua vs rover.info

Compare http://www.ex.ua/get/967129 with http://rover.info/get/967129 (warning, this will download an 8GB MKV, be ready to hit cancel)

rover.info allows access to things ex.ua has blocked for some reason. (I'm logged into ex.ua, so having a login isn't an issue.)


Request minimization

I say just request everything off ex.ua *and* rover.info, but that means we're requesting something like 7ish requests per ID... that means at least 700 million requests. I fear the database's load alarms will go off...

I think the XSPFs can probably go, although I'll definitely hear arguments for while they should stay.

I think the resized (?[0-9]+) avatar requests can go.

The hard decision is whether to replace ex.ua with rover.info entirely. It looks like it'd work 100%, but I'm hesitant. I don't know what to do here.


Some notes/considerations

ex.ua/rover.info use an ID system for all "objects", including conversations, threads, collections, and folders. By bruteforcing all of them, we'll get everything.

However, this will not get the *structure* of interconnected threads. If you look at http://rover.info/91320826, you'll see it has "Beginning of discussion:" and "Last respond:" links that point to other IDs. Woohoo! We can now... only say that ID A relates to ID B, "somehow". We can't go further than that.

So we need to fetch the view_comments links, and then chase all the pages, in order to archive the threads in order and in context. We can recover the ordering with the timestamps, but that doesn't preserve the site heirachy.




I would definitely appreciate extra eyes on the site itself, looking for interesting interconnections between content. I'm having that "I'm sure there's something I've not thought of here..." thing happening, but that may just be nerves.

If you (yes, you, this applies to everyone) feel like wasting a bit of time, just make an account and click around, keeping in mind the need to preserve the semantic structure of the site. If you think of anything, say it in IRC. It may be useful!


The p/per system

Knowing when you're on the last page is important.

Have a look at the navigation area at the top of these 3 pages:

http://rover.info/78522682?r=1&p=21&per=5

http://rover.info/78522682?r=1&p=22&per=5

http://rover.info/78522682?r=1&p=23&per=5

Here are a few things that change when you're on the last page:

  • It looks like the tooltip for the go-to-the-end right-facing arrow contains a Russian string that ends with the number of the last item in the page. (As in, if there are 152 items on a page, the number will be 152.) This will match the yyy in the xxx..yyy number in bold text in the middle of the left and right arrows.
  • The go-right and go-right-to-end arrows have no text in the middle of them
  • Both the go-right and -to-end arrows are no longer wrapped in an <a href....

"The more techniques you use, the less chance for failure when working at scale"...?


r_view

My first archive target was the http://ex.ua/filelist/{i}.xspf URL pattern because it's what ex.ua advertised when the site was functional, it's what I learned, and it's what I added to my little downloader shell script.

After paging through the full set of results crawled by the Wayback Machine with the intention of seeing what I could find, I discovered and started poking through JavaScript code and discovered r_view.

The XSPF approach should be discarded as r_view is superior in every way. Unlike XSPF, it includes the author of the post (critical!!), md5sum and size of every file (verifiable downloads! yay!), and the text description associated with the post, which was going to be lost.

Use this from now on. http://ex.ua/r_view/{i}

Example: http://www.ex.ua/r_view/94350260


RSS

I remembered RSS was a thing while browsing some HTML. Example: http://www.ex.ua/rss/17247363

Unfortunately it only returns a handful of results.

But I discovered this! http://www.ex.ua/rss/17247363?count=100

100 is the max.

This is important, as it is the only way I have found to archive lists of *collections*. (Using r_view or .xspf shows you an empty object with a picture in it. They clearly weren't built to handle this datatype.)


Files, folders and "collections"

Files are /get/... URLs.

Folders are lists of files.

Collections are groups of folders. I'm using the term "collections" because I don't know what else to call them.


RSS vs r_view

As I said before, the r_view/XSPF method doesn't seem to handle the collection datatype, while RSS does.

An example:

r_view: http://www.ex.ua/r_view/17247363

XSPF: http://www.ex.ua/filelist/17247363.xspf (sorry, this link downloads itself)

RSS: http://www.ex.ua/rss/17247363?count=100

Only RSS can list collections. However it only seems to be able to return the first 100; I have not found any pagination options.

NOTE: Notice how in the r_view URL the file_upload_id tag's preview parameter had the same filename as the picture tag's URL parameter? Please don't quote me on this, but this might be a consistent way to detect collections.


r_search_hint

Found this first (also hiding in the JS). It's cute but not useful. http://ex.ua/r_search_hint?s=flac


r_search

I lobbed _hint off the above URL experimentally, and raised the roof in IRC when it worked.

Using this we can search for anything we want! It returns a text list of matching IDs.

This is not especially useful for automated archiving, but critical for manual prioritization.

After trying a bunch of URLs I found count worked with RSS, so I tried it with this URL as well. count can go up to 500 here :D

EDIT: count can go up to 1000. Yeah, crazy. The sad thing is, this is nigh unusable :(


user_say

One of the few HTML links that still work. Examples are http://www.ex.ua/user_say?login=888_888 and http://www.ex.ua/user_say?login=big-bud.

You can specify p to select the page number, zero-indexed. (So p=1 is page two.)

You can specify per to set the number of items per page. 200 is the max. (Use a consistent per with all searches as the site doesn't use an after=...-type system.) The website (ie in a browser) seems to remember the per setting you use, FWIW.


user_recommend

This is similar to user_say except it seems to be for things a user likes. It appears to be possible to switch this off, and some users have turned it off. I would consider this lower priority to crawl, if at all.


Recommended strategy

  • Use r_view with bruteforced IDs to fetch contents of folders. Consider also fetching the image that gets returned (nice-to-have, but not important).
  • Use userlist(s) with user_say to fetch archives of what users have written about. I understand this will include their comments and the files they have publicly released.
  • Check for an RSS feed on every folder ID that seems to be a collection (going off the suggested heuristics above, along with your own ideas) - or alternatively do RSS scans on every ID. I consider the RSS feeds a nice-to-have "extra", and not part of the crawling operation, as collections simply provide a way to group folders together, you're getting the folder list anyway, and we don't yet know how to return more than the 100 most recent RSS items, so it's not particularly introspective.