From Archiveteam
Jump to: navigation, search
Docplayer logo
Docplayer screenshot.png
URL [IA] [WebCite] []
Project status Online!
Archiving status Not saved yet
Project source Unknown
Project tracker Unknown
IRC channel #archiveteam

Docplayer is a document sharing platform. It provides unlimited upload of documents (after registration). The content uploaded ranges from teaching materials through legal papers to advertising ephemera.

The service started sometime in 2015 and is insanely growing: as of June 2016, more than 17 million documents seem to have been uploaded.

The service is available in several countries, with different TLD endings, but with the same layout: Italy, Hungary, Spain, Poland, the Netherlands, France, Turkey, Brasil, just to mention a few.

Vital signs

Seems to be working fine, but as it is a quite new site, and probably home of much copyrighted content, and somehow the general look of it, suggest to the writer of these lines that it is not really trustable. (Maybe just paranoia, but it is one of ArchiveTeam's core concepts, anyway).

Maybe it could be scraped continuously for already uploaded content, in case something happens? Because it's quite an important site.

Site structure

Unfortunately, original documents are protected with reCaptcha, so it will be a tough cookie. Fortunately, the page of the document already contains the "transcript" of the file (as raw text), which is still more than nothing.

Document pages are numbered incrementally, but have the title as suffix:

So documents are not discoverable with brute force, but fortunately, users are, as in that case there is no suffix – and documents are listed on the user's page: e.g.

Interesting that number of users is also high, newest ID is about the same as for documents.

IDs seem to be unique internationally: e.g. and correspond to the same user. So incremental user discovery is possible from one place, but there is a problem: the file pages can be opened only with the TLD in which country it was uploaded. E.g. the Hungarian appears only with the .hu TLD; for other TLDs it gives 404. (This also means that opening the corresponding Hungarian user's page works on any TLD, but except for the Hungarian one, the link to her document directs to a 404 page, so it's sort of broken.)

So discovery could be made with search engines (including the site's own one), and that gives valid results, or doing one incremental user discovery and trying the documents for ALL TLDs. (This latter seems to be a bit inefficient, but it covers EVERYTHING.

SUMMARY: Discovery is simple, in exchange for a bit inefficiency probably everything can be found, but without defeating the captcha, only low-quality result is achievable.