Posterous

From Archiveteam
Revision as of 02:51, 13 March 2013 by Chazchaz101 (talk | contribs) (→‎Seesaw script (for advanced users): Specifying an IP address isn't required)
Jump to navigation Jump to search
Posterous
Posterous home.png
URL http://posterous.com
Status Closing
Archiving status In progress...
Archiving type Unknown
Project tracker here
IRC channel #preposterus (on hackint)

Posterous is a blogging platform started in May 2008. It was acquired by Twitter on March 12, 2012 and will shut down April 30, 2013. Announcement See Posterous#Warrior below for how to help.

Frequently Asked Questions

It's going down! How can I help?

Glad you're interested! First and foremost, consider running our prepared Virtual Machine. Please see Posterous#Warrior down below.

What do you guys need? A huge fat pipe, a.k.a Bandwidth?

Needed/Wanted: Interested volunteers in general and IP addresses. A lot of bandwidth isn't needed, per se. You don't need a fat monster pipe/Internet tube to help out.

Can I donate some cash instead?

Not really, not to the ArchiveTeam specifically. If you feel like you could let go off of a few buckeroo's, consider donating to the Internet Archive. They're awesome and do awesome things, just like us! (Yes, you're included in "us" - you're here, reading already!)

Why aren't we fetching more/bashing the shit out of Posterous to get done already?!

Easy tiger! We all love us, getting a web service down on its knees good.. But we want to get as much as possible out of Posterous.

We've currently rate limited the project and continue to adjust accordingly as well as try out tactics. We've unfortunately been able to bring Posterous down to it's knees a good few times indeed.

The problem is that Posterous is not designed for the load it's currently getting. Especially with us. They've designed Posterous so that the front-ends will hit a cache with content, before hitting the back-end. Ok, but why isn't that helping? We're going through *all* of their accounts and posts and we are ruining the cache. That means Posterous's back-end can't take the request rate at all.. Which will make requests return bad data or no data if we go too fast. Please keep this in mind.

The warrior tells me to ask to run the Posterous project?

Yes, yes - it indeed does. If you've read this - feel free to click on it and go on. This message/warning/notice was introduced earlier in the archiving project when we got banned a lot. Other Warrior projects havn't been this bitchy about banning - therefore the notice is still there.

Will I get banned?

If you help out by running the ArchiveTeam Warrior - it's very unlikely that your IP will get banned. Our objective is to get as much of Posterous as possible. Therefore we have taken measures and continue to take measures to rate limit, check for errors and retry and back off when appropriate to ensure getting as much as possible. There's also some magic over at Posterous end, we won't go into details here though.

If you however are starting your own "Rape the Posterous Silly"-project with own code, or are running too many concurrent jobs - with for example the stand alone code mentioned below (seesaw script for advanced users) - yes. It's very likely you'll get banned.

How do I know if I got banned?

If in doubt, shoot of a request to [1] from the same IP as your Warrior VM. Or log on your Warrior and do "curl -v http://www.posterous.com". If you don't even get a connection, you're surely banned.

How long am I banned, if banned?

Good question! No good answer! Next!

In the beginning of the crawl, individual IPs were banned for days - if not mistaken, a week or so. After experimenting with... overloading... Posterous from different IPs, the ban time have shortened.

The answer is: Hours to weeks. It's unclear.

OK, I'm running the Warrior - I'm getting 502/5XX errors!!

That's not a question.

Posterous will gag out 50X's occationally - we've taken measures to back off for a period of time and retry for a certain number of times. It's alright.

This does not mean your IP has been banned.

Uh, so.. looks like there's plenty of spam on Posterous?

Yep, but we don't care. Grab it all. It's not our thing to decide what gets saved and not - especially if we have the chance to save it all.

Maybe it'll be useful for a spam researcher in the future. maybe not.


Where can I see the project status?

You can see the status at [[2]] - which is the dashboard for this project.

Cool! So you're almost done with this?

Sadly, no! All hostnames are not tracked on the dashboard - because of certain limitations in the current tracker/dashboard. We've unloaded a lot of the users/items. In total, we believe there to be about 10 Million hostnames/sites/users.

The tracker/status dashboard is barfing! Or giving 502 Bad Gateways. What's up?

The tracker/dashboard is a bit fragile - so please don't link it out all too much. It's not optimized for maximum page loads. It's however functional and the source code is freely available on [GitHub] - feel free to look into that and if you see anything that can be improved, submit a pull request.

Our tracker admins will of course kick it back to life if it's acting up. Please join our IRC Channel for status updates regarding the tracker and such

My userstats seems to be reset on the dashboard, what gives?

The user details are cached for a set of time, we've had the caching act up a few times. Please rest assured that every submitted work DOES get counted and it gets in. If you see your username getting submissions and then resetting the total - feel free to poke us in the IRC Channel (anyone with a @). We'll kick the cache in the butt, and your stats will show like it should. This shouldn't happen all that often though.

How do I know if my posterous favorite blogobongobloggo will be fetched?

There's no super nice way, but if you go to Posterous#Site List Grab below, you can grab the hostname list that we've spidered forth and check by opening it and searching for your username/hostname. Or you could use 'grep' on it, you know - like a man.

Can I opt out? I don't want to be saved!

Tough luck, it's already public - that's why we're grabbing it. Besides, don't be embarrassed! We all learn through history - let the history be.

This is cool and all, but where the fuck is the data going?

We'll make sure this data stays public after it's been downloaded. We'll make sure that the awesome duders and duduetters at [Internet Archive] gets a copy for sure. We're grabbing all the Posterous sites in a Internet Archive friendly file format called WARC (WebARCive) - so they should be able to put this into the Wayback machine - if they'd like to.

So, my Warrior doesn't get networking with VirtualBox on Ubuntu, what gives?

You should do the following:

VBoxManage modifyvm "archiveteam-warrior-2" --natdnshostresolver1 on
VBoxManage modifyvm "archiveteam-warrior-2" --natdnsproxy1 on

Thanks and shout outs goes to hdevalenc


How to help

Warrior

You can help by installing and running the ArchiveTeam Warrior and selecting the "posterous" project. The Warrior is a virtual machine you can run in Virtualbox seamlessly to help out.

Seesaw script (for advanced users)

Download:

git clone https://github.com/ArchiveTeam/posterous-grab.git

Follow instructions to install seesaw and edit script for IP address.

For wget: run ./get-wget-lua.sh

Commands:

If you are on a box with more than one public IP address, you can place an IP address after --bind-address= on line 175. Example: "--bind-address=192.168.1.1",

git clone http://github.com/ArchiveTeam/posterous-grab.git
cd posterous-grab
git clone http://github.com/ArchiveTeam/seesaw-kit
cd seesaw-kit
sudo pip install -r requirements.txt
sudo pip install seesaw
cd ../
chmod +x get-wget-lua.sh && ./get-wget-lua.sh
run-pipeline --concurrent 1 --address <your_ip_address> pipeline.py <your_username>

Site List Grab

We have assembled a list of Posterous sites that need grabbing. Total found: 9898986

http://archive.org/details/2013-02-22-posterous-hostname-list

Tools: git

Goal

We found 9.8 million possible posterous accounts. After filtering out the banned/spam accounts we have 6,677,720 left.

They close April 30th, 2013. We have 50 days left and 1,200,000 accounts downloaded.

60 sec * 60 min * 24 hours = 86,400 seconds a day

(6,677,720 - 1,200,000)/ 86,400 = 63.4 days at 1 account a second.

63.4 days (1 fetch a second)/50 days left = 1.268 and round that up to 2 accounts per second actually needed.

Now taking into account that not all accounts are the same size and the previous outages we have had the safe number would be 3x the above answer. So we need to download 6 full accounts per second to positively get all of posterous before it shuts down. This is also based on the assumption that we will not have to redownload any accounts at the end.