Difference between revisions of "Posterous"

From Archiveteam
Jump to navigation Jump to search
(Adding some sort of FAQ)
Line 10: Line 10:
}}
}}


Posterous is a blogging platform started in May 2008. It was acquired by Twitter on March 12, 2012 and will shut down April 30, 2013. [http://blog.posterous.com/thanks-from-posterous Announcement]
Posterous is a blogging platform started in May 2008. It was acquired by Twitter on March 12, 2012 and will shut down April 30, 2013. [http://blog.posterous.com/thanks-from-posterous Announcement] See [[Posterous#Warrior]] below for how to help.


== Warrior ==
== Frequently Asked Questions ==
You can help by installing and running the [[ArchiveTeam Warrior]] and selecting the "posterous" project.


== Seesaw script (for advanced users)==
=== It's going down! How can I help? ===
Glad you're interested! First and foremost, consider running our prepepared Virtual Machine. Please see [[Posterous#Warrior]] down below.
 
=== What do you guys need? A huge fat pipe, a.k.a Bandwidth? ===
Needed/Wanted: Interested volunteers in general and IP addresses. Bandwidth per say isn't actually needed. You don't need a fat monster pipe/Internet tube to help out.
 
=== Why aren't we fetching more/bashing the shit out of Posterous to get done already?! ===
Easy tiger! We all love us, getting a web service down on it's knees good.. But we want to get as much as possible out of Posterous.
 
We've currently rate limited the project and continue to adjust accordingly as well as try out tactics. We've unfortunally been able to bring Posterous down to it's knees a good few times indeed.
 
The problem is that Posterous is not designed for the load it's currently getting. Especially with us. They've designed Posterous so that the front-ends will hit a cache with content, before hitting the back-end. Ok, but why isn't that helping? We're going through *all* of their accounts and posts and we are ruining the cache. That means Posterous's back-end can't take the request rate at all.. Which will make requests return bad data or no data if we go too fast. Please keep this in mind.
 
=== The warrior tells me to ask to run the Posterous project? ===
Yes, yes - it indeed does. If you've read this - feel free to click on it and go on. This message/warning/notice was introduced earlier in the archiving project when we got banned a lot. Other Warrior projects havn't been this bitchy about banning - therefor the notice is still there.
 
=== Will I get banned? ===
If you help out by running the [[ArchiveTeam Warrior]] - it's very unlikely that your IP will get banned. Our objective is to get as much of Posterous as possible. Therefor we have taken measures and continue to take measures to rate limit, check for errors and retry and back off when appropriate to ensure getting as much as possible. There's also some magic over at Posterous end, we won't go into details here though.
 
If you however are starting your own "Rape the Posterous Silly"-project with own code, or are running too many concurrent jobs - with for example the stand alone code mentioned below (seesaw script for advanced users) - yes. It's very likely you'll get banned.
 
=== OK, I'm running the Warrior - I'm getting 502/5XX errors!! ===
That's not a question.
 
Posterous will gag out 50X's occationally - we've taken measures to back off for a period of time and retry for a certain number of times. It's alright.
 
=== Where can I see the project status? ===
You can see the status at [[http://tracker.archiveteam.org/posterous/]] - which is the dashboard for this project.
 
=== Cool! So you're almost done with this? ===
Sadly, no! All hostnames are not tracked on the dashboard - because of certain limitations in the current tracker/dashboard. We've unloaded a lot of the users/items. In total, we believe there to be about 10 Million hostnames/sites/users.
 
=== The tracker/status dashboard is barfing! Or giving 502 Bad Gateways. What's up? ===
The tracker/dashboard is a bit fragile - so please don't link it out all too much. It's not optimized for maximum page loads. It's however functional and the source code is freely available on [[https://github.com/ArchiveTeam/universal-tracker GitHub]] - feel free to look into that and if you see anything that can be improved, submit a pull request.
 
Our tracker admins will of course kick it back to life if it's acting up. Please join our IRC Channel for status updates regarding the tracker and such
 
=== My userstats seems to be resett on the dashboard, what gives? ===
The user details are cached for a set of time, we've had the caching act up a few times. Please rest assured that every submitted work DOES get counted and it gets in. If you see your username getting submissions and then resetting the total - feel free to poke us in the IRC Channel (anyone with a @).
 
 
== How to help ==
 
 
=== Warrior ===
You can help by installing and running the [[ArchiveTeam Warrior]] and selecting the "posterous" project. The Warrior is a virtual machine you can run in Virtualbox seamlessly to help out.
 
=== Seesaw script (for advanced users)===


'''Download:'''
'''Download:'''
Line 47: Line 93:


== Goal ==
== Goal ==
We found 9.8 million possible posterous accounts. After filtering out the banned/spam accounts we have 6,677,720 left.
We found 9.8 million possible posterous accounts. After filtering out the banned/spam accounts we have 6,677,720 left.



Revision as of 00:13, 13 March 2013

Posterous
Posterous home.png
URL http://posterous.com
Status Closing
Archiving status In progress...
Archiving type Unknown
Project tracker here
IRC channel #preposterus (on hackint)

Posterous is a blogging platform started in May 2008. It was acquired by Twitter on March 12, 2012 and will shut down April 30, 2013. Announcement See Posterous#Warrior below for how to help.

Frequently Asked Questions

It's going down! How can I help?

Glad you're interested! First and foremost, consider running our prepepared Virtual Machine. Please see Posterous#Warrior down below.

What do you guys need? A huge fat pipe, a.k.a Bandwidth?

Needed/Wanted: Interested volunteers in general and IP addresses. Bandwidth per say isn't actually needed. You don't need a fat monster pipe/Internet tube to help out.

Why aren't we fetching more/bashing the shit out of Posterous to get done already?!

Easy tiger! We all love us, getting a web service down on it's knees good.. But we want to get as much as possible out of Posterous.

We've currently rate limited the project and continue to adjust accordingly as well as try out tactics. We've unfortunally been able to bring Posterous down to it's knees a good few times indeed.

The problem is that Posterous is not designed for the load it's currently getting. Especially with us. They've designed Posterous so that the front-ends will hit a cache with content, before hitting the back-end. Ok, but why isn't that helping? We're going through *all* of their accounts and posts and we are ruining the cache. That means Posterous's back-end can't take the request rate at all.. Which will make requests return bad data or no data if we go too fast. Please keep this in mind.

The warrior tells me to ask to run the Posterous project?

Yes, yes - it indeed does. If you've read this - feel free to click on it and go on. This message/warning/notice was introduced earlier in the archiving project when we got banned a lot. Other Warrior projects havn't been this bitchy about banning - therefor the notice is still there.

Will I get banned?

If you help out by running the ArchiveTeam Warrior - it's very unlikely that your IP will get banned. Our objective is to get as much of Posterous as possible. Therefor we have taken measures and continue to take measures to rate limit, check for errors and retry and back off when appropriate to ensure getting as much as possible. There's also some magic over at Posterous end, we won't go into details here though.

If you however are starting your own "Rape the Posterous Silly"-project with own code, or are running too many concurrent jobs - with for example the stand alone code mentioned below (seesaw script for advanced users) - yes. It's very likely you'll get banned.

OK, I'm running the Warrior - I'm getting 502/5XX errors!!

That's not a question.

Posterous will gag out 50X's occationally - we've taken measures to back off for a period of time and retry for a certain number of times. It's alright.

Where can I see the project status?

You can see the status at [[1]] - which is the dashboard for this project.

Cool! So you're almost done with this?

Sadly, no! All hostnames are not tracked on the dashboard - because of certain limitations in the current tracker/dashboard. We've unloaded a lot of the users/items. In total, we believe there to be about 10 Million hostnames/sites/users.

The tracker/status dashboard is barfing! Or giving 502 Bad Gateways. What's up?

The tracker/dashboard is a bit fragile - so please don't link it out all too much. It's not optimized for maximum page loads. It's however functional and the source code is freely available on [GitHub] - feel free to look into that and if you see anything that can be improved, submit a pull request.

Our tracker admins will of course kick it back to life if it's acting up. Please join our IRC Channel for status updates regarding the tracker and such

My userstats seems to be resett on the dashboard, what gives?

The user details are cached for a set of time, we've had the caching act up a few times. Please rest assured that every submitted work DOES get counted and it gets in. If you see your username getting submissions and then resetting the total - feel free to poke us in the IRC Channel (anyone with a @).


How to help

Warrior

You can help by installing and running the ArchiveTeam Warrior and selecting the "posterous" project. The Warrior is a virtual machine you can run in Virtualbox seamlessly to help out.

Seesaw script (for advanced users)

Download:

git clone https://github.com/ArchiveTeam/posterous-grab.git

Follow instructions to install seesaw and edit script for IP address.

For wget: run ./get-wget-lua.sh

Commands:

Make sure you place an IP address after --bind-address= on line 175. Example: "--bind-address=192.168.1.1",

git clone http://github.com/ArchiveTeam/posterous-grab.git
cd posterous-grab
git clone http://github.com/ArchiveTeam/seesaw-kit
cd seesaw-kit
sudo pip install -r requirements.txt
sudo pip install seesaw
cd ../
chmod +x get-wget-lua.sh && ./get-wget-lua.sh
run-pipeline --concurrent 1 --address <your_ip_address> pipeline.py <your_username>

Site List Grab

We have assembled a list of Posterous sites that need grabbing. Total found: 9898986

http://archive.org/details/2013-02-22-posterous-hostname-list

Tools: git

Goal

We found 9.8 million possible posterous accounts. After filtering out the banned/spam accounts we have 6,677,720 left.

They close April 30th, 2013. We have 50 days left and 1,200,000 accounts downloaded.

60 sec * 60 min * 24 hours = 86,400 seconds a day

(6,677,720 - 1,200,000)/ 86,400 = 63.4 days at 1 account a second.

63.4 days (1 fetch a second)/50 days left = 1.268 and round that up to 2 accounts per second actually needed.

Now taking into account that not all accounts are the same size and the previous outages we have had the safe number would be 3x the above answer. So we need to download 6 full accounts per second to positively get all of posterous before it shuts down. This is also based on the assumption that we will not have to redownload any accounts at the end.