Difference between revisions of "Lulu Poetry"

Revision as of 06:09, 2 May 2011

Lulu Poetry or Poetry.com, announced on May 1, 2011 that they would close four days later on May 4, deleting all 14 million poems. Archive Team members instantly amassed to find out how to help and aim their LOIC's at it. (By the way, I actually mean their crawlers, not DDoS cannons.)

Site Structure

The urls appear to be flexible and sequential:
(12:13:09 AM) closure: http://www.poetry.com/poems/archiveteam-bitches/3535201/ , heh, look at that, you can just put in any number you like I think
(12:15:16 AM) closure: http://www.poetry.com/user/allofthem/7936443/ same for the users
There are apparently over 14 million poems. The numbers go up to http://www.poetry.com/user/whatever/14712220, though interspersed are urls without poems (author deletions?).

Howto

Claim a range of numbers below.

Generate a hotlist by running this, editing in your start and end number: perl -le 'print "<nowiki>http://www.poetry.com/poems/archiveteam/$_/" for 1000000...2000000' > hotlist

Split the hotlist into 100 sublists: split hotlist

Run wget on each sublist: wget -x i xaa

To avoid getting too many files in one directory, which some filesystems will choke on, recommend moving into a new subdirectory before running each wget on the sublist.

For the daring, here's how to run all wgets on all the sublists in parallel, in subdirs: for x in ???; do mkdir $x.dir; cd $x.dir; wget -x -i ../$x &; cd ..; done

Coordination

IRC name	starting number	ending number	Progress
Who is handling which chunks of urls?
closure	0	999,999	in progress
jag	1,000,000	2,000,000	in progress
notakp	2,000,000	3,000,000	in progress
no2pencil	3,000,000	3,999,999	in progress
Qwerty01	7,000,000	7,999,999	getting started
Awsm	???	???	in progress?
underscor	???	???	in progress?
BlueMax	???	???	in progress?
d8uv	???	???	in progress?
SketchCow	???	???	in progress?
[yourusernamehere]	x,000,000	x,000,000	in progress

@@ Line 10: / Line 10: @@
 There are apparently over 14 million poems. The numbers go up to <nowiki>http://www.poetry.com/user/whatever/14712220</nowiki>, though interspersed are urls without poems (author deletions?).
-==Strategies==
+==Howto==
-Currently people are using wget in various ways.
+Claim a range of numbers below.
-Because the urls are sequential, you can call wget with a nano-bash script:<br>
+Generate a hotlist by running this, editing in your start and end number: <tt>perl -le 'print "<nowiki>http://www.poetry.com/poems/archiveteam/$_/" for 1000000...2000000' > hotlist</tt>
-<tt>for x in $(seq 1 100000); do wget <nowiki>http://www.poetry.com/poems/archiveteam/$x/</nowiki> -O poem$x.html; done</tt><br>
-Or you can just make a text file containing a list of incrementing urls and feed that to wget as a source of urls (may be faster than the above).<br>
+Split the hotlist into 100 sublists: <tt>split hotlist</tt>
-A quick command to build a list of urls: <tt>perl -e 'print "<nowiki>http://www.poetry.com/poems/archiveteam/$_/\n</nowiki>" for 100000...14500000' > biglist</tt>
+Run wget on each sublist: <tt>wget -x i xaa</tt>
+To avoid getting too many files in one directory, which some filesystems will choke on, recommend
+moving into a new subdirectory before running each wget on the sublist.
+For the daring, here's how to run all wgets on all the sublists in parallel, in subdirs: <tt>for x in ???; do mkdir $x.dir; cd $x.dir; wget -x -i ../$x &; cd ..; done</tt>
 ==Coordination==

Difference between revisions of "Lulu Poetry"

Revision as of 06:09, 2 May 2011

Site Structure

Howto

Coordination

Navigation menu

Search