Difference between revisions of "Lulu Poetry"
m (→Howto: -T timeout option) |
|||
Line 24: | Line 24: | ||
For the daring, here's how to run all wgets on all the sublists in parallel, in subdirs: <tt>for x in ???; do mkdir $x.dir; cd $x.dir; wget -x -i ../$x &; cd ..; done</tt> | For the daring, here's how to run all wgets on all the sublists in parallel, in subdirs: <tt>for x in ???; do mkdir $x.dir; cd $x.dir; wget -x -i ../$x &; cd ..; done</tt> | ||
*Recommended wget command: <tt>wget -E -k -o logfile.log -nv -nc -x -i urls.txt</tt> | *Recommended wget command: <tt>wget -E -k -T 8 -o logfile.log -nv -nc -x -i urls.txt</tt> | ||
(please improve if you have better ideas) | (please improve if you have better ideas) | ||
{| class="wikitable" | {| class="wikitable" | ||
Line 35: | Line 35: | ||
|- | |- | ||
| <tt>-k</tt> || <tt>--convert-links</tt> || change links in html files to point to the local versions of the resources | | <tt>-k</tt> || <tt>--convert-links</tt> || change links in html files to point to the local versions of the resources | ||
|- | |||
| <tt>-T</tt> || <tt>--timeout=</tt> || if it gets hung for this long (in seconds), it'll retry instead of sitting waiting | |||
|- | |- | ||
| <tt>-o</tt> || <tt>--output-file</tt> || use the following filename as a log file instead of printing to screen | | <tt>-o</tt> || <tt>--output-file</tt> || use the following filename as a log file instead of printing to screen |
Revision as of 07:09, 2 May 2011
Lulu Poetry or Poetry.com, announced on May 1, 2011 that they would close four days later on May 4, deleting all 14 million poems. Archive Team members instantly amassed to find out how to help and aim their LOIC's at it. (By the way, I actually mean their crawlers, not DDoS cannons.)
Site Structure
The urls appear to be flexible and sequential:
(12:13:09 AM) closure: http://www.poetry.com/poems/archiveteam-bitches/3535201/ , heh, look at that, you can just put in any number you like I think
(12:15:16 AM) closure: http://www.poetry.com/user/allofthem/7936443/ same for the users
There are apparently over 14 million poems. The numbers go up to http://www.poetry.com/user/whatever/14712220, though interspersed are urls without poems (author deletions?).
Howto
Claim a range of numbers below.
Generate a hotlist by running this, editing in your start and end number: perl -le 'print "http://www.poetry.com/poems/archiveteam/$_/" for 1000000..2000000' > hotlist
Split the hotlist into 100 sublists: split hotlist
Run wget on each sublist: wget -x i xaa
To avoid getting too many files in one directory, which some filesystems will choke on, recommend
moving into a new subdirectory before running each wget on the sublist.
For the daring, here's how to run all wgets on all the sublists in parallel, in subdirs: for x in ???; do mkdir $x.dir; cd $x.dir; wget -x -i ../$x &; cd ..; done
- Recommended wget command: wget -E -k -T 8 -o logfile.log -nv -nc -x -i urls.txt
(please improve if you have better ideas)
short | long version | meaning |
---|---|---|
-E | --adjust-extension | adds ".html" to files that are html but didn't originally end in .html |
-k | --convert-links | change links in html files to point to the local versions of the resources |
-T | --timeout= | if it gets hung for this long (in seconds), it'll retry instead of sitting waiting |
-o | --output-file | use the following filename as a log file instead of printing to screen |
-nv | --no-verbose | don't write every little thing to the log file |
-nc | --no-clobber | if a file is already present on disk, skip it instead of re-downloading it |
-x | --force-directories | force it to create a hierarchy of directories mirroring the hierarchy in the url structure |
-i | --input-file | use the following filename as a source of urls to download |
Coordination
Who is handling which chunks of urls? | |||
IRC name | starting number | ending number | Progress |
---|---|---|---|
closure | 0 | 200,000 | complete |
closure | 200,000 | 999,999 | in progress |
jag | 1,000,000 | 2,000,000 | in progress |
notakp | 2,000,000 | 3,000,000 | in progress |
no2pencil | 3,000,000 | 3,999,999 | in progress |
d8uv | 4,000,000 | 4,499,999 | in progress |
Qwerty01 | 4,500,000 | 4,999,999 | getting started |
Awsm | ??? | ??? | in progress? |
underscor | ??? | ??? | in progress? |
BlueMax | ??? | ??? | in progress? |
SketchCow | ??? | ??? | in progress? |
[yourusernamehere] | x,000,000 | x,999,999 | in progress |
[seriouslyeditme] | x,000,000 | x,999,999 | in progress |