Difference between revisions of "Lulu Poetry"
(→Tools) |
|||
Line 1: | Line 1: | ||
'''Lulu Poetry''' or '''Poetry.com''', announced on May 1, 2011 that they would close four days later on May 4, deleting all 14 million poems. Archive Team members instantly amassed to find out how to help and aim their [http://en.wikipedia.org/wiki/LOIC LOIC]'s at it. (By the way, I actually mean their crawlers, not DDoS cannons.) | '''Lulu Poetry''' or '''Poetry.com''', announced on May 1, 2011 that they would close four days later on May 4, deleting all 14 million poems. Archive Team members instantly amassed to find out how to help and aim their [http://en.wikipedia.org/wiki/LOIC LOIC]'s at it. (By the way, I actually mean their crawlers, not DDoS cannons.) | ||
'''News'''<br> | |||
For everyone who left wget running last night, we noticed that the site would go out periodically, serving pages that told of "site maintenance" instead of the poem page that wget was looking for. So we're having to find those files, delete them, then re-download them. See [[Lulu Poetry#Tools|Tools]] for more info. | |||
==Site Structure== | ==Site Structure== | ||
Line 80: | Line 82: | ||
==Tools== | ==Tools== | ||
When the site is under "site maintenance," instead of the poem page, it gives wget a page that says "site maintenance." So worse than a complete failure, it gives a complete html file that's incorrect. | |||
You can find them using this command: <tt>find x??.dir -type f -print0 | xargs --null grep -l "performing site maintenance"</tt> | |||
For detecting server maintenance issues, I created the following correction script:<br> | For detecting server maintenance issues, I created the following correction script:<br> | ||
Revision as of 09:01, 2 May 2011
Lulu Poetry or Poetry.com, announced on May 1, 2011 that they would close four days later on May 4, deleting all 14 million poems. Archive Team members instantly amassed to find out how to help and aim their LOIC's at it. (By the way, I actually mean their crawlers, not DDoS cannons.)
News
For everyone who left wget running last night, we noticed that the site would go out periodically, serving pages that told of "site maintenance" instead of the poem page that wget was looking for. So we're having to find those files, delete them, then re-download them. See Tools for more info.
Site Structure
The urls appear to be flexible and sequential:
(12:13:09 AM) closure: http://www.poetry.com/poems/archiveteam-bitches/3535201/ , heh, look at that, you can just put in any number you like I think
(12:15:16 AM) closure: http://www.poetry.com/user/allofthem/7936443/ same for the users
There are apparently over 14 million poems. The numbers go up to http://www.poetry.com/user/whatever/14712220, though interspersed are urls without poems (author deletions?).
Howto
- Claim a range of numbers below.
- Generate a hotlist of urls for wget to download by running this, editing in your start and end number: perl -le 'print "http://www.poetry.com/poems/archiveteam/$_/" for 1000000..2000000' > hotlist
- Split the hotlist into 100 sublists: split hotlist
- Run wget on each sublist: wget -x i xaa
- To avoid getting too many files in one directory, which some filesystems will choke on, recommend moving into a new subdirectory before running each wget on the sublist.
- For the daring, here's how to run all wgets on all the sublists in parallel, in subdirs: for x in ???; do mkdir $x.dir; cd $x.dir; wget -x -i ../$x &; cd ..; done
Important note: Everyone's getting a lot of 500 errors, probably because we're whacking their server. Because of this, make sure you keep all your log files. Then you can search them later to generate a list of urls to retry. Suggested ways to do this:
- run the command grep -h -B1 "ERROR 500" *.log | grep ^http | sed 's/:$//'
- use a perl script to search your download directory for missing folders in the sequence
Recommended wget command: wget -E -k -T 8 -o logfile.log -nv -nc -x -i urls.txt
short | long version | meaning |
---|---|---|
-E | --adjust-extension | adds ".html" to files that are html but didn't originally end in .html |
-k | --convert-links | change links in html files to point to the local versions of the resources |
-T | --timeout= | if it gets hung for this long (in seconds), it'll retry instead of sitting waiting |
-o | --output-file | use the following filename as a log file instead of printing to screen |
-nv | --no-verbose | don't write every little thing to the log file |
-nc | --no-clobber | if a file is already present on disk, skip it instead of re-downloading it |
-x | --force-directories | force it to create a hierarchy of directories mirroring the hierarchy in the url structure |
-i | --input-file | use the following filename as a source of urls to download |
Coordination
Who is handling which chunks of urls? | |||
IRC name | starting number | ending number | Progress |
---|---|---|---|
closure | 0 | 200,000 | complete |
closure | 200,000 | 999,999 | in progress |
jag | 1,000,000 | 2,000,000 | in progress |
notakp | 2,000,000 | 3,000,000 | in progress |
no2pencil | 3,000,000 | 3,999,999 | in progress |
d8uv | 4,000,000 | 4,499,999 | in progress |
Qwerty01 | 4,500,000 | 4,999,999 | getting started |
Awsm | ??? | ??? | in progress? |
underscor | ??? | ??? | in progress? |
BlueMax | ??? | ??? | in progress? |
SketchCow | ??? | ??? | in progress? |
[yourusernamehere] | x,000,000 | x,999,999 | in progress |
[seriouslyeditme] | x,000,000 | x,999,999 | in progress |
Tools
When the site is under "site maintenance," instead of the poem page, it gives wget a page that says "site maintenance." So worse than a complete failure, it gives a complete html file that's incorrect. You can find them using this command: find x??.dir -type f -print0 | xargs --null grep -l "performing site maintenance"
For detecting server maintenance issues, I created the following correction script:
flist=`grep "currently performing site maintenance" *.html | cut -d: -f1` x=0 for file in ${flist}; do if [ -f ${file} ]; then echo correcting ${file} html=`echo ${file} | cut -c5-11` wget -E http://www.poetry.com/poems/archiveteam/${html}/ -O poem${html}.html 2>/dev/null echo done... x=`expr ${x} + 1` fi done if [ ${x} -eq 0 ]; then echo Directory clean else echo ${x} files corrected fi