User:SadDM/External images in forums

From Archiveteam
Jump to navigation Jump to search

It seems pretty common for forums to have links to external images. Just grabbing the forum doesn't get these images. I've found that the best way to get them is to get them on a second pass. The trick is extracting a list of external images from the initial .warc. The following does a pretty good job (though there are probably a bunch of edge cases that it misses, but it's better than nothing):

zgrep '<img' first_pass.warc.gz|sed 's/<img src="/\n/g'|grep '^http'|sed 's/\([^"][^"]*\).*/\1/'|sort|uniq

Output that to a file and run wget over it with -i. Combine the .warcs and you're good to go.