4chan/4plebs

archive.4plebs.org

URL	archive.4plebs.org
Status	Online!
Archiving status	Saved!
Archiving type	Unknown
IRC channel	#archiveteam-bs (on hackint)
Data^{[how to use]}	@fourplebs

Status: Online Saves Images?: Yes

4plebs is shedding all full-sized images dating before April 2014, about 240GBs worth of data, due to storage limits. We need to retrieve this data and put it on the Internet Archive for safekeeping.

The Bibliotheca Anonoma has received the pruned images from 4plebs via tar piping, and will be uploading to the Internet Archive shortly.

List of Boards

Board	Date of Oldest Thread	Pedigree
/adv/ - Advice	2014-01-12	4plebs.org
/f/ - Flash	2014-03-15	4plebs.org
/hr/ - High Resolution	2012-12-01 (around birth)	4plebs.org
/o/ - Auto	2013-03-13	4plebs.org
/pol/ - Politically Incorrect	2013-10-28	4plebs.org
/s4s/ - Shit 4chan Says	2013-10-05	4plebs.org
/sp/ - Sports	2012-06-11	4plebs.org <- not4plebs.org <- (late 2014 threads lost) <- Archive.moe <- foolz.us
/tg/ Traditional Games	2011-06-26	4plebs.org
/trv/ - Travel	2012-07-02	4plebs.org
/x/ - Paranormal	2013-04-01	4plebs.org

Method 1: Web Scraping

Using wget, we just scrape the images off the server. It's not elegant, but it works, and thankfully the admin has provided some image lists. (change board name in URL to view another list) This will take about a month at least, and that's assuming we're scraping in parallel. The following bash script is used:

!/bin/bash
board="tg"
wget http://img.4plebs.org/boards/$board/image/to_be_removed_in_order.txt
sed -e 's|^./|http://img.4plebs.org/boards/$board/image/|g' -i to_be_removed_in_order.txt
wget -b --tries=10 -nc -c -i to_be_removed_in_order.txt --user-agent="Bibliotheca Anonoma Website Archiver/1.1 (+http://github.com/bibanon/bibanon/wiki)" -w 1

Web Scraping ETA

Below are rough estimates for scraping time, procedurally calculated based on the amount of images listed.

These were intentionally overestimated to ensure that my VPS actually had enough space and time, but actually the time estimates are off by a factor of 9, since it only took 15 hours to scrape 5GBs of data (from /s4s/), not 6 days. Maybe it should be 1.1 seconds, rather than 2? We used a delay just to be polite.

Assumes:

2 second Average Download Time (includes 1 second delay)
600KB Average filesize for regular boards
3MB Average filesize for high resolution boards
8MB Average filesize for /f/lashes

Total

Total Amount of Images: 372123
Total Estimated Size: 244789 MB (or) 240 GB
Total Estimated Timespan:
Parallel: 1 month (30 days)
Sequential: 2063 hours (or) 85 days

/adv/

Status: Scraping - 2015-09-20

Amount of Images: 12973
Estimated Timespan: 72 hours (or) 3 days
Estimated Size: 7601 MB (or) 8 GB
- Actual Timespan: 7h 18m 10s
- Actual Size: 3.3G

/hr/

Amount of Images: 11082
Estimated Timespan: 61 hours (or) 2 days
Estimated Size: 33246 MB (or) 33 GB

/f/

Nothing to be pruned?

/o/

Amount of Images: 37437
Estimated Timespan: 207 hours (or) 8 days
Estimated Size: 21935 MB (or) 22 GB

/pol/

Amount of Images: 107115
Estimated Timespan: 595 hours (or) 24 days
Estimated Size: 62762 MB (or) 62 GB

/s4s/

Status: Scraping - 2015-09-20

Amount of Images: 29504
Estimated Timespan: 163 hours (or) 6 days
Estimated Size: 17287 MB (or) 17 GB
- Actual Timespan: 15h 30m
- Actual Size: 5.7G

/sp/

Nothing to be pruned?

/tg/

Amount of Images: 60556
Estimated Timespan: 336 hours (or) 14 days
Estimated Size: 35482 MB (or) 34 GB

/trv/

Status: Saved! - 2015-09-21

Amount of Images: 1713
Estimated Timespan: 9 hours
Estimated Size: 1003 MB (or) 1 GB
- Actual Timespan: 1h 6m 32s
- Actual Size: 1.1G

/tv/

Amount of Images: 99399
Estimated Timespan: 552 hours (or) 23 days
Estimated Size: 58241 MB (or) 57 GB

/x/

Status: Saved! - 2015-09-20

Amount of Images: 12344
Estimated Timespan: 68 hours (or) 2 days
Estimated Size: 7232 MB (or) 7 GB
- Actual Timespan: 6h 55m 29s
- Actual Size: 3.4G

Method 2: tar Piping

Web scraping does eat up bandwidth and take quite a long time. A better method is to pipe a tar archive from their host server to our (dedicated) server. Yes you heard that right, the tar backup is stored directly on the remote server, not on the host server.

That way, the host server doesn't have to store a redundant backup that could be massive. Instead, just spit it at our server directly.

tar -c /path/to/dir | ssh remote_server 'tar -xvf - -C /absolute/path/to/remotedir'

This would only take about a week or so to transfer 240GBs of data, and reduces the amount of overhead on the web server from requesting 330,000 files: we only send one continuous stream of data.

Source: StackOverflow - Tar piping

4chan/4plebs

Contents

List of Boards

Method 1: Web Scraping

Web Scraping ETA

Total

/adv/

/hr/

/f/

/o/

/pol/

/s4s/

/sp/

/tg/

/trv/

/tv/

/x/

Method 2: tar Piping

Navigation menu

4chan/4plebs

List of Boards

Method 1: Web Scraping

Web Scraping ETA

Total

/adv/

/hr/

/f/

/o/

/pol/

/s4s/

/sp/

/tg/

/trv/

/tv/

/x/

Method 2: tar Piping

Navigation menu

Search