Talk:Yahoo! Groups

From Archiveteam
Revision as of 09:03, 15 December 2019 by LeighRoberts (talk | contribs) (LeighRoberts' initial impressions)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

My first impressions as we wind things up:

Individual Warrior runner advice: - Find out what the maximum item is. Your available disk must be at least twice that big + 10%. - Find out what the median item is. Add 20%. Multiply by 2, because of the WARC file! Divide your available disk space by that. - Workers will get stuck on the largest jobs, so even that sizing might not be conservative enough. - The Docker Warrior needed about 50 MB/container of memory. I did not get figures on yahoo-group-archiver, but I imagine it would be similar. - Scale slowly - don’t fire off all your containers at once, because they all need to load the same requirements. - Start about 1/4 of your planned containers and wait 15-30 minutes to check CPU and memory use before starting the next 1/4. - Check your disk space regularly - Grafana setup advice for this?

Specific advice for this project that would have been useful: - Recommend to people to have two categories of group-joiner accounts: one for public groups, one for private groups we hope to preserve (project specific)

Technical improvements: - queue management is super important - way to add new tracker capacity as needed - Make reports.pl part of the Dockerfile and run automatically - Instructions don’t *exactly* work for Python -apt install pip3 -location of the run-pipeline3

       -might need to pip3 install setuptools first to get requirements.txt to run all the way

(probably most important) Organizational improvements:

- have clear information available to all participants about who is in charge of what, who can perform operations (changing limits on trackers/targets, seeing and resetting jobs), what times they’re generally available, contact information (if not publicly-distributed, at least to two or three others, with them as alternate PoC’s) - determine the maximum number of clients the tracker(s) can feasibly serve (it’s counterproductive for too many clients to be “competing” for jobs)