Talk:Yahoo! Groups

From Archiveteam
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

LeighRoberts' notes and first impressions

Individual Warrior runner advice

  • Find out what the maximum item is. Your available disk must be at least twice that big + 10%.
  • Find out what the median item is. Add 20%. Multiply by 2, because of the WARC file! Divide your available disk space by that.
  • Workers will get stuck on the largest jobs, so even that sizing might not be conservative enough.
  • The Docker Warrior needed about 50 MB/container of memory. I did not get figures on yahoo-group-archiver, but I imagine it would be similar.
  • Scale slowly - don’t fire off all your containers at once, because they all need to load the same requirements.
  • Start about 1/4 of your planned containers and wait 15-30 minutes to check CPU and memory use before starting the next 1/4.
  • Check your disk space regularly
  • Grafana setup advice for this?

Specific advice for this project that would have been useful

  • Recommend to people to have two categories of group-joiner accounts: one for public groups, one for private groups we hope to preserve (project specific)
  • Maximum object size available to all (Grafana?)

Technical improvements

  • queue management is super important
  • way to add new tracker capacity as needed
  • Make reports.pl part of the Dockerfile and run automatically
  • Instructions don’t *exactly* work for Python
    • apt install pip3
    • location of the run-pipeline3
    • might need to pip3 install setuptools first to get requirements.txt to run all the way

(probably most important) Organizational improvements

  • have clear information available to all participants about who is in charge of what, who can perform operations (changing limits on trackers/targets, seeing and resetting jobs), what times they’re generally available, contact information (if not publicly-distributed, at least to two or three others, with them as alternate PoC’s)
  • determine the maximum number of clients the tracker(s) can feasibly serve (it’s counterproductive and demoralizing for too many clients to be “competing” for jobs)