Difference between revisions of "Talk:Yahoo! Groups"

From Archiveteam
Jump to navigation Jump to search
(LeighRoberts' initial impressions)
 
m (Fixed formatting)
 
Line 1: Line 1:
My first impressions as we wind things up:
== LeighRoberts' notes and first impressions ==


Individual Warrior runner advice:
=== Individual Warrior runner advice ===
- Find out what the maximum item is. Your available disk must be at least twice that big + 10%.
* Find out what the maximum item is. Your available disk must be at least twice that big + 10%.
- Find out what the median item is. Add 20%. Multiply by 2, because of the WARC file! Divide your available disk space by that.
* Find out what the median item is. Add 20%. Multiply by 2, because of the WARC file! Divide your available disk space by that.
- Workers will get stuck on the largest jobs, so even that sizing might not be conservative enough.
* Workers will get stuck on the largest jobs, so even that sizing might not be conservative enough.
- The Docker Warrior needed about 50 MB/container of memory. I did not get figures on yahoo-group-archiver, but I imagine it would be similar.
* The Docker Warrior needed about 50 MB/container of memory. I did not get figures on yahoo-group-archiver, but I imagine it would be similar.
- Scale slowly - don’t fire off all your containers at once, because they all need to load the same requirements.
* Scale slowly - don’t fire off all your containers at once, because they all need to load the same requirements.
- Start about 1/4 of your planned containers and wait 15-30 minutes to check CPU and memory use before starting the next 1/4.
* Start about 1/4 of your planned containers and wait 15-30 minutes to check CPU and memory use before starting the next 1/4.
- Check your disk space regularly
* Check your disk space regularly
- Grafana setup advice for this?
* Grafana setup advice for this?


Specific advice for this project that would have been useful:
=== Specific advice for this project that would have been useful ===
- Recommend to people to have two categories of group-joiner accounts: one for public groups, one for private groups we hope to preserve (project specific)
* Recommend to people to have two categories of group-joiner accounts: one for public groups, one for private groups we hope to preserve (project specific)
* Maximum object size available to all (Grafana?)


Technical improvements:
=== Technical improvements ===
- queue management is super important
* queue management is super important
- way to add new tracker capacity as needed
* way to add new tracker capacity as needed
- Make reports.pl part of the Dockerfile and run automatically
* Make reports.pl part of the Dockerfile and run automatically
- Instructions don’t *exactly* work for Python
* Instructions don’t *exactly* work for Python
-apt install pip3
** apt install pip3
-location of the run-pipeline3
** location of the run-pipeline3
        -might need to pip3 install setuptools first to get requirements.txt to run all the way
** might need to pip3 install setuptools first to get requirements.txt to run all the way


(probably most important) Organizational improvements:
=== (probably most important) Organizational improvements ===
 
* have clear information available to all participants about who is in charge of what, who can perform operations (changing limits on trackers/targets, seeing and resetting jobs), what times they’re generally available, contact information (if not publicly-distributed, at least to two or three others, with them as alternate PoC’s)
- have clear information available to all participants about who is in charge of what, who can perform operations (changing limits on trackers/targets, seeing and resetting jobs), what times they’re generally available, contact information (if not publicly-distributed, at least to two or three others, with them as alternate PoC’s)
* determine the maximum number of clients the tracker(s) can feasibly serve (it’s counterproductive and demoralizing for too many clients to be “competing” for jobs)
- determine the maximum number of clients the tracker(s) can feasibly serve (it’s counterproductive for too many clients to be “competing” for jobs)

Latest revision as of 09:08, 15 December 2019

LeighRoberts' notes and first impressions

Individual Warrior runner advice

  • Find out what the maximum item is. Your available disk must be at least twice that big + 10%.
  • Find out what the median item is. Add 20%. Multiply by 2, because of the WARC file! Divide your available disk space by that.
  • Workers will get stuck on the largest jobs, so even that sizing might not be conservative enough.
  • The Docker Warrior needed about 50 MB/container of memory. I did not get figures on yahoo-group-archiver, but I imagine it would be similar.
  • Scale slowly - don’t fire off all your containers at once, because they all need to load the same requirements.
  • Start about 1/4 of your planned containers and wait 15-30 minutes to check CPU and memory use before starting the next 1/4.
  • Check your disk space regularly
  • Grafana setup advice for this?

Specific advice for this project that would have been useful

  • Recommend to people to have two categories of group-joiner accounts: one for public groups, one for private groups we hope to preserve (project specific)
  • Maximum object size available to all (Grafana?)

Technical improvements

  • queue management is super important
  • way to add new tracker capacity as needed
  • Make reports.pl part of the Dockerfile and run automatically
  • Instructions don’t *exactly* work for Python
    • apt install pip3
    • location of the run-pipeline3
    • might need to pip3 install setuptools first to get requirements.txt to run all the way

(probably most important) Organizational improvements

  • have clear information available to all participants about who is in charge of what, who can perform operations (changing limits on trackers/targets, seeing and resetting jobs), what times they’re generally available, contact information (if not publicly-distributed, at least to two or three others, with them as alternate PoC’s)
  • determine the maximum number of clients the tracker(s) can feasibly serve (it’s counterproductive and demoralizing for too many clients to be “competing” for jobs)