Difference between revisions of "Talk:Yahoo! Groups"
Jump to navigation
Jump to search
LeighRoberts (talk | contribs) (LeighRoberts' initial impressions) |
LeighRoberts (talk | contribs) m (Fixed formatting) |
||
Line 1: | Line 1: | ||
== LeighRoberts' notes and first impressions == | |||
Individual Warrior runner advice | === Individual Warrior runner advice === | ||
* Find out what the maximum item is. Your available disk must be at least twice that big + 10%. | |||
* Find out what the median item is. Add 20%. Multiply by 2, because of the WARC file! Divide your available disk space by that. | |||
* Workers will get stuck on the largest jobs, so even that sizing might not be conservative enough. | |||
* The Docker Warrior needed about 50 MB/container of memory. I did not get figures on yahoo-group-archiver, but I imagine it would be similar. | |||
* Scale slowly - don’t fire off all your containers at once, because they all need to load the same requirements. | |||
* Start about 1/4 of your planned containers and wait 15-30 minutes to check CPU and memory use before starting the next 1/4. | |||
* Check your disk space regularly | |||
* Grafana setup advice for this? | |||
Specific advice for this project that would have been useful | === Specific advice for this project that would have been useful === | ||
* Recommend to people to have two categories of group-joiner accounts: one for public groups, one for private groups we hope to preserve (project specific) | |||
* Maximum object size available to all (Grafana?) | |||
Technical improvements | === Technical improvements === | ||
* queue management is super important | |||
* way to add new tracker capacity as needed | |||
* Make reports.pl part of the Dockerfile and run automatically | |||
* Instructions don’t *exactly* work for Python | |||
** apt install pip3 | |||
** location of the run-pipeline3 | |||
** might need to pip3 install setuptools first to get requirements.txt to run all the way | |||
(probably most important) Organizational improvements | === (probably most important) Organizational improvements === | ||
* have clear information available to all participants about who is in charge of what, who can perform operations (changing limits on trackers/targets, seeing and resetting jobs), what times they’re generally available, contact information (if not publicly-distributed, at least to two or three others, with them as alternate PoC’s) | |||
* determine the maximum number of clients the tracker(s) can feasibly serve (it’s counterproductive and demoralizing for too many clients to be “competing” for jobs) | |||
Latest revision as of 09:08, 15 December 2019
LeighRoberts' notes and first impressions
Individual Warrior runner advice
- Find out what the maximum item is. Your available disk must be at least twice that big + 10%.
- Find out what the median item is. Add 20%. Multiply by 2, because of the WARC file! Divide your available disk space by that.
- Workers will get stuck on the largest jobs, so even that sizing might not be conservative enough.
- The Docker Warrior needed about 50 MB/container of memory. I did not get figures on yahoo-group-archiver, but I imagine it would be similar.
- Scale slowly - don’t fire off all your containers at once, because they all need to load the same requirements.
- Start about 1/4 of your planned containers and wait 15-30 minutes to check CPU and memory use before starting the next 1/4.
- Check your disk space regularly
- Grafana setup advice for this?
Specific advice for this project that would have been useful
- Recommend to people to have two categories of group-joiner accounts: one for public groups, one for private groups we hope to preserve (project specific)
- Maximum object size available to all (Grafana?)
Technical improvements
- queue management is super important
- way to add new tracker capacity as needed
- Make reports.pl part of the Dockerfile and run automatically
- Instructions don’t *exactly* work for Python
- apt install pip3
- location of the run-pipeline3
- might need to pip3 install setuptools first to get requirements.txt to run all the way
(probably most important) Organizational improvements
- have clear information available to all participants about who is in charge of what, who can perform operations (changing limits on trackers/targets, seeing and resetting jobs), what times they’re generally available, contact information (if not publicly-distributed, at least to two or three others, with them as alternate PoC’s)
- determine the maximum number of clients the tracker(s) can feasibly serve (it’s counterproductive and demoralizing for too many clients to be “competing” for jobs)