@nraynaud Bizarre isn't it; I'm so very grateful for your efforts.
Some more news - it seems that one big challenge is around concurrency - things improve dramatically if concurrency is set to 1. As soon as something else is running in parallel, we run into the socket failures. I'm expanding things to try your change on another box to see if the outcomes are different - but in summary what I'm seeing so far:
- Concurrency = 1 - works fine first time, fails occasionally (20% of the time?) thereafter
- Concurrency > 1 - almost impossible to get it to run, but sometimes one or two VMs backup ok but not enough to be predictable and never the entire group
- Anything fails - impossible to get a clean run again until the S3 target has been cleaned entirely
So it appears that somewhere there perhaps is a lock occurring when more than one stream is running, and additionally there's some kind of conflict when things have terminated prematurely and the target is therefore not in its expected state on the next run.