DustyArmstrong

DustyArmstrong

@planedrop Not opposed to cloud of course, but it's a network with no internet!

DustyArmstrong

@julien-f Thank you and thank you for the quick resolution, you guys rock.

DustyArmstrong

@olivierlambert Yes but didn't see any changes for backup/logs which is where the issue seemed to arise from. I pull the status from each log entry not the job info itself.

@julien-f awesome, thanks!

DustyArmstrong

Testing the agent out on Arch Linux (mainly due to the spotty 'support' in the AUR/generally) and it is working fine - better than what I had before (which did not report VM info properly). I've set it up as a systemd service to replace the previous one I had, also working as expected.

This would be fun to contribute towards.

DustyArmstrong

For anyone who comes across this, you can just add an exception for your management page and shift will work on the console.

Settings > Privacy & Security > Enhanced Tracking Protection > Manage Exceptions > Add the site url e.g. https://xo.fqdn.com.

DustyArmstrong

Testing the agent out on Arch Linux (mainly due to the spotty 'support' in the AUR/generally) and it is working fine - better than what I had before (which did not report VM info properly). I've set it up as a systemd service to replace the previous one I had, also working as expected.

This would be fun to contribute towards.

DustyArmstrong

@Berrick Ah cool, I was anticipating that doing it live would've caused boot issues but I think I've just got a bug in my head about how it actually works, I'm conflating two separate things based on an incorrect interpretation of a previous experience I had once with XCP.

Thanks!

DustyArmstrong

@Berrick Thanks for your post. Did you export the VHD, run the commands against it then re-import it, or did you just do this on the live VM? Presumably you exported and re-imported like Andrew did. I think I understand what the process would be either way, but it would still be good to confirm with someone who already had success what your sum total steps were.

DustyArmstrong

For anyone who comes across this, you can just add an exception for your management page and shift will work on the console.

Settings > Privacy & Security > Enhanced Tracking Protection > Manage Exceptions > Add the site url e.g. https://xo.fqdn.com.

DustyArmstrong

Updated all my hosts but ended up with a bunch of stuck tasks for API host calls, didn't seem too healthy! It looks like they were stuck, kept seeing a host unhealthy power state repeatedly pop up and disappear.

I opted to select all tasks and delete, same with my logs (I monitor externally anyway) which appears to have resolved this for the moment. I no longer see these mdadm logs being generated and everything appears normal.

DustyArmstrong

@stormi thanks for the reply, the output is (on both hosts):

mdadm: cannot open /dev/md127

I do have a 3rd host that does make use of a software RAID, but that also outputs nothing for /dev/md127.

I am updating the hosts today so it's possible they're just so far behind.

DustyArmstrong

I'm getting tons of Mdadm errors from Xen Orchestra, but not really sure why.

host.getMdadmHealth
{
  "id": "d2de9e76-ffbf-4640-9d68-43178c7c4006"
}
{
  "code": "-1",
  "params": [
    "Command '['mdadm', '--detail', '/dev/md127']' returned non-zero exit status 1",
    "",
    "Traceback (most recent call last):
  File \"/etc/xapi.d/plugins/xcpngutils/__init__.py\", line 101, in wrapper
    return func(*args, **kwds)
  File \"/etc/xapi.d/plugins/raid.py\", line 21, in check_raid_pool
    result = run_command(['mdadm', '--detail', device])
  File \"/etc/xapi.d/plugins/xcpngutils/__init__.py\", line 70, in run_command
    raise subprocess.CalledProcessError(process.returncode, command, None)
CalledProcessError: Command '['mdadm', '--detail', '/dev/md127']' returned non-zero exit status 1
"
  ],
  "task": {
    "uuid": "34429da6-56ee-9b5c-c465-b0493920b3f4",
    "name_label": "Async.host.call_plugin",
    "name_description": "",
    "allowed_operations": [],
    "current_operations": {},
    "created": "20250117T09:42:09Z",
    "finished": "20250117T09:42:09Z",
    "status": "failure",
    "resident_on": "OpaqueRef:f0015d71-0ac1-4a79-bf0d-3700f79ba394",
    "progress": 1,
    "type": "<none/>",
    "result": "",
    "error_info": [
      "-1",
      "Command '['mdadm', '--detail', '/dev/md127']' returned non-zero exit status 1",
      "",
      "Traceback (most recent call last):
  File \"/etc/xapi.d/plugins/xcpngutils/__init__.py\", line 101, in wrapper
    return func(*args, **kwds)
  File \"/etc/xapi.d/plugins/raid.py\", line 21, in check_raid_pool
    result = run_command(['mdadm', '--detail', device])
  File \"/etc/xapi.d/plugins/xcpngutils/__init__.py\", line 70, in run_command
    raise subprocess.CalledProcessError(process.returncode, command, None)
CalledProcessError: Command '['mdadm', '--detail', '/dev/md127']' returned non-zero exit status 1
"
    ],
    "other_config": {},
    "subtask_of": "OpaqueRef:NULL",
    "subtasks": [],
    "backtrace": "(((process xapi)(filename ocaml/xapi-client/client.ml)(line 7))((process xapi)(filename ocaml/xapi-client/client.ml)(line 19))((process xapi)(filename ocaml/xapi-client/client.ml)(line 8780))((process xapi)(filename lib/xapi-stdext-pervasives/pervasiveext.ml)(line 24))((process xapi)(filename ocaml/xapi/rbac.ml)(line 205))((process xapi)(filename ocaml/xapi/server_helpers.ml)(line 95)))"
  },
  "message": "-1(Command '['mdadm', '--detail', '/dev/md127']' returned non-zero exit status 1, , Traceback (most recent call last):
  File \"/etc/xapi.d/plugins/xcpngutils/__init__.py\", line 101, in wrapper
    return func(*args, **kwds)
  File \"/etc/xapi.d/plugins/raid.py\", line 21, in check_raid_pool
    result = run_command(['mdadm', '--detail', device])
  File \"/etc/xapi.d/plugins/xcpngutils/__init__.py\", line 70, in run_command
    raise subprocess.CalledProcessError(process.returncode, command, None)
CalledProcessError: Command '['mdadm', '--detail', '/dev/md127']' returned non-zero exit status 1
)",
  "name": "XapiError",
  "stack": "XapiError: -1(Command '['mdadm', '--detail', '/dev/md127']' returned non-zero exit status 1, , Traceback (most recent call last):
  File \"/etc/xapi.d/plugins/xcpngutils/__init__.py\", line 101, in wrapper
    return func(*args, **kwds)
  File \"/etc/xapi.d/plugins/raid.py\", line 21, in check_raid_pool
    result = run_command(['mdadm', '--detail', device])
  File \"/etc/xapi.d/plugins/xcpngutils/__init__.py\", line 70, in run_command
    raise subprocess.CalledProcessError(process.returncode, command, None)
CalledProcessError: Command '['mdadm', '--detail', '/dev/md127']' returned non-zero exit status 1
)
    at Function.wrap (file:///home/node/xen-orchestra/packages/xen-api/_XapiError.mjs:16:12)
    at default (file:///home/node/xen-orchestra/packages/xen-api/_getTaskResult.mjs:13:29)
    at Xapi._addRecordToCache (file:///home/node/xen-orchestra/packages/xen-api/index.mjs:1068:24)
    at file:///home/node/xen-orchestra/packages/xen-api/index.mjs:1102:14
    at Array.forEach (<anonymous>)
    at Xapi._processEvents (file:///home/node/xen-orchestra/packages/xen-api/index.mjs:1092:12)
    at Xapi._watchEvents (file:///home/node/xen-orchestra/packages/xen-api/index.mjs:1265:14)"
}

Neither host with ID 2de9e76-ffbf-4640-9d68-43178c7c4006 or f0015d71-0ac1-4a79-bf0d-3700f79ba394 are using a software RAID. It may be because I haven't updated the hosts in quite some time. There is no output on either host for cat /proc/mdstat.

Is there a way I can just turn off this check?

DustyArmstrong

An update, if anyone ever comes across this via search engine.

Turns out it was my container's timezone. The image was set to pure UTC, no timezone, by default, so I believe when it was writing files to my network storage it introduced a discrepancy. My network share was recording the file metadata accurately to real-time, so I assume when it came time to do another backup, the file time XO expected was different, making it think it was "stale" or still being "held".

Have now run both scheduled metadata and VM backups without any errors .

In summary: make sure your time, date and timezone are set correctly!

DustyArmstrong

@magran17 thanks Mark.

My config backup runs on a Tuesday and my VMs Friday night, so that happened last night. It did fail at first with the lockfile error as expected but then was successful on the retry. My concurrency is currently set to 2, I did have it on 1 originally but it doesn't seem to make a difference.

I use Ronivay's image too, it seems to work but yeah just these random 3 errors that I can only get rid of by blowing away all my backups/schedules and starting the chain(s) again.

I'm not really sure why it happens, I can only assume rebooting/updating causes some sort of cache to break in the way I have it set up. I am running it very unintended way (Raspi4 ARM64 using Binfmt emulation of x86) so I can't really expect perfection. It's slightly slow but it works super well other than this!

DustyArmstrong

Update: this seems to happen every time I reboot the server or, in particular, update XO. I get the same 3 errors and have to rebuild my backup schedules from scratch each time. Once rebuilt they run perfectly until the next time I update. It may be because I run it in Docker, I'm not sure, but I'd love to understand what causes this and if there's any way to rectify without the rebuild. I don't really understand it and would appreciate any insight.

I get the following 3 problems every time.

EEXIST - this happens on my configuration backups.

Error: EEXIST: file already exists, open '/run/xo-server/mounts/f5bb7b65-ddea-496b-b193-878f19ba137c/xo-config-backups/d166d7fa-5101-4aff-9e9d-11fb58ec1694/20240819T140003Z/data.json'

ENOENT - this also happens on my configuration backups, on the same job.

Error: ENOENT: no such file or directory, rmdir '/run/xo-server/mounts/f5bb7b65-ddea-496b-b193-878f19ba137c/xo-pool-metadata-backups/d166d7fa-5101-4aff-9e9d-11fb58ec1694/ff3e6fa0-6552-e96a-989c-fc8db748d984/20240729T140002Z'

LOCKFILE HELD - This happens on my VM incremental backups. This log is from a prior run a while ago, but I expect my next run will do this as I rebooted.

>> the writer IncrementalRemoteWriter has failed the step writer.beforeBackup() with error Lock file is already being held. It won't be used anymore in this job execution.
Retry the VM backup due to an error
the writer IncrementalRemoteWriter has failed the step writer.beforeBackup() with error Lock file is already being held. It won't be used anymore in this job execution.

Start: 2024-06-29 01:01
End: 2024-06-29 01:41
Duration: 41 minutes
Error: Lock file is already being held

I only have one schedule for config and one schedule for VMs. The files for the config backup don't change, I don't reboot or anything mid-backup, but it seems to totally break the chain. For the VMs, I only have one backup schedule so there should never be another job running which has the lockfile held. Something about restarting the container causes an issue - it feels like something is being cached here but the cache isn't flushed on restart so it leaves some sort of zombified file(s) behind.

DustyArmstrong

@DustyArmstrong

Best posts made by DustyArmstrong

Latest posts made by DustyArmstrong