VM Shuts down on 24th of every month

asai

Greetings,

We have a VM that has started to hard shutdown on the 24th of every month for the last 3 months. There's nothing in the crontab (CentOS 7), and the xen logs show things like below. Can anyone give us some advice on where to start looking for solutions here? Thank you for any assistance you can render.

Feb 24 23:59:42 monota xenopsd-xc: [debug|monota|7 ||xenops] EPOLL error on domain-17, close QMP socket
Feb 24 23:59:43 monota xenopsd-xc: [debug|monota|17 |Parallel:task=3310620.atoms=2.(VBD.unplug vm=6e91bf5a-6f02-4597-6e37-30698b4e539c)|xenops] Device.Generic.hard_shutdown about to blow away backend and error paths
Feb 24 23:59:43 monota xenopsd-xc: [debug|monota|15 |Parallel:task=3310620.atoms=2.(VBD.unplug vm=6e91bf5a-6f02-4597-6e37-30698b4e539c)|xenops] Device.Generic.hard_shutdown about to blow away backend and error paths
Feb 24 23:59:43 monota xenopsd-xc: [debug|monota|17 |Parallel:task=3310620.atoms=2.(VBD.unplug vm=6e91bf5a-6f02-4597-6e37-30698b4e539c)|xenops] xenstore-rm /local/domain/0/error/backend/vbd3/17
Feb 24 23:59:43 monota xenopsd-xc: [debug|monota|17 |Parallel:task=3310620.atoms=2.(VBD.unplug vm=6e91bf5a-6f02-4597-6e37-30698b4e539c)|xenops] xenstore-rm /local/domain/17/error/device/vbd/768
Feb 24 23:59:43 monota xenopsd-xc: [debug|monota|15 |Parallel:task=3310620.atoms=2.(VBD.unplug vm=6e91bf5a-6f02-4597-6e37-30698b4e539c)|xenops] xenstore-rm /local/domain/0/error/backend/vbd3/17
Feb 24 23:59:43 monota xenopsd-xc: [debug|monota|15 |Parallel:task=3310620.atoms=2.(VBD.unplug vm=6e91bf5a-6f02-4597-6e37-30698b4e539c)|xenops] xenstore-rm /local/domain/17/error/device/vbd/5696
Feb 24 23:59:44 monota xenopsd-xc: [debug|monota|10 |events|xenops] Device.Generic.hard_shutdown about to blow away backend and error paths
Feb 24 23:59:44 monota xenopsd-xc: [debug|monota|10 |events|xenops] xenstore-rm /local/domain/0/error/backend/vif/17
Feb 24 23:59:44 monota xenopsd-xc: [debug|monota|10 |events|xenops] xenstore-rm /local/domain/17/error/device/vif/0
Feb 24 23:59:45 monota xapi: [debug|monota|561 |xapi events D:34fe87c9d9cc|helpers] Helpers.call_api_functions failed to logout: Server_error(SESSION_INVALID, [ OpaqueRef:ce31d273-1eaf-440b-8632-69a6282b278c ]) (ignoring)

olivierlambert

Do you have any logs in the guest operating system?

Danp

Are you using XO? Any backups or jobs scheduled to run only on the 24th?

asai

@danp and @olivierlambert ,

Thanks for the response. There are no logs in the guest VM, just a cutoff in /var/log/messages.

I am using XO, and using Delta Backups, but they were not happening on the 24th consistently for the last 3 months, only last month did a backup happen on the 24th. They're scheduled on Mon., Wed., and Fri.

Yes, this is a very odd thing. I've been running Xen VMs since 2007 and have never seen this kind of thing.

D_J

@asai Did you ever figure it out? Just happened to me today (oddly enough, on the 24th) with the same messages in the log...

asai

@d_j , man that's weird.

It hasn't happened again, but it happened 3 months in a row. Dec. - Feb.

Super weird.

D_J

@asai Thanks for the response! Not only do I have the same messages but the C Drive was wiped (almost like every file was deleted except those which were open, the server has 6 other drives which weren't affected). I'm suspecting an issue with the Storage Repository because it's the only drive on that repository. I do have a weekly backup that runs on Sundays through Xen Orchestra. So extremely odd that it happened on the 24th, I was only googling the error message from the log! Ugh!

I ordered a new server ($$$$) which I'll transition all the VMs to and then I'll wipe and reload the current one and test/validate it from scratch since I don't trust it at this point.

dredknight

Hey everyone,
bumped into this topic as we had similar issue with one of our test vms. full day of logs attached logs.tgz.txt

We found out that on the 5th of April one of the VMs shutdown. You can see the issues in the log after 14:10:00.

grep error <the log file>
to see the specific messages.

We are still investigating and not yet sure what is the problem but it seems like it is related to storage.

XCP is latest version 8.2.1.

dredknight

Reuploaded logs again because tgz was not allowed format. Added .txt extention so just remove it and extract.

olivierlambert

Hi,

I'm not sure it's related at all. Do you have HA enabled in your pool? How many hosts do you have? What's your shared storage?

dredknight

@olivierlambert this is just 1 single host in a single cluster. Only local SSD storage.

It is managed by Cloudstack, we are doing high performance tests on that server so nothing of value on it. I thought logs can help find the issue if such actually exists.

olivierlambert

It's hard to tell with just one log file. I can see that the VM is ordered to be shutdown (doesn't sound like a problem), but I can't tell why.

dredknight

@olivierlambert we couldn't find a reason either.
We will run more tests in the following weeks and report if we find anything of value.