@DustinB I forgot to mention that I did look for firmware for the fans and I see nothing on Dell's downloads for the R630 that indicate that there is any fan related firmware at all. That's why I started trying to tweak the settings in the BIOS and iDRAC related to power and cooling, to see if I could get it to go back to the way it was.
Posts
-
RE: Seeking advice on debugging unexplained change in server fan speed
-
RE: Seeking advice on debugging unexplained change in server fan speed
@DustinB Nothing useful yet. I rebooted the servers and explored a bit in the BIOS to see if there were any settings, or to at least tweak some things to see if it would reset whatever went wrong in the reboot in mid December. While doing that I found that one of the two impacted servers was a version behind for the BIOS as well as for the iDRAC so I updated both of them. Unfortunately, that made no change to the fan speeds.
I've been out sick all of this week, so far, but I'll be looking into this more when I get back to the office. I've read about ways to manually control the fans but I'd rather not have to depend on a script running somewhere that makes those kinds of decisions, I'd much rather have iDRAC, or whatever normally controls it, handle it like it used to.
-
RE: Seeking advice on debugging unexplained change in server fan speed
@DustinB I wish I had asked the question here earlier. I asked it a little while ago on ServerFault.com, figuring that was the best place for this question since it has nothing to do with XCP-ng. Nobody has answered and one person even downvoted it without saying why.
If you use ServerFault and you answer over there, I'll mark it as an answer if this works, so you can get some internet points.
https://serverfault.com/questions/1169753/what-might-cause-server-fans-to-double-in-rpm-after-a-simple-reboot -
RE: Seeking advice on debugging unexplained change in server fan speed
@DustinB Interesting, I'll see if there's fan firmware I can update. It's so strange that they were fine and a reboot make them do this. One of the systems is running the fans at full speed which makes them have a high-pitched whine, it's rather annoying, also not great for the fans I imagine.
-
Seeking advice on debugging unexplained change in server fan speed
Back in mid-December I came into the office on a weekend to test my power-outage handling via NUT. I unplugged the UPSs, monitored all the VMs getting shut down and then the servers shut down. I never allowed the UPSs to totally lose power, just kept them unplugged long enough to trigger the server shutdown.
I have two PowerEdge R630 and one R730. When I rebooted the servers, the R630s seemed louder than normal. That's typical on startup but they continued to be louder once booted. The R730 did not seem any different.
I have LibreNMS set up to monitor the servers and the graphs of fan speed confirmed my feelings of them being louder. The fan speeds have doubled on one server and increased by four times on the other but the CPU workload has not changed at all.
The other server is even more dramatic and it is the more lightly loaded of the servers.
As you can see, the fan speeds have remained high ever since the reboot.Over this last weekend we had a power outage so the servers shut down. After rebooting the fans are still running fast so it wasn't just a simple reboot needed to fix this.
LibreNMS isn't capturing CPU usage for some reason but here's the CPU usage from XO. It has not changed significantly in months.
The system board and CPU temps dropped at the same time of course, with all that extra airflow. Note, those temps are in F, not C.
Any ideas of things to look for in the BIOS, iDRAC and/or LibreNMS that might indicate why this would have changed? There were no updates of the BIOS or anything associated with that reboot in December and another reboot has not changed it back. Are there possibly BIOS settings that would tell the server to run fans full speed and maybe those settings randomly changed?
Our servers are near our offices so this significant increase in sound output annoys people. I don't mind when servers are loud because they need to be loud but doubling the noise without any reason is quite annoying.
-
RE: Why does emergencyShutdown take a lot longer than shutting down the host from the console?
@manilx For
host.stop
, what does thebypassBackupCheck
do? I thinkbypassEvacuate
is pretty clear, that must mean "don't try to migrate running VMs to another server before stopping." I feel like I would want to bypass whatever the backup check is because I've lost power and the server needs to be shut down. -
RE: Why does emergencyShutdown take a lot longer than shutting down the host from the console?
@manilx That is great to hear, thank you.
I will definitely share it with the community. Still trying to iron out some of the wrinkles. Testing requires me to let all my servers get shut down so I'm limited in how frequently I can test the solution.
This weekend was my third pull-the-plug test and it was the closest to totally working. In fact, I think this test did totally work, but due to network changes I had to shut down xcp because I've found it gets really mad if you change anything about its network while it's running. It was using the physical console to shut it down that I was shocked at how fast it shut down. That's why I posted to ask about the difference.
I think changing my script to just use
host.stop
will resolve my last concerns about the script. Having a faster shutdown for xcp might also allow me to go back to my original design when I let the more important VMs live a bit longer. I originally staged when VMs got shut down so the important once could survive a 5 or 10 minute power outage. Turns out with xcp taking 8 minutes to shutdown after the VMs were down, I had to change my script to start to close everything as soon as it was clear that this wasn't just a small power blip.When I post it though, everyone will need to recognize that I'm no bash coder. I write code but this is the only bash stuff I've done so it could be rough.
-
RE: Why does emergencyShutdown take a lot longer than shutting down the host from the console?
@olivierlambert I understood what
emergencyShutdownHost
does, I was surprised that it seems to take a long time even if all the VMs were stopped before executing it. There should be nothing to suspend but it still takes about 8 minutes for the server to finish the shutdown.I will start using
host.stop
instead ofhost.emergencyShutdownHost
for the hosts that have no running VMs. Realistically, when having NUT shut it down, I'd rather the host just issue a clean shutdown command to any running VMs. I'm not sure whathost.stop
will do if there is a running VM. If it would politely ask it to stop then that would be perfect, if it yanks the virtual power cord then I wouldn't like that.The ideal is that no host has a running VM by the time I want to shut it down but since NUT runs in a VM at the moment, one host will have that one running VM. I'll be in a bit of a race condition if I issue
host.stop
immediately followed byshutdown now
. It's virtual murder-suicide but the murderer's life depends on the murdered. Can Linux shut down before xcp kills it? Might be an interesting test. -
RE: Performing automated shutdown during a power failure using a USB-UPS with NUT - XCP-ng 8.2
@nomad What do you mean by the grand reconfiguration? Is there a new version that changes how things work?
-
RE: Why does emergencyShutdown take a lot longer than shutting down the host from the console?
@olivierlambert No, I don't use any VM as an SR in XO. Other than the local storage SRs, DVD, removable media and such, the only SRs are hosted on a Synology array or on my UNRAID server.
-
Why does emergencyShutdown take a lot longer than shutting down the host from the console?
This weekend I was working on my servers and I needed to shut them down. I closed all the VMs, then from the physical console I used the UI to tell the host to shut down. It shut down in about 30 to 60 seconds and powered off.
In testing my NUT scripts, they shut down all the VMs then issue the host.emergencyShutdown command and it takes the hosts about 8 minutes to shut down.
Any reason for that difference? Is there a command I can issue through the xo-cli that would cause the faster shutdown?
Another advantage is that the shutdown command from the console actually turns the power off while emergencyShutdown shuts everything down but doesn't power down the hardware, at least it never has for me.
-
RE: XO instance UI unreachable during backups
@Danp Ah, thank you for that.
I need to restructure things a bit but I was already thinking I would do that. The issue is that this VM also runs NUT so it's the last VM running before shutting down the servers. I reduced the memory because it takes a LONG time to suspend a VM with 16GB RAM but doesn't take long to shut one down. Between the 8 to 10 minutes it takes for XCP-ng to shut down and the time it takes to suspend a VM with 16GB RAM, I don't think my batteries will last that long.
I'll have to move NUT into a leaner VM that doesn't handle backups. That's something I was thinking I would do anyway because if there was a power outage during a backup I don't think my NUT script would be able to do what it needs to do. Based on the cron job I run to make sure my xo-cli registration is good, the xo-cli stuff won't run when the VM is hammered like this.
Thanks for helping me understand why this happens and how to fix it.
-
XO instance UI unreachable during backups
I've noticed recently that when my backups are running, they totally slam the CPUs and the web UI is inaccessible. What can I do to improve this? Should I give the instance more than 4 CPU cores? I used to give the instance 16GB RAM but it never went higher than 2GB so I reduced it. Could that cause this?
I can still SSH into the instance but I have little way to know how much of the backup is complete or which backups are finished. I've seen this multiple times a week for the last month or so.
This goes on for hours and without the web UI I can't even gauge how much time might be left. I came in this weekend because I'm trying to improve my network setup to hopefully help with things like this but I can't shut down the servers and tear the network apart when XO is at some unknown point in the backup. Going to try to come back tomorrow and see if it's finished, well, I'll be smarter tomorrow and check the status from home first.
Currently running XO from source commit 1bc0f (two commits behind current due to taking a couple days off last week).
-
RE: NVIDIA Tesla M40 for AI work in Ubuntu VM - is it a good idea?
@gskger said in NVIDIA Tesla M40 for AI work in Ubuntu VM - is it a good idea?:
Nvidia RTX A2000 12GB
I am curious how your testing goes. That sounds like a great card. Not as expensive as the T4 so might be more reasonable for me to consider.
-
RE: Moving management network to another adapter and backups now fail
I'm bummed to hear that it isn't tolerant of changes. When I set up xcp originally, I gave it the first 10Gb port as the management interface and that's on the main LAN, no VLAN. Now I was wanting to move management off of the main LAN and onto a dedicated VLAN on the second 10Gb port. I've been nervous to make that change because I don't want to break something, it seems that concern was well founded. I was actually planning on posting today to ask about how to best move the management interface into a VLAN on a separate port.
Feels like I just have to live with everything on the same port and I won't be able to isolate the management or backup traffic like I want to. Maybe I could move the backups onto a separate VLAN or does that happen through the management interface? I think I need to dive back into the docs.
-
RE: How can I duplicate backup settings to a different XO instance? Should I?
@austinw I have three instances, all on different hosts, one on a host that isn't running XCP.
Is that a weird form of the 3, 2, 1 rule for backups? 3 instances on 2 different host OSs, 1 outside the xcp pool.
I've had times that I needed to tweak the XO instance so it's definitely nice to have another one to do it with. I just realize that if that XO instance went down, like if I lost the host it's running on, then I'd have a tougher time restoring backups because the other instances don't know about the backups. Note, the host running that XO instance is not one my main hosts for the pool, it doesn't run any business critical VMs. Lest someone chastise me for handling backups on the hardware that's running the VMs being backed up.
-
How can I duplicate backup settings to a different XO instance? Should I?
I have XO running in a few different places. One of them handles backups and I think of that XO as my main version. It occurred to me that if I lost that instance for some reason then it might be tough to restore one of my backups. Therefore, does it make sense to somehow duplicate the backup configuration to other installs of XO so they can see the backups or would they just be confused because they have no record of the ID used for the backup.
Would it make sense to use the XO Config under settings to just copy the entire XO config from one install to another? Is there a downside to that?
It's my intent that the various XO installs all manage the same pool. I certainly don't want them all performing the same backups though so I'd obviously have to disable those backups on the other installs.
Good idea? Bad idea?
-
RE: NVIDIA Tesla M40 for AI work in Ubuntu VM - is it a good idea?
@gskger Yeah, looks like it would be too tight. Ouch, those T4s are an order of magnitude more expensive. I'm definitely not interested in going that route.
-
RE: NVIDIA Tesla M40 for AI work in Ubuntu VM - is it a good idea?
@gskger I'm so glad you put pics of your server in that other post. I looked at it last month when you posted it but when I looked at it again yesterday, I realized that those cards might not fit in my R730xd because I have the center drive rack with 4 extra internal drives. I'm concerned that one of those cards would not work with those drives in place since they basically totally cover the memory and CPUs. I'm also concerned that with those center drives, the airflow out of the row of case fans is less direct and might cause heat problems for a video card like that.