XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    error -104

    Scheduled Pinned Locked Moved Xen Orchestra
    21 Posts 7 Posters 4.9k Views 6 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • olivierlambertO Offline
      olivierlambert Vates πŸͺ Co-Founder CEO
      last edited by

      Wow, that's far from any hint on the error log πŸ€”

      Nice catch then.

      1 Reply Last reply Reply Quote 0
      • wyatt-madeW Offline
        wyatt-made @ptunstall
        last edited by

        @ptunstall What commands did you run to resolve this issue? I think I'm having the same issue. I "hid" 4 GPUs from dom0, and now I'm getting the -104 error in Xen Orchestra. I used /opt/xensource/libexec/xen-cmdline –delete-dom0 xen-pciback.hide to try and un-hide the devices from dom0, and using xl pci-assignable-list which still shows one of the GPUs as assignable. Even after removing the GPUs from the server, I'm getting the same result.

        CC @olivierlambert

        tjkreidlT 1 Reply Last reply Reply Quote 0
        • olivierlambertO Offline
          olivierlambert Vates πŸͺ Co-Founder CEO
          last edited by

          Really hard to tell like this, do you have XAPI running?

          wyatt-madeW 1 Reply Last reply Reply Quote 0
          • wyatt-madeW Offline
            wyatt-made @olivierlambert
            last edited by

            @olivierlambert It should be, the server was functional prior to the config changes, and has been rebooted multiple times.

            1 Reply Last reply Reply Quote 0
            • olivierlambertO Offline
              olivierlambert Vates πŸͺ Co-Founder CEO
              last edited by

              Check if it runs and take a look at the usual logs πŸ™‚

              wyatt-madeW 1 Reply Last reply Reply Quote 0
              • wyatt-madeW Offline
                wyatt-made @olivierlambert
                last edited by

                @olivierlambert XAPI is running. The error in Xen Orchestra is connect ECONNREFUSED [ip address of server:port]. I've verified the ip and credentials to the server.

                Restarting the toolstack from the console said it was done, but looking at daemon.log it looks like it's failing to start?20221115_094031-min.jpg

                M 1 Reply Last reply Reply Quote 0
                • olivierlambertO Offline
                  olivierlambert Vates πŸͺ Co-Founder CEO
                  last edited by

                  I suppose rebooting the host doesn't change anything, right?

                  wyatt-madeW 1 Reply Last reply Reply Quote 0
                  • wyatt-madeW Offline
                    wyatt-made @olivierlambert
                    last edited by

                    @olivierlambert Correct. I tried to migrate the 2 VMs off that host to another remote host, but I'm getting the error "This VIF was not mapped to a destination Network in VM.migrate_send operation" or the connection to the server is lost, or "Error: Connection refused (calling connect )", even after restarting the toolstack.

                    wyatt-madeW 1 Reply Last reply Reply Quote 0
                    • P Offline
                      ptunstall
                      last edited by

                      While I was able to solve this issue the first time it popped up for us by returning the GPUs back to the DOM, this issue happened again 2 weeks ago for us and I was unable to get it to work again. We had to re-install the HOST entirely to get it to work. I'm sure this is a user error on our part by missing something. I'd very much like to know the proper workflow to solve this as XCP-ng is our backbone to our entire virtual VFX production suite.

                      We used this command to push the GPUs back to the DOM

                      /opt/xensource/libexec/xen-cmdline --delete-dom0 xen-pciback.hide
                      
                      1 Reply Last reply Reply Quote 0
                      • M Offline
                        mal @wyatt-made
                        last edited by

                        @wyatt-made - The "INVALIDARGUMENT" is the same error that I've got here

                        1 Reply Last reply Reply Quote 0
                        • wyatt-madeW Offline
                          wyatt-made @wyatt-made
                          last edited by

                          @wyatt-made In case anyone in the future comes across this looking for answers: I was able to "resolve" the issue by doing an "upgrade" to the same XCP-ng version using the installation media. This way I was able to preserve any VMs sitting on the hypervisor. This did wipe any kernel settings that were changed as stated in the Compute documentation, but that's the point of the reinstall.

                          1 Reply Last reply Reply Quote 1
                          • P Offline
                            ptunstall
                            last edited by

                            We just encountered this again.

                            I added 2 new GPUs to the node and removed 1 (unused) NIC. Nothing else was changed in the system. Just 3 PCIe changes. The already installed and assigned GPUs were not removed or changed at all, full error:

                            server.enable
                            {
                              "id": "565d1ea8-582c-4596-ae1f-d96f95ef2c37"
                            }
                            {
                              "errno": -104,
                              "code": "ECONNRESET",
                              "syscall": "write",
                              "url": "https://10.169.4.124/jsonrpc",
                              "call": {
                                "method": "session.login_with_password",
                                "params": "* obfuscated *"
                              },
                              "message": "write ECONNRESET",
                              "name": "Error",
                              "stack": "Error: write ECONNRESET
                                at WriteWrap.onWriteComplete [as oncomplete] (node:internal/stream_base_commons:94:16)
                                at WriteWrap.callbackTrampoline (node:internal/async_hooks:130:17)"
                            }
                            

                            I can SSH into the node without issue.

                            I was looking over this: https://xcp-ng.org/docs/api.html

                            Tried this:

                            xe-toolstack-restart
                            

                            I get this error now:

                            server.enable
                            {
                              "id": "88698db1-9b95-4ca8-b690-98395145f282"
                            }
                            {
                              "errno": -111,
                              "code": "ECONNREFUSED",
                              "syscall": "connect",
                              "address": "10.169.4.124",
                              "port": 443,
                              "url": "https://10.169.4.124/jsonrpc",
                              "call": {
                                "method": "session.login_with_password",
                                "params": "* obfuscated *"
                              },
                              "message": "connect ECONNREFUSED 10.169.4.124:443",
                              "name": "Error",
                              "stack": "Error: connect ECONNREFUSED 10.169.4.124:443
                                at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1300:16)
                                at TCPConnectWrap.callbackTrampoline (node:internal/async_hooks:130:17)"
                            }
                            

                            I will try this suggested same version upgrade and report back.

                            1 Reply Last reply Reply Quote 0
                            • P Offline
                              ptunstall
                              last edited by

                              Additionally I noticed that when SSHed into the node and working with the CLI xe commands some of them don't go through:

                              [16:19 gpuhost05 ~]# xe vm-list
                              uuid ( RO)           : b8e7a3c8-e68e-ac45-2dec-b04b4fc5426b
                                   name-label ( RW): vast-ws14
                                  power-state ( RO): halted
                              
                              
                              uuid ( RO)           : c6b78b22-1153-4622-a5a1-1a0880b2d68f
                                   name-label ( RW): Control domain on host: gpuhost05
                                  power-state ( RO): running
                              
                              
                              [16:19 gpuhost05 ~]# xe vm-list
                              uuid ( RO)           : b8e7a3c8-e68e-ac45-2dec-b04b4fc5426b
                                   name-label ( RW): vast-ws14
                                  power-state ( RO): halted
                              
                              
                              uuid ( RO)           : c6b78b22-1153-4622-a5a1-1a0880b2d68f
                                   name-label ( RW): Control domain on host: gpuhost05
                                  power-state ( RO): running
                              
                              
                              [16:19 gpuhost05 ~]# xe vm-list
                              uuid ( RO)           : b8e7a3c8-e68e-ac45-2dec-b04b4fc5426b
                                   name-label ( RW): vast-ws14
                                  power-state ( RO): halted
                              
                              
                              uuid ( RO)           : c6b78b22-1153-4622-a5a1-1a0880b2d68f
                                   name-label ( RW): Control domain on host: gpuhost05
                                  power-state ( RO): running
                              
                              
                              [16:19 gpuhost05 ~]# xe vm-list
                              Error: Connection refused (calling connect )
                              [16:19 gpuhost05 ~]#
                              

                              I try to start a VM manually:

                              [16:11 gpuhost05 ~]# xe vm-start uuid=b8e7a3c8-e68e-ac45-2dec-b04b4fc5426b
                              Lost connection to the server.
                              [16:12 gpuhost05 ~]# xe vm-start uuid=b8e7a3c8-e68e-ac45-2dec-b04b4fc5426b
                              Lost connection to the server.
                              [16:12 gpuhost05 ~]# xe vm-start uuid=b8e7a3c8-e68e-ac45-2dec-b04b4fc5426b
                              Lost connection to the server.
                              [16:12 gpuhost05 ~]# xe vm-list
                              uuid ( RO)           : b8e7a3c8-e68e-ac45-2dec-b04b4fc5426b
                                   name-label ( RW): vast-ws14
                                  power-state ( RO): halted
                              
                              
                              uuid ( RO)           : c6b78b22-1153-4622-a5a1-1a0880b2d68f
                                   name-label ( RW): Control domain on host: gpuhost05
                                  power-state ( RO): running
                              
                              
                              [16:12 gpuhost05 ~]# xe vm-start uuid=b8e7a3c8-e68e-ac45-2dec-b04b4fc5426b
                              Lost connection to the server.
                              
                              1 Reply Last reply Reply Quote 0
                              • tjkreidlT Offline
                                tjkreidl Ambassador @wyatt-made
                                last edited by

                                @wyatt-made said in error -104:

                                GPU

                                Do you have the use memory above 4G decoding disabled in the BIOS settings?

                                https://nvidia.custhelp.com/app/answers/detail/a_id/4119/~/incorrect-bios-settings-on-a-server-when-used-with-a-hypervisor-can-cause-mmio

                                P W 2 Replies Last reply Reply Quote 0
                                • P Offline
                                  ptunstall @tjkreidl
                                  last edited by

                                  @tjkreidl Yes, This node had 12GPUs in it running in a bare metal environment a year ago before being repurposed.

                                  T 1 Reply Last reply Reply Quote 0
                                  • W Offline
                                    wawa @tjkreidl
                                    last edited by

                                    @tjkreidl I do have that option enabled. This is also passing the entire GPU through to a VM, and using an AMD GPU.

                                    1 Reply Last reply Reply Quote 0
                                    • T Offline
                                      tuxen Top contributor @ptunstall
                                      last edited by tuxen

                                      @ptunstall when the GPU was pushed back to dom0, did you also remove the PCI address from the VM config?

                                      What's the output of:

                                      xe vm-param-get uuid=<...> param-name=other-config

                                      ?

                                      P 1 Reply Last reply Reply Quote 0
                                      • P Offline
                                        ptunstall @tuxen
                                        last edited by

                                        @tuxen No GPUs were removed. only 2 were added. The only PCIE item removed was a NIC but I didn't remove it from dom0 or assign it to any VMs, it was just in the system.

                                        1 Reply Last reply Reply Quote 0
                                        • First post
                                          Last post