DNS queries during backup job
@olivierlambert said in DNS queries during backup job:
Okay I would be curious to see if you have a similar behavior on XOA
I can have a look at work during the week.
I have investigated a bit, and indeed Node does not cache DNS queries and calls system methods directly (e.g.
I've created a test branch which improves the situation: https://github.com/vatesfr/xen-orchestra/pull/6196
But I'm wondering if it's the right approach, maybe it this responsibility should be left to the system and we should
nscdto our XOAs.
Let me know if you have any opinions on this or feedbacks on my branch.
I'll put this to test and see tomorrow what the DNS query stats look like.
Just my two cents but i feel like one shouldn't "fix" a flaw or bad behaviour in application by relying on external dependency to deal with it, especially if it's fixable. Sure using something like
nscdin XOA would kinda fix the issue in it but wouldn't possible perf issue etc still exist in node? I'm not competent to review the code so can't say anything about the actual implementation in feature branch.
It's not trivial to decide where to put that "frontier". XOA is meant to be an entire system, not just with XO code, but also the updater and other things.
For the DNS thing, I have to admit I don't know yet what's the best practice. I suppose it also depends on where do you want to stop thinking about doing "non-core" features (ie DNS caching) vs doing it internally. Should we also implement other "system" stuff? It's not trivial to answer that
I think the main point to focus on here is that XO is doing totally unnecessary DNS queries with excessive frequency. I don't see this as implementing a non-core feature but a fix in the logic how application figures out where to connect and how often. How exactly and what options there are is outside of my knowledge
IMHO I don't think applications in general have internal dns caching, but they do rely on system provided functionality. So with that in mind it is sensible to use a system package rather than some fixing inside XO code. Especially considering XO can run on other platforms than XOA.
Andrew Top contributor 💪 last edited by
@Forza I agree that the OS is responsible for caching host records. The real question is why is XO doing so many lookups repeatedly. Maybe it is actually a Node problem (in addition to code issues).
In most applications once a socket is opened to a host it stays open and does not need to do another lookup until it is closed and a new connection is made. If XO or Node is stateless and opens a new connection for each block read/write (or group of blocks) then it may do a lot of lookups. The mass lookups seems to be a sign of a lot of overhead that could be reduced to improve performance.
Yes, nscd can be a host query (DNS) cache solution (for XO source and XOA) but can the code be improved to reduce overhead and improve general performance?
Here is a quick MRTG image of DNS requests. You can see when I enabled nscd that caches lookup requests (hint, sunday night):
@Andrew said in DNS queries during backup job:
If XO or Node is stateless and opens a new connection for each block read/write (or group of blocks) then it may do a lot of lookups. The mass lookups seems to be a sign of a lot of overhead that could be reduced to improve performance.
I agree that's a good question (for @julien-f I assume)
@julien-f this changed the situation from thousands of queries in minutes to no noticeable spike in query graphs during backup job, so huge improvement.
Although it is nice that there is work arounds for the DNS spikes with either nscd or the in-process DNS cache, i think the DNS spikes are a symptom of a whole different issue.
I think we can safely assume that each DNS lookup is corresponding to one attempt at establishing a TCP connection then there is some code somewhere that spawns an awfull lot of short lived connections instead of reusing / pooling them - with all the issues that follows in that area (insufficient ulimit NOFILE, connections in TIME_WAIT/exhausting of client ports etc)
@hoerup I agree with your analysis, not sure how easy it will be to fix, we'll investigate.
Did some further testing if amount of DNS queries would correlate to the amount of actual connections made to the host. This doesn't seem to be the case which is even more interesting Some results below.
Ran an incremental from delta backup which took in total of 9 minutes:
- Amount of DNS queries: close to 7k
- Amount of HTTPS connects logged to host IP-address: 478.
- Amount of HTTPS connects/disconnects logged in total to host IP-address: 955
Connection counts were about the same with installation from
dns.lookupbranch provided by @julien-f above, without the amount of DNS queries obviously.
@ronivay are all dns queries for the same host and record?
ronivay Top contributor 💪 last edited by ronivay
Yep. Same domain, asks A and AAAA at the same time, both being individual queries obviously.
Also XAPI (so on host side) doesn't support HTTP/2.
The DNS cache has been merged, keep us posted if you have any issues.