Full disclosure, I'm not a fan of systemd. I started working with Linux in the late 90's and watched it grow from a marginalized operating system to the most dominant operating system in the datacenter. I've lived through so many "year of the Linux desktop" years I remember when it wasn't a joke. From my vantage point, administering Linux servers professionally for nearly 20 years, systemd is Linux on the desktop at the cost of Linux in the datacenter.
Why do I feel this way? It's mostly the reinvention and incorrect implementations of core UNIX tools and modalities. There's a lot of information on systemd out there. There's a lot of bias involved. So, today, I'm not going to talk about that. I am going to address a critical mistake in the systemd-resolved daemon which implements DNS lookups for systems running systemd.
I'll jump right to the work-around. If you're running a system which is using systemd, you should probably be running systemd-resolved configured to use a single DNS resolver, 127.0.0.1, and run Unbound. There are resources on how to configure and run Unbound, but the best is Calomel's Unbound Tutorial. If you need to maintain consistent, reliable DNS resolution that's compatible with previous versions of Linux, the only way to do that is to have a single DNS server in /etc/resolv.conf.
Why This Matters
This thread on systemd-resolved explains the issue. Yes, putting external DNS servers into your internal servers /etc/resolv.conf is not great form, but that's completely missing the point exposed in this bug report.
systemd-resolved is implementing state tracking against a stateless protocol.
Not only that, but it does it poorly. In the cases described by the commenters, a temporary blip in connectivity to internal DNS servers wound up blacklisting them indefinitely. In my nearly 20 years as a Linux admin, I've seen nearly every junior admin come up with the same idea after their first DNS outage, "Why don't we just keep track of what DNS servers respond and then ignore ones that are failing?" It sounds great, but because DNS is a stateless protocol by design, determining "working server" from "not working server" is profoundly more difficult then issuing a HTTP request to a status handler. It's complicated.
The Old Behavior
So, there's a lot of misconceptions about glibc's resolver library, so I'm hoping to squash a bit of that and address how most name resolutions work on most Linux systems. Yes, it's possible to use a different resolver library and those libraries may not implement resolution the same. However, I want to talk about the glibc resolver and how it interacts with /etc/resolv.conf on a stock CentOS 6 system and every UNIX and Linux prior.
Here's a sample /etc/resolv.conf
search edgeofsanity.net nameserver 192.168.1.1 nameserver 220.127.116.11 nameserver 18.104.22.168
First things first, /etc/resolv.conf supports only three nameservers, and further servers are ignored. I've seen up to eight servers in resolv.conf's administered by experienced, knowledgeable folks. Remember, only the first three are ever queried.
So what happens with this resolv.conf? Well, if 192.168.1.1 is responding to queries, it will always be used to resolve every query. If a query passes the timeout, the default is 5 seconds, without a response, the query will be resent to 192.168.1.1 once more before advancing to 22.214.171.124. These counters are tracked internally by the process running the resolver library. These are not global counters, they are local to each process. This particular failure case is also per-query, meaning each DNS query will have to timeout twice to 192.168.1.1 before advancing to 126.96.36.199.
Why? Well, a timeout could happen for any number of reasons. A timeout of a nameserver for one query doesn't predict a timeout in the future to the same server for the same query. It's complicated.
What this configuration guarantees is that every query will take at least 10 seconds to resolve if 192.168.1.1 is down. This is less than ideal, so we can improve that a little by adding options.
search edgeofsanity.net nameserver 192.168.1.1 nameserver 188.8.131.52 nameserver 184.108.40.206 options timeout 1 attempts 1
timeout to 1 second and
attempts to 1, we'll try 220.127.116.11 if
192.168.1.1 doesn't respond within 1 second. Again, this is per-query,
per-process, so every query will always try 192.168.1.1 before moving on to
18.104.22.168, because, repeat after me, "a timeout of a single query to single DNS
server cannot predict that even the same query to the same server will timeout
at any point in the future."
This improves the failure case for 192.168.1.1 becoming unavailable, but it's still 1+ second for every DNS query, which is unacceptably slow for any web-scale service. There's another option we can introduce to decrease the impact the DNS server being unavailable has on our servers:
search edgeofsanity.net nameserver 192.168.1.1 nameserver 22.214.171.124 nameserver 126.96.36.199 options timeout 1 attempts 1 rotate
We introduce the
rotate option to the config file. If you were to run this:
while true; do getent hosts www.google.com; done You'd probably be surprised
to see EVERY query going to 192.168.1.1. Maybe you can guess why that is?
That's right! The
rotate option is per-process, so each time we run
we start a new process, which starts at the first name server for the first
query and continues on to the next server for the next query. Failures
per-query are still processed the same way.
If you had a failure of 192.168.1.1, you'd have more than 33% of DNS queries
taking 1+ seconds to resolve. Why? Again,
rotate is per-process so long
running processes will rotate through the bad server every 3 queries.
However, every new process will always start at the beginning of the list.
The New Behavior
OK, so what's described in the GitHub issue is the systemd-resolved's author deciding to break a fundamental design in the DNS resolution on UNIX systems. Servers are never skipped in the previous glibc resolver world. This is because, and I'll say it again, a timeout for a single DNS query to a single DNS server does not predict a timeout for that same query to that same server at any point in the future. The systemd-resolved behavior now adds this state to a stateless protocol, which leads to unpredictable and inconsistent behavior in one of the lowest level, most misunderstood, and most critical components in your infrastructure.
There is a way to work-around this, if every DNS server in the list of DNS servers is marked as being problematic, systemd-resolved falls back to the default behavior of going through every server in the list and resetting their state. The easiest way to ensure this happens is to list a single nameserver in the /etc/resolv.conf settings. This will force a short circuiting in the state tracking logic.
I'm not going to bash systemd or any of it's authors or maintainers. They're doing their best to solve hard problems. I do disagree fundamentally with their direction and assumptions, but they're writing code and dealing with angry communities, and I won't pile on. However, this behavior is fundamentally different than everything else in the space and represents what I fear is a naivety and disinterest in understanding the problem space. If you administer Linux systems professionally, you need to be aware of this difference and how it will impact your infrastructure if there are issues with upstream DNS providers.
It's entirely possible this change in behavior will have no or very little impact on your infrastructure. It's important to understand this difference as DNS is often impacted by or impacting the availability of your services.
First, I got something wrong. In the case we
rotate enabled and 3
nameservers, approximately 50% of queries will take 1+ seconds to resolve.
This is because the state isn't magic, it's a simple pointer that's
incremented each time. Consider, query #1 goes to 192.168.1.1, it times out,
the pointer is advanced to 188.8.131.52 and it succeeds. Query #2 comes in and
that pointer is advanced to 184.108.40.206, it succeeds. Query #3 comes in, the
pointer is advanced to 192.168.1.1 it times out and moves on to 220.127.116.11.
Rinse and repeat.
resolve from /etc/nsswitch.conf.