[Maarten Van Horenbeeck] [Information Security] [Resources]

DNS Quality

Capacity Planning
Optimizing BIND Performance
Optimizing your zones

The internet is built upon a correctly functioning Domain Name System. Yet, research has indicated that there are quite a few .com domains, which have high amounts of not answered queries. For most commercial organisations, this is an unacceptable situation. In this document, I will try to give some pitches and pointers on how to develop a good, scalable and future proof DNS strategy.

This document is entirely based on BIND 9.x nameservers. This is definitely not due to any disapproval of other nameservers. On the contrary, I am convinced that many smaller DNS servers, such as djbdns, may be much more suited for certain goals than BIND. However, BIND is where my own experience is most robust, and it is also a DNS server which has proven itself on numerous occasions (such as the fact that it runs on all of the root name servers).

Capacity planning [back]

Decent capacity planning is of great importance to the overall delivery of any DNS service. BIND 9.x has two different ways which make it easy to perform capacity planning built-in, and there is one additional way which can be implemented with additional software.

First of all, while definitely subject to a performance impact, it is possible to do a great deal of logging of all actions performed by the BIND daemon. An example of the implementation of detailed logging in named.conf:
logging {
      channel "default_syslog" {
            // Send most of the named messages to syslog.
            syslog local2;
      severity debug;
      };
      channel audit_log {
            // Send the security related messages to a separate file.
            file "/var/log/named.log";
            severity debug;
      print-time yes;
      };
      category default { default_syslog; };
      category general { default_syslog; };
      category security { audit_log; default_syslog; };
      category config { default_syslog; };
      category resolver { audit_log; };
      category xfer-in { audit_log; };
      category xfer-out { audit_log; };
      category notify { audit_log; };
      category client { audit_log; };
      category network { audit_log; };
      category update { audit_log; };
      category queries { audit_log; };
      category lame-servers { audit_log; };
};
This logging produces a lot of valuable output, and can be processed with dedicated log analysis programs. A good way to use these logs, especially in a split DNS environment running of the same nameservers (so where a nameservers authorative for the internet zones is also authorative for intranet domains), is to measure the number of internal DNS lookups as opposed to all external DNS lookups. In a normal situation, this should match the 80-20 rule (known from the world of management), so that either 20% of the queries is external, or internal. If the division is more in the direction of 50-50, then it would be better to split both functions up in a physically different server.

Another way to have a look at the load of requests your name server is bearing, is to enable access to the BIND daemon through rndc, the Named control program. After enabling access (by creating some keys and adding access to the named.conf file), you can obtain statistics with the following command:

rndc stats

In order to have BIND dump its statistics to a filename of choice, an additional statement is needed in named.conf, being:

statistics-file "/var/log/named.stats";

An example of the logs dumped by this command:
+++ Statistics Dump +++ (1075046707)
success 37973 Succesfully answered queries
referral 403 Number of referred queries
nxrrset 5783 A record of this specific type did not exist for the name
nxdomain 7054 Domain did not exist
recursion 8099 All queries leading to recursion
failure 348 Failure responses
--- Statistics Dump --- (1075046707)
It would be very easy to write a short cronjob which would do an hourly "rndc stats", and then calculates the average load of the server over a period of a number of hours. By cross checking the total number of nameserver actions with the total number of inbound DNS queries (which can e.g. be measured on a perimeter router or firewall, it is possible to calculate the amount of "dropped" queries, aka queries which have never been replied to, not even with an error message. This is a very important number to know in capacity planning, as it is directly related to the number of clients which will not be able to access the websites and services hosted on your DNS servers.

Optimizing BIND Performance [back]

There are quite a few ways to optimize the performance of a BIND nameserver. As always, much of this is related to raw CPU power, and a multiprocessor box is definitely advised for machines which are to serve huge zones. In such case, recompile BIND with --enable-threads, so it can make use of running named in multiple threads, on different CPUs.

On the configuration level, it's important not to allow your nameserver to do lookups for zones which it isn't authorative for. These lookups, called recursive lookups take up useless CPU power and bandwidth. They can be disabled with the entry recursion no in your named.conf.

Also make sure that your machine is correctly configured to enable high speed network communications.

Optimizing your zones [back]
Compared to the changes above, these are changes which are effected "per zone", and only have an impact on their specific zone. However, as some of these changes may make quite a difference for larger domains, they will impact general name service performance for all zones by the server.

Some of these comments are also _not_ strictly performance related. However, in the long run, security and general "quality" improvements will lead to an overall better level of service and less performance issues.
Timing information for the zone is also very important, mainly due to traffic concerns. Almost all of the values which follow are set in the SOA (Start of Authority) of the zone, but some can also be set for each individual RR (Resource Record).

Refresh & Retry
Both of these values solely impact the maintainers of the primary and secondary nameservers. They define every how long the secondaries will check the primary nameserver for updates to the zone. When hosting multiple thousands of zones, having this set to a low value will increase network traffic between the secondaries and primary. RFC1912 advises 1200-43200 seconds for the REFRESH value, and a good value for RETRY is about 120-7200 seconds. A noticeable exception is when NOTIFY is being used. In such case, the primary will automatically trigger a message to all secondaries, and this value can then be put very high, in order to decrease network traffic. Keep in mind however that this may still be a good backup if the NOTIFY mechanism fails.

Expire
This value was introduced to ensure stability of the zone data. It is the length of time that the zone will be kept by the secondary nameservers, before it will expire, and will no longer be served by the secondaries. A good value is anywhere between 2 to 4 weeks (this translates to 1209600 to 2419200 seconds). There is little reason to go for anything except the highest value, as it will not really lead to an increase in traffic, and increasing this value will only result in a more stable zone.

Minimum TTL
The Minimum TTL is also set in the SOA (although other TTLs can be defined for each resource record). It defines how long a resolver will cache the response given. This is the time (in seconds) it will take for a resolver to "appreciate" a new response, if it has already cached a prior response.