[lacnog] Articulo: On the Time Value of Security Features in DNS

Sab Sep 28 17:42:00 BRT 2013

FYI

Fuente:
<http://www.circleid.com/posts/20130913_on_the_time_value_of_security_features_in_dns/>

---- cut here ----
On the Time Value of Security Features in DNS
 Sep 13, 2013 10:45 AM PDT	
By Paul Vixie

There are some real problems in DNS, related to the general absence of
Source Address Validation (SAV) on many networks connected to the
Internet. The core of the Internet is aware of destinations but blind to
sources. If an attacker on ISP A wants to forge the source IP address of
someone at University B when transmitting a packet toward Company C,
that packet is likely be delivered complete and intact, including its
forged IP source address. Many otherwise sensible people spend a lot of
time and airline miles trying to improve this situation — I want to
shout out to Paul Ferguson and Daniel Senie for their excellent work on
BCP38 about 15 years ago. My own modest contribution is SAC004, a
pointy-haired-boss (PHB) description of the problem, which just had its
ten year anniversary during which we all agreed — rather morbidly — that
the problem had gotten nothing but worse in the ten years since publication.

The problems created for the Domain Name System (DNS) by the general
lack of SAV are simply hellish. DNS is both fragile and dangerous
because of the general lack of SAV — by which I mean that DNS can be
injured easily and that it is easy to use DNS as a weapon to injure
others. The main problems related to the general lack of SAV are:

    1. Indirect packet-bombing. Since anyone can impersonate anyone else
at the Internet's most fundamental packet-layer level, a DNS server who
receives a stream of requests from some impersonated victim will answer
those requests, therefore transmitting to the victim some response
traffic they did not solicit and can neither avoid nor shut off. Given
recent advances in DNS such as DNSSEC, responses are a lot larger than
they used to be — so the responses received by the victim can currently
be up to 70 times the size of the requests originally forged by the
attacker. The ratio between the size of the forged request and the
unsolicited response is called the "amplification factor". Solution: DNS
Response Rate Limiting (RRL), created in 2012 by Vernon Schryver and
Paul Vixie, allows a server to loosely keep track of repeated queries
and avoid answering query flows that would not have come from a
legitimate client. The decision criterion is that a legitimate client
would have stopped asking when they got the answer we sent earlier. We
still need universal SAV, but RRL gives us some breathing space.

    2. Datagram related cache poisoning. Since anyone can impersonate
anyone else at the Internet's most fundamental packet-layer level, a DNS
questioner who receives a stream of almost-matching answers to one of
their outstanding questions will quietly and efficiently sort through
that stream, waiting for a response having the correct 16-bit
transaction ID (TXID) as well as a correct 16-bit UDP source port. An
attacker adds their responses to this stream, making guesses at the
16-bit TXID and 16-bit UDP port. In less time than we'd like, the law of
averages gives the attacker their "in", and their poisonously wrong
answer will have the right pair of 16-bit identifying marks. The
questioner in this scenario is an ISP or university or company name
server answering for a large local population, so, the poisonously wrong
answer will be cached and shared with the local population. Solution:
UDP Source Port Randomization (SPR), invented by Dan Bernstein and
brought to bear on this problem by Dan Kaminsky. SPR expands the size of
the random target from 1-in-65,000 to 1-in-7,000,000, and lengthens the
average successful attack from "minutes at 100Mbit/sec" to "days at
100Mbit/sec". Note that SPR is only a band-aid — the real fix for all
forms of cache poisoning is DNSSEC.

    3. Fragmentation related cache poisoning. Modern DNS depends on
large UDP datagrams that will not necessarily fit into a single packet
on the wire. The Internet handles this by splitting the UDP datagram
into multiple IP packets called "fragments", where the first fragment
has the front of the datagram, the middle fragments if any have more of
the datagram, and the last fragment has the end of the datagram. The
security problem created by this is that the 16-bit DNS transaction ID
(TXID) and 16-bit UDP source port are only present in the first
fragment. The tie that binds all these fragments together so that they
can be reassembled at the destination is a 16-bit IP ID. Our earlier
experiences tell us that because SAV is not widely deployed, attackers
can forge the IP source address of any packet, including faked middle
and final fragments. And they only need to guess a 16-bit number in this
case, which means their target is 1-in-65000 again, just like DNS itself
was back before we deployed SPR. Solution: stop using fragmentation for
DNS, unless the message is signed with DNSSEC or Transaction Signatures
(TSIG). This is a nice neat solution, since the main reason we need
large UDP datagrams in modern DNS is because of DNSSEC. Fragments are
not a concern when data is crypto-authentically signed by DNSSEC or
TSIG, because forging a middle or final fragment will result in a
reassembled datagram with bad signatures.

DNS-based packet-bombing attacks are all the rage. They are often
directed at web servers or IRC servers or gaming servers, but just as
often at other DNS servers. Some of the attacks are motivated by anger
or amusement, and others by ransom or protection demands. "Nice online
gambling establishment you've got there… it'd be a real shame if
something bad happened to it."

SAV, where deployed, stops DNS packet-bombing attacks, and also stops
off-path DNS poisoning attacks. However, SAV has to be deployed on the
attacker's network, and is therefore not a viable defense. Things could
improve if a lot of operators all over the Internet decided to make
common cause with each other and all do the right and necessary thing by
deploying SAV locally even though the beneficiaries would only be users
of other networks, and no one could catch them if they cheated. This
outcome seems unlikely but several of us will keep on trying to get the
word out about it.

DNSSEC, where deployed, stops all known DNS poisoning attacks, on-path
or off-path. However, it has to be deployed both by a domain owner and
by name server operators. There are tens of millions of old
DNSSEC-incapable name servers out there, all outside the domain owner's
influence. We are at the time of this writing sixteen years into the
DNSSEC effort, and universal deployment does not seem near, but several
of us will keep trying to get the word out about it.

These are real problems that your colleagues in the Internet engineering
and operations worlds are wrestling with every week or so, if not every
day. There are however weaker problems, less real, often spoken of,
sometimes reported, but also disputed. A widely circulated vulnerability
report in recent weeks has occasioned this article, and before we get to
less well known wrong-headed beliefs about what's wrong with the DNS and
how to fix them, let's get right at the headlines.

RRL Slip Frames

It's been observed, correctly, that DNS Response Rate Limiting (RRL)
interacts poorly with UDP Source Port Randomization (SPR) as a fix for
Kaminsky-style DNS cache poisoning. What happens is that each response
which is deliberately dropped by RRL lengthens the time window during
which an attacker can flood a questioner with possibly-matching answers
to outstanding question. Simply put, under RRL, the lifetime of a
question can be ~30 seconds, whereas without RRL, the lifetime is ~30
milliseconds. Because of the law of averages, this ~1000x increase in
the length of the time window yields a corresponding improvement in
attack effectiveness. The proposed solution with which RRL's creators do
not agree is to make the default RRL "slip" value be 1 rather than 2. In
this configuration there would be no dropped responses, only Slip frames
(TC=1 responses) urging a questioner — if there is one — to retry with TCP.

This proposal causes more problems than it solves, and is in fact
unnecessary, because the problem described is quite livable. Here are
the details.

First let's consider the impact on the packet-bombing victim if a
reflecting name server uses RRL with "slip=1" such that there are no
dropped responses, only Slip frames. This victim will see a drop in bits
per second but no drop in packets per second. This is because Slip
frames are smaller than real answers, in fact Slip frames are identical
in size to the question and so a "slip=1" proponent might validly call
this configuration attenuative — the attack heard by the victim will
have fewer bits in it thanks to RRL. We are however directly aware of a
vast number of routers, switches, servers, name servers, firewalls, and
other on-path devices whose principle bottleneck is packets not bits.
That is, these devices might be able to receive or forward five hundred
megabits per second (500 Mbit/sec) of large packets but only a fifty
megabits bits per second (50 Mbit/sec) of small packets. This is weak
engineering on their part but we don't get to judge the manufacturers or
the operators of these weak devices — we must take them into account
when planning our defense. The creators of DNS RRL did take account of
these common limitations, and our conclusion was, RRL must be
attenuative in packets per second, not just in bits per second, in order
to serve its intended purpose. Real operational experience has shown
that "slip=2" makes a server unattractive as a denial-of-service
reflector, whereas not so "slip=1".

Second let's examine the real client who would like to make a real query
during a packet-bombing attack in which their IP address is being forged
in query storms sent to an interesting authority name server who has
enabled RRL. This client (i.e., victim) is hearing whatever reflective
debris results from the attack, like a large number of unsolicited
responses. That debris may be enough to saturate the path from the
reflecting server to the victim, but let's assume for a moment that
there's enough capacity for the victim to ask a real question and get a
real answer even in the midst of this storm. So the victim will not only
hear the attack flow but will have its real and legitimate questions
swept up in the resulting RRL dragnet. Where "slip=2" as the RRL
designers recommend, the victim will see a mixture of dropped responses
and Slip frames. When it sees a drop it will retry with UDP, whereas
when it sees a Slip it will retry with TCP. This mix of retry type,
roughly half UDP and half TCP, is the best case scenario, because the
victim has a great chance of acquiring its real answer after which it
will stop asking the attack-similar question. A pure TCP fallback
strategy would be less reliable due to the fragility of TCP/DNS, about
which, more will be said below. Since RRL's goals are to both avoid
congestion and preserve content reachability, the default "slip=2"
really is far better than "slip=1".

Finally let's look at detectability. It's been observed that an
authority server who enables RRL with the default "slip=2" configuration
will see its names more vulnerable to Kaminsky-style DNS poisoning
attacks. The general increase is from "days of 100 Mbit/sec blast" to
"hours of 100 Mbit/sec blast". This increase cannot be simply ignored,
but on consideration of the attack surface, we find that a recursive
name server that is unmonitored to the extent that a fat 100 Mbit/second
blast can go unnoticed for many hours, is a recursive name server that
has bigger problems of its own, and creates bigger problems for all of
us than increased susceptibility to DNS poisoning. We urge the operators
of such recursive servers to either close their server to public access,
or to install a firewall between their server and and the rest of us,
and in either case to please deploy DNSSEC validation. We do not believe
that there are a lot of authority server operators lying awake nights
right now worrying about the difference between "slip=2" and "slip=1" —
because they are too busy lying awake nights thinking about the tens of
millions of recursive name servers that are either open to the public
Internet, or which have not been patched for SPR, or both. The "slip=2"
problem, if any, is specific to certain names, and still requires many
hours of uninterrupted 100 Mbit/sec blasting from the attacker to the
victim in order to have a chance at success. This level of threat is
beneath concern for Internet infrastructure operators.

Use of TCP

The designers, implementers, and operators of DNS infrastructure are
often exhorted, "why don't we just use TCP?" The attraction of TCP is
obvious — it is not susceptible to SAV attacks. A TCP packet whose IP
source address is forged has no impact on anybody, since the attacker's
inability to hear the victim's response prevents TCP from "starting up"
and from consuming any server resources. However, the reasons not to use
TCP are just as obvious. DNS uses UDP by default, and TCP as a fallback,
and this design element was in no way accidental, and is not subject to
change at this late date. Let's explore the reasons.

First there's total transaction time. DNS/UDP is a single round trip
protocol, a question goes out, the answer comes back. Even if the answer
is fragmented and therefore contains several packets, those packets will
be minimum spaced, back to back, thus fitting into a single round trip
time (RTT). TCP by comparison is a 3xRTT protocol, requiring a minimum
of three round trips to exchange a question and an answer. The SYN goes
out, the SYN+ACK comes back, an ACK+question goes out, then an
ACK+response comes back, a FIN comes back, a FIN+ACK goes out, and
finally a FIN comes back. With even moderate RTT's measured in the 50
millisecond to 100 millisecond range, DNS/TCP has far lower throughput
than DNS/UDP simply because of the speed of light and the laws of
physics and the number of round trips involved. It's reasonable to
expect a small 1U Linux rack-mount server to handle 100 Kq/sec of pure
DNS/UDP but only 5 Kq/sec or less when tested with pure DNS/TCP.

It's been argued that the extra round trips of using TCP for DNS can be
amortized over many queries, thus if you leave the TCP sessions open you
can send many queries and receive many answers at a rate closer to one
transaction per RTT. This observation is not without merit, but it does
not hold up. DNS servers especially older ones are typically limited to
a few dozen open TCP sessions per server, whereas the number of high
volume flows handled by a busy name server is in the thousands or tens
of thousands. This is, at least, an orders-of-magnitude problem. But in
addition, we have the problem of the DNS specification, which specifies
that the initiator will close the TCP session when its work is complete
and that a server shall not unilaterally close such sessions even for
resource exhaustion reasons without first waiting about 30 seconds. This
makes such servers vulnerable to trivial TCP exhaustion attacks, where
any attacker can at very low cost acquire and hold all of a server's TCP
resources. Since many attackers have denial of service as one of their
goals, we as defenders know that any strategy which urges more
transactions toward TCP dependency, is itself too fragile.

A new protocol could be designed that would not have these problems, or
indeed a change could be published to the DNS specification which gave
us a more robust session handling mechanism for DNS/TCP. If not for the
tens of millions of existing servers who will behave in the old way, and
the estimated two decades before a change like this could reach critical
mass, the idea of fixing DNS/TCP would have some merit.

Query Type ANY

Many recent packet-bombing attacks have used query type ANY in order to
incite a reflecting server to respond with a large (amplified) answer.
Query type ANY was designed for diagnostic purposes and has no real
operational use, and so, some inventive server operators have modified
their name servers to drop queries of type ANY. This is short sighted in
two ways. First, it is not necessary to use query type ANY to get a
server to send a large answer — many other query types such as TXT or
even MX are capable of generating large answers that provide excellent
amplification. In addition, the advent of DNSSEC means that all answers
are far larger than they used to be, especially for negative answers
which are DNSSEC's largest kind.

Even more importantly, security is a matter of economics. Attackers and
defenders are trying to drive their opponent's costs up while driving
their own costs down — that's the game. This move, where a server
operator modifies their name server to drop queries of type ANY, is
exactly wrong. Any attacker who is in the least way inconvenienced by
this change can simply change their attack to use a different query
type, at which point the defense has no easy next step. Secure defense
must be designed according to the attacker's resulting alternatives and
the defender's resulting costs for each of those.

State Tension

There is a necessary tension between performance and safety. DNS is an
incredibly large and busy global system involving hundreds of millions
of agents and many billions of transactions every day. This system
scales because it uses UDP, a stateless transport protocol that requires
only one round trip per transaction and no inter-transaction state. As a
result of DNS's primary dependency on UDP, and due to the lack of
universal deployment of SAV, DNS's extreme performance comes along with
extreme fragility and extreme danger. If we want less fragility and less
danger then we're going to have to add state somewhere. RRL adds
opportunistic light weight state to servers, allowing them to avoid
serving as reflecting amplifiers in today's common packet-bombing
attacks. TCP adds required heavy weight state, at a cost far higher than
we can accept given DNS's massive size and transaction load. Other forms
of state are possible, and Donald Eastlake proposed an opportunistic
medium weight method in 2007 that's worth another look: DNS Cookies.

In Eastlake's DNS Cookie proposal, a requestor can include a large
random number in the clear text of a request, and a responder can echo
this back and append its own large random number in the clear text of a
response. There is no privacy or secrecy offered. Once each side has
proved to the other that they are adding these random numbers to their
messages, each other-side can opportunistically drop any message lacking
the correct random number for that endpoint. This is problematic due to
NAT, where the nature of an endpoint is no longer simply "the IP address
you and they both think they are using", but that's a detail. There are
other details which also need work, but what's plain by now is that the
conclusion reached in 2007 was wrong. The IETF DNS Extensions Working
Group (DNSEXT) determined in 2007 that this proposal was too complex for
its use case. What we know now that we did not know then is that the use
case is every DNS server and every DNS transaction. Let's reconsider,
noting that the roll-out can be incremental — there's no flag day and no
fork lift.

Conclusion

Secure design isn't just preventing the problems you can think of or
that you're having. It also doesn't mean preventing problems that you've
heard anecdotally that other people have thought of or might be having.
Security is about economics. One design is more secure than another if
its risks and risk related costs are lower, which means some thought has
to be given to the costs an attacker would have. Changing a design for
security reasons requires careful cost analysis of the defender's
alternatives, and careful benefit analysis of what the attacker's
alternatives would be. Goodness in defense comes from reducing the
defender's costs and possibly raising the attacker's costs.

More importantly, the Internet is 30 years old and yet just beginning.
Our designs should take the past into account and should learn from
present day experiences, but we must look primarily to the future. The
largest part of the cost:benefit curve, and the greatest area under
same, is in the future. When we push for solutions to today's problems
we run the risk of re-fighting the previous war and also the risk of
driving the defender's costs up for no good reason because we didn't
change the attacker's costs enough.

A DNS authority server operator whose intent in deploying RRL is to make
their servers less attractive as reflectors for packet-bombing attacks
will want the default Slip value of "2" since this attenuates both bits
and packets. Any change in the risk of that authority's names being
poisoned as a result of RRL will be so small as to be of academic
interest only.

Glossary

DNS, Domain Name System: a universal, distributed, reliable, autonomous,
hierarchical naming system used by the Internet.

DNSSEC, Secure DNS: a security layer for DNS data, which protects source
data against modification by third parties.

PHB, Pointy Haired Boss: a Dilbert reference; a manager who needs
everything boiled down to, or slightly beyond, the high level essentials.

RRL, Response Rate Limiting: a server side method of detecting and
blunting some kinds of packet-bombing attacks.

RTT, Round Trip Time: the total transmission delays from the initiator
to the responder and back again (depends on speed of light).

SAV, Source Address Validation: filtering of packets leaving edge (such
as ISP, university, or commercial) networks whose IP source-address is
not local.

SPR, Source Port Randomization: a deliberate fuzzing of the UDP port
number, to avoid prediction by off-path attackers.

TSIG, Transaction Signatures: a security layer for DNS transactions,
which protects messages against modification by third parties.

Bibliography

BCP38, “Network Ingress Filtering”, IETF RFC #2827 and BCP #38, Paul
Ferguson and Daniel Senie, May 2000.

SAC004, “Securing the Edge”, ICANN SSAC Document #4, Paul Vixie, October
2002.

“Domain Name System (DNS) Cookies”, IETF DNSEXT I-D revision #2, Donald
Eastlake, August 2007.

DNS RRL, “DNS Response Rate Limiting”, Paul Vixie and Vernon Schryver,
April 2012.

“Fragmentation Considered Poisonous”, Amir Herzberg and Haya Shulman,
May 2012.

“DNS amplification attacks and open DNS resolvers”, CERT.be, September 2013.

“Blocking DNS Messages is Dangerous”, Mr. Florian MAURY, DNS-OARC,
October 2013,

NCSC-2013-0597, "Rate limiting of DNS responses caused vulnerability",
NCSC.NL, September 2013.

By Paul Vixie, CEO, Farsight Security. More blog posts from Paul Vixie
can also be read here.

---- cut here ----
-- 
Fernando Gont
SI6 Networks
e-mail: fgont en si6networks.com
PGP Fingerprint: 6666 31C6 D484 63B2 8FB1 E3C4 AE25 0D55 1D4E 7492