FIXED: Amazon EC2 vulnerable to UDP flood attack

UPDATE 2009-10-12: I’m happy to let you know this post is not longer relevant. Amazon AWS team successfully deployed the fix and the scenario used to simulate Denial of Service attack using UDP flood isn’t applicable anymore. All that in less than 24 hours after publishing the link on Twitter. Good job!

Original post follows.

Unfortunate events surrounding the DDoS attack against BitBucket kicked-off heated discussions about the nature of this vulnerability. Where Amazon officially acknowledged this to be a single isolated incident, many others started asking questions why did it happen in first place?

  • Was BitBucket’s security group configuration set to block UDP traffic?
  • How come they haven’t got better visibility of the on-going attack?
  • Is this really Amazon’s fault?

Both personal and professional interest led me to find out more. Having designed series of tests how to replicate this scenario, I’ve started first instance and set up the target environment.

instance : c1.medium (us-east-1d)
EBS volume : 200 GB attached to (/dev/sdf)
monitoring : vmstat, netstat, iptraf, Amazon CloudWatch
security group : allowed SSH only (port 22/TCP)

UDP flood set up to be generated from the second instance (c1.medium) using simple Perl script, managing to generate whopping traffic of 650mbit per second (according to iptraf) using 1KB packets to random ports on the target IP.

Test 1. Let it run has been successful in a way there was no visibility on target machine. Still surprised by the traffic level generated on the source box, I’ve pointed the UDP flood to another machine – with security group allowing UDP traffic (ports 0 – 65535) – to check if the network traffic is able to reach another box. And it was. Not only from the same availability zone, but even from the different ones (tested us-east-1c and us-east-1b).

Test 2. consisted of formatting the prepared EBS, 5 samples for both scenario with and without UDP flood.

No traffic (1m15s)
UDP Flood (2m54s)

During the test there were only moderate increase in IO waits (somewhere between 2 – 4%).

Test 3. Bonnie++ performance test of the EBS volume. Running with no incoming traffic, it took around 8 minutes to produce quite reasonable report. Having switched on the UDP flood I’ve repeated the same tests and my expectation was to see some results in similar time. Fifteen minutes later and bonnie still haven’t even finished third step (rewriting). Another 10 minutes without any significant progress pointed me to do some research what’s going on. The box wasn’t performing virtually any IO operations, and time spent waiting for IO topped 100% every second reading (1s delay). Bingo!

To verify if the problem is really caused by incoming UDP flood, I’ve stopped the traffic for a brief interval (around 7 seconds) and monitored using vmstat:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
0  1 893480  11272   3112 1699240    0    0     0     0   10   11  0  0 66 34
0  1 893480  11272   3112 1699240    0    0     0     0    9    6  0  0  0 100
0  1 893480  11272   3112 1699240    0    0     0     0   10    9  0  0 67 33
0  1 893480  11272   3112 1699240    0    0     0     0   11    8  0  0  0 100
0  1 893480   8824   3100 1700052    0    0 23808 24864  962  697  0  1 68 31
0  1 893480  12284   3084 1697988    0    0 16384 16576  711  424  0  2  4 93
0  1 893480   9020   3084 1700088    0    0 20480 20720  817  563  0  1 68 31
0  1 893480  10432   3072 1700192    0    0 20864 20720  907  612  0  4  5 90
0  1 893480  10976   3040 1699724    0    0 15620 12432  588  423  0  1 68 31
0  1 893480  10872   3044 1698556    0    0 12676 16576  600  350  0  2  2 96
0  1 893480  10328   3024 1700676    0    0 19976 16576  761  535  0  1 68 31
0  1 893480  12408   3004 1698096    0    0  8708 12432  457  254  0  1  4 95
0  1 893480  12408   3004 1698096    0    0     0     0    9    7  0  0 67 33
0  1 893480  11636   3004 1699120    0    0  1024     0   38   38  0  0  0 100
0  1 893480  10548   3004 1700420    0    0  1280     0   47   45  0  0 66 33
0  1 893480  10188   3004 1700756    0    0  3584  4144  195  110  0  0  0 100
0  1 893480  10120   2992 1697968    0    0  6404  8288  256  205  0  0 67 33
0  1 893480  12468   2992 1696864    0    0  8064  8288  343  250  0  0  2 98
0  1 893480  11720   2972 1696984    0    0 12420 12432  495  333  0  0 67 32
0  1 893480  10136   2976 1700800    0    0  6916  4144  321  190  0  0  0 100
0  1 893480  11972   2956 1698820    0    0  4096  4144  161  117  0  0 67 33
0  1 893480  11364   2960 1699480    0    0  3844  4144  200  126  0  0  1 99
0  1 893480  11432   2960 1699480    0    0  2944  4144  160   91  0  0 66 34
0  1 893480  11156   2960 1699820    0    0   256     0   18   12  0  0  0 100
0  1 893480  10884   2960 1700020    0    0   256     0   17   17  0  0 66 34
0  1 893480  10856   2960 1700076    0    0     0     0    9    8  0  0  0 100
0  1 893480  10856   2960 1700076    0    0     0     0    9    9  0  0 67 33

As you can see on line 5 the IO traffic resumed, roughly correlating to the time incoming traffic stopped. Seven seconds later with the UDP traffic back on the box tried to keep up for another quarter of minute before giving it up.

Nothing! Based on my notes the first bonnie run occured at 10:40, switched on the UDP flood at 10:50, and started second bonnie run at 10:52. My patience ran out before 11:30 where there’s small peak caused by interactive iptraf session.

At this point there were no reasons to continue testing. All IO operations to/from EBS volume seemed to be blocked by UDP traffic generated by a single instance!

Conclusion

BitBucket guys had every reason to be angry. Blocking UDP in the security group configuration only hides the problem. Contraindicating the Jesper Nøhr statement, during this experiment there were no peaks visible using paid monitoring service – Amazon CloudWatch (see above). Which was probably the amount of information available to AWS 1st line of support.

This corresponds to the ‘black box’ described by Jesper. Looking back on the results it’s obvious that

  • on-demand network capacity backfired in this case
  • security group configuration is most likely applied on the host system
  • host architecture seems to be sharing same network interface(s) for actual network traffic as well as network traffic to/from EBS instances. Even though instances got only a single network interface, I would expect this separation to be implemented on the host system. Segregation of the network traffic is one of the first lesson learned in high-exposed clustered environment.
  • a week after the attack and there isn’t any fix in place. Hello, Amazon?!?!

To be fair, it’s been the first incident of such a magnitude. Let’s hope Amazon AWS team will come up with the architecture fix before somebody use the vulnerability in much wider and devastating attack. In mean time, the only workaround we can apply is to hide our instances as much as we can. Load-balancers and proxies in front of the worker instances should be enough, as long as you don’t share the same host machine.

Have a good weekend and good luck protecting your instance’s IPs!

PS: who had the same dark thought as I just had? What about S3?

[UPDATE 2009-10-11 7:00pm] c1.xlarge instances are able to generate UDP flood in the rate of 800 mbps. I guess, Amazon AWS is running 1Gbps network infrastructure.