FIXED: Amazon EC2 vulnerable to UDP flood attack
UPDATE 2009-10-12: I’m happy to let you know this post is not longer relevant. Amazon AWS team successfully deployed the fix and the scenario used to simulate Denial of Service attack using UDP flood isn’t applicable anymore. All that in less than 24 hours after publishing the link on Twitter. Good job!
Original post follows.
Unfortunate events surrounding the DDoS attack against BitBucket kicked-off heated discussions about the nature of this vulnerability. Where Amazon officially acknowledged this to be a single isolated incident, many others started asking questions why did it happen in first place?
- Was BitBucket’s security group configuration set to block UDP traffic?
- How come they haven’t got better visibility of the on-going attack?
- Is this really Amazon’s fault?
Both personal and professional interest led me to find out more. Having designed series of tests how to replicate this scenario, I’ve started first instance and set up the target environment.
instance : c1.medium (us-east-1d)
EBS volume : 200 GB attached to (/dev/sdf)
monitoring : vmstat, netstat, iptraf, Amazon CloudWatch
security group : allowed SSH only (port 22/TCP)
UDP flood set up to be generated from the second instance (c1.medium) using simple Perl script, managing to generate whopping traffic of 650mbit per second (according to iptraf) using 1KB packets to random ports on the target IP.
Test 1. Let it run has been successful in a way there was no visibility on target machine. Still surprised by the traffic level generated on the source box, I’ve pointed the UDP flood to another machine – with security group allowing UDP traffic (ports 0 – 65535) – to check if the network traffic is able to reach another box. And it was. Not only from the same availability zone, but even from the different ones (tested us-east-1c and us-east-1b).
Test 2. consisted of formatting the prepared EBS, 5 samples for both scenario with and without UDP flood.
No traffic (1m15s)
UDP Flood (2m54s)
During the test there were only moderate increase in IO waits (somewhere between 2 – 4%).
Test 3. Bonnie++ performance test of the EBS volume. Running with no incoming traffic, it took around 8 minutes to produce quite reasonable report. Having switched on the UDP flood I’ve repeated the same tests and my expectation was to see some results in similar time. Fifteen minutes later and bonnie still haven’t even finished third step (rewriting). Another 10 minutes without any significant progress pointed me to do some research what’s going on. The box wasn’t performing virtually any IO operations, and time spent waiting for IO topped 100% every second reading (1s delay). Bingo!
To verify if the problem is really caused by incoming UDP flood, I’ve stopped the traffic for a brief interval (around 7 seconds) and monitored using vmstat:
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- 0 1 893480 11272 3112 1699240 0 0 0 0 10 11 0 0 66 34 0 1 893480 11272 3112 1699240 0 0 0 0 9 6 0 0 0 100 0 1 893480 11272 3112 1699240 0 0 0 0 10 9 0 0 67 33 0 1 893480 11272 3112 1699240 0 0 0 0 11 8 0 0 0 100 0 1 893480 8824 3100 1700052 0 0 23808 24864 962 697 0 1 68 31 0 1 893480 12284 3084 1697988 0 0 16384 16576 711 424 0 2 4 93 0 1 893480 9020 3084 1700088 0 0 20480 20720 817 563 0 1 68 31 0 1 893480 10432 3072 1700192 0 0 20864 20720 907 612 0 4 5 90 0 1 893480 10976 3040 1699724 0 0 15620 12432 588 423 0 1 68 31 0 1 893480 10872 3044 1698556 0 0 12676 16576 600 350 0 2 2 96 0 1 893480 10328 3024 1700676 0 0 19976 16576 761 535 0 1 68 31 0 1 893480 12408 3004 1698096 0 0 8708 12432 457 254 0 1 4 95 0 1 893480 12408 3004 1698096 0 0 0 0 9 7 0 0 67 33 0 1 893480 11636 3004 1699120 0 0 1024 0 38 38 0 0 0 100 0 1 893480 10548 3004 1700420 0 0 1280 0 47 45 0 0 66 33 0 1 893480 10188 3004 1700756 0 0 3584 4144 195 110 0 0 0 100 0 1 893480 10120 2992 1697968 0 0 6404 8288 256 205 0 0 67 33 0 1 893480 12468 2992 1696864 0 0 8064 8288 343 250 0 0 2 98 0 1 893480 11720 2972 1696984 0 0 12420 12432 495 333 0 0 67 32 0 1 893480 10136 2976 1700800 0 0 6916 4144 321 190 0 0 0 100 0 1 893480 11972 2956 1698820 0 0 4096 4144 161 117 0 0 67 33 0 1 893480 11364 2960 1699480 0 0 3844 4144 200 126 0 0 1 99 0 1 893480 11432 2960 1699480 0 0 2944 4144 160 91 0 0 66 34 0 1 893480 11156 2960 1699820 0 0 256 0 18 12 0 0 0 100 0 1 893480 10884 2960 1700020 0 0 256 0 17 17 0 0 66 34 0 1 893480 10856 2960 1700076 0 0 0 0 9 8 0 0 0 100 0 1 893480 10856 2960 1700076 0 0 0 0 9 9 0 0 67 33
As you can see on line 5 the IO traffic resumed, roughly correlating to the time incoming traffic stopped. Seven seconds later with the UDP traffic back on the box tried to keep up for another quarter of minute before giving it up.
Nothing! Based on my notes the first bonnie run occured at 10:40, switched on the UDP flood at 10:50, and started second bonnie run at 10:52. My patience ran out before 11:30 where there’s small peak caused by interactive iptraf session.
At this point there were no reasons to continue testing. All IO operations to/from EBS volume seemed to be blocked by UDP traffic generated by a single instance!
Conclusion
BitBucket guys had every reason to be angry. Blocking UDP in the security group configuration only hides the problem. Contraindicating the Jesper Nøhr statement, during this experiment there were no peaks visible using paid monitoring service – Amazon CloudWatch (see above). Which was probably the amount of information available to AWS 1st line of support.
This corresponds to the ‘black box’ described by Jesper. Looking back on the results it’s obvious that
- on-demand network capacity backfired in this case
- security group configuration is most likely applied on the host system
- host architecture seems to be sharing same network interface(s) for actual network traffic as well as network traffic to/from EBS instances. Even though instances got only a single network interface, I would expect this separation to be implemented on the host system. Segregation of the network traffic is one of the first lesson learned in high-exposed clustered environment.
- a week after the attack and there isn’t any fix in place. Hello, Amazon?!?!
To be fair, it’s been the first incident of such a magnitude. Let’s hope Amazon AWS team will come up with the architecture fix before somebody use the vulnerability in much wider and devastating attack. In mean time, the only workaround we can apply is to hide our instances as much as we can. Load-balancers and proxies in front of the worker instances should be enough, as long as you don’t share the same host machine.
Have a good weekend and good luck protecting your instance’s IPs!
PS: who had the same dark thought as I just had? What about S3?
[UPDATE 2009-10-11 7:00pm] c1.xlarge instances are able to generate UDP flood in the rate of 800 mbps. I guess, Amazon AWS is running 1Gbps network infrastructure.