On February 28, 2018 GitHub, one of the leading developer collaboration sites was deluged with inbound traffic from a DDos attack which peaked out at a whopping 1.35Tbps, or ~127 Mpps, which may be the largest recorded denial of service attack to date.
The attack was carried out by exploiting open Memcache servers to amplify traffic towards the specified target, a technique known as Memcrashing. While the massive amount of traffic generated by the DDos attack was impressive, perhaps moreso was that the resultant outage only lasted about 10 minutes.
The quick response was a result of the unusual traffic anomalies being detected by network monitoring systems employed at GitHub, which notified engineers who took action to mitigate the attack as explained by Sam Kottle in an Incident Report on GitHub Engineering.
“At 17:21 UTC our network monitoring system detected an anomaly in the ratio of ingress to egress traffic and notified the on-call engineer and others in our chat system.”
“Given the increase in inbound transit bandwidth to over 100Gbps in one of our facilities, the decision was made to move traffic to Akamai, who could help provide additional edge network capacity. At 17:26 UTC the command was initiated via our ChatOps tooling to withdraw BGP announcements over transit providers and announce AS36459 exclusively over our links to Akamai. Routes reconverged in the next few minutes and access control lists mitigated the attack at their border. Monitoring of transit bandwidth levels and load balancer response codes indicated a full recovery at 17:30 UTC. At 17:34 UTC routes to internet exchanges were withdrawn as a follow-up to shift an additional 40Gbps away from our edge.”
“Making GitHub’s edge infrastructure more resilient to current and future conditions of the internet and less dependent upon human involvement requires better automated intervention. We’re investigating the use of our monitoring infrastructure to automate enabling DDoS mitigation providers and will continue to measure our response times to incidents like this with a goal of reducing mean time to recovery (MTTR).” – Sam Kottler via Github Engineering