For some customers, JIRA and some other Atlassian products have been down for an entire week. Some are reporting that Atlassian is saying that it could be another 2 weeks until the products are back up and running. Chalk that up worse than Roblox's 3 day outage back in October 2021. Why so many outages?
We don't know the full story behind Atlassian's outage yet, but both outages seem to be run-of-the-mill engineering issues. No nefarious hacks or exploits, no third-party or cloud provider downtime.
Roblox doesn't use public cloud, but Atlassian's outage only affects cloud customers (on-prem deployments are functioning correctly). While I believe that companies like Roblox will have trouble keeping up in a cloud services world where the bar is always being raised – these outages aren't always a cloud issue.
The Meta outage timeline was due in small part to remote work – after misconfiguring DNS, engineers couldn't access internal tools and networks used to debug and remediate the problem. Maybe there's an opportunity to rethink infrastructure in a world where much of site reliability is done completely remotely, with even new failure modes.
Something that companies are learning from Atlassian's radio silence on the outage – communication matters. Many customers are left in the dark, and we'll see if they use this as an opportunity to move some workflows off the product.