The company said a series of mistakes during Facebook network maintenance caused an outage that took its services offline on Monday.
Facebook’s line of apps, including Instagram, WhatsApp and Messenger, went offline for more than five hours as employees scrambled to fix the damage. More than 3.5 billion people around the world use Facebook services to communicate with friends and family, distribute political messages, and expand their businesses through advertising and community outreach. copper.
Santosh Janardhan, Facebook’s vice president of infrastructure, wrote in the blog post that the problem initially occurred in a network that Facebook calls a “backbone,” connecting its data centers. all over the world.
During network maintenance, a command is issued to evaluate the available capacity. But the order backfired, shutting down the network and blocking Facebook’s data centers from communicating, Janardhan said. A testing tool designed to catch erroneous commands failed to detect errors, he added.
But that was only the beginning of the problems. “This change caused a complete disconnection of server connections between our data centers and the internet,” Janardhan wrote. “And that complete loss of connection caused a second crash that made things worse.”
With Facebook’s data centers offline, the company’s internet address management servers were also unavailable. “This makes it impossible for the rest of the internet to find our servers,” Mr. Janardhan said.
As the scope of the outage became clear, Facebook engineers struggled to restore access because its data centers were heavily guarded and employees couldn’t get in immediately. the company said.
Mr. Janardhan wrote.
Once the engineers were inside Facebook’s data centers and started working, they were able to restore the network. But they need to be slow when they bring the servers online so as not to overwhelm the system, Mr. Janardhan said.
The company planned to study how outages happen and create drills that would allow employees to practice repairing Facebook’s systems more quickly, he added.