Entering the so-called pit of the Syslab reveals a nook of orderly chaos. Amidst the constant backdrop of students chattering elsewhere in the lab, Jefferson’s Student Systems Administrators (Sysadmins) toss ideas back and forth in a complicated technical jargon they each speak fluently, clustered in an alcove of chairs, wires, and computer screens glowing with lines upon lines of code. Here, they have spent the past week working on reviving the lab’s services.
It all began at 3:30 in the afternoon on Sunday, Sept. 23, when senior Sysadmin Dylan Jones attempted to update the Syslab’s file storage system. Powered by a service known as Ceph, the storage system is the linchpin of the Syslab ecosystem, housing the data at the heart of every service the lab offers.
“The storage system is redundant, fault-tolerant, and self-healing,” Jones said. “So normally it’s able to recover from hardware failures or somebody smash[ing] a server with a sledgehammer.”
This time, however, after Jones updated the storage servers, the data they contained suddenly became inaccessible. Since these servers are the centralized system powering the entire lab, all Syslab services—including Ion, Webmail, email forwarding, Director, workstations, and library proxy—went down as a result.
“For the first 48 hours after the incident started, I just devoted all the time I had to it,” Jones said. “I was able to put off homework until later: ‘Just one more hour and it’ll be fixed’. But after 48 hours passed, I knew that wasn’t going to happen.”
At that point, the entire Sysadmin team was working to bring the lab back up. As the resident experts on the file storage system, Jones and senior Sysadmin Omkar Kulkarni initially shouldered the brunt of the work involving Ceph. However, to ensure too much pressure didn’t fall on any one person, the process remained a team effort.
“We try to distribute responsibility. Different people are in charge of different tasks based on what their expertise is,” Sysadmin adviser Patrick White said. “One thing you want is for everybody in the group to feel like they’re contributing to the solution, so everybody’s busy. Nobody is the only person primarily responsible for fixing a thing.”
As part of that collaborative initiative, many of the Sysadmins have been working several hours a day in order to restore the lab, using lunches, 8th periods, and time at home to do so.
“We are doing everything we can. Every member we have is working as hard as they can, as many hours as they can, to get all of this resolved,” Jones said. “We’re working with people now from universities that do this professionally; we’re using every resource available to try and get this issue fixed.”
White and Jones have been in contact with an expert from the University of Indiana in order to resolve the issue with speed and efficiency.
“Currently we are in emergency response mode. Everybody is working on anything they can handle,” junior Sysadmin Keegan Lanzillotta said. “We’re all doing different things trying to get up as many services as we can as quickly as possible. Some people are still working on the data storage array; other people are focused on particular services.”
Along with this divide-and-conquer strategy, they have been using a triage methodology to determine which services to tackle. Ion took first priority, with the Sysadmins meeting a Tuesday, Sept. 25 deadline to fix it in time for 8th period on Wednesday.
“We understand that these services are critical, so there is certainly pressure to get things working,” Lanzillotta said. “But we understand that pressure. No one has ever found it necessary to tell us that we need to work harder.”
While their teamwork and sense of urgency have paid off in the successful restoration of Ion, library proxy, workstations, and Webmail, several services are still down with no clear timeline on when they will be working. Further, each currently functioning service is running as a new instance rather than directly connecting back to the centralized file storage system, meaning that the overarching challenge of fixing that system still looms. Given the responsibility of this workload, many Sysadmins have had to make sacrifices regarding free time, sleep, and even homework.
“People definitely have a hard time sometimes. We try to keep everything within 8th period so that you don’t have to do too much at home, but when things like this happen, you spend a little more time working on Sysadmin stuff and less time working on schoolwork,” Kulkarni said. “Even during school, you’re not always thinking about what’s happening in the classroom that you’re in, but more about what’s going to happen if anything else breaks, or how long it will take for things to get back up.”
Despite these stresses and sacrifices, the team remains focused on pushing forward toward a solution.
“I have not heard any of the Sysadmins openly blaming anybody else or criticizing. When somebody tries something that doesn’t work, somebody else comes in with a new suggestion without saying ‘you idiot, why would you even try that,’” White said. “They have been completely supportive of each other in this process. That doesn’t always happen. When you have critical failures like this, it pushes teams to their limits.”
Perhaps what defines this team is its ability to push back, countering failure with resolve and difficulty with passion.
“One advantage that [Sysadmins] have is they love it,” White said. “You can see getting this stuff to work is their passion, and any person who’s spending countless hours on their passion is a happy person.”