Good day, everyone
The recent downtime of the buzzen/relay websites:
First off, I apologize that it took so long to figure out the cause of the issue, but it wasn't a straightforward reboot and fix.
On Thursday, I was running a backup of the site files for buzzen and relay, then was going to grab a backup of the databases ( ironically to protect against this sort of thing from happening). As it was backing up the buzzen files, it errored out. I was then informed that Buzzen was showing a different website. This happens occasionally due to how the system is set up with the panel we use. It uses docker containers to segregate users from each other but uses a floating IP to those containers (think when you boot your modem it is assigned an IP from your provider) sometimes it changes the IP to the container which a simple command usually fixes the system so it points each domain to the floating IP.
Well, this time, that command didn't work. So I rebooted the server as it should have started the system up and assigned everything correctly. This time, the server booted up but didn't run any services like ssh (needed so I can access the server and fix/update things). This is where the rabbit hole begins: Trying to troubleshoot what was happening took time to get access to the server first off, then second to figure out what happened.
First, I got the server into rescue mode and checked the file system out. Everything appeared correct. All files were still in place and had the right permissions, etc. Rebooting the server so it boots from the hard drive, but still no ssh. Finally got in via a KVM console, which is a backdoor our data center provides to the server. Makes it so like I'm right in front of the server even though it's a province away. Found and issue that the ssh had a permissions issue with took a bit to sort through and fix. But it still wasn't starting.
So now it was time to search through the server logs. Well, some of them were quite large. and didn't shed any light that I could see. So via console I starting looking at the journal output...think diary for the server logs various aspects from booting up to firewall blocks, to application errors etc. Well, this process was quite long; the journal was 6.6 million lines long. When I got to the point in the log that things went wrong on the server, it gave me a clue to what was happening. 1) there were some partition errors on the hard drive, and 2) it was trying to mount a partition that shouldn't have been there. Fixed the partition issues with a simple command. But took a while to figure out why it was mounting what it shouldn't have been. Once I fixed that, the server booted right up, and SSH and other services started like normal.
Buzzen was working again, so I started the firewall, which I had disabled during the troubleshooting. Forgetting this causes an issue with the floating IP service so people were seeing the bad gateway yellow page again. Ran the command to fix that issue, and it worked for everything except buzzen and relay. Checking the output for the docker container, there was a port issue that the container for buzzen uses. Researching and trying various fixes didn't correct the issue until this morning when I got it working.
As of now, the sites are up and running, and I've grabbed a backup of the databases and site files, so we have them.
Going forward to help protect against this sort of thing happening, we will be doing a couple of steps.
First, I will be doing some work and plan on bringing the server down again in a few days to redo the system from scratch. reinstall the operating system, reinstall the control panel with the newest version which have some significant changes including automatic backups of the websites and databases.
Second, I will be teaching chain how to make periodic backups, and between the two of us, we will grab backups and keep them on our external hard drives for safe keeping. This way if anything does happen, we will have fairly current backups of the sites. This way if there is an issue, we can get things up and running in a few hours with minimal loss of data.
Again, I do apologize for the extended downtime.
Wes,
Buzzen Administration