DevOps Day 09
Real-World Production Troubleshooting
Today’s task felt like a real production support scenario. A critical application was down because the database service failed. My role wasn’t to create something new, but to investigate, diagnose, and fix the issue.
It was a great exercise in methodical troubleshooting. I followed clues from general errors to detailed logs, tested different hypotheses, and finally found the root cause — which was deeper than it first appeared.
Table of Contents¶
- Table of Contents
- The Task
- My Troubleshooting Journey: A Step-by-Step Solution
- Why Did I Do This? (The "What \& Why")
- Deep Dive: The Hierarchy of Troubleshooting
- Common Pitfalls
- Exploring the Commands Used
The Task¶
- The Nautilus application was down. The production support team had identified that the mariadb service was not running on the database server (stdb01). My task was to investigate the issue and bring the database service back online.
My Troubleshooting Journey: A Step-by-Step Solution¶
- My path to resolving this outage involved a multi-step investigation.
Step 1: Initial Investigation¶
- First, I logged into the database server (
peter@stdb01) and confirmed the problem usingsystemctl.sudo systemctl status mariadb # Output confirmed the service was "inactive (dead)" - My first attempt to fix it with a simple
startcommand failed, which told me the problem was not a simple crash.sudo systemctl start mariadb # Output: Job for mariadb.service failed... See "journalctl -xeu mariadb.service" for details.
Step 2: Digging into the Logs¶
- Following the error message's advice, I checked the detailed logs.
The logs were filled with "Operation not permitted" and "Failed to mount" errors. These were
journalctl -xeu mariadb.servicesystemderrors, not MariaDB errors, pointing to a problem with the underlying server environment, likely file permissions or a missing resource.
Step 3: Testing Hypotheses¶
-
Hypothesis 1: Disk Space. My first thought was that the disk was full, as a database will refuse to start without space to write.
This showed plenty of free space. Hypothesis was incorrect.df -h -
Hypothesis 2: Incorrect Directory Ownership. My next thought was that the data directory (
/var/lib/mysql) was owned by the wrong user (e.g.,rootinstead ofmysql).ls -ld /var/lib/mysql
Step 4: Discovering the True Root Cause¶
-
The previous command resulted in the ultimate clue:
ls: cannot access '/var/lib/mysql': No such file or directory -
The problem wasn't wrong permissions; the entire data directory was missing.
Step 5: The Final Solution¶
- With the root cause identified, I executed the full recovery procedure.
- Create the missing directory:
sudo mkdir /var/lib/mysql - Set the correct ownership so the
mysqluser could access it:sudo chown mysql:mysql /var/lib/mysql - Initialize the database structure by running the installation script. This creates the necessary system tables in the new directory.
sudo mysql_install_db --user=mysql - Start the service. With all prerequisites in place, the service could finally start.
sudo systemctl start mariadb - Enable the service to ensure it starts after a reboot.
sudo systemctl enable mariadb - Final Verification.
sudo systemctl status mariadb # Output now showed "active (running)"
Why Did I Do This? (The "What & Why")¶
- systemd and systemctl: systemd is the main service manager in modern Linux. systemctl is the tool to start, stop, enable, or check services — a core sysadmin skill.
- journalctl: Lets you view detailed system logs. When systemctl status isn’t enough, journalctl helps find exact error messages.
- Root Cause Analysis: This task taught me to go beyond fixing symptoms. The service was down, but the real problem was a missing directory. Restarting alone wouldn’t have solved it.
Deep Dive: The Hierarchy of Troubleshooting¶
My process followed a logical hierarchy, moving from general to specific.
1. Confirm the problem: Is the service really down? (systemctl status)
2. Attempt a simple fix: Will a simple restart work? (systemctl start)
3. Gather more data: The simple fix failed, so why? (journalctl)
4. Form a hypothesis: Based on the logs (permission errors), I suspected an environmental issue. My first guess was disk space.
5. Test the hypothesis: Was the disk full? (df -h) No.
6. Refine the hypothesis: If not disk space, what else could cause permission errors? Incorrect file ownership.
7. Test the refined hypothesis: Who owns the data directory? (ls -ld)
8. Discover the root cause: The directory doesn't even exist.
9. Implement the full solution: Rebuild the environment (mkdir, chown, mysql_install_db) and then start the service.
This methodical process is key to solving complex issues efficiently.
Common Pitfalls¶
- Not Reading Error Messages: The start command failed but explicitly said to run journalctl. Ignoring this advice would leave you guessing.
- Stopping at the First Clue: The journalctl logs mentioned "permission denied," which could lead one to only check chown. But the real problem was a level deeper: the directory itself was gone.
Exploring the Commands Used¶
- sudo systemctl status mariadb: Checks the current status of the service.
- sudo systemctl start mariadb: Attempts to start the service.
- sudo systemctl enable mariadb: Configures the service to start on boot.
- journalctl -xeu mariadb.service: Displays detailed, service-specific logs to find the root cause of a failure.
- df -h: Checks disk space usage in a human-readable format.
- ls -ld [directory]: Lists the details of a directory itself, including its owner.
- sudo mkdir [directory]: Creates a new directory.
- sudo chown -R mysql:mysql [directory]: Changes the owner and group of a directory recursively.
- sudo mysql_install_db --user=mysql: A specific command for MariaDB/MySQL that creates the initial database schema in an empty data directory.