So, You've Inherited a Mess

They handed you the keys and a pat on the back. “Good luck,” they said, with a look that was equal parts pity and relief. You’re the new sysadmin, the new SRE, the new DevOps guy. And you’ve just inherited a steaming pile of undocumented, unmanaged, and unstable infrastructure.

Don’t panic. I’ve been there. More times than I care to count, starting from my first encounter with a closet full of overheating Sun SPARCstations in ‘98. The names change, the tech gets “better,” but the situation is always the same: you’re walking into a digital minefield.

Before you start unplugging things (and believe me, the temptation is strong), here’s your three-step plan for the first 48 hours. This isn’t about boiling the ocean; it’s about triage.

Step 1: Observe and Document (Don’t Touch Anything)

Your first instinct is to start fixing things. Resist it. You don’t know what you don’t know. Your predecessor, for all their faults, was the only one who knew that the ancient billing_export.sh script must run at 2:17 AM on the third Tuesday of the month, or the entire accounting department’s reporting goes up in flames.

For the next two days, your job is to be a ghost.

Inventory Everything: Physical servers, VMs, cloud instances, network gear, storage arrays. Use nmap, arp-scan, and the cloud provider’s console. Give everything a name, even if it’s just mystery-box-in-rack-3.
Map the Network: How does data flow? What talks to what? Draw it out. Use a whiteboard, a notebook, graphviz—I don’t care. Just get it out of the abstract and into a format you can see.
Listen to the System: Check the logs. All of them. dmesg, /var/log/syslog, application logs, firewall logs. Look for the patterns, the recurring errors, the weird cron jobs that finish in (127) status.

You are an archaeologist, and this is your dig site. Don’t break the artifacts.

Step 2: Identify the Crown Jewels

Not all systems are created equal. You need to figure out which ones losing would get you fired. Is it the customer database? The e-commerce frontend? The Active Directory controller?

Find the systems that directly make or support the company making money. These are your “Crown Jewels.” Your initial efforts must be entirely focused on ensuring their continued, uninterrupted existence. Everything else is secondary.

Ask the business people if you have to. “What system not working would cause you the most pain?” They’ll tell you.

Step 3: The First Backup

You have no idea what the backup situation is. Assume it’s either non-existent or hasn’t been tested since the last ice age.

Before you make a single change, you will perform a backup of the Crown Jewels. It doesn’t have to be elegant. It can be a simple dd to an NFS mount, a database dump to a local file, or a cloud snapshot.

# Don't just copy-paste this, you fool. Understand it.
# For a PostgreSQL database, for example:
pg_dump -U someuser -h db.example.com -Fc a_critical_database > /mnt/backups/db-$(date +%Y-%m-%d).dump

Verify it. A backup that hasn’t been tested is just a collection of bits. Can you restore it to a test system? Do you know how to restore it?

That’s it. That’s your first 48 hours. You haven’t “fixed” anything, but you’ve done something more important. You’ve established a baseline. You’ve reduced your ignorance. You’ve taken the first step from being a victim of the system to being its master.

Now the real work begins.

Step 1: Observe and Document (Don’t Touch Anything)#

Step 2: Identify the Crown Jewels#

Step 3: The First Backup#

Step 1: Observe and Document (Don’t Touch Anything)

Step 2: Identify the Crown Jewels

Step 3: The First Backup