SATA disk recovery on Linux

About a year ago (Oct 2008) I replaced a dying Seagate Barracuda 200Gb IDE drive with a Barracuda 1Tb SATA drive. That entailed upgrading from Fedora Core 4 to FC9 to get SATA support (ASUS A8B MX motherboard). As I recall, the IDE drive with Reiser 3 filesystem failed gradually - I would get I/O errors on odd files, run fsck, clean up a bunch of errors, and repeat until it became obvious there was a hardware issue at which point I copied everything to ext3 on the new drive, keeping the old one online for some months until it started impacting the running system. If a file had no I/O errors and had passed fsck, it worked normally.

Recently I started having problems with the new drive. I didn't read the logs properly at first, and thought the odd sda entries I'd seen in passing were something to do with inserting/unplugging a memory stick. I didn't initially get i/o errors, but I noticed that sometimes reading a "man" page that it was slow - the first page would appear quickly, but the colon at the bottom of the page indicating the page was fully parsed would take a while to appear. Finally, while unpacking a big zip file with a few thousand entries, I got solid errors trying to create directories.

When I rebooted the machine to try to clear the errors, it was very slow to come up. Looking at the logs, it seemed that the SATA driver was getting errors and then re-initializing the drive each time, which took a while. The log had entries like:

kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
kernel: ata3.00: irq_stat 0x40000000
kernel: ata3.00: cmd c8/00:08:99:90:aa/00:00:00:00:00/e7 tag 0 dma 4096 in
kernel:         res 51/40:00:99:90:aa/00:00:07:00:00/07 Emask 0x9 (media error)
kernel: ata3.00: status: { DRDY ERR }
kernel: ata3.00: error: { UNC }
kernel: ata3.00: configured for UDMA/133
kernel: ata3: EH complete

kernel: ata3.00: configured for UDMA/133
kernel: sd 2:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
kernel: sd 2:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor]
kernel: Descriptor sense data with sense descriptors (in hex):
kernel:        72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
kernel: sd 2:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed
kernel: end_request: I/O error, dev sda, sector 128618649
kernel: sd 2:0:0:0: [sda] 1953525168 512-byte hardware sectors (1000205 MB)
kernel: sd 2:0:0:0: [sda] Write Protect is off
kernel: sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

I bought 2 new SATA disks (I may get a warranty refund on 1, hopefully), and installed one alongside the failing one. I rebooted and, as before, initialized the new disk and copied files across with rsync. I chose ext4, which seemed to be fully supported in my now-updated FC9 system (2.6.27.19-78 kernel).

I did an rsync copy in pieces, rather than saying "rsync -a /". The old disk had a boot primary partition, and a volume group containing another 4 partitions - /, /home, swap and /two. I had never quite got the hang of the volume manager, particularly when there's an fsck error on boot and the message says "disk error on /dev/sda, enter password for disk check" and I can't find the disk to check. So I reverted to the old scheme of directly mounted partitions - even if I can't resize them, it's easy enough to move stuff around or make symlinks, and 1Tb's not exactly tight yet.

The /home partition copy went OK - although I did it by subdirectory, and finished by a general "rsync -a /home", there were no errors or hangs. But when it came to copying /boot and /, the copy process would hit a bad sector and hang while the driver retried. If I got fed up I would reboot, or power-cycle the machine, then try again avoiding the affected directories. It seemed to get worse with use, eventually giving an I/O error on a top-level directory such as /usr, but recover after a reboot.

When I retrieved my FC9 rescue disk from work the next day, I found that it would not mount the ext4 partitions I had copied everything to. I also had an older Maxtor 40Gb still working (predating the 200Gb Seagate, in fact). I installed a minimal FC9 on that, booted it and did "yum update kernel". That gave me 2.6.27.25-78 instead of the original 2.6.25-14, which could mount the ext4 partitions.

I wanted if at all possible to recover my original root disk, with all configs, updates and additions (yum, manual rpm and install-from-source stuff). I have odd things backed up on other computers, the Maxtor disk etc. but no 1Tb image. So doing a fresh install was a last resort. Accordingly I set about making the recovered copy of / bootable. There were many files missing, some critical. I had 3 possibilities - recover them from the bad disk, get a new package online with yum, or find a copy elsewhere on the disk or on the install DVD.

Every time I rebooted, there was a good chance that the bad disk would not come up, and the two disks sometimes appeared in a different order (sda vs. sdb). This got a bit confusing. The old fstab referred to partitions by volume, the new one by UUID, and fdisk/fsck by device. I ended up relabelling them and using labels in fstab.
After making the Maxtor disk bootable in BIOS and recovering as much as I could, I made the new Seagate bootable and edited grub.conf to choose between the Maxtor drive, the old kernel on the new / partition, the new kernel etc. I ended up with several unbootable kernel sets; some probably because drivers were missing from /lib/modules/ and others probably just config errors. Fortunately there was always one or two working to recover from.
Another thing that got confusing was that the two network interfaces would come up in a different order, or one would be missing.

It was fairly obvious that the recovered / partition was not going to boot easily. So I prepared another partition to be the new / by installing the "filesystem" RPM on it, as I recall running MAKEDEV to recreate devices (I had not tried to rsync /dev from the bad disk), and then copying files from the initial copy, in effect using one partition as a staging area to create the other.
Finally I got the recovered system to boot with init=/bin/sh, and was able to fix a few missing libraries etc. so that most critical programs would run. I then thought to use a more systematic method than trying to recover files by hand. The RPM database was OK (not corrupted), so "rpm -V" would check if any files were missing or corrupted. As it turned out, none were corrupted, but a good many were missing (as explained, if bad file was found doing rsync, typically it would hang then block the directory until power-cycle, making it difficult to traverse directories with bad entries). In order to preserve existing configs, I could re-install packages with "rpm -U xxx.rpm --force". That worked OK providing I had an RPM, e.g. from the install DVD. For packages updated with yum, typically I had had cache disabled and had no copy of the RPM. Yum appears to have no equivalent to --force; I could not re-install the same version, and did not want to delete packages (configs would usually have got saved as .rpmsave, but I'd have had to locate these and reload them). For many packages in fedora-update-newkey, I was able to retrieve the RPM from a mirror with wget, and then reinstall with rpm --force. I made a script to verify packages then, for those that had missing files, try first the install DVD and then wget. This was able to salvage a reasonable number of packages, but left those that had been retrieved from a different repository or installed by hand.

Since I could not find a way to force an yum install of an existing package, I came up with a workaround. I created a dummy system on a spare partition, with RPM database, /var/cache/yum etc, and did "yum --installroot /xxx install yyy.rpm". This left the latest RPM in the yum cache, which I was then able to reinstall on the real system with "rpm -U --force". I scripted this; too - I initially had about 1700 packages installed, about 1500 with one or more missing files.
In retrospect, it seems likely that "yumdownloader" would have simplified this somewhat.

By this time I had pretty much recovered everything from the original Fedora Core 9, updated to the latest version. I also had most of the packages from other repositories. That left a few packages I had manually downloaded, or built as RPM, where I had the RPM saved on /home. I was able mostly to reinstall these. There are probably still a number of tarball-built packages broken, which I shall fix as I need them

It was possible to check the updatedb database on the bad disk to see what files were originally present; however, since the disk was gradually going bad, I suspect that the indexer was not finding everything.

During this process I was trying to keep notes - pasting commands etc. - and kept finding I'd left the notes on an unmounted partition. It proved easier to use a memory stick, also for keeping copies of grub.conf, fstab.

Useful commands and locations:

fdisk -l lists available disks, by device
/dev/disk/by-* lists disks by UUID, label etc.
e2label changes the label on ext2/3/4 filesystems, as will tune2fs
Volume groups are listed in /dev/VolGroup*; fsck works on those
mount -l gives disk labels of mounted files
rpm -r uses a given filesystem root; will list RPM database on a different disk
yum --installroot uses a given filesystem root; will install packages on a different disk
The RPM package filesystem creates basic directories such as /dev, /var etc.
chcon --reference will set the SELINUX properties on a file from a reference file
ls -lZ lists SELINUX properties
locate -d searches an updatedb database on a different disk

Conclusions

Given enough persistence, it is possible to recover an RPM-based system with missing files
I don't really want to do this again
The typical Linux filesystem is a mess - no easy way to separate distro, add-on and user files onto separate disks. Or for that matter read-write and read-only. SELinux makes this worse by, for instance, setting different attributes on /home so that putting webserver directories there becomes difficult.
Having a backup of the RPM database is a good idea - even if you don't have backup of the actual packages, they can be found online.

What next ?

I intend to replace the bad disk with a second identical SATA disk. I will probably configure this for nightly backups, rather than RAID, as this gives easy recovery from accidental erasure and screw-ups - providing the backup cycle has not yet run, I can easily recover individual files from the second disk. (I use this scheme elsewhere). However, this is not proof against local disaster (fire, meteorite..) or theft.

It would be nice if I could use more RPM packages, but I currently lack the skill to create my own RPMs of complex software. I also use a lot of Perl modules from CPAN; I may look at tools such as cpanspec or cpan2rpm.

In reality, I'll probably keep the same unmaintainable muddle, because it's just too hard to be systematic when using software from multiple sources with different philosophies. I may try to keep better backups of the RPM database.