SATA disk recovery on LinuxAbout a year ago (Oct 2008) I replaced a dying Seagate Barracuda 200Gb IDE drive with a Barracuda 1Tb SATA drive. That entailed upgrading from Fedora Core 4 to FC9 to get SATA support (ASUS A8B MX motherboard). As I recall, the IDE drive with Reiser 3 filesystem failed gradually - I would get I/O errors on odd files, run fsck, clean up a bunch of errors, and repeat until it became obvious there was a hardware issue at which point I copied everything to ext3 on the new drive, keeping the old one online for some months until it started impacting the running system. If a file had no I/O errors and had passed fsck, it worked normally.Recently I started having problems with the new drive. I didn't read the logs properly at first, and thought the odd sda entries I'd seen in passing were something to do with inserting/unplugging a memory stick. I didn't initially get i/o errors, but I noticed that sometimes reading a "man" page that it was slow - the first page would appear quickly, but the colon at the bottom of the page indicating the page was fully parsed would take a while to appear. Finally, while unpacking a big zip file with a few thousand entries, I got solid errors trying to create directories. When I rebooted the machine to try to clear the errors, it was very slow to come up. Looking at the logs, it seemed that the SATA driver was getting errors and then re-initializing the drive each time, which took a while. The log had entries like: kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 kernel: ata3.00: irq_stat 0x40000000 kernel: ata3.00: cmd c8/00:08:99:90:aa/00:00:00:00:00/e7 tag 0 dma 4096 in kernel: res 51/40:00:99:90:aa/00:00:07:00:00/07 Emask 0x9 (media error) kernel: ata3.00: status: { DRDY ERR } kernel: ata3.00: error: { UNC } kernel: ata3.00: configured for UDMA/133 kernel: ata3: EH complete kernel: ata3.00: configured for UDMA/133 kernel: sd 2:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK kernel: sd 2:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor] kernel: Descriptor sense data with sense descriptors (in hex): kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 kernel: sd 2:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed kernel: end_request: I/O error, dev sda, sector 128618649 kernel: sd 2:0:0:0: [sda] 1953525168 512-byte hardware sectors (1000205 MB) kernel: sd 2:0:0:0: [sda] Write Protect is off kernel: sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUAI bought 2 new SATA disks (I may get a warranty refund on 1, hopefully), and installed one alongside the failing one. I rebooted and, as before, initialized the new disk and copied files across with rsync. I chose ext4, which seemed to be fully supported in my now-updated FC9 system (2.6.27.19-78 kernel). I did an rsync copy in pieces, rather than saying "rsync -a /". The old disk had a boot primary partition, and a volume group containing another 4 partitions - /, /home, swap and /two. I had never quite got the hang of the volume manager, particularly when there's an fsck error on boot and the message says "disk error on /dev/sda, enter password for disk check" and I can't find the disk to check. So I reverted to the old scheme of directly mounted partitions - even if I can't resize them, it's easy enough to move stuff around or make symlinks, and 1Tb's not exactly tight yet. The /home partition copy went OK - although I did it by subdirectory, and finished by a general "rsync -a /home", there were no errors or hangs. But when it came to copying /boot and /, the copy process would hit a bad sector and hang while the driver retried. If I got fed up I would reboot, or power-cycle the machine, then try again avoiding the affected directories. It seemed to get worse with use, eventually giving an I/O error on a top-level directory such as /usr, but recover after a reboot. When I retrieved my FC9 rescue disk from work the next day, I found that it would not mount the ext4 partitions I had copied everything to. I also had an older Maxtor 40Gb still working (predating the 200Gb Seagate, in fact). I installed a minimal FC9 on that, booted it and did "yum update kernel". That gave me 2.6.27.25-78 instead of the original 2.6.25-14, which could mount the ext4 partitions. I wanted if at all possible to recover my original root disk, with all configs, updates and additions (yum, manual rpm and install-from-source stuff). I have odd things backed up on other computers, the Maxtor disk etc. but no 1Tb image. So doing a fresh install was a last resort. Accordingly I set about making the recovered copy of / bootable. There were many files missing, some critical. I had 3 possibilities - recover them from the bad disk, get a new package online with yum, or find a copy elsewhere on the disk or on the install DVD.
Every time I rebooted, there was a good chance that the bad disk would not come up, and
the two disks sometimes appeared in a different order (sda vs. sdb). This got a bit confusing.
The old fstab referred to partitions by volume, the new one by UUID, and fdisk/fsck by device.
I ended up relabelling them and using labels in fstab.
It was fairly obvious that the recovered / partition was not going to boot easily. So
I prepared another partition to be the new / by installing the "filesystem" RPM on it,
as I recall running MAKEDEV to recreate devices (I had not tried to rsync /dev from the bad disk),
and then copying files from the initial copy, in effect using one partition as a staging area to
create the other.
Since I could not find a way to force an yum install of an existing package,
I came up with a workaround. I created a
dummy system on a spare partition, with RPM database, /var/cache/yum etc, and did
"yum --installroot /xxx install yyy.rpm". This left the latest RPM in the yum cache,
which I was then able to reinstall on the real system with "rpm -U --force". I scripted this; too
- I initially had about 1700 packages installed, about 1500 with one or more missing files.
By this time I had pretty much recovered everything from the original Fedora Core 9, updated to the latest version. I also had most of the packages from other repositories. That left a few packages I had manually downloaded, or built as RPM, where I had the RPM saved on /home. I was able mostly to reinstall these. There are probably still a number of tarball-built packages broken, which I shall fix as I need them It was possible to check the updatedb database on the bad disk to see what files were originally present; however, since the disk was gradually going bad, I suspect that the indexer was not finding everything. During this process I was trying to keep notes - pasting commands etc. - and kept finding I'd left the notes on an unmounted partition. It proved easier to use a memory stick, also for keeping copies of grub.conf, fstab.
Useful commands and locations:
Conclusions
What next ?I intend to replace the bad disk with a second identical SATA disk. I will probably configure this for nightly backups, rather than RAID, as this gives easy recovery from accidental erasure and screw-ups - providing the backup cycle has not yet run, I can easily recover individual files from the second disk. (I use this scheme elsewhere). However, this is not proof against local disaster (fire, meteorite..) or theft.It would be nice if I could use more RPM packages, but I currently lack the skill to create my own RPMs of complex software. I also use a lot of Perl modules from CPAN; I may look at tools such as cpanspec or cpan2rpm. In reality, I'll probably keep the same unmaintainable muddle, because it's just too hard to be systematic when using software from multiple sources with different philosophies. I may try to keep better backups of the RPM database. |