Kurt Roeckx's journal

Finding which sector belongs to which file on a RAID device.
11th November 2008

I got this nice message (and some others) in my log today:
smartd: Device: /dev/sdd, 2 Currently unreadable (pending) sectors

After running a smartctl -t long /dev/sdd smartctl now reports:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   089   089   000    Old_age   Always       -       8719
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       2
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0

[...]
SMART Error Log Version: 1
ATA Error Count: 10 (device log contains only the most recent five errors)
[...]
Error 10 occurred at disk power-on lifetime: 8692 hours (362 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 78 e5 89 e0  Error: UNC 8 sectors at LBA = 0x0089e578 = 9037176

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 78 e5 89 2f 00  45d+06:36:27.880  READ DMA EXT
  25 00 08 70 e5 89 2f 00  45d+06:36:27.798  READ DMA EXT
  25 00 08 68 e5 89 2f 00  45d+06:36:27.308  READ DMA EXT
  25 00 08 e0 77 e5 35 00  45d+06:36:27.295  READ DMA EXT
  25 00 08 60 e5 89 2f 00  45d+06:36:26.653  READ DMA EXT

[...]
Error 9 occurred at disk power-on lifetime: 8692 hours (362 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 60 e5 89 e0  Error: UNC 8 sectors at LBA = 0x0089e560 = 9037152

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%      8715         797566305
# 2  Extended offline    Completed without error       00%      8198         -

Notice that 197 Current_Pending_Sector indicates that there are 2 sectors with a problem.

Note that the SMART error log shows LBA sectors 9037176 (0x89e578) and 9037152 (0x89e560) but it's limited to 0xffffff. The self test logs shows us the proper LBA error at 797566305 (0x2f89e561) which has a 0x2f in front, and also shows us that it's the second sector in the block of 8 we tried to read.

The kernel log also shows:

ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata4.00: (BMDMA2 stat 0xc0009)
ata4.00: cmd 25/00:08:60:e5:89/00:00:2f:00:00/e0 tag 0 cdb 0x0 data 4096 in
         res 51/40:00:61:e5:89/00:00:2f:00:00/f0 Emask 0x9 (media error)

So I've tried to reproduce the error with:

# dd if=/dev/sdd of=797566305 skip=797566305 bs=512 count=1 iflag=direct
1+0 records in
1+0 records out
512 bytes (512 B) copied, 1.27727 s, 0.4 kB/s

Notice that it can actually read that block, it just seems to take a while.

And the error log now contains:

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 01 61 e5 89 e0  Error: UNC 1 sectors at LBA = 0x0089e561 = 9037153

Because it's part of a RAID device I could just let the whole disk resync, but I was a little curious which files were in it. So I found this document that explains on how to find which files has the problem. But it only has examples covering ext2/ext3 on a partition and LVM and nothing about RAID device. So it looks like someone using RAID 0 will have a hard time finding the file that's having a problem.

So following the document we start with:

# fdisk -lu /dev/sdd

Disk /dev/sdd: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders, total 976773168 sectors
Units = sectors of 1 * 512 = 512 bytes
Disk identifier: 0x000578f9

   Device Boot      Start         End      Blocks   Id  System
/dev/sdd1   *          63      385559      192748+  fd  Linux raid autodetect
/dev/sdd2       972864270   976768064     1951897+  fd  Linux raid autodetect
/dev/sdd3          385560   972864269   486239355   fd  Linux raid autodetect

Partition table entries are not in disk order

In /proc/mdstat we see:

md1 : active raid5 sda3[0] sdd3[3] sdc3[2] sdb3[1]
      1458717696 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

So sdd3 is part of md1 and I have an ext3 filesytsem on that. Tune2fs -l |grep Block gives:

Block count:              364679424
Block size:               4096
Blocks per group:         32768

mdadm tells me:

# mdadm -QE /dev/sdd3
/dev/sdd3:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 747b071a:a70f723b:c3a9ed44:15845cbf
  Creation Time : Mon Nov 12 20:13:51 2007
     Raid Level : raid5
  Used Dev Size : 486239232 (463.71 GiB 497.91 GB)
     Array Size : 1458717696 (1391.14 GiB 1493.73 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 1

    Update Time : Mon Nov 10 10:57:28 2008
          State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0
       Checksum : 4b3147f4 - correct
         Events : 78

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     3       8       51        3      active sync   /dev/sdd3

   0     0       8        3        0      active sync   /dev/sda3
   1     1       8       19        1      active sync   /dev/sdb3
   2     2       8       35        2      active sync   /dev/sdc3
   3     3       8       51        3      active sync   /dev/sdd3

I've been told that 0.90 stores the superblock at the end of the device, so that makes it a little easier. And this seems to agree:

# cat /sys/block/md1/md/dev-sdd3/offset
0

We have a left-symmetric layout, so in raid5_compute_sector() we see

	stripe = chunk_number / data_disks;
	*dd_idx = chunk_number % data_disks;
[...]
	case ALGORITHM_LEFT_SYMMETRIC:
		*pd_idx = data_disks - stripe % raid_disks;
		*dd_idx = (*pd_idx + 1 + *dd_idx) % raid_disks;

So it looks like the chunks are located on the disk like:

stripesdasdbsdcsdd
0D 0D 1D 2P 0-2
1D 4D 5P 3-5D 3
2D 8P 6-8D 6D 7
3P 9-11D 9D 10D 11
4D 12D 13D 14P 12-14

With each of those 64K.

The problem is at 797566305 and the partition started at 385560. So that's at sector 797180745 inside partition sdd3. Each sector is 512 bytes and a chunk is 65536 bytes which gets us 128 sectors per chunk. If we want to know which the stride the sector is on we do: 797180745/128=6227974.5703125

This means that the stripe is number 6227974 and that it's sector 0.5703125*128=73 on sdd3 within that chunk.

The parity is on disk: pd_idx = 3 - 6227974 % 4 = 1, being sdb.

To get the chunk we need to multiply with the number of data disks (3) and since it's the forth disk and the parity is on the second disk we need to add 1 and get: 6227974*3+1=18683923

So calculating that back to sectors we multiply it with the number of sectors per chunk again and add our offset in the chunk: 18683923*128+73=2391542217

To test that our math is any good we try:

# dd if=/dev/md1 of=797566305.2 skip=2391542217 bs=512 count=1 iflag=direct
1+0 records in
1+0 records out
512 bytes (512 B) copied, 5.99103 s, 0.1 kB/s

We see an error message in the kernel log, there are 2 new errors in the error log, so it seems that everything is calculated correctly. The files are also identical.

The file system is in blocks of 4096 bytes, so for the file system this is block 2391542217*512/4096=298942777.125

Then we use debugfs:

# debugfs
debugfs 1.41.2 (02-Oct-2008)
debugfs:  open /dev/md1
debugfs:  icheck 298942777
Block   Inode number
298942777       <block not found>

Which isn't making any sense to me, since I get that error about every 2 hour when a certain cronjob runs. 298942777 / 32768 (Blocks per group) = 9123.00955. So I wonder if ext3 is keeping some meta data there and debugfs doesn't tell that?

Anyway, at some point I was running a smartctl -t long /dev/sdd again because I couldn't reproduce the error anymore, the cronjob started, and I then found alot of errors in my log again and now it included:

raid5:md1: read error corrected (8 sectors at 797180744 on sdd3)

It's so much fun that disks always behaves the same trying to read the same sector.

Smartctl now reports:

  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0

I assume the kernel removed the pending sector by writing to it. The other pending error was already gone for some time, I assume new data got written to that one.

Contact: Kurt Roeckx <kurt@roeckx.be>