当osd出现down的状态,日志信息显示为:
f049a9f5700 -1 os/FileStore.cc: In function 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, size_t, ceph::bufferlist&, uint32_t, bool)' thread 7f049a9f5700 time os/FileStore.cc: 2850: FAILED assert(allow_eio || !m_filestore_fail_eio || got != -5)
通过assert(allow_eio || !m_filestore_fail_eio || got != -5),非allow_eio且配置中fail_eio为true时,若有IO error则assert fail。
分析osd 所对应的磁盘信息:
# dmesg -T | grep sdh sd 0:2:7:0: [sdh] 3904294912 512-byte logical blocks: (1.99 TB/1.81 TiB) sd 0:2:7:0: [sdh] Write Protect is off sd 0:2:7:0: [sdh] Mode Sense: 1f 00 00 08 sd 0:2:7:0: [sdh] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sdh: unknown partition table sd 0:2:7:0: [sdh] Attached SCSI disk XFS (sdh): Mounting V4 Filesystem XFS (sdh): Ending clean mount sd 0:2:7:0: [sdh] sd 0:2:7:0: [sdh] sd 0:2:7:0: [sdh] sd 0:2:7:0: [sdh] CDB: end_request: I/O error, dev sdh, sector 1100294168 sd 0:2:7:0: [sdh] sd 0:2:7:0: [sdh] sd 0:2:7:0: [sdh] sd 0:2:7:0: [sdh] CDB: end_request: I/O error, dev sdh, sector 1100294400 sd 0:2:7:0: [sdh] sd 0:2:7:0: [sdh] sd 0:2:7:0: [sdh] sd 0:2:7:0: [sdh] CDB: end_request: I/O error, dev sdh, sector 1100294400 sd 0:2:7:0: [sdh] sd 0:2:7:0: [sdh] sd 0:2:7:0: [sdh] sd 0:2:7:0: [sdh] CDB: end_request: I/O error, dev sdh, sector 1100294168 sd 0:2:7:0: [sdh] sd 0:2:7:0: [sdh] sd 0:2:7:0: [sdh] sd 0:2:7:0: [sdh] CDB: end_request: I/O error, dev sdh, sector 1100294400 sd 0:2:7:0: [sdh] sd 0:2:7:0: [sdh] sd 0:2:7:0: [sdh] sd 0:2:7:0: [sdh] CDB: end_request: I/O error, dev sdh, sector 1100294400 sd 0:2:7:0: [sdh] sd 0:2:7:0: [sdh] sd 0:2:7:0: [sdh] sd 0:2:7:0: [sdh] CDB: end_request: I/O error, dev sdh, sector 1100294352 Buffer I/O error on device sdh, logical block 1100294352 Buffer I/O error on device sdh, logical block 1100294353 Buffer I/O error on device sdh, logical block 1100294354 Buffer I/O error on device sdh, logical block 1100294355 Buffer I/O error on device sdh, logical block 1100294356 Buffer I/O error on device sdh, logical block 1100294357 Buffer I/O error on device sdh, logical block 1100294358 Buffer I/O error on device sdh, logical block 1100294359 Buffer I/O error on device sdh, logical block 1100294360 sd 0:2:7:0: [sdh] sd 0:2:7:0: [sdh] sd 0:2:7:0: [sdh] sd 0:2:7:0: [sdh] CDB: end_request: I/O error, dev sdh, sector 1100294405 sd 0:2:7:0: [sdh] sd 0:2:7:0: [sdh] sd 0:2:7:0: [sdh] sd 0:2:7:0: [sdh] CDB: end_request: I/O error, dev sdh, sector 1100294405 sd 0:2:7:0: [sdh] sd 0:2:7:0: [sdh] sd 0:2:7:0: [sdh] sd 0:2:7:0: [sdh] CDB: end_request: I/O error, dev sdh, sector 1100294405
可以看到I/O Error信息,截取一段信息进行分析
badblocks -v -s -b 512 -o /root/badblocks.txt /dev/sdh 1100300000 1100290000 Checking blocks 1100290000 to 1100300000 Checking for bad blocks (read-only test): done Pass completed, 8 bad blocks found. (8/0/0 errors) [root@host20 ~]# cat badblocks.txt 1100294400 1100294401 1100294402 1100294403 1100294404 1100294405 1100294406 1100294407
发现1100294400 -4407被检测出来。但1100294352 – 1100294360 却没有(执行badblcoks命令仅使用了读模式检测)