Comment 22 for bug 1668129

Revision history for this message
Dan Streetman (ddstreet) wrote :

On an i3 instance in east-1, where i can reproduce fairly easily, the errors i'm getting unfortunately don't help. the nvme controller is failing some requests, but it isn't providing any useful info about why it doesn't like the requests. for example, here is some debug I added:

[ 1464.634709] nvme nvme0: invalid field command_id 3eb qid 5 cmd_type 1 cmd_flags 4001 data_dir 1 status 2002

the controller is failing a request with error 2 "invalid field", and sets the "more error data" flag 0x2000. So I pulled the error log page, which is supposed to provide more data about why the request failed.

[ 1464.634836] nvme nvme0: error log entry: count 5d5281a qid 5 command_id 3eb status 2002 byte ff bit ff lba 0 ns 1 vendor 0 csi 0

the nvme controller error log is supposed to provide details about the failure, but this provides no new info; the qid and command_id match the failure above, but the error location fields (byte and bit) which are supposed to point to the specific byte/bit in the request that the controller doesn't like, are set to 0xffff which means "If the error is not specific to a particular command then this field shall be set to FFFFh" - so that's totally unhelpful.

Instead of trying to determine what the controller's unhappy about, I'll try bisecting with an older kernel, to find the commit that introduces the failure.