o2image fails on s390x

Bug #1745155 reported by Andreas Hasenack
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OCFS2 Tools
Fix Released
Unknown
Ubuntu on IBM z Systems
Fix Released
Undecided
Unassigned
ocfs2-tools (Ubuntu)
Fix Released
Medium
Unassigned
Eoan
Won't Fix
Medium
Unassigned
Focal
Fix Released
Medium
Unassigned

Bug Description

o2image fails on s390x:

dd if=/dev/zero of=/tmp/disk bs=1M count=200
losetup --find --show /tmp/disk
mkfs.ocfs2 --cluster-stack=o2cb --cluster-name=ocfs2 /dev/loop0 # loop dev found in prev step

Then this comand:
o2image /dev/loop0 /tmp/disk.image

Results in:
Segmentation fault (core dumped)

dmesg:
[ 862.642556] ocfs2: Registered cluster interface o2cb
[ 870.880635] User process fault: interruption code 003b ilc:3 in o2image[10c180000+2e000]
[ 870.880643] Failing address: 0000000000000000 TEID: 0000000000000800
[ 870.880644] Fault in primary space mode while using user ASCE.
[ 870.880646] AS:000000003d8f81c7 R3:0000000000000024
[ 870.880650] CPU: 0 PID: 1484 Comm: o2image Not tainted 4.13.0-30-generic #33-Ubuntu
[ 870.880651] Hardware name: IBM 2964 N63 400 (KVM/Linux)
[ 870.880652] task: 000000003cb81200 task.stack: 000000003d50c000
[ 870.880653] User PSW : 0705000180000000 000000010c184212
[ 870.880654] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:1 AS:0 CC:0 PM:0 RI:0 EA:3
[ 870.880655] User GPRS: 0000000144f0cc10 0000000000000001 0000000000000001 0000000000000000
[ 870.880655] 0000000000000000 0000000144ef6090 0000000144f13cc0 0000000100000000
[ 870.880656] 0000000144ef6000 0000000144ef3280 0000000144f13cd8 0000000000037ee8
[ 870.880656] 000003ff965a6000 000003ffe5e7e410 000000010c183bc6 000003ffe5e7e370
[ 870.880663] User Code: 000000010c184202: b9080034 agr %r3,%r4
                          000000010c184206: c02b00000007 nilf %r2,7
                         #000000010c18420c: eb21200000df sllk %r2,%r1,0(%r2)
                         >000000010c184212: e31030000090 llgc %r1,0(%r3)
                          000000010c184218: b9f61042 ork %r4,%r2,%r1
                          000000010c18421c: 1421 nr %r2,%r1
                          000000010c18421e: 42403000 stc %r4,0(%r3)
                          000000010c184222: 1322 lcr %r2,%r2
[ 870.880672] Last Breaking-Event-Address:
[ 870.880675] [<000000010c18e4ca>] 0x10c18e4ca

Upstream issue:
https://github.com/markfasheh/ocfs2-tools/issues/22

This was triggered by our ocfs2-tools dep8 tests: http://autopkgtest.ubuntu.com/packages/o/ocfs2-tools/bionic/s390x

Related branches

Changed in ocfs2-tools:
status: Unknown → New
tags: added: ubuntu-ha
Changed in ocfs2-tools (Ubuntu):
status: New → Incomplete
status: Incomplete → Confirmed
importance: Undecided → Medium
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Based on:

https://github.com/markfasheh/ocfs2-tools/issues/22#issuecomment-452931021

I'm skipping autopkgtests for all:

DEB_BUILD_ARCH_ENDIAN=big

And checking if any other parts of ocfs2-tools are buggy for big endian arches.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Weird,

We have:

https://docs.oracle.com/cd/E37670_01/E37355/html/ol_about_ocfs2.html

Saying: "Support for heterogeneous clusters of nodes with a mixture of 32-bit and 64-bit, little-endian (x86, x86_64, ia64) and big-endian (ppc64) architectures."

But we have it passing for all little endian arches (including ppc64el)... and not for s390x (the only current big-endian arch we have support for).

Despite what Oracle docs say, this function looks suspicious:

int ocfs2_set_bit(int nr,void * addr)
{
        int mask, retval;
        unsigned char *ADDR = (unsigned char *) addr;

        ADDR += nr >> 3;
        mask = 1 << (nr & 0x07);
        retval = (mask & *ADDR) != 0;
        *ADDR |= mask;
        return retval;
}

So, despite using single *u8 for ADDR, we have:

*u8 ADR = (u8 *) (u32 *) addr;

and that might explain why addr is 0;

Also, we have:

(mask & *ADDR) -> a dereference without a cast (specially important if we are differentiating little and big endian).

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

(gdb) bt
#0 0x000002aa0001baa4 in ocfs2_image_mark_bitmap (ofs=0x2aa0005a2c0, blkno=236353) at image.c:254
#1 0x000002aa00007678 in traverse_extents (ofs=0x2aa0005a2c0, el=0x2aa0007a8c0) at o2image.c:117
#2 0x000002aa00007faa in traverse_inode (ofs=0x2aa0005a2c0, inode=10) at o2image.c:317
#3 0x000002aa00007474 in traverse_group_desc (ofs=0x2aa0005a2c0, grp=0x2aa00077c00, dump_type=2, bpc=4) at o2image.c:76
#4 0x000002aa00007876 in traverse_chains (ofs=0x2aa0005a2c0, cl=0x2aa00076cc0, dump_type=2) at o2image.c:155
#5 0x000002aa00007e7c in traverse_inode (ofs=0x2aa0005a2c0, inode=12) at o2image.c:291
#6 0x000002aa00008f38 in scan_raw_disk (ofs=0x2aa0005a2c0) at o2image.c:633
#7 0x000002aa000096fc in main (argc=3, argv=0x3fffffffc18) at o2image.c:780

ofs seems to be broken:

(gdb) print *(ofs)
$6 = {
  fs_devname = 0xffffffffffffffff <error: Cannot access memory at address 0xffffffffffffffff>,
  fs_flags = 4294967295,
  fs_io = 0xffffffffffffffff,
  fs_super = 0xffffffffffffffff,
  fs_orig_super = 0xffffffffffffffff,
  fs_blocksize = 4294967295,
  fs_clustersize = 4294967295,
  fs_clusters = 4294967295,
  fs_blocks = 18446744073709551615,
  fs_umask = 4294967295,
  fs_root_blkno = 18446744073709551615,
  fs_sysdir_blkno = 18446744073709551615,
  fs_first_cg_blkno = 18446744073709551615,
  uuid_str = '\377' <repeats 33 times>,
  fs_cluster_alloc = 0xffffffffffffffff,
  fs_inode_allocs = 0xffffffffffffffff,
  fs_system_inode_alloc = 0xffffffffffffffff,
  fs_eb_allocs = 0xffffffffffffffff,
  fs_system_eb_alloc = 0xffffffffffffffff,
  fs_dlm_ctxt = 0xffffffffffffffff,
  ost = 0x10002aa0005d040,
  qinfo = {{
      qi_inode = 0x0,
      flags = 0,
      qi_info = {
        dqi_bgrace = 0,
        dqi_igrace = 0,
        dqi_syncms = 0,
        dqi_blocks = 0,
        dqi_free_blk = 0,
        dqi_free_entry = 0
      }
    }, {
      qi_inode = 0x0,
      flags = 0,
      qi_info = {
        dqi_bgrace = 0,
        dqi_igrace = 0,
        dqi_syncms = 0,
        dqi_blocks = 0,
        dqi_free_blk = 0,
        dqi_free_entry = 0
      }
    }},
  fs_private = 0x0
}

and its pointer is passed along the stack trace since frame #1 (scan_raw_disk).

It comes from:

* ocfs2_open is modified to be aware of OCFS2_FLAG_IMAGE_FILE.
* open routine allocates ocfs2_image_state and loads the bitmap if
* OCFS2_FLAG_IMAGE_FILE flag is passed in

ret = ocfs2_open(src_file,OCFS2_FLAG_RO|OCFS2_FLAG_NO_ECC_CHECKS|open_flags, 0,0, &ofs);

And likely related to how bitmap is disposed in the image file for a big endian arch.

ofs->ost pointer is not broken, because it comes from a malloc call after ofs was "allegedly" correctly read (which is not true in big endian arches):

ret = ocfs2_malloc0(sizeof(struct ocfs2_image_state),&ofs->ost);

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

With that, and after talking to some other engineers, we decided it is best, instead of whitelisting little endian arches, make the build process to fail whenever this can occur, which, right now, after investigation, is only related to how bitmaps are disposed in the imgfile (and mapped using (char *) w/out any endian converstion).

So, currently a test like:

union un {
    uint8_t u8;
    uint16_t u16;
    uint16_t u32;
    uint16_t u64;
};

union un u = { .u64 = 0x7b };

printf("0x%x\n", (uint) u.u8);
printf("0x%x\n", (uint) u.u16);
printf("0x%x\n", (uint) u.u32);
printf("0x%x\n", (uint) u.u64);

And verifying that all 4 variables have the same "0x7b" contents is enough to make sure we are running in a little endian arch.

Changed in ocfs2-tools (Ubuntu):
status: Confirmed → In Progress
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

The merge request:

https://code.launchpad.net/~rafaeldtinoco/ubuntu/+source/ocfs2-tools/+git/ocfs2-tools/+merge/372384

Can serve as a request to remove s390x of supported architectures for this package's binaries.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Talking to @paelzer, we both agreed that whitelisting architectures is the best alternative, instead of having a little/big endian test added to the test, adding complexity and a bigger delta to be maintained.

Meanwhile, we got ocfs2-tools added to the autopkgtest blacklist:

https://bazaar.launchpad.net/~ubuntu-release/britney/hints-ubuntu/revision/3907

Meaning that this "change" lost "momentum". Nevertheless, in the next merge, I'll white list appropriate architectures, so we don't face migration problems again (and since ocfs2-tools is part of the HA bigger effort).

Changed in ocfs2-tools (Ubuntu):
status: In Progress → Confirmed
assignee: nobody → Rafael David Tinoco (rafaeldtinoco)
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :
Changed in ocfs2-tools (Ubuntu Eoan):
status: New → Confirmed
importance: Undecided → Medium
assignee: nobody → Rafael David Tinoco (rafaeldtinoco)
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Eoan needs fixing for the s390x migration for its next SRU.

Adding block-proposed-eoan as tag.

tags: added: block-proposed-eoan
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Change has been accepted upstream:

https://salsa.debian.org/ha-team/ocfs2-tools/merge_requests/2/commits

commit 08c247a
Author: Valentin Vidic <email address hidden>
Date: Fri Dec 6 18:16:59 2019

    Merge branch 'master' into 'master'

    debian/control: specify only supported architectures (LP: #1745155)

    See merge request ha-team/ocfs2-tools!2

    (cherry picked from commit c9406a8d5cec753398dfa2d6d554392aa1d15693)

    3a67253c debian/control: specify only supported architectures (LP: #1745155)

But not released in a new version yet.

We have an almost finished merge:

https://code.launchpad.net/~rafaeldtinoco/ubuntu/+source/ocfs2-tools/+git/ocfs2-tools/+merge/376188

Fixing this issue.

Changed in ocfs2-tools (Ubuntu Focal):
status: Confirmed → In Progress
Changed in ocfs2-tools (Ubuntu Eoan):
status: Confirmed → Won't Fix
assignee: Rafael David Tinoco (rafaeldtinoco) → nobody
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Frank,

want to ask IBM if they are interested in this ? I just removed s390x from the arch linux (Debian and Ubuntu), based on upstream discussion. Unsure if they're willing to make it work in s390x.

Best,

Rafael

Frank Heimes (fheimes)
tags: added: s390x
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ocfs2-tools - 1.8.6-2ubuntu1

---------------
ocfs2-tools (1.8.6-2ubuntu1) focal; urgency=medium

  * debian/control: specify only supported architectures (LP: #1745155)

 -- Rafael David Tinoco <email address hidden> Fri, 29 Nov 2019 13:12:00 +0000

Changed in ocfs2-tools (Ubuntu Focal):
status: In Progress → Fix Released
Changed in ocfs2-tools (Ubuntu Focal):
assignee: Rafael David Tinoco (rafaeldtinoco) → nobody
Changed in ocfs2-tools (Ubuntu):
assignee: Rafael David Tinoco (rafaeldtinoco) → nobody
Changed in ocfs2-tools:
status: New → Fix Released
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.