Sometimes vmcore tarball is corrupted when generated using /proc/sysrq-trigger

Bug #2042803 reported by Zhixiong Chi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Zhixiong Chi

Bug Description

Brief Description
-----------------
I generate a vmcore using 'echo c > /proc/sysrq-trigger', then I try to unpack it, but it fails with error 'Unexpected EOF in archive'.

sysadmin@controller-0:~$ tar xzvf vmcore_first.tar.1.gz
202309280221/
202309280221/dmesg.202309280221
202309280221/dump.202309280221
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now

Severity
--------
Minor

Steps to Reproduce
------------------
(1) Run 'sudo bash; echo c > /proc/sysrq-trigger' to generate a vmcore.

(2) Unpack the vmcore tarball under '/var/log/crash' using command 'tar xzvf vmcore_first.tar.1.gz'.

Expected Behavior
------------------
Write down what was expected after taking the steps written above

Actual Behavior
----------------
State what is the actual behavior

Reproducibility
---------------
Not 100% Reproducible

System Configuration
--------------------

Branch/Pull Time/Commit
-----------------------

Last Pass
---------

Timestamp/Logs
--------------
The raw size of the vmcore is 198M, but the vmcore tarball's size is just 40M.

2023-09-28T02:22:05.747 controller-0 crashDumpMgr: notice new vmcore detected (size:207424137:198M) ; /var/log/crash avail:7256076000:6.8G

sysadmin@controller-0:~$ ls -l /var/log/crash/
total 41768
-rw-r--r-- 1 root root 42769829 Sep 28 02:22 vmcore_first.tar.1.gz

  When unpacking it, it fails with error 'Unexpected EOF in archive'.

sysadmin@controller-0:~$ cp /var/log/crash/vmcore_first.tar.1.gz .
sysadmin@controller-0:~$ tar xzvf vmcore_first.tar.1.gz
202309280221/
202309280221/dmesg.202309280221
202309280221/dump.202309280221
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now

  When loading it using crash tool, it says the image is truncated/incomplete.

sysadmin@controller-0:~$ crash test/usr/lib/debug/vmlinux-5.10.0-6-amd64 202309280221/dump.202309280221
crash 7.2.9
Copyright (C) 2002-2020 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.
please wait... (sorting flat format data: 99%)
crash: read error: 202309280221/dump.202309280221 (flat format): truncated/incomplete
crash: 202309280221/dump.202309280221: not a supported file format

Test Activity
-------------

Workaround
----------

Changed in starlingx:
assignee: nobody → Zhixiong Chi (zhixiongchi)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to utilities (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/utilities/+/900151

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to utilities (master)

Reviewed: https://review.opendev.org/c/starlingx/utilities/+/900151
Committed: https://opendev.org/starlingx/utilities/commit/ef0fc6807920ffda3afa7a628c3e52dcd91e1418
Submitter: "Zuul (22348)"
Branch: master

commit ef0fc6807920ffda3afa7a628c3e52dcd91e1418
Author: Zhixiong Chi <email address hidden>
Date: Tue Oct 31 19:14:15 2023 -0700

    Reorder logmgmt after crash-dump-manager.service

    Sometimes we got a truncated vmcore file after sysrq magic key.
    This is most likely not an issue with generating the vmcore file
    itself, but rather an issue generating the tar archive.

    Assume that the system has crashed, and the kdump kernel has
    booted up, and the vmcore file has been saved to "/var/crash/".
    The system then reboots into "normal mode". During the normal
    boot-up, the order of the logmgmt and crashDumpMgr systemd
    services is not constrained to the correct order via systemd
    unit dependencies.

    This results in the following:
    Sometimes, after a crash, logmgmt daemon is started before
    crashDumpMgr can save the vmcore file tar archive from /var/crash
    to /var/log/crash/vmcore*.tar. Because of this, logrotate called
    by the logmgmt daemon grabs a partial copy of the vmcore tar file,
    and then the logrotate compresses the partial vmcore tar file into
    vmcore*.tar.gz.

    All this to say, this issue is fixed by ensuring that logmgmt
    daemon is started after crashDumpMgr finishes archiving the vmcore
    into a tar file. After that, logrotate started by logmgmt daemon
    will further compress the vmcore tar file into the vmcore tar.gz
    file.

    Now we add the systemd dependency crash-dump-manager for logmgmt
    service to make sure that logmgmt daemon should wait for the
    crashDumpMgr daemon to finish starting.

    TestPlan:
    PASS: build-pkgs; build-image
    PASS: Jenkins Installation
    PASS: Execute the testcycle 'sudo tee <<<c /proc/sysrq-trigger'
          and check the vmcore file after reboot.

    Closes-Bug: 2042803

    Change-Id: I7b7f8adeda0a1bb16529e2fab7fc4f1dfb87b932
    Signed-off-by: M. Vefa Bicakci <email address hidden>
    Signed-off-by: Zhixiong Chi <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.9.0 stx.distro.other
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.