installation fails creating user on large multipath system - multipathd show command times out in curtin log with 163 paths

Bug #1878041 reported by bugproxy
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Release Notes for Ubuntu
Undecided
Unassigned
The Ubuntu-power-systems project
Medium
Canonical Foundations Team
curtin
Medium
Unassigned
subiquity
Undecided
Unassigned
multipath-tools (Ubuntu)
Undecided
Unassigned

Bug Description

While on the creating user screen the installation fails. There are no messages to the screen. The system is a power 9 ppc64le system with a large number of multipath disks. The installer is the latest pending focal, updated to 20.05.1.

Logs taken from the system are attached. This bug follows LP 1873728, but seems to be a different failure.

Related branches

Revision history for this message
bugproxy (bugproxy) wrote : tarball of log files from installer

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-185835 severity-medium targetmilestone-inin20041
Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → debian-installer (Ubuntu)
Frank Heimes (fheimes)
no longer affects: debian-installer (Ubuntu)
Changed in ubuntu-power-systems:
importance: Undecided → Medium
assignee: nobody → Canonical Foundations Team (canonical-foundations)
Frank Heimes (fheimes)
tags: added: installer
Revision history for this message
Dimitri John Ledkov (xnox) wrote : Re: installation fails creating user on large multipath system

Command: ['multipathd', 'show', 'maps', 'raw', 'format', "name=%n multipath='%w' sysfs='%d' paths='%N'"]
Exit code: 1
Reason: -
Stdout: timeout receiving packet

Stderr: ''
Unexpected error while running command.
Command: ['multipathd', 'show', 'maps', 'raw', 'format', "name=%n multipath='%w' sysfs='%d' paths='%N'"]
Exit code: 1
Reason: -
Stdout: timeout receiving packet

Stderr: ''
curtin: Installation failed with exception: Unexpected error while running command.

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

Where large is:

In [11]: probedata=json.loads(open('probe-data.json').read())
In [12]: len(probedata['multipath']['paths'])
Out[12]: 163

So just 163 paths.

@IBM how many total paths do you actually expect on that system?

summary: - installation fails creating user on large multipath system
+ installation fails creating user on large multipath system - multipathd
+ show command times out in curtin log with 163 paths
Revision history for this message
Frank Heimes (fheimes) wrote :

I guess it would also be interesting to run probert manually and see how long it takes:
time probert
respectively:
time probert --storage

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

2020-05-11 05:32:39,767 DEBUG root:39 start: subiquity/Filesystem/_probe/probe_once: restricted=False
2020-05-11 05:32:52,099 DEBUG root:39 finish: subiquity/Filesystem/_probe/probe_once: SUCCESS: restricted=False

probably 14 seconds or so

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

From multipath.conf.5 manpage:

uxsock_timeout

CLI receive timeout in milliseconds. For larger systems CLI
commands might timeout before the multipathd lock is released and
the CLI command can be pro‐ cessed. This will result in errors
like "timeout receiving packet" to be re‐ turned from CLI
commands. In these cases it is recommended to increase the CLI
timeout to avoid those issues.

The default is: 1000

Revision history for this message
Dan Watkins (oddbloke) wrote :

It sounds like we need to bump this timeout in curtin. My assumption is that it will have no effect on multipath commands unless they were previously timing out, so I assume that we can set this globally.

Changed in curtin:
status: New → Triaged
importance: Undecided → Medium
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: New → Triaged
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2020-05-18 02:06 EDT-------
Sorry for the delay, from time probert --storage

real 0m16.505s
user 0m2.796s
sys 0m2.169s

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

I guess we need to up the limit for probing too -- currently we time it out after 15 seconds.

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

> So just 163 paths.
> @IBM how many total paths do you actually expect on that system?

Note that 163 is a prime number, which seems like an unusual setup. I expected for most devices to have the same number of paths, or like to have a predictable divisible number by 2/3/4 or some such.

Marking this ticket incomplete, and it will self close, unless you provide us with more information.

Changed in subiquity:
status: New → Incomplete
Changed in curtin:
status: Triaged → Incomplete
Changed in ubuntu-power-systems:
status: Triaged → Incomplete
Revision history for this message
Thierry FAUCK (thierry-j) wrote :

I could access a system with a lot of I/O and multipath disks, and I could choose a multipath disk and say continue but when installer asks for user name system crashes and when to send report to Canonical (what I did once).
To be complete I also had errors on nve disk but I could definitely check that the disk is broken and it is not related to our issue.

For info, I see in full report there are device P: /devices/virtual/block/dm-88

In that report searching for multipath I also found :

[ 1285.250755] device-mapper: table: 253:72: multipath: error getting device
 [ 1285.250791] device-mapper: ioctl: error adding target to table
 [ 1285.263657] device-mapper: table: 253:72: multipath: error getting device
 [ 1285.263674] device-mapper: ioctl: error adding target to table
 [ 1286.359539] device-mapper: table: 253:80: multipath: error getting device
 [ 1286.359569] device-mapper: ioctl: error adding target to table
 [ 1287.327763] device-mapper: table: 253:87: multipath: error getting device
 [ 1287.327793] device-mapper: ioctl: error adding target to table
 [ 1287.341100] device-mapper: table: 253:87: multipath: error getting device
 [ 1287.341120] device-mapper: ioctl: error adding target to table
 [ 1287.350460] device-mapper: table: 253:87: multipath: error getting device
 [ 1287.350481] device-mapper: ioctl: error adding target to table

File "/snap/subiquity/1876/lib/python3.6/site-packages/curtin/block/multipath .py", line 18, in _extract_mpath_data
     mp_dict = util.load_shell_content(line, add_empty=True)
   File "/snap/subiquity/1876/lib/python3.6/site-packages/curtin/util.py", line 1232, in load_shell_content
     key, value = line.split("=", 1)
 ValueError: not enough values to unpack (expected 2, got 1)
UdevDb:

Revision history for this message
Ryan Harper (raharper) wrote :

Hi, can you provide the output of the following commands and attach to the bug?

multipathd show maps raw format "name=%n multipath='%w' sysfs='%d' paths='%N'"
multipathd show patsh raw format "device='%d' serial='%z' multipath='%m' host_wwpn='%N' target_wwnn='%n' host_wwpn='%R' target_wwpn='%r' host_adapter='%a'"

Revision history for this message
Thierry FAUCK (thierry-j) wrote :

boston113 has 5 disks with 4 paths each. That's pretty much the smallest FC SAN we'd ever set up.

Revision history for this message
Thierry FAUCK (thierry-j) wrote :
Download full text (3.7 KiB)

There are a lot of SCSI disks as far as I understand ( 68), but they are not mulitpath capable

Here are the commands I am issuing to get the installer:
wget http://cdimage.ubuntu.com/ubuntu-server/daily-live/current/groovy-live-server-ppc64el.iso
mkdir iso
mount -o loop -t iso9660 groovy-live-server-ppc64el.iso iso
kexec -l ./iso/casper/vmlinux --initrd=./iso/casper/initrd.gz --append="ip=xx::xx.xx.xxx.x:255.255.255.0:ltc-boston113:enP1p9s0f0:none:x.x.x.x rl=htt:/xx.xxx.xx.xx/ubuntu-server/daily-live/current/groovy-live-server-ppc64el.iso --- quiet "
kexec -e
The system is going down NOW!
Sent SIGTERM to all processes
Sent SIGKILL to all processes
[ 98.496213] kexec_core: Starting new kernel
[257090.353861465,5] OPAL: Switch to big-endian OS
[257093.325518345,5] OPAL: Switch to little-endian OS
[ 5.503384] blk_update_request: protection error, dev nvme0c0n1, sector 3124995968 op 0x0:(READ) flags 0x4090700 phys_seg 1 prio class 0
[ 5.504824] blk_update_request: protection error, dev nvme0c0n1, sector 3124995968 op 0x0:(READ) flags 0x4010000 phys_seg 1 prio class 0
[ 5.504876] Buffer I/O error on dev nvme0n1p2, logical block 24413887, async page read
[ 5.506553] blk_update_request: protection error, dev nvme0c0n1, sector 3124995968 op 0x0:(READ) flags 0x4090700 phys_seg 1 prio class 0
[ 5.508097] blk_update_request: protection error, dev nvme0c0n1, sector 3124995968 op 0x0:(READ) flags 0x4010000 phys_seg 1 prio class 0
[ 5.508141] Buffer I/O error on dev nvme0n1p2, logical block 24413887, async page read
------------------- these messages are related to a disk used as secure boot --------------------
multiple messages
mount: mounting /dev/sdaz on /cdrom failed: Device or resource busy
....
Two methods available for IP configuration:
  * static: for static IP configuration
  * dhcp: for automatic IP configuration
static dhcp (default 'dhcp'):
vlan id (optional):
 http://cdimage.ubuntu.com/ubuntu-server/daily-live/current/groovy-live-server-ppc64el.iso (default)
url:
.....
groovy-live-server-p 97% |******************************* | 905M 0:00:01 ETA
groovy-live-server-p 99% |******************************* | 920M 0:00:00 ETA
groovy-live-server-p 100% |********************************| 925M 0:00:00 ETA

[ 702.806817] blk_update_request: protection error, dev nvme0c0n1, sector 3124995840 op 0x0:(READ) flags 0x4010000 phys_seg 1 prio class 0
[ 702.808426] blk_update_request: protection error, dev nvme0c0n1, sector 3124995840 op 0x0:(READ) flags 0x4010000 phys_seg 1 prio class 0
Connecting to plymouth: Connection refused

[FAILED] Failed to start LVM event activation on device 129:99.
[FAILED] Failed to start LVM event activation on device 129:115.
[FAILED] Failed to start LVM event activation on device 130:19.
[ 727.083442] blk_update_request: protection error, dev nvme0c0n1, sector 3124995840 op 0x0:(READ) flags 0x4010000 phys_seg 1 prio class 0
[ 727.085050] blk_update_request: protection error, dev nvme0c0n1, sector 3124995840 op 0x0:(READ) flags 0x4010000 phys_seg 1 prio class 0
[ 734.032080] device-mapper: table: 253:74: multipath: error getting device
[ 734.042066] device-mapper: table: 253:74: multipath: erro...

Read more...

Revision history for this message
Thierry FAUCK (thierry-j) wrote :
Download full text (4.4 KiB)

WHen I select a multipath device I get

                                                                                ┌────────────────────────────────────────────────────────────────────────┐  │ │  │ Sorry, there was a problem completing the installation. │  │ │  │ Information is being collected from the system that will help the │  │ developers diagnose the report. │  │ │  │ |--•--| │ : Removing leading `/' from member names │
 └───────────────────────────────────tar: Removing leading `/' from hard link targets

                                                                              ┌────────────────────────────────────────────────────────────────────────┐  │ │  │ Sorry, there was a problem completing the installation. │  │ │  │ [ View full report ] │  │ │  │ If you want to help improve the installer, you can send an error │  │ report. │  │ │  │ [ Send to Canonical ] │  │ │  │ Do you want to try starting the installation again? │  │ │  │ [ Restart the installer ] │  │ │  │ [ Close report ] │  │ │  └────────────────────────────────────────────────────────────────────────┘
no way to go to shell and get the log nad crash info - it has been sent to canonical !!!
 │ [Close report ] │  │ │  └────────────────────────────────────────────────────────────────────────┘  generating crash report
tar: Removing leading `/' from member names
tar: Removing leading `/' from hard link targets

python error .....
and comes back to
 ┌────────────────────────────────────────────────────────────────────────┐  │ ...

Read more...

Revision history for this message
Thierry FAUCK (thierry-j) wrote :

I also checked that with a SCSI disk (not multipath) even if there are 88 SCSI disks on that system, installation is proceding.

Revision history for this message
Ryan Harper (raharper) wrote :

The installer is multipath aware; so in the case that there are not any multipath devices in the system, then the output from those commands will not include any maps or paths to multipath devices.

However, the error you posted suggested that *something* was returned from those commands and we failed to parse that; thus we're keenly interested in running those commands exactly to capture the output.

I believe you can use F2 to enter debug shell on the installer, from there you can run those commands and hopefully capture the output in some way.

Revision history for this message
Thierry FAUCK (thierry-j) wrote :
Download full text (4.8 KiB)

multipathd show maps raw format "name=%n multipath='%w' s
name=mpatha multipath='360050768028211d8b000000000000061' sysfs='dm-11' paths='4'
name=mpathb multipath='360050768028211d8b000000000000062' sysfs='dm-14' paths='4'
name=mpathc multipath='360050768028211d8b000000000000063' sysfs='dm-15' paths='4'
name=mpathd multipath='360050768028211d8b000000000000064' sysfs='dm-16' paths='4'
name=mpathe multipath='360050768028211d8b000000000000067' sysfs='dm-17' paths='4'
name=mpathf multipath='35000c500940da6ab' sysfs='dm-21' paths='2'
name=mpathg multipath='35000c500940d0763' sysfs='dm-24' paths='2'
name=mpathh multipath='35000c500940ced27' sysfs='dm-25' paths='2'
name=mpathi multipath='35000c500940da2a7' sysfs='dm-26' paths='2'
name=mpathj multipath='35000c50093ff901f' sysfs='dm-27' paths='2'
name=mpathk multipath='35000c500940d38db' sysfs='dm-28' paths='2'
name=mpathl multipath='35000c500940d35d7' sysfs='dm-29' paths='2'
name=mpathm multipath='35000c500940d367f' sysfs='dm-30' paths='2'
name=mpathn multipath='35000c500940d238b' sysfs='dm-31' paths='2'
name=mpatho multipath='35000c50093f63a3f' sysfs='dm-32' paths='2'
name=mpathp multipath='35000c500940cf547' sysfs='dm-33' paths='2'
name=mpathq multipath='35000c500940d155f' sysfs='dm-34' paths='2'
name=mpathr multipath='35000c50093f63ccb' sysfs='dm-35' paths='2'
name=mpaths multipath='35000c500940d57bf' sysfs='dm-36' paths='2'
name=mpatht multipath='35000c50093f1854b' sysfs='dm-37' paths='2'
name=mpathu multipath='35000c50093f666d3' sysfs='dm-38' paths='2'
name=mpathv multipath='35000c500940da247' sysfs='dm-39' paths='2'
name=mpathw multipath='35000c50094143b53' sysfs='dm-40' paths='2'
name=mpathx multipath='35000c500940d3233' sysfs='dm-41' paths='2'
name=mpathy multipath='35000c500940d0597' sysfs='dm-42' paths='2'
name=mpathz multipath='35000c500940d1beb' sysfs='dm-43' paths='2'
name=mpathaa multipath='35000c500940cd137' sysfs='dm-44' paths='2'
name=mpathab multipath='35000c500940d13f3' sysfs='dm-45' paths='2'
name=mpathac multipath='35000c50093f15937' sysfs='dm-46' paths='2'
name=mpathad multipath='35000c500940d2d53' sysfs='dm-47' paths='2'
name=mpathae multipath='35000c500940d45db' sysfs='dm-48' paths='2'
name=mpathaf multipath='35000c50093ff59af' sysfs='dm-49' paths='2'
name=mpathag multipath='35000c500940d2743' sysfs='dm-50' paths='2'
name=mpathah multipath='35000c500940db9b7' sysfs='dm-51' paths='2'
name=mpathai multipath='35000c500940db8ef' sysfs='dm-52' paths='2'
name=mpathaj multipath='35000c500940db813' sysfs='dm-53' paths='2'
name=mpathak multipath='35000c50093ff5673' sysfs='dm-54' paths='2'
name=mpathal multipath='35000c50093f725bb' sysfs='dm-55' paths='2'
name=mpatham multipath='35000c500940d4c17' sysfs='dm-56' paths='2'
name=mpathan multipath='35000c500940dcfdf' sysfs='dm-57' paths='2'
name=mpathao multipath='35000c500940d22e3' sysfs='dm-58' paths='2'
name=mpathap multipath='35000c500940d41f7' sysfs='dm-59' paths='2'
name=mpathaq multipath='35000c500940d45eb' sysfs='dm-60' paths='2'
name=mpathar multipath='35000c500940d05af' sysfs='dm-61' paths='2'
name=mpathas multipath='35000c500940d47a7' sysfs='dm-62' paths='2'
name=mpathat multipath='35000c500940cede3' sysfs='dm-63' paths='2'
name=m...

Read more...

Revision history for this message
Thierry FAUCK (thierry-j) wrote :
Revision history for this message
Thierry FAUCK (thierry-j) wrote :
Revision history for this message
Thierry FAUCK (thierry-j) wrote :
Revision history for this message
Ryan Harper (raharper) wrote :

> name=nqn.1994-11.com.samsung:nvme:PM1725a:HHHL:S3RVNA0J300208 :nsid.1 multipath='eui.335256304a3002080025384100000001' sysfs='nvme0n1' paths='1'

That's the culprit.

NVME multipath ...

@Dimitri
Was there another bug related to blacklisting NVME multipath by default?

Why does the nvme multipath 'name' field include colon separated fields?

Ryan Harper (raharper)
Changed in curtin:
status: Incomplete → In Progress
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: Incomplete → In Progress
Revision history for this message
Server Team CI bot (server-team-bot) wrote :

This bug is fixed with commit 48703979 to curtin on branch master.
To view that commit see the following URL:
https://git.launchpad.net/curtin/commit/?id=48703979

Changed in curtin:
status: In Progress → Fix Committed
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: In Progress → Fix Committed
Revision history for this message
Thierry FAUCK (thierry-j) wrote :

I spent some time trying to test that small patch and the result is not good as far as I can tell

- I copy /snap/subiquity/1939/lib/python3.6/site-packages/curtin to /usr/lib/python3.8/site-packages and coming back to installer and I get a message

================================================================================
 SSH Setup [tar: Removing leading `/' from member names================================================
  lqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqk  x x  x Sorry, there was a problem completing the installation. x

Am I doing something bad (for example should I need another ISO, ....) ? Any other way to test the installer ?

Revision history for this message
Patricia Domingues (patriciasd) wrote :

Thierry, the curtin fix is not yet available on the server daily-live images.
I'd say copying over curtin manually is not preferable.
I think it would be worth retesting it, just if you have a smaller multipath system.
If you don't, we could wait until the fix is published to Ubuntu Server images.

Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

@thierry-j, a fix for this should now be available from the subiquity edge channel.

There are instructions on how to do an edge channel subiquity installation here: https://discourse.ubuntu.com/t/how-to-test-the-latest-version-of-subiquity/12428. But let us know if you run into difficulties.

Would you be able to try that out on your "massively multipathed" system and let us know the result?

Many thanks.

Changed in subiquity:
status: Incomplete → In Progress
Revision history for this message
Thierry FAUCK (thierry-j) wrote : Re: [Bug 1878041] Re: installation fails creating user on large multipath system - multipathd show command times out in curtin log with 163 paths

On 8/25/2020 7:48 PM, Andrew Cloke wrote:
> @thierry-j, a fix for this should now be available from the subiquity
> edge channel.
>
> There are instructions on how to do an edge channel subiquity
> installation here: https://discourse.ubuntu.com/t/how-to-test-the-
> latest-version-of-subiquity/12428. But let us know if you run into
> difficulties.
>
> Would you be able to try that out on your "massively multipathed" system
> and let us know the result?
>
> Many thanks.
>
>
> ** Changed in: subiquity
> Status: Incomplete => In Progress
>
Installer uprade fails looping with messages:

--   File
"/snap/subiquity/2000/lib/python3.6/site-packages/subiquity/core.py",
line 274, in run     super().run()
   File
"/snap/subiquity/2000/lib/python3.6/site-packages/subiquitycore/core.py",
line 488, in run    self.urwid_loop.run()
   File
"/snap/subiquity/2000/usr/lib/python3/dist-packages/urwid/main_loop.py",
line 286, in run     self._run()
   File
"/snap/subiquity/2000/usr/lib/python3/dist-packages/urwid/main_loop.py",
line 384, in _run     self.event_loop.run()
   File
"/snap/subiquity/2000/usr/lib/python3/dist-packages/urwid/main_loop.py",
line 1484, in run     reraise(*exc_info)
   File
"/snap/subiquity/2000/usr/lib/python3/dist-packages/urwid/compat.py",
line 58, in reraise     raise value
   File "/snap/subiquity/2000/usr/lib/python3.6/asyncio/events.py", line
145, in _run
     self._callback(*self._args)
   File
"/snap/subiquity/2000/lib/python3.6/site-packages/subiquity/core.py",
line 366, in select_initial_screen
     self.error_reporter.start_loading_reports()
   File
"/snap/subiquity/2000/lib/python3.6/site-packages/subiquity/common/errorreport.py",
line 349, in start_loading_reports
     os.makedirs(self.crash_directory, exist_ok=True)
AttributeError: 'ErrorReporter' object has no attribute 'crash_directory'

So I couldn't go further and test
Regards

__________________________________________________________________________
thf - Thierry Fauck - <email address hidden>>
/pubkey: 4096R/FCC181CE/
/fingerprint: 5CCF 6B82 DE4E E72A A40B B63E A153 BF4F FCC1 81CE/

Revision history for this message
Patricia Domingues (patriciasd) wrote :

Thierry, thanks for replying it.
I've added a more detailed instructions on how to install the latest Subiquity version:

https://people.canonical.com/~patriciasd/ppc64le/power_subiquity

I was able to install it:
```
  latest/edge: 20.07.1+git29.fc51315d 2020-09-02 (2004) 54MB classic
installed: 20.07.1+git29.fc51315d (2004) 54MB classic
```

Could you please try following the link above?
Let me know if you have any questions.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

The AttributeError: 'ErrorReporter' object has no attribute 'crash_directory' crash should be fixed now in edge btw.

Revision history for this message
Thierry FAUCK (thierry-j) wrote :

Update of installer is now successful, but the system still crash when selecting
when I press enter to confirm 'file system destruction/creation

' after defining file systems I get an error and when I say

'send to canonical' I get messages

Revision history for this message
Thierry FAUCK (thierry-j) wrote :
Revision history for this message
Thierry FAUCK (thierry-j) wrote :
Revision history for this message
Thierry FAUCK (thierry-j) wrote :
Revision history for this message
Thierry FAUCK (thierry-j) wrote :
Revision history for this message
Thierry FAUCK (thierry-j) wrote :
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

FWIW the failure is now this:

 curtin.util.ProcessExecutionError: Unexpected error while running command.
 Command: ['multipathd', 'show', 'maps', 'raw', 'format', "name=%n multipath='%w' sysfs='%d' paths='%N'"]
 Exit code: 1
 Reason: -
 Stdout: timeout receiving packet

 Stderr: ''

So it sounds like the timeout mentioned in comment #7 still needs to be raised?

Changed in ubuntu-power-systems:
status: Fix Committed → In Progress
Changed in curtin:
status: Fix Committed → In Progress
Revision history for this message
Ryan Harper (raharper) wrote :

For the timeout, I'm wondering why curtin should be setting this value? It's not accessible via the multipathd command line, rather it's a multipath.conf setting with a default.

I don't think it makes sense for curtin to muck with multipath.conf temporarily when running one or more of the multipathd commands.

Should multipath packaging provide a different default value? If not in the package, where and why?

Changed in curtin:
status: In Progress → Incomplete
Revision history for this message
Dimitri John Ledkov (xnox) wrote :

I do think multipath should be bumped.

Revision history for this message
Frank Heimes (fheimes) wrote :

For 20.10 a 'known issue' entry was added to the release notes:
"LP :#1878041 - In case of multipath systems with huge amounts of paths, the installer may hit a timeout."

Changed in ubuntu-release-notes:
status: New → Fix Released
tags: added: fr-885
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Looking at this a bit more carefully, I think we can remove the call that is failing and replace it with some udev checking instead. I'm actually a bit surprised it fails -- a very similar call was made by probert earlier in the run and that seems to have worked fine (although I can't . I wonder if it is conflicting over the multipathd lock with multipath calls made by udev in response to device changes or something like that. Do you still have access to this system for testing? I can probably make a snap for testing with tomorrow.

Revision history for this message
Thierry FAUCK (thierry-j) wrote :

Tried to boot hirsute dated of 01/14 - it failed to initramfs

Booted with 20.10 suiquity-channel=edge - and it failed again - sent to canonical the failure using the installer

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Hi, can you try again with suiquity-channel=edge/lp1878041 please? If it fails, and it's possible, can you attach the .crash file here? Error reports to the tracker get truncated in ways that make them a bit less useful for this sort of thing (it is also hard to find a specific report).

The initramfs failure sounds worrying, can you file a separate bug about that?

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2021-01-21 04:14 EDT-------
still crash during install after choice of disk - as asked here is the content of /var/crash

Revision history for this message
bugproxy (bugproxy) wrote : crash from install of 01/21

------- Comment (attachment only) From <email address hidden> 2021-01-21 04:16 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2021-01-21 04:49 EDT-------
By the way, I checked with hirsute and system now boots and stops at same level !!!

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Thanks for that. The failure is different now (which suggests my changes have helped):

 Running command ['sgdisk', '--new', '2:18432:2424585727', '--typecode=2:8300', '/dev/mapper/mpathr'] with allowed return codes [0] (capture=True)
 An error occured handling 'partition-1': ProcessExecutionError - Unexpected error while running command.
 Command: ['sgdisk', '--new', '2:18432:2424585727', '--typecode=2:8300', '/dev/mapper/mpathr']
 Exit code: 4
 Reason: -
 Stdout: ''
 Stderr: Could not create partition 2 from 18432 to 2424585727
         Could not change partition 2's type code to 8300!
         Error encountered; not saving changes.

I think curtin has some bugs with multipath disks where the block size is 4096.

The end of the attempted partition in bytes is 2424585727*4096 = 9931103137792, larger than the reported size of the disk, 9931038130176.

The first call to sgdisk was

['sgdisk', '--new', '1:256:2303', '--typecode=1:4100', '/dev/mapper/mpathr']

So simply based on this, I would expect the start sector of the second partition to be 2304, not 18432. Chasing code a bit, I think the function calc_dm_partition_info (https://git.launchpad.net/curtin/tree/curtin/commands/block_meta.py#n666) is doing something wrong wrt sector sizes but it's too late at night for me to really even try to see what. Back tomorrow with a fresher brain.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Actually lets try something simple, can you try suiquity-channel=edge/mwhudson-hack when this build finishes (probably in about 20 minutes, but I'm going to bed): https://launchpad.net/~mwhudson/+snap/subiquity-hack/+build/1267203

Also I'm glad to hear today's hirsute daily works!

Revision history for this message
Server Team CI bot (server-team-bot) wrote :

This bug is fixed with commit ea15dfa9 to curtin on branch master.
To view that commit see the following URL:
https://git.launchpad.net/curtin/commit/?id=ea15dfa9

Changed in curtin:
status: Incomplete → Fix Committed
Revision history for this message
bugproxy (bugproxy) wrote : crash of 01/22

------- Comment (attachment only) From <email address hidden> 2021-01-22 05:07 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : crash itself

------- Comment (attachment only) From <email address hidden> 2021-01-22 05:09 EDT-------

Revision history for this message
Thierry FAUCK (thierry-j) wrote :

When I reach the FILE SYSTEM SUMMARY panel and say 'continue' I get the 'an error occured during installation' message within a minute.
I was running the parameters : snap refresh subiquity --channel=edge/mwhudson-hack
I didn't apply to update to the new version of the installer as I believe it would override the previous parameter. I wait for 5mn at FILE SYSTEM SUMMARY and nothing happen, so it sound slike it is when I validate the problem occures. Attached you'll find content of crash.

Revision history for this message
Thierry FAUCK (thierry-j) wrote :
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Hm that's failing in the same sort of way. I've just pushed a new version (0+git.5ef5b352) to edge/mwhudson-hack, can you try that? Hopefully that will work better, or at least have more informative logs.

Revision history for this message
Thierry FAUCK (thierry-j) wrote :

unfortunately same behaviour

Revision history for this message
Thierry FAUCK (thierry-j) wrote :
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla
Download full text (5.5 KiB)

------- Comment From <email address hidden> 2021-01-26 14:33 EDT-------
Log contains:
get_blockdev_sector_size: (log=4096, phys=4096)
Running command ['fdasd', '--table', '/dev/dm-36'] with allowed return codes [0] (capture=False)
Checking if /dev/dm-36 is a swap device
Found swap magic: b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
Running command ['udevadm', 'info', '--query=property', '--export', '/dev/dm-36'] with allowed return codes [0] (capture=True)
/dev/dm-36 is multipath device? False
Running command ['multipath', '-c', '/dev/dm-36'] with allowed return codes [0] (capture=True)
/dev/dm-36 is multipath device member? False
wiping superblock on /dev/dm-36
wiping /dev/dm-36 attempt 1/4
wiping 1M on /dev/dm-36 at offsets [0, -1048576]
successfully wiped device /dev/dm-36 on attempt 1/4
shutdown running on holder type: 'disk' syspath: '/sys/class/block/dm-34'
Running command ['lsblk', '--noheadings', '--bytes', '--pairs', '--output=ALIGNMENT,DISC-ALN,DISC-GRAN,DISC-MAX,DISC-ZERO,FSTYPE,GROUP,KNAME,LABEL,LOG-SEC,MAJ:MIN,MIN-IO,MODE,MODEL,MOUNTPOINT,NAME,OPT-IO,OWNER,PHY-SEC,RM,RO,ROTA,RQ-SIZE,SIZE,STATE,TYPE,UUID', '/dev/dm-34'] with allowed return codes [0] (capture=True)
get_blockdev_sector_size: info:
{
"dm-34": {
"ALIGNMENT": "0",
"DISC-ALN": "0",
"DISC-GRAN": "0",
"DISC-MAX": "0",
"DISC-ZERO": "0",
"FSTYPE": "",
"GROUP": "disk",
"KNAME": "dm-34",
"LABEL": "",
"LOG-SEC": "4096",
"MAJ:MIN": "253:34",
"MIN-IO": "4096",
"MODE": "brw-rw----",
"MODEL": "",
"MOUNTPOINT": "",
"NAME": "mpathr",
"OPT-IO": "0",
"OWNER": "root",
"PHY-SEC": "4096",
"RM": "0",
"RO": "0",
"ROTA": "1",
"RQ-SIZE": "256",
"SIZE": "9931038130176",
"STATE": "running",
"TYPE": "mpath",
"UUID": "",
"device_path": "/dev/dm-34"
},
"dm-35": {
"ALIGNMENT": "0",
"DISC-ALN": "0",
"DISC-GRAN": "0",
"DISC-MAX": "0",
"DISC-ZERO": "0",
"FSTYPE": "",
"GROUP": "disk",
"KNAME": "dm-35",
"LABEL": "",
"LOG-SEC": "4096",
"MAJ:MIN": "253:35",
"MIN-IO": "4096",
"MODE": "brw-rw----",
"MODEL": "",
"MOUNTPOINT": "",
"NAME": "mpathr-part1",
"OPT-IO": "0",
"OWNER": "root",
"PHY-SEC": "4096",
"RM": "0",
"RO": "0",
"ROTA": "1",
"RQ-SIZE": "128",
"SIZE": "8388608",
"STATE": "running",
"TYPE": "part",
"UUID": "",
"device_path": "/dev/dm-35"
},
"dm-36": {
"ALIGNMENT": "0",
"DISC-ALN": "0",
"DISC-GRAN": "0",
"DISC-MAX": "0",
"DISC-ZERO": "0",
"FSTYPE": "",
"GROUP": "disk",
"KNAME": "dm-36",
"LABEL": "",
"LOG-SEC": "4096",
"MAJ:MIN": "253:36",
"MIN-IO": "4096",
"MODE": "brw-rw----",
"MODEL": "",
"MOUNTPOINT": "",
"NAME": "mpathr-part2",
"OPT-IO": "0",
"OWNER": "root",
"PHY-SEC": "4096",
"RM": "0",
"RO": "0",
"ROTA": "1",
"RQ-SIZE": "128",
"SIZE": "1073741824",
"STATE": "running",
"TYPE": "part",
"UUID": "",
"device_path": "/dev/dm-36"
}
}
get_blockdev_sector_size: (log=4096, phys=4096)
Checking if /dev/dm-34 is a swap device
Found swap magic: b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
Running command ['udevadm', 'info', '--query=property', '--export', '/dev/dm-34'] with allowed return codes [0] (capture=True)
/dev/dm-34 is multipath device? True
Running command ['multipathd', 'show', 'maps', 'raw', 'format', "name='%n' multipath='%w' sysfs='%d' paths='%N'"] with allowed return codes [0] (capture=True...

Read more...

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

It looks like the snap didn't get updated to the version from edge/mwhudson-hack for that run?

Revision history for this message
Patricia Domingues (patriciasd) wrote :

Hi Thierry thanks again for testing this,

From your crash files, I noticed you're using an image built on October -2020-10-22 14:33 (url=http://cdimage.ubuntu.com/releases/20.10/release/ubuntu-20.10-live-server-ppc64el.iso) and also subiquity is an old version:
` 2021-01-26 08:10:48,735 INFO subiquity:135 Starting Subiquity revision 2101 `
```
SnapRevision: 2101
SnapUpdated: False
SnapVersion: 20.09.1+git1.67496d12
SourcePackage: subiquity
```
so, I'm going to ask you for trying again following these steps, please:

https://people.canonical.com/~patriciasd/ppc64le/power_subiquityJAN26

with this latest 20.10 URL: http://cdimage.ubuntu.com/ubuntu-server/daily-live/20210126/hirsute-live-server-ppc64el.iso

For my own reference I've tested it on P9 witherspoon (bobone).

Let me know if you have any questions.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2021-01-26 16:10 EDT-------
Bu in the crash file we get the info: Kernel command line: ip=9.40.193.115::9.40.193.1:255.255.255.0:ltc-boston113:enP1p9s0f0:none:9.3.1.200 url=http://cdimage.ubuntu.com/releases/20.10/release/ubuntu-20.10-live-server-ppc64el.iso snap refresh subiquity --channel=edge/mwhudson-hack --- quiet

or in the meta file:
[ 1.572051] This architecture does not have kernel memory protection.
[ 1.572057] Run /init as init process
[ 1.572059] with arguments:
[ 1.572060] /init
[ 1.572061] snap
[ 1.572062] refresh
[ 1.572063] subiquity
[ 1.572064] ---
[ 1.572065] with environment:
[ 1.572066] HOME=/
[ 1.572067] TERM=linux
[ 1.572068] ip=9.40.193.115::9.40.193.1:255.255.255.0:ltc-boston113:enP1p9s0f0:none:9.3.1.200
[ 1.572069] url=http://cdimage.ubuntu.com/releases/20.10/release/ubuntu-20.10-live-server-ppc64el.iso
[ 1.572070] --channel=edge/mwhudson-hack

Did I mistyped the parameters ?

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Yes, it should be subiquity-channel=edge/mwhudson-hack on the kernel command line

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote : Fixed in curtin version 21.2.

This bug is believed to be fixed in curtin in version 21.2. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in curtin:
status: Fix Committed → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2021-01-27 02:58 EDT-------
Retried with parameter Kernel command line: ip=9.40.193.115::9.40.193.1:255.255.255.0:ltc-boston113:enP1p9s0f0:none:9.3.1.200 url=http://cdimage.ubuntu.com/releases/20.10/release/ubuntu-20.10-live-server-ppc64el.iso snap refresh subiquity-channel=edge/mwhudson-hack --- quiet
I saw the subiquity build related to the mwhudson-hack , so I expect this time it is good
same behaviour for failure

Revision history for this message
bugproxy (bugproxy) wrote : crash

------- Comment (attachment only) From <email address hidden> 2021-01-27 03:00 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : meta

------- Comment (attachment only) From <email address hidden> 2021-01-27 03:02 EDT-------

Revision history for this message
Patricia Domingues (patriciasd) wrote :

Thanks again Thierry, there's a new subiquity version available.
Please, could you try with Hirsute/21.04 -
This URL:
`http://cdimage.ubuntu.com/ubuntu-server/daily-live/current/hirsute-live-server-ppc64el.iso` and you do not need to add `subiquity-channel=...` on kernel command line.

Revision history for this message
bugproxy (bugproxy) wrote : crash for hirsute

------- Comment on attachment From <email address hidden> 2021-01-28 04:40 EDT-------

today's crash

Revision history for this message
bugproxy (bugproxy) wrote : meta

------- Comment (attachment only) From <email address hidden> 2021-01-28 04:40 EDT-------

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

The crash file is interesting, it seems that we have many many udev timeouts getting stuck for more than 120s.

I'm now not sure if these are timeouts in lvm calls of udev. or somewhere else.

It would be interesting to see if one can complete the install without lvm, with just ext4.

Also not sure how to make this more resilient. Cause clearly things are not broken, it's just there is a lot of things to do. Hence timeouts are getting in the way, rather than being helpful.

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2021-01-28 12:03 EDT-------
Comments and time out are related to NVME devices
[ 130.332163] blk_update_request: protection error, dev nvme0c0n1, sector 3124995968 op 0x0:(READ) flags 0x2090700 phys_seg 1 prio class 0
[ 130.333596] blk_update_request: protection error, dev nvme0c0n1, sector 3124995968 op 0x0:(READ) flags 0x2010000 phys_seg 1 prio class 0
[ 130.333639] Buffer I/O error on dev nvme0n1p2, logical block 24413887, async page read

Apparently not using LVM seems to work .

May be the problem was to use a device with already LVM in there !

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

thank you for that.

i am tempted to close out the multipath package task; and open an lvm package task.

i'm not sure if we allow reusing existing lvms, if we do, that would be also an intersting datapoint test case.

normally lvm commands do create udev cookies, and trigger udev to settle with a timeout. I do wonder, if curtin should be doing something more clever here. Like lookaside at all the prober data and try to disable that behaviour and instead do udev triggers / waiting by itself possibly with larger timeouts.

or if we can somehow "speed things up" by deactivating all the multipaths and all the "unused" drives not selected for the installation.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Oh so the last crash with LVM was this silly thing: https://bugs.launchpad.net/subiquity/+bug/1905412. I really need to fix this.

I think given that this bug report has fixed two real problems and has resulted in a successful install without LVM, I'm going to close it and any new problems can be handed in a new report. Feel free to reopen if you disagree :)

Also, thanks for all your testing with this system, it's definitely made the installer better!

Changed in ubuntu-power-systems:
status: In Progress → Fix Released
Changed in subiquity:
status: In Progress → Fix Released
Changed in multipath-tools (Ubuntu):
status: New → Invalid
Revision history for this message
Frank Heimes (fheimes) wrote :

Agreeing to Michaels suggestion.
Marking LP 1905412 as affecting the Power project.

To post a comment you must log in.