Error: ('timed out',) - can not deploy large images with Ironic

Bug #1997581 reported by Narinder Gupta
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Charm Guide
Fix Released
Undecided
Unassigned
OpenStack Ironic API Charm
Invalid
Undecided
Unassigned
OpenStack Ironic Conductor Charm
Fix Committed
Medium
Felipe Reyes
Ussuri
Fix Released
Undecided
Unassigned
Victoria
Fix Released
Undecided
Unassigned
Wallaby
Fix Released
Undecided
Unassigned
Xena
Fix Released
Undecided
Unassigned
Yoga
Fix Released
Undecided
Unassigned
Zed
Fix Released
Undecided
Unassigned

Bug Description

Deploy Ussuri Ironic on Focal and after configuraitons can deploy the Operating systems like ubuntu, centos etc. But can not deploy windows iamges as it shows timeout error.

Error: 10 12:56:55 debian ironic-python-agent[1690]: 2022-11-10 12:56:55.056 1690 WARNING ironic_python_agent.extensions.standby [-] Image download failed, 0 of 9: Error downloading image: Download of image 7276f86b-2f0a-4539-b193-bbdad96d181f failed: Unable to write image to /tmp/7276f86b-2f0a-4539-b193-bbdad96d181f. Error: ('timed out',): ironic_python_agent.errors.ImageDownloadError: Error downloading image: Download of image 7276f86b-2f0a-4539-b193-bbdad96d181f failed: Unable to write image to /tmp/7276f86b-2f0a-4539-b193-bbdad96d181f. Error: ('timed out',)

While we can deploy other images without any issue and also wget works for the link it timeouts.

Revision history for this message
Felipe Reyes (freyes) wrote : Re: [Bug 1997581] [NEW] can not deploy windows image with Itonic

>
> Error: 10 12:56:55 debian ironic-python-agent[1690]: 2022-11-10
> 12:56:55.056 1690 WARNING ironic_python_agent.extensions.standby [-]
> Image download failed, 0 of 9: Error downloading image: Download of
> image 7276f86b-2f0a-4539-b193-bbdad96d181f failed: Unable to write image
> to /tmp/7276f86b-2f0a-4539-b193-bbdad96d181f. Error: ('timed out',):
> ironic_python_agent.errors.ImageDownloadError: Error downloading image:
> Download of image 7276f86b-2f0a-4539-b193-bbdad96d181f failed: Unable to
> write image to /tmp/7276f86b-2f0a-4539-b193-bbdad96d181f. Error: ('timed
> out',)
>

Could you share the size of the Windows images?, how long is it taking to get the "timed out"
error?.

Revision history for this message
Narinder Gupta (narindergupta) wrote : Re: can not deploy windows image with Itonic

Windows image size is 9.14 GB and checked in /temp folder up to 4GB it download then times out for rest. this is main error. Unable to write image to /tmp/7276f86b-2f0a-4539-b193-bbdad96d181f. Error: ('timed out',)

Revision history for this message
Narinder Gupta (narindergupta) wrote :

uploading the logs from ironic.

summary: - can not deploy windows image with Itonic
+ can not deploy windows image with Ironic
Revision history for this message
Gabriel Samfira (gabriel-samfira) wrote : Re: can not deploy windows image with Ironic

Hi Narinder!

A few steps we could try to debug this:

1) Would you mind converting the Windows image to RAW (it's a good idea to have all images as RAW, as that's the native ceph storage - allows CoW) and uploading it to glance again. This would avoid downloading the image to /tmp and allow ironic to stream it directly to the block device. It should be considerably faster.

2) It would also be interesting to know what kind of download speeds you have. Looking at the log you attached, it seems you're getting a "connect timeout" while attempting to download the image. This would suggest connectivity issues between the node that ironic agent is running on and trying to install, and swift. Perhaps a misconfigured port? But if you have around 4 GB downloaded, this may just be a misleading message (?)

3) As a debug step, could you try to ssh into the deploying instance and attempt to manually download the image using the same temp URL ironic agent is trying to fetch? Curl/wget should work.

Revision history for this message
Narinder Gupta (narindergupta) wrote (last edit ):

please find the answers as part of debugging effort we tried few things as mentioned below.

1) Would you mind converting the Windows image to RAW (it's a good idea to have all images as RAW, as that's the native ceph storage - allows CoW) and uploading it to glance again. This would avoid downloading the image to /tmp and allow ironic to stream it directly to the block device. It should be considerably faster.
Let me try that and will get back to you.

2) Well we have good speed on the instances and i can confirm that manual down just works fine using the same temp URL. As far as port misconfiguration is concerned we have verified the same multiple time and linux images on the same server just works fine.

3)We tried point 3 and I can confirm that manual down just works fine using the same temp URL on the same instance where timeout is seeing. Also Linux images just works fine.

Revision history for this message
Felipe Reyes (freyes) wrote :

@Narinder, any chance of getting the logs of radosgw?, because looking at the logs provided in comment #3 , I see many attempts of ironic-python-agent to download the image although the connection fails with "timed out", then when it finally is able to connect to radosgw, it gets a 401 error "Received status code 401 from https://radosgw2-i... expected 200. Response body: AccessDenied".

https://paste.ubuntu.com/p/Gbg2pQRJkQ/

Revision history for this message
Narinder Gupta (narindergupta) wrote :

One more note is that, today we built the oracle linux image with 15G size, that image is also times out with same error message. Could you check for any time out value set for image download which can be increased.

https://bugs.launchpad.net/ironic-python-agent/+bug/1661328

Revision history for this message
Narinder Gupta (narindergupta) wrote (last edit ):

Raw image did not work as well even when we change the config in Ironic with stream raw images = true

Jan 11 20:05:51 debian ironic-python-agent[1692]: 2023-01-11 20:05:51.982 1692 WARNING ironic_python_agent.extensions.standby [-] Image download failed, 0 of 20: Error downloading image: Download of image 6a7f827d-b10f-4f07-be0c-d510fa4415a4 failed: Unable to write image to device /dev/sda. Error: ('timed out',): ironic_python_agent.errors.ImageDownloadError: Error downloading image: Download of image 6a7f827d-b10f-4f07-be0c-d510fa4415a4 failed: Unable to write image to device /dev/sda. Error: ('timed out',)

Jan 11 20:06:12 debian ironic-python-agent[1692]: 2023-01-11 20:06:12.005 1692 INFO ironic_python_agent.extensions.standby [-] Attempting to download image from https://radosgwurl:443/swift/v1/AUTH_2815f61d907a424d86503e5e94f2d27b/glance/6a7f827d-b10f-4f07-be0c-d510fa4415a4?temp_url_sig=4ca8363b6c01048c8fe14cfc368d7bde46411251&temp_url_expires=1673468247

Revision history for this message
Gabriel Samfira (gabriel-samfira) wrote (last edit ):

@Narinder

From the node being deployed, could you try:
(Edit: Correction, try arping from a node on the same L2 network as the radosGW service.)

arping radosgwhostname # or the VIP of the service

The way it's behaving, it sounds like either the VIP is erroneously configured on more than one container/instance (happened to me a few times with the hacluster charm), or more than one system on your network has the same IP configured. While the ARP table has the proper MAC:IP, you are able to download without issue, and when it flips, it times out/exhibits "connect timeout", because you loose the TCP session. Small images can squeeze in between the moments the ARP "poisoning" happens. But larger images will get interrupted when it happens.

I cannot say for certain without access to your system, but from the look of the logs, it's possible this is what you're seeing.

Revision history for this message
Felipe Reyes (freyes) wrote :

I'm marking this bug as incomplete while the information requested is collected and shared.

Data needed:

- Run arping from a node on the same L2 network as the radosGW service to verify VIP is correctly configured.
- Collect logs from the node that's attempting to be deployed + radosgw, ideally setting radosgw loglevel to 3, but 2 would be OK too, it all depends on the free disk space available and how busy your radosgw is serving requests
  juju config ceph-radosgw loglevel=3

Changed in charm-ironic-api:
status: New → Incomplete
summary: - can not deploy windows image with Ironic
+ Error: ('timed out',) - can not deploy large images with Ironic
Revision history for this message
Narinder Gupta (narindergupta) wrote :

logs from IPA agents

Revision history for this message
Narinder Gupta (narindergupta) wrote :

logs from rados gw units

Revision history for this message
Narinder Gupta (narindergupta) wrote :

logs from Rados GW units

Revision history for this message
Felipe Reyes (freyes) wrote :

the logs from IPA ( https://bugs.launchpad.net/charm-ironic-api/+bug/1997581/comments/11 ) are from November 28th, so we can't correlate them with radosgw.

Although I was able to do some digging on radosgw:

I see just 1 request url (with 381 hits) that seems to be coming from ipa

$ grep GET ceph-client.rgw.iadaz02sasgtw0*.log | awk '{print $11}' | sort | uniq -c
...
    381 /swift/v1/AUTH_2815f61d907a424d86503e5e94f2d27b/glance/6a7f827d-b10f-4f07-be0c-d510fa4415a4?temp_url_sig=467d0136a17c2b5644d61cb4043a122a448b181c&temp_url_expires=1673623768
...

There are 189 requests per unit that returned a 401 error, and only 3 requests succeeded (code 200 on GET)

$ grep 467d0136a17c2b5644d61cb4043a122a448b181 ceph-client.rgw.iadaz02sasgtw01.log | awk '{print $10, $13}' | sort | uniq -c
      1 "GET 200
    189 "GET 401
$ grep 467d0136a17c2b5644d61cb4043a122a448b181 ceph-client.rgw.iadaz02sasgtw02.log | awk '{print $10, $13}' | sort | uniq -c
      2 "GET 200
    189 "GET 401
      1 "HEAD 200

it would be great to have sosreports from radosgw units to check other aspects of the system.

Felipe Reyes (freyes)
Changed in charm-ironic-api:
status: Incomplete → New
Revision history for this message
Felipe Reyes (freyes) wrote :

James found that there is a config option in ironic called [glance]swift_temp_url_duration, this option's default is 20m, it precisely indicates that if deployments get a 401 error when trying to download the image, this time should be increased.

I was checking the charms and we have no facilities to update this in the charm, I will prepare a custom charm with a very high timeout so you can test this.

[0] https://docs.openstack.org/ironic/latest/configuration/config.html#glance.swift_temp_url_duration

Revision history for this message
Felipe Reyes (freyes) wrote :

Here I'm attaching a charm that contains a ironic.conf template with [glance].swift_temp_url_duration set to 24hr, this is a very large timeout for testing purposes, if you can test the charm in your would be great, please let me know how it works.

juju upgrade-charm --path ./ironic-conductor-lp1997581.0.charm ironic-conductor

```
diff --git a/src/templates/train/ironic.conf b/src/templates/train/ironic.conf
index 4c5a59f..bd548d7 100644
--- a/src/templates/train/ironic.conf
+++ b/src/templates/train/ironic.conf
@@ -49,6 +49,7 @@ swift_container = glance
 {% if options.temp_url_secret -%}
 swift_temp_url_key = {{ options.temp_url_secret }}
 {% endif %}
+swift_temp_url_duration = 86400 # 24hr

 [swift]
 {% include "parts/service-auth" %}
```

Revision history for this message
Felipe Reyes (freyes) wrote :

note: I'm adding a task on this bug for charm-ironic-conductor because we will want to expose [glance].swift_temp_url_duration in the config options in any case, since it's an important knob for operators.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ironic-conductor (master)
Changed in charm-ironic-conductor:
status: New → In Progress
Felipe Reyes (freyes)
Changed in charm-ironic-conductor:
importance: Undecided → Medium
assignee: nobody → Felipe Reyes (freyes)
Revision history for this message
Felipe Reyes (freyes) wrote :

Closing the loop on downloading large images, the issue is specific to the environment and not a software bug, there seems to be a firewall in the middle between the ironic's network and ceph-radosgw frontend that's shutting down the connection based on an undetermined criteria, it was possible to download the image successfully using "curl" many times with the flag "-C -" to continue from where it was left, an image of 16GiB was completed in 5 attempts.

In any case the charm needs to expose the swift_temp_url_duration config so it can be tweaked by operators with slow to POST servers.

Changed in charm-ironic-api:
status: New → Invalid
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ironic-conductor (master)

Reviewed: https://review.opendev.org/c/openstack/charm-ironic-conductor/+/871117
Committed: https://opendev.org/openstack/charm-ironic-conductor/commit/0cd0912668675128dd1d838783c8e452465f7a3b
Submitter: "Zuul (22348)"
Branch: master

commit 0cd0912668675128dd1d838783c8e452465f7a3b
Author: Felipe Reyes <email address hidden>
Date: Thu Jan 19 11:17:12 2023 -0300

    Add config option swift-temp-url-duration

    This config option allows operators to increase the duration of
    temporary URLs handed to ironic-python-agent to download images when
    using the 'direct' deployment interface[0].

    The rendered file (ironic.conf) will only include
    [glance].swift_temp_url_duration when it's overriden by the
    operator, otherwise the upstream's default is used by omitting the key
    defition.

    [0] https://docs.openstack.org/ironic/latest/admin/interfaces/deploy.html#direct-deploy

    Closes-Bug: #1997581
    Change-Id: I6a4a582b199e5137a9e5fd185882acec20424f3a

Changed in charm-ironic-conductor:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ironic-conductor (stable/zed)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ironic-conductor (stable/yoga)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ironic-conductor (stable/xena)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ironic-conductor (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/charm-ironic-conductor/+/872308

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ironic-conductor (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/charm-ironic-conductor/+/872309

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ironic-conductor (stable/ussuri)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ironic-conductor (stable/zed)

Reviewed: https://review.opendev.org/c/openstack/charm-ironic-conductor/+/872437
Committed: https://opendev.org/openstack/charm-ironic-conductor/commit/dbba6ae1d3e477282846f3190affea36b341fca1
Submitter: "Zuul (22348)"
Branch: stable/zed

commit dbba6ae1d3e477282846f3190affea36b341fca1
Author: Felipe Reyes <email address hidden>
Date: Thu Jan 19 11:17:12 2023 -0300

    Add config option swift-temp-url-duration

    This config option allows operators to increase the duration of
    temporary URLs handed to ironic-python-agent to download images when
    using the 'direct' deployment interface[0].

    The rendered file (ironic.conf) will only include
    [glance].swift_temp_url_duration when it's overriden by the
    operator, otherwise the upstream's default is used by omitting the key
    defition.

    [0] https://docs.openstack.org/ironic/latest/admin/interfaces/deploy.html#direct-deploy

    Closes-Bug: #1997581
    Change-Id: I6a4a582b199e5137a9e5fd185882acec20424f3a
    (cherry picked from commit 0cd0912668675128dd1d838783c8e452465f7a3b)

tags: added: in-stable-zed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ironic-conductor (stable/yoga)

Reviewed: https://review.opendev.org/c/openstack/charm-ironic-conductor/+/872438
Committed: https://opendev.org/openstack/charm-ironic-conductor/commit/22998bb693cfceaa485dc23c607f89e6f267da56
Submitter: "Zuul (22348)"
Branch: stable/yoga

commit 22998bb693cfceaa485dc23c607f89e6f267da56
Author: Felipe Reyes <email address hidden>
Date: Thu Jan 19 11:17:12 2023 -0300

    Add config option swift-temp-url-duration

    This config option allows operators to increase the duration of
    temporary URLs handed to ironic-python-agent to download images when
    using the 'direct' deployment interface[0].

    The rendered file (ironic.conf) will only include
    [glance].swift_temp_url_duration when it's overriden by the
    operator, otherwise the upstream's default is used by omitting the key
    defition.

    [0] https://docs.openstack.org/ironic/latest/admin/interfaces/deploy.html#direct-deploy

    Closes-Bug: #1997581
    Change-Id: I6a4a582b199e5137a9e5fd185882acec20424f3a
    (cherry picked from commit 0cd0912668675128dd1d838783c8e452465f7a3b)
    (cherry picked from commit dbba6ae1d3e477282846f3190affea36b341fca1)

tags: added: in-stable-yoga
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ironic-conductor (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/charm-ironic-conductor/+/872440
Committed: https://opendev.org/openstack/charm-ironic-conductor/commit/f5787906ac185a13fa474d343ad1441215fd6b29
Submitter: "Zuul (22348)"
Branch: stable/xena

commit f5787906ac185a13fa474d343ad1441215fd6b29
Author: Felipe Reyes <email address hidden>
Date: Thu Jan 19 11:17:12 2023 -0300

    Add config option swift-temp-url-duration

    This config option allows operators to increase the duration of
    temporary URLs handed to ironic-python-agent to download images when
    using the 'direct' deployment interface[0].

    The rendered file (ironic.conf) will only include
    [glance].swift_temp_url_duration when it's overriden by the
    operator, otherwise the upstream's default is used by omitting the key
    defition.

    [0] https://docs.openstack.org/ironic/latest/admin/interfaces/deploy.html#direct-deploy

    Closes-Bug: #1997581
    Change-Id: I6a4a582b199e5137a9e5fd185882acec20424f3a
    (cherry picked from commit 0cd0912668675128dd1d838783c8e452465f7a3b)
    (cherry picked from commit dbba6ae1d3e477282846f3190affea36b341fca1)
    (cherry picked from commit 22998bb693cfceaa485dc23c607f89e6f267da56)

tags: added: in-stable-xena
tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ironic-conductor (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/charm-ironic-conductor/+/872308
Committed: https://opendev.org/openstack/charm-ironic-conductor/commit/31ad9aeceb0c005279f63fd830b5341e39e12fa1
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 31ad9aeceb0c005279f63fd830b5341e39e12fa1
Author: Felipe Reyes <email address hidden>
Date: Thu Jan 19 11:17:12 2023 -0300

    Add config option swift-temp-url-duration

    This config option allows operators to increase the duration of
    temporary URLs handed to ironic-python-agent to download images when
    using the 'direct' deployment interface[0].

    The rendered file (ironic.conf) will only include
    [glance].swift_temp_url_duration when it's overriden by the
    operator, otherwise the upstream's default is used by omitting the key
    defition.

    [0] https://docs.openstack.org/ironic/latest/admin/interfaces/deploy.html#direct-deploy

    Closes-Bug: #1997581
    Change-Id: I6a4a582b199e5137a9e5fd185882acec20424f3a
    (cherry picked from commit 0cd0912668675128dd1d838783c8e452465f7a3b)
    (cherry picked from commit dbba6ae1d3e477282846f3190affea36b341fca1)
    (cherry picked from commit 22998bb693cfceaa485dc23c607f89e6f267da56)

tags: added: in-stable-victoria
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ironic-conductor (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/charm-ironic-conductor/+/872309
Committed: https://opendev.org/openstack/charm-ironic-conductor/commit/9c83c508fc29a43bbb9fee43d3efe4edbf8e7cd1
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit 9c83c508fc29a43bbb9fee43d3efe4edbf8e7cd1
Author: Felipe Reyes <email address hidden>
Date: Thu Jan 19 11:17:12 2023 -0300

    Add config option swift-temp-url-duration

    This config option allows operators to increase the duration of
    temporary URLs handed to ironic-python-agent to download images when
    using the 'direct' deployment interface[0].

    The rendered file (ironic.conf) will only include
    [glance].swift_temp_url_duration when it's overriden by the
    operator, otherwise the upstream's default is used by omitting the key
    defition.

    [0] https://docs.openstack.org/ironic/latest/admin/interfaces/deploy.html#direct-deploy

    Closes-Bug: #1997581
    Change-Id: I6a4a582b199e5137a9e5fd185882acec20424f3a
    (cherry picked from commit 0cd0912668675128dd1d838783c8e452465f7a3b)
    (cherry picked from commit dbba6ae1d3e477282846f3190affea36b341fca1)
    (cherry picked from commit 22998bb693cfceaa485dc23c607f89e6f267da56)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ironic-conductor (stable/ussuri)

Reviewed: https://review.opendev.org/c/openstack/charm-ironic-conductor/+/872450
Committed: https://opendev.org/openstack/charm-ironic-conductor/commit/aadb3d974be9a8372abb1c15e3b353e192593f44
Submitter: "Zuul (22348)"
Branch: stable/ussuri

commit aadb3d974be9a8372abb1c15e3b353e192593f44
Author: Felipe Reyes <email address hidden>
Date: Thu Jan 19 11:17:12 2023 -0300

    Add config option swift-temp-url-duration

    This config option allows operators to increase the duration of
    temporary URLs handed to ironic-python-agent to download images when
    using the 'direct' deployment interface[0].

    The rendered file (ironic.conf) will only include
    [glance].swift_temp_url_duration when it's overriden by the
    operator, otherwise the upstream's default is used by omitting the key
    defition.

    [0] https://docs.openstack.org/ironic/latest/admin/interfaces/deploy.html#direct-deploy

    Closes-Bug: #1997581
    Change-Id: I6a4a582b199e5137a9e5fd185882acec20424f3a
    (cherry picked from commit 0cd0912668675128dd1d838783c8e452465f7a3b)
    (cherry picked from commit dbba6ae1d3e477282846f3190affea36b341fca1)
    (cherry picked from commit 22998bb693cfceaa485dc23c607f89e6f267da56)

Felipe Reyes (freyes)
Changed in charm-ironic-conductor:
status: Fix Committed → Fix Released
Felipe Reyes (freyes)
no longer affects: charm-ironic-api/ussuri
no longer affects: charm-ironic-api/victoria
no longer affects: charm-ironic-api/wallaby
no longer affects: charm-ironic-api/xena
no longer affects: charm-ironic-api/yoga
no longer affects: charm-ironic-api/zed
Changed in charm-ironic-conductor:
status: Fix Released → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-guide (master)
Changed in charm-guide:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-guide (master)

Reviewed: https://review.opendev.org/c/openstack/charm-guide/+/880243
Committed: https://opendev.org/openstack/charm-guide/commit/871df7f062be15fa388855f11b1a0f690bcd7f2a
Submitter: "Zuul (22348)"
Branch: master

commit 871df7f062be15fa388855f11b1a0f690bcd7f2a
Author: Felipe Reyes <email address hidden>
Date: Wed Apr 12 16:49:30 2023 -0400

    ironic-conductor charm: Temporary url timeout

    Closes-Bug: #1997581
    Change-Id: Ib13508008c26eecd63ac57e31fa7d8176f3ee444

Changed in charm-guide:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.