Netapp: dfm lun list refresh not up-to-date
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Cinder |
Fix Released
|
Undecided
|
Ben Swartzlander |
Bug Description
Short summary:
When creating volume from snapshot, the operation may fail with an error "No entry in LUN table for volume". LUN representing the volume was created correctly on the filer, but DFM did not refresh its LUN list in time, causing the above error.
Detailed description:
Openstack Netapp driver is calling DFM apis to manage LUNs. If the SOAP call goes directly to DFM (like create_volume), then DFM will correctly update its internal LUN list. However, in some cases (like create_
Consequences of which is that 'create_
Versions:
* Openstack Netapp driver from latest Folsom branch
* DFM version 5.1
* OnTap: 7.3.6P5
Notes:
* after several attempts to refresh the DFM LUN list, the new LUN will eventually appear, but it is quite unpredictable when
* this issue applies to 7-mode. The code for Cluster mode appears to be different..
* if DFM supposed to return up-to-date list after refresh completed, then this is DFM problem (because it does not). Otherwise, it should be handled somehow in the driver
Symptoms:
'lun show' command on the filer shows most up-to-date list
'dfm lun list' shows DFM's internal LUN list which may be "behind", missing LUNs created recently
Steps to reproduce:
I have written a simple script (attached) which imitates driver's code to refresh lun list.
1) Inside the script, update DFM's api url and credentails (same as in nova.conf)
2) run interactively on the host with access to DFM
# python -i /tmp/lun_refresh.py
2013-01-02 17:04:18,524 INFO Soap client init..
2013-01-02 17:04:22,444 INFO Soap client init done.
2013-01-02 17:04:23,523 DEBUG Discovered 3 datasets and 5 LUNs
>>>
>>> ds = d._get_
>>> host_id = '<filers-hostname>'
3) on Netapp filer, create 'test2' LUN manually
> lun create -s 1g -t linux /vol/OpenStack_
4) switch to python cli and run refresh
>>> d._refresh_
2013-01-02 17:04:59,697 INFO Starting refresh..
2013-01-02 17:05:14,782 INFO calling TimestampList
2013-01-02 17:05:29,884 INFO calling TimestampList
2013-01-02 17:05:44,987 INFO calling TimestampList
2013-01-02 17:06:00,087 INFO calling TimestampList
2013-01-02 17:06:15,187 INFO calling TimestampList
2013-01-02 17:06:15,272 INFO Finished refresh..
2013-01-02 17:06:15,529 INFO DFM lun refresh FOUND volume "test2"
(GOOD case, new lun is there after first refresh..)
5) repeat steps 3) and 4) for other test LUNs
Netapp:
> lun create -s 1g -t linux /vol/OpenStack_
python cli:
>>> d._refresh_
2013-01-02 17:06:28,025 INFO Starting refresh..
2013-01-02 17:06:43,094 INFO calling TimestampList
2013-01-02 17:06:43,179 INFO Finished refresh..
2013-01-02 17:06:43,438 DEBUG DFM lun refresh did not return volume "test3" (1/3)
2013-01-02 17:06:43,438 INFO Starting refresh..
2013-01-02 17:06:58,512 INFO calling TimestampList
2013-01-02 17:06:58,597 INFO Finished refresh..
2013-01-02 17:06:58,865 DEBUG DFM lun refresh did not return volume "test3" (2/3)
2013-01-02 17:06:58,865 INFO Starting refresh..
2013-01-02 17:07:13,946 INFO calling TimestampList
2013-01-02 17:07:14,032 INFO Finished refresh..
2013-01-02 17:07:14,290 INFO DFM lun refresh FOUND volume "test3"
(BAD case, first two DFM lun refreshes did NOT return newly created LUN, third one worked)
Dirty workaround:
Just repeat DFM LUN refresh several times. Stop, when the new LUN appears on the list. This is what '_refresh_
I will probably create a proper patch against Folsom later on. The logic will be the same as in the attached script (minus DEBUG messages). However, I believe this workaround is quite lame. Repeating the same operation N-times still does not guarantee that it will work. I have seen cases where DFM lun list was out-of-date for minutes (!) even if repeatedly calling refresh via SOAP and/or 'dfm lun discover ...'. It would be nice if somebody could look into the DFM itself, why the refresh does not work.
Regards,
Brano Zarnovican
affects: | nova → cinder |
tags: | added: driver |
Changed in cinder: | |
assignee: | nobody → Rushi Agrawal (rushiagr) |
Changed in cinder: | |
assignee: | Rushi Agrawal (rushiagr) → John Griffith (john-griffith) |
Changed in cinder: | |
assignee: | John Griffith (john-griffith) → Ben Swartzlander (bswartz) |
Changed in cinder: | |
milestone: | none → grizzly-3 |
status: | Fix Committed → Fix Released |
Changed in cinder: | |
milestone: | grizzly-3 → 2013.1 |
tags: | added: essex-backport-netapp-potential folsom-backport-netapp-potential |
tags: | added: navneet-netapp-backport |
tags: | removed: essex-backport-netapp-potential |
More info about the DFM refresh problem..
1) request to refresh lun list is ignored if it was executed "close" to the previous refresh has finished. Looks like the graceful period is around a minute or so. In this period DFM will return timestamp of the previous monitor execution. Currently, driver will run DfmObjectRefresh, but it does not check if the timestamp is in the past, only that it is non-zero.
https:/ /github. com/openstack/ nova/blob/ stable/ folsom/ nova/volume/ netapp. py#L870
Fix/workaround: Repeat DfmObjectRefresh requests until you get back timestamp higher than first execution of Refresh.
2) (Theory) Running 'lun' monitor will not refresh lun list if the new lun is 'inside' a qtree which was not discovered yet. Even if you run DfmObjectRefres h(.., ChildType= 'lun_path' ) multiple times and let it correctly finish, new lun still won't appear. It looks that qtree (and his lun) will appear only after 'file_system' monitor has been executed. This monitor is NOT triggered with ChildType= 'lun_path' parameter.
Fix/workaround: Explicitly trigger both 'file_system' and 'lun' monitors.