Bug #1036318 “OPAC timeout within the client” : Bugs : Evergreen

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2012-10-17:

#1

We have had reports of this from our staff, complete with screen shots. However, I have been unable to reproduce the reported behavior.

Basically, what we've had reported is that staff are sometimes prompted with a tpac login screen when trying to place a hold in the staff client.

Yesterday, I tried reproducing this in production and in development and was unsuccessful. I would leave my staff client sessions logged in and retry the tpac functions at increasing intervals of 10, 15, 20, and 30 minutes. Not once was I prompted to login. On development, I even left the client with the tpac screen up and did nothing else.

There have been some issues with different time outs in the past when staff use operator change in the client. Are your staff using this, by chance? (I thought that this operator change timeout might have been fixed, but I am not 100% sure.)

Jason

Revision history for this message

Tim Spindler (tspindler-cwmars) wrote on 2012-10-17:

#2

Jason,

I know at least internally, we have had it happen on staff who are not switching users so I know that at least in all instances it does not happen when changing users but I do not know enough to say that it may not be a component. Again, this is in 2.2.0 so I'm not sure if other work may have fixed this issue in later versions.

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2012-10-17:

#3

Our reports have been with recent branches based on master.

I'm doing an experiment today to see if having more than 1 tab open at a time affects it. I'm leaving search results open in a tab that I'm not touching for 4 hours to see if that tpac session timeouts separately from the rest of my staff client session.

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2012-12-13:

#4

Since it seems to happen sporadically and only on our busy production server, I wonder if it isn't memcached starvation.

We should devise a test for that condition, perhaps using a very small memcached setting.

Changed in evergreen:
milestone:	none → 2.4.0-alpha
status:	New → Confirmed
importance:	Undecided → Medium

Ben Shum (bshum) on 2013-03-03

Changed in evergreen:
milestone:	2.4.0-alpha1 → 2.4.0-beta

Ben Shum (bshum) on 2013-03-17

Changed in evergreen:
milestone:	2.4.0-beta → 2.4.0-rc

Revision history for this message

Robert J Jackson (rjackson-deactivatedaccount) wrote on 2013-04-02:

#5

Evergreen Indiana consortium is running 2.2 and we are receiving sporadic reports of this issue. It involves the staff attempting to place a hold for a patron within the staff client and receiving the TPAC login screen when the Search is initiated. Log files do not contain anything useful regarding the issue. Latest reported incident there were 0 log entries for the time frame captured on screen print showing the login screen.

Ben Shum (bshum) on 2013-04-27

summary:	- OPAC timeout withn the client + OPAC timeout within the client
Changed in evergreen:
milestone:	2.4.0-rc → none

Revision history for this message

Michele Morgan (mmorgan) wrote on 2013-05-09:

#6

Just to add our experience, hoping the more info is out there, the better.

We have had several libraries reporting this on our 2.2.8 system. The unpredictable timeouts impact service to patrons since the staff member needs to take the time to re-login to the client.

We've also been unable to pin down the exact cause of this behavior, libraries usually notice that the opac within the client has timed out because they no longer see item barcode links within the client, and they are not able to place holds for patrons, instead being presented with the patron login screen. This does not happen at predictable intervals.

I did some testing with timeouts on our training server running 2.3.6. I set the OPAC inactivity timeout to 10 seconds. I did not see the 10 second timeout affect the opac in the staff client at all. It really doesn't seem like this loss of opac authentication in the staff client is due to any of the timeout settings. Maybe some sequence of actions on the workstation causes the authentication to be dropped. I have seen it happen on my own workstation at central site on occasion, but have not been able to link it to a cause. I did capture a screencast of tabs I had open on one occasion:

http://screencast.com/t/KZ2pb5EUWbl

We've asked staff members at our libraries to provide more info on workstation activity.

Revision history for this message

Galen Charlton (gmc) wrote on 2013-05-17:

#7

One suggestion for finding occurrences of this is to configure Apache to log user agents (e.g., by using the "combined" log format rather than "common"). Once that's done, the following grep should turn them up:

grep 'GET /eg/opac/login?redirect_to=%2Feg%2Fopac.*oils_xulrunner' {apache_access_log}

Revision history for this message

Galen Charlton (gmc) wrote on 2013-05-17:

#8

One thing that would be very useful to know: if a staff member is presented with the TPac login prompt, what happens if they immediately open a new staff client tab and do (say) a patron search?

If the staff client informs them that their session has expired and prompts them to enter their credentials, that would indicate that something went wrong with the session -- for example, memcached starvation as Jason suggested might result in a session key getting evicted.

If the staff client lets the user proceed with the patron search without prompting for credentials, that would indicate that the staff session is OK, and would suggest (among other possibilities) that the something went wrong with the OPAC 'ses' cookie.

Something else that would help when this happens is opening the Javascript Console in a new tab, then evaluating this:

var consoleService = Components.classes['@mozilla.org/consoleservice;1'].getService(Components.interfaces.nsIConsoleService);
var cookieMgr = Components.classes["@mozilla.org/cookiemanager;1"].getService(Components.interfaces.nsICookieManager);
for (var e = cookieMgr.enumerator; e.hasMoreElements();) {
var cookie = e.getNext().QueryInterface(Components.interfaces.nsICookie);
if (cookie.name == 'ses') consoleService.logStringMessage("Found OPAC ses cookie: " +cookie.host + ";" + cookie.name + "=" + cookie.value + "\n");
}
var data_cache = Components.classes["@open-ils.org/openils_data_cache;1"].getService();
var ses = data_cache.wrappedJSObject.data.session;
consoleService.logStringMessage("Staff client session is " + ses.key + " (login type = " + ses.login_type + "; permanent: " + ses.is_perm + ")");

This will spit back (in the console log) a list of what the staff clients considers to be its user session, both from the cookie used for the embedded OPAC and the staff client's stash.

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2013-05-21:

#9

Mainly for jeff_, here's a screenshot.

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2013-05-21:

#10

screen shot Edit (154.8 KiB, image/png)

Here's a smaller screen shot with slightly better redaction.

Revision history for this message

Simon Mai (simonmai) wrote on 2013-07-11:

#11

Hi,
Our libs have been getting this issue also. I "played" around and have been able to reproduce the reported behavior as below steps:
1. Go to Local Setting Editor, define the Security: Staff Inactivity Timeout : 18
2. Close the client and log off
3. Login client again
4. Patron Search/Search a patron/Holds/Place Hold/Search a book (Don't click place hold)
5. Wait ... (count down 18, 17, 16 ..... 1)
6. The result screen will be returned to the homepage
7. Search the book again, and click Place Hold. At this time, the patron login screen will be showed up !!!

My opinion: this issue relates to Staff client session. It happens when we place a hold in the staff client getting timeout (expired session). For some reason, the patron login screen is showed up, instead of the staff client 's prompt screen that prompts staff to enter their credentials.

A quick solution for now: I will setup the "Security: Staff Inactivity Timeout" as 99999999.... to make sure the client unable to get timeout. Let's see if this issue happen again or not.
PS: I'm using 2.3.6 client for testing out.

Revision history for this message

Simon Mai (simonmai) wrote on 2013-07-11:

#12

expire.jpg Edit (513.7 KiB, image/jpeg)

screen shot

Ben Shum (bshum) on 2013-08-22

no longer affects:

evergreen/2.2

Revision history for this message

Beth Longwell (blongwel) wrote on 2013-09-26:

#13

We are on Evergreen 2.4.0 and two sites have reported this issue within the last two days. A restart of the session resolves the issue. One site indicated that they have to restart 3 times a day because of the login prompt they are getting when trying to place a patron hold.

Revision history for this message

Kate Butler (kreb) wrote on 2013-10-09:

#14

We have been getting this issue, but it seems to only happen when someone uses the operator change from within the staff client and chooses the 'temporary' setting. When that session times out, it does not restore the previous login credentials for the opac.

Revision history for this message

Jeff Godin (jgodin) wrote on 2013-11-05:

#15

Looking at a recent report of this on a 2.2 system, the staff user logs in around 9 AM, is active until early afternoon, then in the later afternoon after a few hours of inactivity, the staff client presents the auth token it received earlier that morning, and the session is not found in memcached.

In the staff client, activity was taking place mostly in the Item Status interface, and the staff member received the usual prompt to re-authenticate.

The period of inactive time in the above example was shorter than the value of Staff Login Inactivity Timeout.

From the data available to me at present, this seems like memcached eviction due to memory exhaustion within the slab for items of this size.

On the staff client side of things, a re-auth works as expected. TPAC may need to be taught how to re-auth for staff, as other comments have suggested.

The long-term solution may be database-backed sessions, where memcached is used for caching but where an evicted session key may fall back to the (slower) database lookup to refresh the cache.

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2013-11-06:

#16

Jeff Godin's monologue from IRC today:

(10:02:44 AM) jeff: csharp: if you can, i'd start with finding the "successful login" log entry for the user + workstation in question for the day that they had the problem, then look at log entries having that auth token.
(10:03:14 AM) jeff: in your case, it sounds like it will be a little noisy due to their xul interfaces still working.
(10:04:09 AM) jeff: oh.
(10:04:20 AM) jeff: hey, this might be it.
(10:04:30 AM) jeff: how long is your staff login timeout?
(10:05:26 AM) jeff: can you trace apache access logs back to individual machines, or are they all behind NAT with a bunch of other machines?
(10:05:53 AM) RoganH left the room (quit: Ping timeout: 265 seconds).
(10:06:18 AM) jeff: say your staff login timeout is 3600 seconds. tpac pages will all get a meta refresh tag that redirects to the logout page after 3600 seconds.
(10:06:32 AM) jeff: if they have a tab open and idle for an hour, it'll log their tpac session out, killing the cookie.
(10:07:04 AM) jeff: even if they have been active in another tpac tab during the meantime.
(10:09:40 AM) jeff: so if you can go from individual machine to apache log entries, i'll bet you see a logout.
(10:09:53 AM) jeff: (a tpac logout url request from the client)
(10:10:29 AM) jeff: if your timeout is so long that they couldn't possibly have had a tpac tab open and idle for that long after logging in, then that's probably not it.
(10:11:11 AM) jeff: arguably, staff tpac pages shouldn't have that meta refresh, and also shouldn't have a logout link.

The above appears to be what is happening in the majority of our cases.

To put it succinctly, staff have a tpac search tab open and idle for more than the staff activity timeout. When that refresh kicks in, the cookie that holds their credentials for the TPAC session within the staff client gets killed. However, the actual authtoken used by the staff client still works because its expiration is reset as they continue to use the staff client. This would explain why the tpac sessions within the staf client ask for a login but the staff client doesn't.

One way of mitigating this would be to remove the refresh from tpac pages loaded in the staff client. I'll throw up a branch in a few that attempts to do this.

I don't want to steal Jeff's thunder or speak for him, but I believe he is also looking into a more comprehensive fix that involves adjusting or not using the cookie in the staff client.

Inevitably, though, I think losing the refresh on the tpac pages in the staff client is part of the solution, so doing so won't be wasted effort.

Jeff Godin's monologue from IRC today:

(10:02:44 AM) jeff: csharp: if you can, i'd start with finding the "successful login" log entry for the user + workstation in question for the day that they had the problem, then look at log entries having that auth token.
(10:03:14 AM) jeff: in your case, it sounds like it will be a little noisy due to their xul interfaces still working.
(10:04:09 AM) jeff: oh.
(10:04:20 AM) jeff: hey, this might be it.
(10:04:30 AM) jeff: how long is your staff login timeout?
(10:05:26 AM) jeff: can you trace apache access logs back to individual machines, or are they all behind NAT with a bunch of other machines?
(10:05:53 AM) RoganH left the room (quit: Ping timeout: 265 seconds).
(10:06:18 AM) jeff: say your staff login timeout is 3600 seconds. tpac pages will all get a meta refresh tag that redirects to the logout page after 3600 seconds.
(10:06:32 AM) jeff: if they have a tab open and idle for an hour, it'll log their tpac session out, killing the cookie.
(10:07:04 AM) jeff: even if they have been active in another tpac tab during the meantime.
(10:09:40 AM) jeff: so if you can go from individual machine to apache log entries, i'll bet you see a logout.
(10:09:53 AM) jeff: (a tpac logout url request from the client)
(10:10:29 AM) jeff: if your timeout is so long that they couldn't possibly have had a tpac tab open and idle for that long after logging in, then that's probably not it.
(10:11:11 AM) jeff: arguably, staff tpac pages shouldn't have that meta refresh, and also shouldn't have a logout link.

The above appears to be what is happening in the majority of our cases.

To put it succinctly, staff have a tpac search tab open and idle for more than the staff activity timeout. When that refresh kicks in, the cookie that holds their credentials for the TPAC session within the staff client gets killed. However, the actual authtoken used by the staff client still works because its expiration is reset as they continue to use the staff client. This would explain why the tpac sessions within the staf client ask for a login but the staff client doesn't.

One way of mitigating this would be to remove the refresh from tpac pages loaded in the staff client. I'll throw up a branch in a few that attempts to do this.

I don't want to steal Jeff's thunder or speak for him, but I believe he is also looking into a more comprehensive fix that involves adjusting or not using the cookie in the staff client.

Inevitably, though, I think losing the refresh on the tpac pages in the staff client is part of the solution, so  doing so won't be wasted effort.

Revision history for this message

Jeff Godin (jgodin) wrote on 2013-11-06:

#17

To add a little to what Jason pasted and stated (thanks!) in comment 16:

There are two scenarios for symptoms here. The first is more common, and more closely matches the original reports in this bug:

In a staff client with at least one tab open to the TPAC catalog interface, all existing and new TPAC interfaces lose their login session, while other interfaces such as checkin / checkout / item status remain functional, and do not prompt the user to re-authenticate. The most common means of recovery here is logging out and then back into the staff client. At least one of the TPAC tabs will be at the default /eg/opac/home URL.

This is demonstrated well by the screencast included by Michele Morgan in comment 6, above: http://screencast.com/t/KZ2pb5EUWbl -- tabs 1 and 4 seem to have redirected to the TPAC "home" page.

This seems most likely to be caused by the following:

Given a default auth.staff_timeout value of 7200 seconds (2 hours), any TPAC page in the staff client includes a meta http-equiv refresh to the TPAC logout URL, timed for 7200 seconds. If you have a TPAC tab open in the background sitting idle, no matter what you're doing in other tabs, the idle tab eventually reaches the end of that 7200 second timer, and loads the TPAC logout page. This clears the cookies in the staff client, breaking any browser interfaces which rely on the cookie to authenticate (such as TPAC). The logout process clears the cookie, then redirects to the TPAC "home" page.

The fix: Don't include the refresh in TPAC when TPAC is being used within the staff client. This is a one-line template change, working branch located at user/jeff/lp1036318_opac_timeout_in_staff_client

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/jeff/lp1036318_opac_timeout_in_staff_client

The second scenario for symptoms is where both the TPAC and non-TPAC interfaces in the staff client stop working, and while the TPAC interface does not prompt the user to re-authenticate, the XUL interfaces do prompt to re-authenticate, and after successful re-authentication any new or existing TPAC tabs are once more logged in (after a reload or new page navigation in the case of existing TPAC tabs).

The second scenario seems most likely to be caused by early memcached eviction of the user's session -- OR a normal expiry of the user's session. Better handling of session expiry in TPAC, and increasing session durability in the face of memcached restarts or slab memory exhaustion might be distinct bugs.

Thanks to everyone who added thoughts to this bug, and thanks especially to Chris Sharp and Jason Stephenson for helping in confirming symptoms and testing fixes.

I'll open a new wishlist bug on the session durability issue.

Shall we also have a bug on enabling TPAC to re-authenticate after session timeout, or leave that issue as a part of this bug?

Either way, the change in the working branch mentioned above should stand on its own, and can be reviewed and signed off by any interested party -- Jason?

To add a little to what Jason pasted and stated (thanks!) in comment 16:

There are two scenarios for symptoms here. The first is more common, and more closely matches the original reports in this bug:

In a staff client with at least one tab open to the TPAC catalog interface, all existing and new TPAC interfaces lose their login session, while other interfaces such as checkin / checkout / item status remain functional, and do not prompt the user to re-authenticate. The most common means of recovery here is logging out and then back into the staff client. At least one of the TPAC tabs will be at the default /eg/opac/home URL.

This is demonstrated well by the screencast included by Michele Morgan in comment 6, above: http://screencast.com/t/KZ2pb5EUWbl -- tabs 1 and 4 seem to have redirected to the TPAC "home" page.

This seems most likely to be caused by the following:

Given a default auth.staff_timeout value of 7200 seconds (2 hours), any TPAC page in the staff client includes a meta http-equiv refresh to the TPAC logout URL, timed for 7200 seconds. If you have a TPAC tab open in the background sitting idle, no matter what you're doing in other tabs, the idle tab eventually reaches the end of that 7200 second timer, and loads the TPAC logout page. This clears the cookies in the staff client, breaking any browser interfaces which rely on the cookie to authenticate (such as TPAC). The logout process clears the cookie, then redirects to the TPAC "home" page.

The fix: Don't include the refresh in TPAC when TPAC is being used within the staff client. This is a one-line template change, working branch located at user/jeff/lp1036318_opac_timeout_in_staff_client

http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/jeff/lp1036318_opac_timeout_in_staff_client

The second scenario for symptoms is where both the TPAC and non-TPAC interfaces in the staff client stop working, and while the TPAC interface does not prompt the user to re-authenticate, the XUL interfaces do prompt to re-authenticate, and after successful re-authentication any new or existing TPAC tabs are once more logged in (after a reload or new page navigation in the case of existing TPAC tabs).

The second scenario seems most likely to be caused by early memcached eviction of the user's session -- OR a normal expiry of the user's session. Better handling of session expiry in TPAC, and increasing session durability in the face of memcached restarts or slab memory exhaustion might be distinct bugs.

Thanks to everyone who added thoughts to this bug, and thanks especially to Chris Sharp and Jason Stephenson for helping in confirming symptoms and testing fixes.

I'll open a new wishlist bug on the session durability issue.

Shall we also have a bug on enabling TPAC to re-authenticate after session timeout, or leave that issue as a part of this bug?

Either way, the change in the working branch mentioned above should stand on its own, and can be reviewed and signed off by any interested party -- Jason?

tags:

added: pullrequest

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2013-11-06:

#18

I agree that the session durability should be a separate bug.

I also think that re-authentication of the tpac/staff client when the staff client session has completely timed out should be another ticket. I've never really experienced it that much until today and there are a lot of things not to like about how it works, even when restricted to the staff client interface without tpac involved.

I'll take some more time to review the code in this branch, but I'm already convinced it is ready to be committed.

Do we need some kind of unit/integration test for this? If so, what would that look like?

Revision history for this message

Jason Stephenson (jstephenson) wrote on 2013-11-07:

#19

It's a tiny change, and it cherry-picks cleanly to 2.3 and 2.4, so in she goes!

Thanks, Jeff and everyone else who worked on this problem by sharing information, etc.

Changed in evergreen:
status:	Confirmed → Fix Committed
milestone:	none → 2.5.0

Dan Wells (dbw2) on 2013-11-11

Changed in evergreen:
status:	Fix Committed → Fix Released

Evergreen

OPAC timeout within the client

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

	Status	Importance	Assigned to	Milestone
Evergreen	Fix Released	Medium	Unassigned	Evergreen 2.5.0
2.3	Fix Released	Medium	Unassigned	Evergreen 2.3.12
2.4	Fix Released	Medium	Unassigned	Evergreen 2.4.4