Ubuntu
glibc package

Comment 18 for bug 1640518

Revision history for this message

Andrew Morrow (acmorrow) wrote on 2016-11-10:

#18

OK, I upgraded valgrind to 3.12 on the power machine and I can now get it to run meaningfully. We are seeing many error reports of the following form:

[js_test:fsm_all_sharded_replication] 2016-11-10T16:19:58.396+0000 s40019| ==34604== Thread 50:
[js_test:fsm_all_sharded_replication] 2016-11-10T16:19:58.396+0000 s40019| ==34604== Invalid read of size 2
[js_test:fsm_all_sharded_replication] 2016-11-10T16:19:58.396+0000 s40019| ==34604== at 0x4F2AD20: __lll_unlock_elision (elision-unlock.c:36)
[js_test:fsm_all_sharded_replication] 2016-11-10T16:19:58.396+0000 s40019| ==34604== by 0x4F1DB07: __pthread_mutex_unlock_usercnt (pthread_mutex_unlock.c:64)
[js_test:fsm_all_sharded_replication] 2016-11-10T16:19:58.396+0000 s40019| ==34604== by 0x4F1DB07: pthread_mutex_unlock (pthread_mutex_unlock.c:314)

[js_test:fsm_all_sharded_replication] 2016-11-10T16:20:43.998+0000 s40019| ==34604== Invalid write of size 2
[js_test:fsm_all_sharded_replication] 2016-11-10T16:20:43.998+0000 s40019| ==34604== at 0x4F2AD30: __lll_unlock_elision (elision-unlock.c:37)
[js_test:fsm_all_sharded_replication] 2016-11-10T16:20:43.998+0000 s40019| ==34604== by 0x4F1DB07: __pthread_mutex_unlock_usercnt (pthread_mutex_unlock.c:64)
[js_test:fsm_all_sharded_replication] 2016-11-10T16:20:43.998+0000 s40019| ==34604== by 0x4F1DB07: pthread_mutex_unlock (pthread_mutex_unlock.c:314)
[js_test:fsm_all_sharded_replication] 2016-11-10T16:20:43.999+0000 s40019| ==34604== by 0xD803C7: operator()<const mongo::executor::TaskExecutor::RemoteCommandCallbackArgs&, long unsigned int&, void> (functional:600)

In all cases, the invalid write appears to be a write into a freed block. Frequently, the address appears to be aligned 'Address 0x...e'. So, this is very interesting.

Another engineer and I took a close look at one of these instances, and we do not believe there is any way that the mutex could be accessed after it was deleted.

Is there a way we can disable the libc lock elision code? An environment variable or other similar setting? We would like to see if we still see these sorts of reports after disabling lock elision. If so, then it would almost certainly be a logic error in our code that we are just missing. On the other hand, if the valgrind reports go away when we disable lock elision, then it would be evidence that lock elision might be at fault for the stack corruption we are observing, at which point I would re-try our original repro.

OK, I upgraded valgrind to 3.12 on the power machine and I can now get it to run meaningfully. We are seeing many error reports of the following form:

[js_test:fsm_all_sharded_replication] 2016-11-10T16:19:58.396+0000 s40019| ==34604== Thread 50:
[js_test:fsm_all_sharded_replication] 2016-11-10T16:19:58.396+0000 s40019| ==34604== Invalid read of size 2
[js_test:fsm_all_sharded_replication] 2016-11-10T16:19:58.396+0000 s40019| ==34604==    at 0x4F2AD20: __lll_unlock_elision (elision-unlock.c:36)
[js_test:fsm_all_sharded_replication] 2016-11-10T16:19:58.396+0000 s40019| ==34604==    by 0x4F1DB07: __pthread_mutex_unlock_usercnt (pthread_mutex_unlock.c:64)
[js_test:fsm_all_sharded_replication] 2016-11-10T16:19:58.396+0000 s40019| ==34604==    by 0x4F1DB07: pthread_mutex_unlock (pthread_mutex_unlock.c:314)

[js_test:fsm_all_sharded_replication] 2016-11-10T16:20:43.998+0000 s40019| ==34604== Invalid write of size 2
[js_test:fsm_all_sharded_replication] 2016-11-10T16:20:43.998+0000 s40019| ==34604==    at 0x4F2AD30: __lll_unlock_elision (elision-unlock.c:37)
[js_test:fsm_all_sharded_replication] 2016-11-10T16:20:43.998+0000 s40019| ==34604==    by 0x4F1DB07: __pthread_mutex_unlock_usercnt (pthread_mutex_unlock.c:64)
[js_test:fsm_all_sharded_replication] 2016-11-10T16:20:43.998+0000 s40019| ==34604==    by 0x4F1DB07: pthread_mutex_unlock (pthread_mutex_unlock.c:314)
[js_test:fsm_all_sharded_replication] 2016-11-10T16:20:43.999+0000 s40019| ==34604==    by 0xD803C7: operator()<const mongo::executor::TaskExecutor::RemoteCommandCallbackArgs&, long unsigned int&, void> (functional:600)

In all cases, the invalid write appears to be a write into a freed block. Frequently, the address appears to be aligned 'Address 0x...e'. So, this is very interesting.

Another engineer and I took a close look at one of these instances, and we do not believe there is any way that the mutex could be accessed after it was deleted.

Ubuntuglibc package

Comment 18 for bug 1640518

Ubuntu
glibc package