Alternately, one could use the GCC intrinsics. They are more conservative about memory barriers, which I believe to be more correct in any case (it is not safe to let the compiler or the instruction scheduler move memory accesses into the ldrex/strex critical region). Other than memory barrier differences, the intrinsic-based implementation should be equally fast. (Although there is no equivalent of fetchAndStore, the only actual use cases store 0, and I've special-cased that in this implementation using __sync_fetch_and_and().)
Alternately, one could use the GCC intrinsics. They are more conservative about memory barriers, which I believe to be more correct in any case (it is not safe to let the compiler or the instruction scheduler move memory accesses into the ldrex/strex critical region). Other than memory barrier differences, the intrinsic-based implementation should be equally fast. (Although there is no equivalent of fetchAndStore, the only actual use cases store 0, and I've special-cased that in this implementation using __sync_ fetch_and_ and().)