•Both conditional branches are forward branches, so they are properly predicted not to be taken (to match the common case of no contention for the lock).
•The OR writes its result to a second register; this allows the OR and the BLBS to be interchanged if that would give a faster instruction schedule.
•Other operate instructions (from the critical section) may be scheduled into the LDQ_L..STQ_C sequence, so long as they do not fault or trap and they give correct results if repeated; other memory or operate instructions may be scheduled between the STQ_C and BEQ.
•The memory barrier instructions are discussed in Section 5.5.4. It is correct to substitute WMB for the second MB only if:
–All data locations that are read or written in the critical section are accessed only after acquiring a software lock by using lock_variable (and before releasing the software lock).
–For each read u of shared data in the critical section, there is a write v such that:
1.v is BEFORE the WMB
2.v follows u in processor issue sequence (see Section 5.6.1.1)
3.v either depends on u (see Section 5.6.1.7) or overlaps u (see Section 5.6.1), or both.
–Both lock_variable and all the shared data are in
Generally, the substitution of a WMB for the second MB increases performance.
•An ordinary STQ instruction is used to clear the lock_variable.
It would be a performance mistake to
If, when one processor spins with writes, another processor is modifying (not changing) the lock_variable, then the writes on the first processor may cause the STx_C of the modify on the second processor always to fail.
This deadlock situation is avoided by:
•Having only one processor execute a store (no STx_C), or
•Having no write in the spin loop, or
•Doing a write only if the shared variable actually changes state (1 → 1 does not change state).