Actually, both Intel and AMD engineers have been talking about making the
documentation _stricter_, rather than looser. They apparently already are
pretty damn strict, because not being stricter than the docs imply just
ends up exposing too many potential problems in software that didn't
really follow the rules.
For example, it's quite possible to do loads out of order, but guarantee
that the result is 100% equivalent with a totally in-order machine. One
way you do that is to keep track of the cacheline for any speculative
loads, and if it gets invalidated before the speculative instruction has
completed, you just throw the speculation away.
End result: you can do any amount of speculation you damn well please at a
micro-architectural level, but if the speculation would ever have been
architecturally _visible_, it never happens!
(Yeah, that is just me in my non-professional capacity of hw engineer
wanna-be: I'm not saying that that is necessarily what Intel or AMD
actually ever do, and they may have other approaches entirely).
Stores never "leak up". They only ever leak down (ie past subsequent loads
or stores), so you don't need to worry about them. That's actually already
documented (although not in those terms), and if it wasn't true, then we
couldn't do the spin unlock with just a regular store anyway.
(There's basically never any reason to "speculate" stores before other mem
ops. It's hard, and pointless. Stores you want to just buffer and move as
_late_ as possible, loads you want to speculate and move as _early_ as
possible. Anything else doesn't make sense).
So I'm fairly sure that the only thing you really need to worry about in
this thing is the load-load ordering (the load for the spinlock compare vs
any loads "inside" the spinlock), and I'm reasonably certain that no
existing x86 (and likely no future x86) will make load-load reordering
effects architecturally visible, even if the implementation may do so
*internally* when it's not possible to see it in the end result.
Linus
-