I am not sure Christoph was refering to actual instructions.
I was suggesting using for static percpu (vmlinux or modules) :
vmlinux : (offset31 computed by linker at vmlinux link edit time)
incl %gs:offset31
modules : (offset31 computed at module load time by module loader)
incl %gs:offset31
(If we make sure all this stuff is allocated in first chunk)
And for dynamic percpu :
movq field(%rdi),%rax
incl %gs:(%rax) /* full 64bits 'offsets' */
I understood (but might be wrong again) that %gs itself could not be used with an offset > 2GB, because
the way %gs segment is setup. So in the 'dynamic percpu' case, %rax should not exceed 2^31
--