Wednesday, 25 December 2013

x86_x32 instruction set

There was an article on slashdot today
(Slashdot)
on whether the x32 instruction set architecture is of any use. The
comments were equally divided - some loving it for embedded use where every
byte or cycle counts, and some saying its pointless.

In general, I think support for this is a good thing. We mostly
wont use it on our desktops, and probably not even on our Androids -
the scope for confusion, missing libraries, and other stuff is huge.

For those that dont know - the Intel/AMD chips support classic i396
32-bit mode binaries and instructions, whilst x86_64 supports full
64-bit operation. i386 mode is important (or was important), in the
transition to full 64-bit chips, since at the time, the OS, compilers,
and apps didnt support 64-bit and not everything was recompiled "overnight".

Getting a new architecture ready is a lot of work - we see this today
in the race for ARM-64. Very typically, what is needed is the OS to support the
architecture, followed by the compilers. Actually, the compilers have
to come first to support the OS, so there can be a standoff until the
two stabilise.

Next comes the glibc library which provides the interface to the OS,
and pretty much a bottom-up recompile of every library (X11, GTK, Gnome,
KDE, QT, etc) and lastly the apps; initially the core OS/distribution apps,
and then vendor apps.

This can take a while to stabilise across the full suite of apps - maybe
1-2y optimistically.

Given the maturity of 64-bit architectures, nobody is in a race to
support x32 variant. The x32 variant is simply a 4GB address space
versoin of x86_64. All the instructions and registers are available
from the 64-bit architecture, but the address space is limited, which
in turn reduces the sizes of pointers. Hence, smaller memory demands.
In todays multigigabyte desktops, thats not a big issue. But this can lead
to smaller binaries - smaller footprints, faster to load, less
pages of memory, less TLB and less cache misses. The fact that more
code can exist in the instruction cache is important and can give additional
gains.

In general, the reported gains are small to minor - maybe a few percent,
unless the app is "pointer heavy" - lots of pointers. Many apps
have large datasets in memory (eg XML or text, or web pages); but sophisticated
apps will have big complex structures with lots of pointers, so the savings
may be good.

I tried recompiling CRiSP as a new linux-x86_x32 architecture to see
what the real world effect is. (Side note: I updated my ptrace implementation
to support x32 architecture - it required very few lines of code to
do so; the kernel system call interface is nearly identical to x86_64,
bar a quirk of how syscalls are encoded).

Recompiling (and defining) a new crisp variant took around 5-10 minutes
of effort - although I had to put a workaround in for <sys/sysctl.h>
which has a pragma complaining that the sysctl() system call is not
supported for x32 architecture apps. (I dont think I use that, but
had to conditionalise out the #include).

The results are interesting - x32 seems to win, compared to x86_64. The
results suggest from 5-10% performance improvement. I attach
two runs of my performance benchmark in CRiSP macros - what these macros
do doesnt matter much, except to note each test attempts to take
5s, counting how many of certain operations can be performed. How
this would translate into real world performance is not worth of a comparison.
CRiSP is pretty efficient anyhow, and if you run CRiSP, your CPU is
going to be mostly idle; and when it isnt, maybe you get upto 5%
performance improvement - i.e. you wouldnt notice it. If
we were optimising, for example, battery performance on a mobile,
tablet or laptop, then having x32 could be beneficial (not just
for CRiSP, but the OS and all the other apps you use). Imagine
an extra 30-60mins of battery life on your portable device! Worth
having, but may not happen.

Anyhow, heres the relevant benchmark data:


text data bss dec hex filename
1379459 50844 70212 1500515 16e563 bin.linux-x86_32/cr
1424517 93000 78024 1595541 185895 bin.linux-x86_64/cr
1349955 52064 70692 1472711 1678c7 bin.linux-x86_x32/cr


The above shows the sizes of the 'cr' executable - the x32 variant
certainly wins. (Due to more registers and better instruction layout
compared to the x86_32 variant).

Heres the CPU benchmarks. I didnt run the x86_32 benchmark,
since most people will run pure 64-bit desktops and use
the 64-bit binary. One final note: CRiSP compiled nicely, and
identically compared to the two other linux architectures.
*Except* for missing X11 libraries - i.e. I cannot build a "crisp"
windowed GUI version of CRiSP (maybe I havent installed the relevant
packages, so I will take a look after this blog post to see if it
is viable).


PERF3: 25 December 2013 23:14 v11.0.22a -- linux-x86_x32
1) loop Time: 5.00 3,635,000/sec
2) macro_list Time: 5.01 58,200/sec
3) command_list Time: 5.00 4,185/sec
4) strcat Time: 5.00 77,000/sec
5) listcat Time: 5.01 240/sec
6) string_assign Time: 5.00 448,800/sec
7) get_nth Time: 4.99 8,890/sec
8) put_nth Time: 5.00 142,340/sec
9) if Time: 5.00 559,000/sec
10) trim Time: 5.00 143,800/sec
11) compress Time: 5.00 116,000/sec
12) loop_float Time: 5.00 2,950,000/sec
13) edit_file Time: 5.00 80,600/sec
14) edit_file2 Time: 5.00 2,592/sec
15) macro_call Time: 5.00 74,100/sec
16) gsub Time: 5.00 226,300/sec
17) sieve Time: 4.99 0/sec
Total: 100.980000 Elapsed: 1:42



PERF3: 25 December 2013 23:20 v11.0.22a -- linux-x86_64
1) loop Time: 5.00 3,300,000/sec
2) macro_list Time: 5.00 44,450/sec
3) command_list Time: 5.00 3,360/sec
4) strcat Time: 5.00 72,500/sec
5) listcat Time: 5.02 212/sec
6) string_assign Time: 5.00 437,200/sec
7) get_nth Time: 5.00 8,435/sec
8) put_nth Time: 4.99 129,000/sec
9) if Time: 5.00 546,000/sec
10) trim Time: 5.00 129,300/sec
11) compress Time: 5.00 118,600/sec
12) loop_float Time: 5.00 2,790,000/sec
13) edit_file Time: 5.00 88,600/sec
14) edit_file2 Time: 5.01 2,340/sec
15) macro_call Time: 5.00 66,100/sec
16) gsub Time: 5.00 187,100/sec
17) sieve Time: 5.00 0/sec
Total: 101.420000 Elapsed: 1:42


Post created by CRiSP v11.0.22a-b6663


No comments:

Post a Comment