Monday, 14 April 2014

CMD.EXE - a bug (and one of mine fixed!)

Lets get my embarrassment out of the way first. The last
release of CRiSP for Windows (11.0.29) contained a silly bug.
The installer defaulted to "C:\Program Files" instead of
"C:\Program Files\CRiSP". After people mentioned that the install
was going to the wrong choice of "Program Files" vs "Program Files (x86)",
I switched to using %ProgramFiles% so that the C runtime and Win API
would do the right thing - depending whether you installed win32 or win64
version of CRiSP.

I had been delaying resolving an issue with the uninstaller, which
had been failing to remove the final registry entry - due to the
fact that the setup.exe tool was distinguishing between a win32 vs
a win64 install. (CRiSP still comes as a Win32 or Win64 application,
despite all Windows systems supporting 64-bit as default; theres no real
need for the 32-bit version, except it may be 0.5% faster or smaller
in memory use; CRiSP will mostly fit into the L2 CPU cache).

As a reminder for history, CRiSP started life on an 80286 Unix system
and was ported to Windows 3.1! Much of the internal structure hasnt
changed in all these years, except to improve the bugs and the
GUI.

Since we are talking about Windows, someone posted an issue on CRiSP
and on investigation, its a bug in CMD.EXE. Very amazing. CMD.EXE
is so old - dating back to Windows NT 3.51, and before. Its a bizarre
application because it lets you have an MS-DOS style console window
and supports a console API for DOS like apps. "CR.EXE" uses that
to give the same BRIEF look'n'feel dating back to the early 1990s.

But a bug report - in 2014 ?!

CRiSP has a feature to display "[EOF]" as the last line in the file.
You can turn it on (e.g. "set eof" at the Command: prompt). It displays
in reverse video.

If CRiSP window borders are turned off (Alt-F1), then you will likely
notice strange lines on the screen as you scroll around.

This is being caused because when in reverse video, CMD.EXE is adding
one extra pixel to the left and right of the marked region. You
can see this effect by compiling and running the following test
program. The "[EOF]" is shown at line 10 and 12 of the console.
If you arrange to enter text at the C:\ prompt as you get near
to the "[EOF]" you will see the extra pixel disappear. Its easier to use
a tool like ZOOMIN.EXE or "xmag" on Unix, to examine closerly at the
delta.

Of course, CMD.EXE's bug is my bug, so I get to fix it, despite almost
no-one in the known universe caring about this, except people who
use CR.EXE (including myself).

Makes your heart bleed, doesnt it?!


#include <windows.h>

int main(int argc, char **argv)
{ int i, j;
HANDLE h = GetStdHandle(STD_OUTPUT_HANDLE);
int num_written;
COORD coord;
WORD attribs[80];

for (j = 0; j < 2; j++) {
coord.X = 30;
coord.Y = 10 + 2 * j;

WriteConsoleOutputCharacter(h,
"[EOF]", 5, coord, &num_written);

attribs[0] = FOREGROUND_RED | FOREGROUND_GREEN | FOREGROUND_BLUE | BACKGROUND_GREEN;
attribs[0] = BACKGROUND_GREEN | BACKGROUND_INTENSITY;
if (j == 1)
attribs[0] = BACKGROUND_GREEN ;
for (i = 1; i < 5; i++)
attribs[i] = attribs[0];
WriteConsoleOutputAttribute(h, attribs, 5, coord, &num_written);
}
}
c:\> cl test.c
c:\> cls
c:\> test


Post created by CRiSP v11.0.30a-b6722


Thursday, 10 April 2014

rss.pl - time for a change

The DynDNS address I am using is likely to disappear soon.
I have created a couple of replacement ones (most likely crisp.publicvm.com).
If you are using the service at crisp.dyndns-server.com:3000, then
please update your bookmark(s). I'll announce here anything
"final". (I dont know if I will need to renew my DNS or if I will forget).

FYI, I added '/q' over the weekend - now I am being even more frugal!
Instead of 2-4 x 100KB of data, I can possibly get away with 2-4 x 20K of
data. At this rate, my PAYG SIM is going to last me a *year*!

(Just taken me days to resolve a google/cookie conflict so I can actually
post this!)

Post created by CRiSP v11.0.29a-b6718


Sunday, 23 March 2014

Why I love Xen

I dont. But I thought my prior blog required balance :-)

I think I found the Xen + DTrace culprit. I have a couple
of routines in dtrace which invoke the CLI and STI instructions.
They are a bit naughty, and designed to shield dtrace from the complex
number of enable/disable interrupt primitives in the kernel.
(The kernel has multiple layers of interrupt functions and macros -
to try and hide the implementation details of each cpu, along with
preemption and scheduling functions).

I spent a lot of time thinking about DTrace and Xen - what was
I doing wrong. I spent a lot of time not thinking about DTrace and Xen.
As much as I wanted and needed help, to find this horrible bug, I couldnt
even phrase the right question.

The evidence was clear: occasional single step traps would not properly
single step - it was as if the single-step trap flag was not set
in the FLAGS register. I tried molesting the code in the interrupt
routines - checking very careful the assembler code.

The Linux kernel has two sets of interrupts for many of the core
traps (int1, int3 and others). Which one is used depends on whether
Xen is active or not. When Xen is active, there are two extra registers
on the interrupt stack.

After much playing with the dtrace code and the linkage from dtrace to
the underlying kernel...I started to investigate something else.

I noticed the dtrace dumps were caused at the point we switched from
one cpu to another. My VirtualBox VM had 4 cpus. So I set it to 1 cpu.
DTrace worked remarkably better - but soon enough, we still core dumped
on:


$ dtrace -n fbt::sys_chdir:


in one window and


$ while true
do
cd /
done


in the other. The fact that we had a single-step trap, which was not
single stepping is really curious. So I started grepping for CLI/STI
in the driver code, and came across my old friends dtrace_enable_interrupts()
and dtrace_disable_interrupts().

These functions use assembler to execute the CLI/STI instructions.
Now, what if we had done the wrong thing? Xen wouldnt really know
(except by accident) that we had executed these instructions and we
may be running without correct disabling interrupts. Xen is a hypervisor
and if it doesnt know whats going on, all bets are off.

So I replaced the CLI/STI with arch_local_irq_restore and
arch_local_irq_save, and it works! I had gotten away with this for
a small duration and this is the Xen issue (I hope its the only
issue). Whilst writing this blog, my test case ran with no problems.

Now I can formalise the change and push out a new release.

I still dont like Xen - I would rather have my VM die and tell me
that something is wrong immediately, not lie through its teeth that
the guest code is working when it isnt.

Post created by CRiSP v11.0.27a-b6714


Friday, 21 March 2014

Why I dislike Xen

I have a hate/hate relationship for Xen. All forms of virtualisation
*except* Xen, emulate a CPU in a way where it is extremely difficult to
tell the difference. Xen replaces certain instructions with an API
and a portal to the outside world. If you look at such instructions
as CLI and STI - classic 8080 instructions (yes, all the way back to the
original 8-bit cpus), they work in a well understood way.

Fast forward to Xen. Those instructions dont exist - nor any other
instruction which can affect the virtual state of the CPU. Instead,
the kernels are compiled to call a function via the hypercall portal.

Now, if you do the wrong thing - you are on your own. You wont know what you
did wrong. It is so easy to crash the VM *host* if you are not careful.

So, lets look at dtrace. Dtrace appears to work well. But if you are
on Xen, as of this writing, the results are erratic - after a few
probes, you will crash the process, or the kernel, or the VM.

Heres an example of a process crash:


[ 2390.769639] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
[ 2390.769640] IP: [<ffffffffa01cd193>] cpu_core_exec+0x133/0xfffffffffffbcfa0 [dtracedrv]
[ 2390.769641] PGD 17319067 PUD 172b9067 PMD 0
[ 2390.772090] Oops: 0002 [#1] SMP
[ 2390.772090] CPU 1
[ 2390.772090] Modules linked in: dtracedrv(PO) nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc psmouse microcode serio_raw i2c_piix4 e1000 [last unloaded: dtracedrv]
[ 2390.772090]
[ 2390.772090] Pid: 3057, comm: bash Tainted: P W O 3.4.0 #1 innotek GmbH VirtualBox/VirtualBox
[ 2390.772090] RIP: e030:[<ffffffffa01cd193>] [<ffffffffa01cd193>] cpu_core_exec+0x133/0xfffffffffffbcfa0 [dtracedrv]
[ 2390.772090] RSP: e02b:ffff8800173d5f78 EFLAGS: 00010193


When a dtrace probe fires (because we replaced the instruction with a breakpoint),
we have to single step the replaced instruction. What is
happening above is that the single step is not honored.
The EFLAGS register shows bit 0x100 is set - a single step is in
effect. But Xen has decided it isnt. So the instruction
at the buffer location continues after the instruction. Heres
the disassembly:


(gdb) x/i 0xffffffffa01cd190
0xffffffffa01cd190: push %rbp
0xffffffffa01cd191: nop
0xffffffffa01cd192: nop <=== shouldnt get here
0xffffffffa01cd193: add %al,(%rax) <=== but we died here


Such sweet joy to figure out what "rule" I broke. I had
reported in a prior blog entry that the Xen documentation
is very poor.

More than likely I am setting the trap flag by bit-mask operation
on the flags and not by telling Xen what we did.

Off to find the code which is stopping Ubuntu 12.04 from working
reliably.

Post created by CRiSP v11.0.27a-b6714


Monday, 17 March 2014

Malloc(RSS)

I recently wrote about the rss.pl RSS news crawler I wrote. Majority
of it took a few hours to write - the goal was to reduce my mobile
bandwidth and avoid annoying adverts being injected into pages. I set
my sights very low - this doesnt need to be a technically difficult
piece of software, and it gave me something to play with - something
I have done numerous times over the years. (CRiSP incorporates a
news side-bar, for instance; and CRiSP used to support a web
spider - it still does, but theres less need for it in the
day of superfast broadband - compared to the early modems it
was written against).

I can successfully report that rss.pl exceeded my desire to go on
a diet. Previously I was approaching 500MB/month of bandwidth use
and it was out of control, due to the obnoxious way adverts were
being generated (they were mutilating the HTML of a page - after
I had paid the cost of downloading the page - and effectively
paying twice for a page, whilst denying compression). In addition,
the mechanism would add significant delay - I was lucky to read
1 or 2 web pages in multiple minutes.

With rss.pl, I do a page refresh maybe 2 or 3 times a day. The
single (compressed) page load of ~90kb (~250kb uncompressed) lets
me wade through maybe 100-200 news items over the course of a few
hours - I can read far more, without annoying delays.

In one month, I hit 60MB of use - I had set myself a diet of 100MB/month
but I exceeded that. I would have hit 50MB or lower, but occasionally
I use my device for email.

Anyway, I started adding stats to the service. Allows me to fine
tune how it works. I noticed other people use it (by all means do -
rss.pl - but
I reserve the right to take it down or restrict usage if I find any abuse).

One of the things I am debating is how much history to carry. I currently
allow for the last 2000 new sites in a cache, but of the 10-12 sites
I poll, at varying frequencies, some sites are exposing old items.
(I dont use date aging - yet - simply URL history). The working set
is about 700-800 news items, so within 2 or so days, I will be exposed
to stale news items popping through. Its interesting watching the effect.
For example, www.osnews.com is not a fast updating site - maybe 1 item
per day, yet my code is allowing multiple old items to pop through.
"The Old New Thing" (http://blogs.msdn.com/b/oldnewthing/rss.aspx) is a
great site, and updates at one update per day. Yet very old items are
peeping through.

Somehow, the algorithm for retrieving sites, and the cycle update
from these sites is causing a kind of "shearing" effect - something
typically seen with a fragmented heap (hence the reference to malloc) or
locality issues in a page swapping algorithm.

Its as if the sinusoidal wave of each news feed is leading to some
interesting effects.

I can just increase my history limit - doing so wastes more CPU
cycles on my server (so I now need to use a better data structure
than what I presently do). Maybe I will relent and do
If-Modified-Since requests and use the timestamps to keep everything
fresh.

At the moment, I discard older news items and pages I generate, but
I may decide to keep them for long term archival - for more data mining
expeditions.


Post created by CRiSP v11.0.27a-b6712


Sunday, 9 March 2014

Latest DTrace news

I havent much worked on DTrace recently - having been looking at
some side projects (adding multiple cursor support to CRiSP, my
rss.pl enhancements, and various bugs/enhancements in CRiSP).

Of course, when I stop looking at dtrace, after a while, a trickle
of things come in. I thought I would summarise some of my "TODO" items.

First, a user reported an issue on a 3.2 kernel. Seems like a Xen
hosted kernel is very unreliable. Although he reported an issue with
an fbt::: torture test, I could reproduce with a task switch
function - even after a few probes, suggesting the basic handling of an
IRET instruction may be having problems on Xen. (Xen does paravirtualisation
which is another way of saying, all the work is on the user and
not itself, to virtualise. I really do not like Xen - its a work
generation scheme!)

Next was an issue with the PID provider. I did start working on this
and found some missing functionality in libproc/libdtrace. I started
working on this, but its not ready for consumption. I have included
the partial work in the next release. At issue is running a process
and intercepting the shlib events so that dtrace can look at the
shlib and add probes.

Next up is the 3.12 kernel. I usually leave it a while before touching
new kernels - Ubuntu 14.04 is due in a few weeks and I can kill two
birds with one stone by waiting for that to come out.

Another annoyance is the libdwarf vs libdw confusion which prevents
ctfconvert from being built. This is mostly irrelevent unless people
want to use kernel struct's in their probes. I need to discard
both libraries and use readelf and parse the output from it - its
the only portable way to do what is necessary vs the confusion
inherent in both dwarf libraries.

So, its unlikely I will release a new dtrace release for a few more
weeks, unless I magically fix one or more of the issues above.


Post created by CRiSP v11.0.26a-b6707


Sunday, 2 March 2014

candycrush - file format

After all the recent discussions about candycrush, I thought I would
test it out - see what the fuss is about.

Its a nicely put together game - delivering a "Bejewelled" experience
which gets you hooked quite early on. Starts really easy and then
becomes challenging. You quickly run out of lives. Lives are
given back to you at a rate of 1 every 30 mins. If you leave the game
for a few hours you can have a few lives to retry the levels again,
until you run out.

Great so far. With over 300 levels - the game will keep you hooked
for a long time. But paying for the extras and in-app purchases can soon
be expensive.

Looking at all the cheats on the web - there seems to be two:
one is to use facebook - login/deinstall/reinstall, and you can
bump up your lives.

The other is to download a "save" file which is premodified to give
you 999 lives and other stuff.

I took a quick look and located that (for Android), the
file of interest is something like

/data/data/com.king.candycrushsaga/app_storage/save_NNN.dat

where NNN is some unique number for you.

Looking at this file in a binary editor shows some form of binary structure.
Given that I was at level 3, the following pattern showed itself:


0002f0: 0003 0000 00c0 1200 0003 0000 00ac 1c00 ................
000300: 0002 0000 00b4 3c00 0003 0000 003c 8700 ......<......<..
000310: 0003 0000 0020 f800 0003 0000 0034 fb02 ..... .......4..
000320: 0003 0000 00ec fe00 0003 0000 00d4 4301 ..............C.
000330: 0003 0000 00ac 3e01 0002 0000 0038 ee01 ......>......8..
000340: 0003 0000 00a8 5b01 0002 0000 005c 5c01 ......[......\\.
000350: 0003 0000 0008 7d02 0003 0000 0018 c801 ......}.........
000360: 0003 0000 00dc 6c02 0002 0000 0010 0d01 ......l.........
000370: 0003 0000 008c 1103 0003 0000 0014 ed03 ................
000380: 0003 0000 00e4 9800 0002 0000 0088 2b01 ..............+.


After that is a sea of all zeroes. I modified those "3"'s - at 8 byte
intervals, and uploaded to my device.

And yes - it now shows that I have completed (or started) those levels.
I am assuming the format here is - number of stars, and a high score account.
Not clear what the 0000 is after the stars - maybe the stars are
an int (i.e. 4 bytes) and 4 bytes for the score.

I presume theres lots more stuff in there to find, including a timer
or other attributes you can earn/pay for.


Post created by CRiSP v11.0.25a-b6698