Friday 2 July 2010

DTrace / CTF Hashing function

Am playing with hashing functions. Why? Because I feel like it. No
real reason, other than trying to speed up my full text search engine.
(Why am I writing a full text search engine? Err...no reason either, but,
because I can, and because I want to better understand some of the issues
around this, and because I may be able to incorporate it into CRiSP (its
bundled with CRiSP), but also, because it may be useful on its own).

Anyway, was browsing ctf_hash.c in the DTrace package, and there seems
to be something wrong with the hash function:

static ulong_t
ctf_hash_compute(const char *key, size_t len)
{
  ulong_t g, h = 0;
  const char *p, *q = key + len;
  size_t n = 0;

  for (p = key; p < q; p++, n++) {
    h = (h << 4) + *p;

    if ((g = (h & 0xf0000000)) != 0) {
      h ^= (g >> 24);
      h ^= g;
      }
    }

  return (h);
}


The first problem is that variable "n" is unused.


The second problem seems to be that is this a bad hashing function?
"h" is basically the bytes of the key, shifted by 4 each time.


For short strings (less than 7 chars), we neve execute the XOR lines.
And the hash doesnt seem very strong or distributed. I am guessing
the goal here is that for most strings in the kernel, or an application,
they are generally longer than 7 bytes. So, some experimentation
is required to see if the distribution is flat.


Secondly, since we know the size of the string (len), the loop
can be unrolled considerably. For strings less then 7, we could avoid
the complex "g" and/test/jump. Something like this:



static ulong_t
ctf_hash_compute(const char *key, size_t len)
{
  ulong_t g, h = 0;
  const char *p, *q = key + len;

  if (len <= 7) {
   for (p = key; p < q; p++) {
     h = (h << 4) + *p;
   }
  } else {
    for (p = key; p < q; p++) {
      h = (h << 4) + *p;

      if ((g = (h & 0xf0000000)) != 0) {
        h ^= (g >> 24);
        h ^= g;
      }
    }
  }

  return (h);
}



Post created by CRiSP v10.0.2a-b5871

No comments:

Post a Comment