Compile-Time Hash String Generation

Pope Kim Jan 11, 2012

Motivation

Recently, I worked on a code-base where everything was referenced by strings. It was quite expensive as you can imagine, so to avoid the cost of string comparison, hash values(ints) were used. This system had a giant hash string registry, which generates a hash value and stores both the hash and actual strings whenever the code sees a new string, well all during run-time.

I personally did not enjoy this system for a number of reasons:

  • it takes up too much memory
    • almost 90% of those strings were never really used as real char * string representation & they were simply just look-up keys
    • so no need to store actual string representations: just hash values were enough
    • then we can generate hash values offline
  • hash string registry was quite slow to be multithread-safe. New strings can be registered by any thread, so we had to have a locking mechanism, which turned out to be a bottle neck time to time
  • strings in the executable or data files are easy to read by anyone with a hex editor. Maybe it makes attackers' life a bit easier?

So I thought of doing something better…… (as follows)

We Need Two Different String Types

I think we should adopt two different ways of handling strings for those 90% and 10%.

1. Pure char* String (10%)

Any string that needs actual string representations should be in this format:

  • filenames that the game needs to load sometime (unless your filenames are hashed, too. If you wanna go nuts sure.. but I think it's too extreme)
  • any string that needs to be printed out on the screen. You really don't want your users to read some hex number like 0xFFFFF, 0x888888 or 0x000000, right? (looking at the picture below.. I think it might be okay… lol)

    cat Image Source:icanhascheeszburger.com

Since they would be only 10% of the string for most games, probably you can just do strcmp() on them instead of generating and storing hash values for them. But if you really insist, you can still do the similar thing to what our hash string manger used to do. (still better than having 10 times more strings stored on memory)

2. Hash-Valued String (90%)

So all the other strings, of which char* representations are never used, will be just a simple hash value, an int. A good example is any strings

  • only used for comparison
  • only used as look-up key

#1(pure char* string) is very straight forward, sofrom now on I'll only focus on #2(Hash-Valued String) in this post.

Choosing a Good Hash Function: x65599

We all know what hash function is. Just in case you don't, it's just a function that will "hopefully" generate unique integer values for different strings. (I said "hopefully" because there is chance to have a same vale for tow different strings: we call it hash collision.) Once you have a unique hash value for each strings, you can simply compare two strings by those hash values instead of comparing every single characters in them What you get out of this? It's faster and simpler.

Then how do you generate a unique hash value for each string? Well, therea lot of different hash functions. Some have less collisions while the others have more. Some are faster while the others are slow. You can choose whatever based on your need. For me, a hash function for game should:

  • have near-zero collision
  • versatile enough to be used as run-time function: you will possibly need toconcatenate strings and then generate hash value to compare in your game
  • be fast enough to be used in game

I did some research and found this amazing comparison chart by Chris Savoie, and I decided the best one for me is x65599. (Also I liked the name quite a lot. Anything that starts with X = a badass)

I also looked into the implementation (shown below). It's simple. You keep multiplying 65599. A-Ha! that's why it's called x65599 :D

// A hash function with multiplier 65599 (from Red Dragon book)
unsigned int generateHash(const char *string, size_t len)
{
    unsigned int hash = 0;
    for(size_t i = 0; i < len; ++i)
    {
       hash = 65599 * hash + string[i];
    }
    return hash ^ (hash >> 16);
}

Alright, now I picked the hash function. Let's move onto what to do in tools-side and code-side.

Saving Data from Tools

If there is any tool that used to save useless char* values into the game data, it's time to change it to save out a hash value, instead. It's simple. Just call that function from your tool (or implement same function if the tools use different language from your game).

Simply done. Moving onto in-game code.

Generating Compile-time Hash Value in Code

Alright let's say, I want to find a bone named "funny_bone". I would usually write a code like this:

bones.find("funny_bone");

but now the tools export a hash value for this string, so I have to write something like this:

const char * boneToFind = "funny_bone";
bones.find( generateHash(boneToFind, strlen(boneToFind) );

Sure, it would work. But wouldn't the string "funny_bone" into our executable? If so, it will use the same or more amount of memory than our old string registry. But this string is a const string, and compiler "might" know about it. Can we make sure the compiler does all thenecessarycomputation during compile time and somehow magically spits out assembly codeequivalentto this?

// yup 0XF1C66FD7F is the real hash value for "funny_bone"
bones.find(0xF1C6FD7F);  

I think we can if these two things can happen:

  • inlining the hash function: if compiler choose to not inline the hash function, there is no way it can calculate an int value during the compile-time. So this is a must. inline keyword is kinda a guideline on most compilers, but I don't think it's a hard thing to achieve.
  • unrolling the hash loop: we are dealing with constant length string, so if we have a way to tell compiler to unroll this and do all the calculation during compile-time, it would work.

Help Inlining generateHash(const char*, size_t)

First off, I want to make sure theimplementationof generateHash(const char*, size_t) function is available for inlining when the compiler compiles the code. Also I want to allow programmers to use a simple macro without calling strlen(const char *) specifically. So I decided to add a define like this:

#define HASH_STRING(str) generateHash(str, strlen(str));

With this, my hash.h looks like this:

// compiile-time hash string test
// author: Pope Kim (www.popekim.com)

#include <string.h>
#define HASH_STRING(str) generateHash(str, strlen(str));

// A hash function with multiplier 65599 (from Red Dragon book).
// we put this implementation into header file so that compiler can
// always inline it.
inline unsigned int generateHash(const char *string, size_t len)
{
    unsigned int hash = 0;
    for(size_t i = 0; i < len; ++i)
    {
       hash = 65599 * hash + string[i];
    }
    return hash ^ (hash >> 16);
}

Test Code

Now I made a test code to see if different compilers and optimization options can flat down HASH_STRING(str) into a single integer value in compile-time.

This is my main.cpp:

// compiile-time hash string test
// author: Pope Kim (www.popekim.com)

#include <stdio.h>

#include "hash.h"

int main(int args, char** argv)
{
    unsigned int hashValue = HASH_STRING("funny_bone");
    printf("hash value is 0x%8x\n", hashValue);


    return 0;
}

Now, let's see if the compilers do a good job on this.

With Visual Studio 2010 SP1

I've created a quick win32 console application project and tested it with different optimization flag. This is what I did.

  1. Change project configuration to Release
  2. To output assembly file, I went to Project Properties > C/C++ > Output Files > Assembler Output, and chose Assembly-Only Listing (/FA)
  3. Then to test different optimization flag, I went to Project Properties > C/C++ > Optimization and compiled with the following settings:

Disabled (/Od)

Two findings:

  • generateHash() function call is not inlined, as expected
  • interestingly, strlen() is optimized: look at push 10
_main      ; COMDAT
; File e:\temp\x65599\x65599\main.cpp
; Line 11
 push ebp
 mov ebp, esp
 push ecx
; Line 12
---> push 10     ; 0000000aH
 push OFFSET $SG-5
---> call [email protected]@[email protected]  ; generateHash
 add esp, 8
 mov DWORD PTR _hashValue$[ebp], eax
; Line 13
 mov eax, DWORD PTR _hashValue$[ebp]
 push eax
 push OFFSET $SG-6
 call DWORD PTR __imp__printf
 add esp, 8<div>; Line 15
 xor eax, eax
; Line 16
 mov esp, ebp
 pop ebp
 ret 0
_main ENDP

Minimize Size(/O1)

same as disabled, but the function call is inlined at least.

_main      ; COMDAT
; File e:\temp\x65599\x65599\main.cpp
; Line 12
 xor ecx, ecx
 xor eax, eax
[email protected]:
 movsx edx, BYTE PTR [email protected][email protected]@[email protected][eax]
 imul ecx, 65599    ; 0001003fH
 add ecx, edx
 inc eax
---> cmp eax, 10     ; 0000000aH
---> jb SHORT [email protected]
 mov eax, ecx
 shr eax, 16     ; 00000010H
 xor eax, ecx
; Line 13
 push eax
 push OFFSET [email protected][email protected]@[email protected]
 call DWORD PTR __imp__printf
 pop ecx
 pop ecx
; Line 15
 xor eax, eax
; Line 16
 ret 0
_main ENDP

Maximize Speed(/O2)

You see the first line? push -238617217 which is 0xF1C6FDF? IT IS ALL FLATTENED DOWN TO AN INT!

; Line 13
---> push -238617217    ; f1c6fd7fH
 push OFFSET [email protected][email protected]@[email protected]
 call DWORD PTR __imp__printf
 add esp, 8
; Line 15
 xor eax, eax
; Line 16
 ret 0
_main ENDP

So I opened the resulting .exe file in a text editor to see if there's any string named "funny_bone". And I couldn't find any! YAY!

notepad

Full Optimization(/Ox)

Walla! again!

_main      ; COMDAT
; File e:\temp\x65599\x65599\main.cpp
; Line 13
---> push -238617217    ; f1c6fd7fH
 push OFFSET $SG-6
 call DWORD PTR __imp__printf
 add esp, 8
; Line 15
 xor eax, eax
; Line 16
 ret 0
_main ENDP

But when I searched "funny_bone" in the exe file…

there is funny bone WTF? Why are you there, funny_bone?

Seriously? Full Optimization option doesn't strip unused strings? To be sure I tested with another simple program, but the result was same. Even this very simple string doesn't get stripped out with this compiler option.

void idiot()
{
 const char* idiot = "OMG";
}

I talked to this to Karl, and he found String Pooling option is off for /Ox by default while it is on for both /O1 and /O2. Strange eh? As soon as I enabled C/C++ > Code Generation > Enable String Polling the string disappeared. YAY!

With G++

How can I finish a test without trying G++? I used G++ 4.5.3 to test this with this compiler flags:

g++ *.cpp -pedantic -Wall -S <optimization-flag>

Note that -S flag is to stop generating after assembler

-O0

-O0 flag means no optimization pretty much, so i didn't expect anything. Still strlen() function seems to be converted to 10, at least.

LFE4:
 .def ___main; .scl 2; .type 32; .endef
 .section .rdata,"dr"
LC0:
 .ascii "funny_bone\0"
LC1:
 .ascii "hash value is 0x%8x\12\0"
 .text
.globl _main
 .def _main; .scl 2; .type 32; .endef
_main:
LFB5:
 pushl %ebp
LCFI4:
 movl %esp, %ebp
LCFI5:
 andl $-16, %esp
LCFI6:
 subl $32, %esp
LCFI7:
 call ___main
---> movl $10, 4(%esp)
 movl $LC0, (%esp)
---> call __Z12generateHashPKcj
 movl %eax, 28(%esp)
 movl 28(%esp), %eax
 movl %eax, 4(%esp)
 movl $LC1, (%esp)
 call _printf
 movl $0, %eax
 leave
LCFI8:
 ret

-O1

now generateHash() function is inlined too. But still doing all the calculation.

.def ___main; .scl 2; .type 32; .endef
 .section .rdata,"dr"
LC0:
 .ascii "funny_bone\0"
LC1:
 .ascii "hash value is 0x%8x\12\0"
 .text
.globl _main
 .def _main; .scl 2; .type 32; .endef
_main:
LFB5:
 pushl %ebp
LCFI0:
 movl %esp, %ebp
LCFI1:
 andl $-16, %esp
LCFI2:
 pushl %ebx
LCFI3:
 subl $28, %esp
LCFI4:
 call ___main
 movl $LC0, %eax
 movl $LC0+10, %ebx
 movl $0, %edx
L2:
 imull $65599, %edx, %edx
 movsbl (%eax), %ecx
 addl %ecx, %edx
 addl $1, %eax
---> cmpl %ebx, %eax
---> jne L2
 movl %edx, %eax
 shrl $16, %eax
 xorl %eax, %edx
 movl %edx, 4(%esp)
 movl $LC1, (%esp)
 call _printf
 movl $0, %eax
 addl $28, %esp
 popl %ebx
LCFI5:
 movl %ebp, %esp
LCFI6:
 popl %ebp
LCFI7:
 ret

-O2

Same as -O1. Of course, loop unroll optimization is only enabled with -O3.

 .def ___main; .scl 2; .type 32; .endef
 .section .rdata,"dr"
LC0:
 .ascii "funny_bone\0"
LC1:
 .ascii "hash value is 0x%8x\12\0"
 .text
 .p2align 4,,15
.globl _main
 .def _main; .scl 2; .type 32; .endef
_main:
LFB5:
 pushl %ebp
LCFI0:
 movl %esp, %ebp
LCFI1:
 andl $-16, %esp
LCFI2:
 subl $16, %esp
LCFI3:
 call ___main
 movl $LC0, %eax
 xorl %edx, %edx
 .p2align 4,,7
L2:
 imull $65599, %edx, %edx
 movsbl (%eax), %ecx
 addl $1, %eax
 addl %ecx, %edx
---> cmpl $LC0+10, %eax
---> jne L2
 movl %edx, %eax
 shrl $16, %eax
 xorl %edx, %eax
 movl %eax, 4(%esp)
 movl $LC1, (%esp)
 call _printf
 xorl %eax, %eax
 leave
LCFI4:
 ret

Also I tried to search the string in the exe file, but it was not there.. YAY!

-O3

Finally… Seeing movl $-238617217, 4(%esp)? It's flattened down!

 .def ___main; .scl 2; .type 32; .endef
 .section .rdata,"dr"
LC0:
 .ascii "hash value is 0x%8x\12\0"
 .text
 .p2align 4,,15
.globl _main
 .def _main; .scl 2; .type 32; .endef
_main:
LFB5:
 pushl %ebp
LCFI0:
 movl %esp, %ebp
LCFI1:
 andl $-16, %esp
LCFI2:
 subl $16, %esp
LCFI3:
 call ___main
---> movl $-238617217, 4(%esp)</u></i>
 movl $LC0, (%esp)
 call _printf
 xorl %eax, %eax
 leave
LCFI4:
 ret

-Os

-Os stands for size. And it didn't do good :(

LFE4:
 .def ___main; .scl 2; .type 32; .endef
 .section .rdata,"dr"
LC0:
 .ascii "funny_bone\0"
LC1:
 .ascii "hash value is 0x%8x\12\0"
 .text
.globl _main
 .def _main; .scl 2; .type 32; .endef
_main:
LFB5:
 pushl %ebp
LCFI7:
 movl %esp, %ebp
LCFI8:
 andl $-16, %esp
LCFI9:
 subl $16, %esp
LCFI10:
 call ___main
 movl $10, 4(%esp)
 movl $LC0, (%esp)
---> call __Z12generateHashPKcj
 movl $LC1, (%esp)
 movl %eax, 4(%esp)
 call _printf
 xorl %eax, %eax
 leave
LCFI11:
 ret

Quick Summary

So let me quickly summarize which optimization flags you should use for VS 2010 SP1 and G++ to use this amazing(maybe a bit stupid if you are cynical :P) trick.

Visual Studio 2010 SP1

  • /O2
  • /Ox with Enable String Pooling on

G++ 4.5.3

  • -O3

Debugging

Alright. It's good we removed the string representations from code. The size of the executable is smaller, but it's horrible for debugging. When I got a crash on a bone "named" 0xF1C6FD7F, how do I know which bone I have to look at in 3DS Max? ARRGHH.. Okay, so apparently we really need char* string for at least debugging!

So should I revert everything I did so far? I don't want to. :) This data is only useful for debugging, so I should devise a way of debugging it without affecting the disc build. Actually, it is a very easy problem. I just need a string database, which has both pairs of <hash key, char*> for all the strings saved out from the data baking tools and used in our code.

Generating Debug String Database

In what format, should I store our string database? Using a light-weight SQL DB file is definitely an option. What about SQL-lite? I heard good things about it. Or I can simply outputs a list of pairs of <int, char*> values into a text file. I would probably choose a plain text file, or compressed text file, to make it easier to load the file in C++ codebase. Whatever it is, the filename will bedebug.string_db.

Saving Debug String DB from Tools

So now I just need to change my tools to open debug.string_db file and insert any new string entries into this while it's saving game-ready data.

Very simply done!

Saving Debug String DB from Codebase

Then what do we do with the strings in the code? Luckily enough, I defined HASH_STRING("") macro. I can just write a C# or python script searching through all the text files in my codebase to find any char* strings wrapped by that macro. Regular expression? sure. Then I would run same hash generation code to get the hash key, and write this intodebug.string_dbfile.

Not very simple. But easy and fast enough…. done!

Looking Up String Values

So now the question is how we can easily find what string it is while we are in Visual Studio.

String Lookup Tool

Do you remember using DirectX Error Lookup tool that comes with DirectX SDK? It looks like this:

DX Error Lookup Tool

Maybe I can write a tool like this. This tool would simply opendebug.string_dbfile and find any hash value I type in. Yeah I can justcopy and paste the hash value from Visual Studio Watch window into this. Yup, it's a bit annoying but it's easy to make.

Visual Studio Plug-in?

Next thought I have is something I'm not sure if it's possible because I don't know too much about making a Visual Studio plug-in.

If a Visual Studio plug-in is allowed to read a text file(or a SQL DB) and modifies what shows up in Watch window, maybe I can do this. It's just something I need to take some time and see if it's possible.. I haven't done it so far. I think I'll be fine with the string lookup tool for now

In-Game Hash/String Registry

Or I can make a debug-only hash/string registry, which simply loadsdebug.string_dbfile. Then anyone can find out the actual string representation easily in the code. This will be only available on debug build, and the loading code will simply disappear in the disc build.

Maybe I'm Crazy But

While I was thinking about the last option for looking up string values, I found this is actually very close to how the localization database works. You have a string id(key) and the value of the given language. If you want to swap to different language you will just load different localization database file. In the files, all the keys are same as before, just the string values are different.

So, if one decides to implement the in-game hash/string registry option for debugging, maybe he can use the same architecture for localization? I don't know too much about localization database, so I might be just being stupid. I just had this crazy idea. heh =)

One BIG Problem, Though

I was excited for a while. Then Noel, my buddie and one of the best programmers I've met in this industry, asked me if the loop unroll works if it the string length is longer. So I did a quick test.

With Visual Studio 2010 SP1

I found Visual Studio 2010 SP1 works only up to 10 characters. This is the result I got when I used 11 character-long string "funny_bone1".

_main      ; COMDAT

; File e:\temp\x65599\main.cpp
; Line 12
 xor ecx, ecx
 xor eax, eax
 npad 12
[email protected]:
 movsx edx, BYTE PTR $SG-5[eax]
 imul ecx, 65599    ; 0001003fH
 inc eax
 add ecx, edx
---> cmp eax, 11     ; 0000000bH
---> jb SHORT [email protected]
 mov eax, ecx
 shr eax, 16     ; 00000010H
 xor eax, ecx
; Line 13
 push eax
 push OFFSET $SG-6
 call DWORD PTR __imp__printf
 add esp, 8
; Line 15
 xor eax, eax
; Line 16
 ret 0
_main ENDP

With G++

G++ did a better job. G++ works up to 17 characters. This is the result I got when i used 18 character-long string "funny_bone12345678"

 .def _main; .scl 2; .type 32; .endef
_main:
LFB5:
 pushl %ebp
LCFI0:
 movl %esp, %ebp
LCFI1:
 andl $-16, %esp
LCFI2:
 subl $16, %esp
LCFI3:
 call ___main
 movl $LC0, %eax
 xorl %edx, %edx
 .p2align 4,,7
L2:
 imull $65599, %edx, %edx
 movsbl (%eax), %ecx
 addl $1, %eax
 addl %ecx, %edx
---> cmpl $LC0+18, %eax
---> jne L2
 movl %edx, %eax
 shrl $16, %eax
 xorl %edx, %eax
 movl %eax, 4(%esp)
 movl $LC1, (%esp)
 call _printf
 xorl %eax, %eax
 leave
LCFI4:
 ret
<div>

What Does This Mean?</b>

This means it will optimize strings with length equal to or less than 10 or 17 depending on your compilers. For other strings, it'll do all the calculation during run-time. Is it still worth it? I think so. You will just need to follow this guideline:

  • if possible, keep look-up key strings short
  • don't call HASH_STRING() macro multiple times on a same string. In other words, cache the hash value somewhere(e.g, as a member variable of an object)

Also there's a chance that future compilers do a better job at unrolling this. Up to 64 characters would be nice. (Unfortunately, VS 2011 Preview doesn't do any better job….)

Or constexpr from C++11 can be used for this one day? It's not even being supported by VS 2011 Preview… so whateva……

The best solution would be a compiler switch like this:

inline unsigned int generateHash(const char *string, size_t len)
{
    unsigned int hash = 0;

    #pragma unroll
    for(size_t i = 0; i < len; ++i)
    {
        hash = 65599 * hash + string[i];
    }
    return hash ^ (hash >> 16);
}

This would unroll the loop if len is constant. I believe the IBM compiler supports something similar to this. And we have this kinda switch in HLSL. So why can't we have it in our C++ compilers?

Microsoft, can we please have this option, pretty please?

Walkaround (for now)

Since my original was posted, Mikkel Gjoel let me know Humus had a way to generate compile-time hash more than this limit.

I tested it quickly and it works great. I tested up to 64 chars! There is an usability issue: you can't have a generic generateHash(const char*) function with specialized generateHash(const char &(string)[N]); functions together. Compiler gets confused. So now I have to make two different versions.

From the altdevblogaday link posted in the comment, I found a nice way to get around the striked-through issue:

struct ConstCharWrapper
{
  inline ConstCharWrapper(const char* str) : m_str(str) {}
  const char* m_str;
};

inline unsigned int generateHash(ConstCharWrapper wrapper, size_t len)
{
  const char* string = wrapper.m_str;

  // same as before
}

Also I found VS 2010 is not smart enough to know these two are same, thus not flattening down the first even with my way: (g++ is smart enough to know it).

#1

const char * const funny = "funny_bone";
unsigned int hashValue = HASH_STRING(funny);

#2

unsigned int hashValue = HASH_STRING("funny_bone");

But using #1 confuses regex when generating the debug string DB anyways, so probably I can just ignore it. But how would I force educate other programmers to use the first form if the string is const? That's something I'm trying to figure out.

Anyways, if I figure out a nice way to use this new info in a less painful way, I'll write another post. This post has gotten too long already.. ugggh…