NOTE: This is for windows hosts only!
Just posting some ideas from Dave Aitel here (so they're not forgotten.
"I've always wondered at the use of md5 for file determination of malware. Seems like it's time for something a bit more of a curved function than that. You want to determine not only file identity, but file closeness. Personally I'd probably unpack them, then design a vector of and then I'd just do vector differences from each other. Another option is to run them in a sandbox, and just record their use of API's as a vector.
You can probably devolve each API call into a tuple and use that as a direction in an N-dimensional space and do some simple pattern matching as your HIDS as well. That way your HIDS would not only recognize one