Cohesive Digests for Ints and Floats
On developing sdxtra I encountered a problem to calculate cohesive digests for numeric values, agnostic to their types. This post is a brief summary of the solution I came up with.
The Cohesive Digesting Problem
Numeric Values and Their Representations
In most programming languages, numeric values are usually represented as various types, such as float64
, int32
, uint64
, etc., to meet different requirements of precision, range, and memory usage. Internally, these types may differ in bit lengths, layouts or signedness. Therefore, even referring to the same number in math, they may have different binary representations.
package main
import (
"fmt"
"math"
)
func main() {
fmt.Printf("%016x %x\n", uint64(42), math.Float64bits(42.0))
// 000000000000002a 4045000000000000
}
Digesting
Digesting is a process that maps arbitary data into a fixed-length string, called a digest. It ensures same values are always associated with the same digest string, while different values would hopefully map to uncollided ones. Such property enables us to compare values by comparing their digests, which is useful in many scenarios like data deduplication, data indexing, etc.
During the digesting process, one would first convert the input data into a byte sequence following a designated scheme, with which some seasoned algorithms like SHA-256 further calculate the digest string. The scheme one adopts should be carefully chosen, such that the resulting byte sequence faithfully encodes the information of the input data.
package main
import (
"crypto/sha256"
"fmt"
"encoding/binary"
)
type Foo struct {
A uint32
B string
}
func main() {
foo := Foo{42, "hello"}
h := sha256.New()
h.Write(binary.BigEndian.AppendUint32(nil, foo.A))
h.Write([]byte(foo.B))
fmt.Printf("%x\n", h.Sum(nil))
}
Above is a simple example of digesting a struct Foo
using SHA-256 algorithm. We first encode the uint32 value foo.A
with a 4-byte big-endian format, then append all bytes of string value foo.B
to the end and finalize the byte sequence.
Type Leakage in Digesting
Straightforward as it seems, however, with this code the digest string might record more than what we expect. The scheme not only encodes value 42
, but also the type information uint32
– the 4-byte length, the big-endian format, altogether constitutes the byte sequence. If foo.A
is later changed to int64
or float32
, the digest string would be different, even though the value is still 42
.
Well… this might frustrate our users, most of whom are unaware of the distinction between ints and floats. Imagine they enquire our application with the same set of parameters, only to find out resources exist yesterday are gone now, owing to an upgrade at midnight that modifies the field type.
The Problem
Thereby we have the cohesive digesting problem: how to properly calculate digests for numeric values, such that their types are not leaked into the digested results?
To simplify the problem, this post focuses on developing an injective mapping scheme for converting uint64
and float64
to byte sequences, regardless of its type. Towards this goal, there are two possible solutions:
- Convert
uint64
andfloat64
to some “super type”X
, then convertX
to byte sequence. - Develop respective dedicated schemes for
uint64
andfloat64
.
Attempt 1: Find “Super Type” X
The first attempt is natural – we purge the type information by introducing a “super type” X
. This type should be able to cover all possible values of uint64
and float64
, and should be able to convert between them without loss of information.
We are going to investigate two candidates: float64
and big.Float
.
float64
as Super Type
One might easily think of float64
as the super type, as it has a wide value range of $-10^{308} \sim 10^{308}$, enough to cover all possible uint64
values, likely.
This is however not true. In fact, the distribution of float64
s becomes “sparse” when the value is large, due to limited precision. Only integers $\leq 2^{53}$ can be represented exactly in float64
, while others larger are rounded to the nearest representable value. The following code snippet demonstrates that large integers are not well preserved when coerced into float64
:
package main
func main() {
println(float64(1<<53) == float64(1<<53+1)) // true
}
Obviously, float64
must be crossed out from the list.
big.Float
as Super Type
If float64
is not suitable, how about types with more precision?
big.Float
is a handy type in Go that emulates arbitary precision floating-point numbers. One can tweak the precision of big.Float
to cover all possible uint64
and float64
values, then convert the value to a byte sequence.
func WriteUint64(v uint64, h hash.Hash) {
var f big.Float
f.SetPrec(128)
f.SetUint64(v)
b, _ := f.GobEncode()
h.Write(b)
}
func WriteFloat64(v float64, h hash.Hash) {
var f big.Float
f.SetPrec(128)
f.SetFloat64(v)
b, _ := f.GobEncode()
h.Write(b)
}
Here we set the precision of big.Float
to 128 bits, such that conversion from either uint64
or float64
to big.Float
won’t be truncated. We then serialize the big.Float
value using GobEncode
method, which converts the value to a byte sequence.
This solution is neat, yet with several drawbacks:
GobEncode
is designed for serialization, not digesting, with no guarantee for an injection between the values and the byte sequences (though it does for now).- It depends on the specific implementation of
GobEncode
, which might change in the future. - It cannot handle
Nan
values.SetFloat64
would panic if the input isNan
. If that is a must, the solution is not applicable. - It is hard to replicate this solution in other languages. As
math/big
is unique to Go, other parts of the system written by different languages will struggle to align with this specific behavior.
Of course, one could roll their own digesting scheme based on big.Float
. I instead chose to develop a dedicated scheme for uint64
and float64
, as described in the next section.
Attempt 2 (Final): Develop Dedicated Schemes
To avoid the drawbacks of the first attempt, I have developed a customized scheme that cohesively maps number x
to a 17-byte length sequence b
. The scheme is designed as follows:
b[0]
is the category flag.b[0]
could be0
(zero),1
(positive),2
(negative),3
(positive inf),4
(negative inf),5
(Nan), indicating the characteristic of the digested number.- If x is zero or not finite (
b[0]
is0, 3, 4, 5
),- let
b[1:17]
be all zeros;
- let
- If
x
is non-zero finite andabs(x) < 2^64
,- let
b[1:9]
be the little-endian representation offloor(abs(x))
, - let
b[9:17]
be the IEEE-754 representation offrac(abs(x))
;
- let
- If
x
is non-zero finite andabs(x) >= 2^64
,b[1:9]
being the IEEE-754 representation ofabs(x)
.- let
b[9:17]
be all zeros.
A reference implementation in Go is as follows (adapted from sdxtra/…/helper.go):
const (
zero = 0
positive = 1
negative = 2
inf = 3
neginf = 4
nan = 5
)
func WriteFloat64(x float64, h hash.Hash) {
buf := make([]byte, 17)
switch {
case x == 0:
buf[0] = zero
case math.IsInf(x, 1):
buf[0] = inf
case math.IsInf(x, -1):
buf[0] = neginf
case math.IsNaN(x):
buf[0] = nan
default:
goto NORMAL
}
h.Write(buf)
return
NORMAL:
if x < 0 {
buf[0] = negative
x = -x
} else {
buf[0] = positive
}
i, f := math.Modf(x)
if i < 1<<64 {
binary.LittleEndian.PutUint64(buf[1:], uint64(i))
} else {
binary.LittleEndian.PutUint64(buf[1:], math.Float64bits(x))
}
binary.LittleEndian.PutUint64(buf[9:], math.Float64bits(f))
h.Write(buf)
}
func WriteUint64(x uint64, h hash.Hash) {
buf := new([]byte, 17)
if x > 0 {
buf[0] = positive
binary.LittleEndian.PutUint64(buf[1:], x)
}
h.Write(buf)
}
Discussion on the Scheme and Implementation
There are several points worth discussing about the scheme and within the implementation:
Category Flag
This scheme use the first byte to signify special values, which helps handle some quirky cases of IEEE-754 floating-point numbers, including:
- Zeros IEEE-754 has both positive and negative zeros, which are different in bit representation but should be treated as the same value.
- Nans IEEE-754 has multiple representations for
Nan
, all of which should be treated as the same value during digesting.
Short-path for Uint64
The scheme enables direct conversion from uint64
to byte sequence, without the need to convert to float64
first. This would save some computation and accelerate the digesting process.
math.Modf
For some values, the scheme involves decomposition into their integral and fractional parts. This operation has several advantages:
- It forms an injection, that is, different
x
corresponds to different pair of(floor(x), frac(x))
. - It can be implemented via bit operations in most programming languages, with promising efficiency and strong consistency.
One can try to verify these properties by referring to IEEE-754 standards.
Author: hsfzxjy.
Link: .
License: CC BY-NC-ND 4.0.
All rights reserved by the author.
Commercial use of this post in any form is NOT permitted.
Non-commercial use of this post should be attributed with this block of text.
OOPS!
A comment box should be right here...But it was gone due to network issues :-(If you want to leave comments, make sure you have access to disqus.com.