Cohesive Digests for Ints and Floats
On developing sdxtra there comes a problem that is to calculate cohesive digests for numeric values, in a way agnostic to their types. This post discusses the problem and presents solutions in Go, though the ideas are applicable to other languages as well.
The Cohesive Digesting Problem
Numeric Values and Their Representations
In most programming languages, numeric values are usually represented as various types, such as float64
, int32
, uint64
, etc., to meet different requirements of precision, range, and memory usage. Internally, these types diverse in bit lengths, layouts or signedness. Therefore, even referring to the same number in math, they may have different bit representation.
package main
import (
"fmt"
"math"
)
func main() {
fmt.Printf("%016x %x\n", uint64(42), math.Float64bits(42.0))
// 000000000000002a 4045000000000000
}
Digesting
Digesting is a process that maps arbitary data into a fixed-length string, called a digest.
Digests are deterministic and discriminative. By deterministic, it means the same input data would always generate the same digest string. By discriminative, it means different input data would hopefully generate different digest strings. Though there will always be collisions, due to pigeonhole principle, modern digesting algorithms like SHA-256 are designed to minimize such probability, which effectively enables usage of digests in scenarios like data deduplication, data indexing, etc.
Modern digesting algorithms are byte-oriented. During the digesting process, one should first convert the input data into a byte sequence following a designated scheme, with which the algorithms like SHA-256 further calculate the digest string. The scheme one adopts should be carefully chosen, such that the resulting byte sequence faithfully reflects the information of the input data.
package main
import (
"crypto/sha256"
"fmt"
"encoding/binary"
)
type Foo struct {
A uint32
B string
}
func main() {
foo := Foo{42, "hello"}
h := sha256.New()
h.Write(binary.BigEndian.AppendUint32(nil, foo.A))
h.Write([]byte(foo.B))
fmt.Printf("%x\n", h.Sum(nil))
}
Above is a simple example of digesting a struct Foo
using SHA-256 algorithm. The scheme here for byte-sequence generation is straightforward: we first encode the uint32 value foo.A
with a 4-byte big-endian format, then append all bytes of string value foo.B
to the end and finalize the byte sequence.
Type Leakage in Digesting
One might easily derive such a digesting scheme, which, however, could sometimes encode more information than what we expect.
The problem lies in the type information leakage. For the field foo.A
, we not only encode the numeric value 42
, the way we encode it also reveals the type information of the field – the 4-byte length, the big-endian format, altogether constitutes the byte sequence. If later we switch to int64
or float32
, but keep encoding by their bit representation, the digest would change accordingly.
Well… this might frustrate our users, as most of them are unaware of the distinction between ints and floats. Imagine they enquire our application with the same set of parameters, only to find out resources exist yesterday are gone now, owing to an upgrade at midnight that modifies a field type. What a nightmare.
The Problem
Thereby we have the cohesive digesting problem: how to properly calculate digests for numeric values, such that their types are not leaked into the digested results?
As said, the problem is equivalent to finding an type-agnostic injective that converts numeric values to byte sequences. To simplify the problem, this post focuses on handling uint64
and float64
types, as they are the most generic numeric types in Go. Towards this goal, there are two possible solutions:
- Convert
uint64
andfloat64
to some “super type”X
, then convertX
to byte sequence. - Develop respective dedicated schemes for
uint64
andfloat64
.
Attempt 1: Find “Super Type” X
The first idea is natural – we purge the type information by introducing a “super type” X
. This type should be able to cover all possible values of uint64
and float64
, and should be able to convert between them without loss of information.
Considering integers are a subset of real nunbers, we are going to investigate two candidates: float64
and big.Float
.
float64
as Super Type
One might easily think of float64
as the super type, as it has a wide value range of $-10^{308} \sim 10^{308}$, enough to cover all possible uint64
values, likely.
This is however not true. In fact, the distribution of float64
s becomes “sparse” when the value is large, due to limited precision. Only integers $\leq 2^{53}$ can be represented exactly in float64
, while others larger are rounded to the nearest representable value. The following code snippet demonstrates that large integers are not well preserved when coerced into float64
:
package main
func main() {
println(float64(1<<53) == float64(1<<53+1)) // true
}
Obviously, float64
must be crossed out from the list.
big.Float
as Super Type
If float64
is not suitable, how about types with more precision?
big.Float
is a handy type in Go that emulates arbitary precision floating-point numbers. One can tweak the precision with (big.Float).SetPrec
to cover all possible uint64
and float64
values, then convert the value to a byte sequence.
func WriteUint64(v uint64, h hash.Hash) {
var f big.Float
f.SetPrec(128)
f.SetUint64(v)
b, _ := f.GobEncode()
h.Write(b)
}
func WriteFloat64(v float64, h hash.Hash) {
var f big.Float
f.SetPrec(128)
f.SetFloat64(v)
b, _ := f.GobEncode()
h.Write(b)
}
Here we set the precision of big.Float
to 128 bits, such that conversion from either uint64
or float64
to big.Float
won’t be truncated. We then serialize the big.Float
value using GobEncode
method, which converts the value to a byte sequence.
This solution is neat, yet with several drawbacks:
GobEncode
is designed for serialization, not digesting, with no guarantee for an injection between the values and the byte sequences (though it does for now).- It depends on the specific implementation of
GobEncode
, which might change in the future. - It cannot handle
Nan
values.SetFloat64
would panic if the input isNan
. If that is a must, the solution is not applicable. - It is hard to replicate this solution in other languages. As
math/big
is unique to Go, other parts of the system written by different languages will struggle to align with this specific behavior.
Of course, one could roll their own digesting scheme based on big.Float
. I instead chose to develop a dedicated scheme for uint64
and float64
, as described in the next section.
Attempt 2 (Final): Develop Dedicated Schemes
To avoid the drawbacks of the first attempt, I have developed a customized scheme that cohesively maps number x
to a 17-byte length sequence b
. The scheme is designed as follows:
b[0]
is the category flag.b[0]
could be0
(zero),1
(positive),2
(negative),3
(positive inf),4
(negative inf),5
(Nan), indicating the characteristic of the digested number.- If x is zero or not finite (
b[0]
is0, 3, 4, 5
),- let
b[1:17]
be all zeros;
- let
- If
x
is non-zero finite andabs(x) < 2^64
,- let
b[1:9]
be the little-endian representation offloor(abs(x))
, - let
b[9:17]
be the IEEE-754 representation offrac(abs(x))
;
- let
- If
x
is non-zero finite andabs(x) >= 2^64
,b[1:9]
being the IEEE-754 representation ofabs(x)
.- let
b[9:17]
be all zeros.
A reference implementation in Go is as follows (adapted from sdxtra/…/helper.go):
const (
zero = 0
positive = 1
negative = 2
inf = 3
neginf = 4
nan = 5
)
func WriteFloat64(x float64, h hash.Hash) {
buf := make([]byte, 17)
switch {
case x == 0:
buf[0] = zero
case math.IsInf(x, 1):
buf[0] = inf
case math.IsInf(x, -1):
buf[0] = neginf
case math.IsNaN(x):
buf[0] = nan
default:
goto NORMAL
}
h.Write(buf)
return
NORMAL:
if x < 0 {
buf[0] = negative
x = -x
} else {
buf[0] = positive
}
i, f := math.Modf(x)
if i < 1<<64 {
binary.LittleEndian.PutUint64(buf[1:], uint64(i))
} else {
binary.LittleEndian.PutUint64(buf[1:], math.Float64bits(x))
}
binary.LittleEndian.PutUint64(buf[9:], math.Float64bits(f))
h.Write(buf)
}
func WriteUint64(x uint64, h hash.Hash) {
buf := new([]byte, 17)
if x > 0 {
buf[0] = positive
binary.LittleEndian.PutUint64(buf[1:], x)
}
h.Write(buf)
}
Discussion on the Scheme and Implementation
There are several points worth discussing about the final scheme and within the implementation:
Category Flag
This scheme use the first byte to signify special values, which helps handle some quirky cases of IEEE-754 floating-point numbers, including:
- Zeros IEEE-754 has both positive and negative zeros, which are different in bit representation but should be treated as the same value.
- Nans IEEE-754 has multiple representations for
Nan
, all of which should be treated as the same value during digesting.
Short-path for Uint64
The scheme enables direct conversion from uint64
to byte sequence, without the need to convert to float64
first. This would save some computation and accelerate the digesting process.
math.Modf
For some values, the scheme involves decomposition into their integral and fractional parts. This operation has several advantages:
- It forms an injection, that is, different
x
corresponds to different pair of(floor(x), frac(x))
. - It can be implemented via bit operations in most programming languages, with promising efficiency and strong consistency.
One can try to verify these properties by referring to IEEE-754 standards.
Author: hsfzxjy.
Link: .
License: CC BY-NC-ND 4.0.
All rights reserved by the author.
Commercial use of this post in any form is NOT permitted.
Non-commercial use of this post should be attributed with this block of text.
OOPS!
A comment box should be right here...But it was gone due to network issues :-(If you want to leave comments, make sure you have access to disqus.com.