Cohesive Digests for Ints and Floats

On developing sdxtra I encountered a problem to calculate cohesive digests for numeric values, agnostic to their types. This post is a brief summary of the solution I came up with.

The Cohesive Digesting Problem

Numeric Values and Their Representations

In most programming languages, numeric values are usually represented as various types, such as float64, int32, uint64, etc., to meet different requirements of precision, range, and memory usage. Internally, these types may differ in bit lengths, layouts or signedness. Therefore, even referring to the same number in math, they may have different binary representations.

package main

import (
"fmt"
"math"
)

func main() {
fmt.Printf("%016x %x\n", uint64(42), math.Float64bits(42.0))
// 000000000000002a 4045000000000000
}

Digesting

Digesting is a process that maps arbitary data into a fixed-length string, called a digest. It ensures same values are always associated with the same digest string, while different values would hopefully map to uncollided ones. Such property enables us to compare values by comparing their digests, which is useful in many scenarios like data deduplication, data indexing, etc.

During the digesting process, one would first convert the input data into a byte sequence following a designated scheme, with which some seasoned algorithms like SHA-256 further calculate the digest string. The scheme one adopts should be carefully chosen, such that the resulting byte sequence faithfully encodes the information of the input data.

package main

import (
"crypto/sha256"
"fmt"
"encoding/binary"
)


type Foo struct {
A uint32
B string
}

func main() {
foo := Foo{42, "hello"}
h := sha256.New()
h.Write(binary.BigEndian.AppendUint32(nil, foo.A))
h.Write([]byte(foo.B))
fmt.Printf("%x\n", h.Sum(nil))
}

Above is a simple example of digesting a struct Foo using SHA-256 algorithm. We first encode the uint32 value foo.A with a 4-byte big-endian format, then append all bytes of string value foo.B to the end and finalize the byte sequence.

Type Leakage in Digesting

Straightforward as it seems, however, with this code the digest string might record more than what we expect. The scheme not only encodes value 42, but also the type information uint32 – the 4-byte length, the big-endian format, altogether constitutes the byte sequence. If foo.A is later changed to int64 or float32, the digest string would be different, even though the value is still 42.

Well… this might frustrate our users, most of whom are unaware of the distinction between ints and floats. Imagine they enquire our application with the same set of parameters, only to find out resources exist yesterday are gone now, owing to an upgrade at midnight that modifies the field type.

The Problem

Thereby we have the cohesive digesting problem: how to properly calculate digests for numeric values, such that their types are not leaked into the digested results?

To simplify the problem, this post focuses on developing an injective mapping scheme for converting uint64 and float64 to byte sequences, regardless of its type. Towards this goal, there are two possible solutions:

  1. Convert uint64 and float64 to some “super type” X, then convert X to byte sequence.
  2. Develop respective dedicated schemes for uint64 and float64.

Attempt 1: Find “Super Type” X

The first attempt is natural – we purge the type information by introducing a “super type” X. This type should be able to cover all possible values of uint64 and float64, and should be able to convert between them without loss of information.

We are going to investigate two candidates: float64 and big.Float.

float64 as Super Type

One might easily think of float64 as the super type, as it has a wide value range of $-10^{308} \sim 10^{308}$, enough to cover all possible uint64 values, likely.

This is however not true. In fact, the distribution of float64s becomes “sparse” when the value is large, due to limited precision. Only integers $\leq 2^{53}$ can be represented exactly in float64, while others larger are rounded to the nearest representable value. The following code snippet demonstrates that large integers are not well preserved when coerced into float64:

package main

func main() {
println(float64(1<<53) == float64(1<<53+1)) // true
}

Obviously, float64 must be crossed out from the list.

big.Float as Super Type

If float64 is not suitable, how about types with more precision?

big.Float is a handy type in Go that emulates arbitary precision floating-point numbers. One can tweak the precision of big.Float to cover all possible uint64 and float64 values, then convert the value to a byte sequence.

func WriteUint64(v uint64, h hash.Hash) {
var f big.Float
f.SetPrec(128)
f.SetUint64(v)
b, _ := f.GobEncode()
h.Write(b)
}

func WriteFloat64(v float64, h hash.Hash) {
var f big.Float
f.SetPrec(128)
f.SetFloat64(v)
b, _ := f.GobEncode()
h.Write(b)
}

Here we set the precision of big.Float to 128 bits, such that conversion from either uint64 or float64 to big.Float won’t be truncated. We then serialize the big.Float value using GobEncode method, which converts the value to a byte sequence.

This solution is neat, yet with several drawbacks:

  • GobEncode is designed for serialization, not digesting, with no guarantee for an injection between the values and the byte sequences (though it does for now).
  • It depends on the specific implementation of GobEncode, which might change in the future.
  • It cannot handle Nan values. SetFloat64 would panic if the input is Nan. If that is a must, the solution is not applicable.
  • It is hard to replicate this solution in other languages. As math/big is unique to Go, other parts of the system written by different languages will struggle to align with this specific behavior.

Of course, one could roll their own digesting scheme based on big.Float. I instead chose to develop a dedicated scheme for uint64 and float64, as described in the next section.

Attempt 2 (Final): Develop Dedicated Schemes

To avoid the drawbacks of the first attempt, I have developed a customized scheme that cohesively maps number x to a 17-byte length sequence b. The scheme is designed as follows:

  • b[0] is the category flag. b[0] could be 0 (zero), 1 (positive), 2 (negative), 3 (positive inf), 4 (negative inf), 5 (Nan), indicating the characteristic of the digested number.
  • If x is zero or not finite (b[0] is 0, 3, 4, 5),
    • let b[1:17] be all zeros;
  • If x is non-zero finite and abs(x) < 2^64,
    • let b[1:9] be the little-endian representation of floor(abs(x)),
    • let b[9:17] be the IEEE-754 representation of frac(abs(x));
  • If x is non-zero finite and abs(x) >= 2^64,
    • b[1:9] being the IEEE-754 representation of abs(x).
    • let b[9:17] be all zeros.

A reference implementation in Go is as follows (adapted from sdxtra/…/helper.go):

const (
zero = 0
positive = 1
negative = 2
inf = 3
neginf = 4
nan = 5
)

func WriteFloat64(x float64, h hash.Hash) {
buf := make([]byte, 17)
switch {
case x == 0:
buf[0] = zero
case math.IsInf(x, 1):
buf[0] = inf
case math.IsInf(x, -1):
buf[0] = neginf
case math.IsNaN(x):
buf[0] = nan
default:
goto NORMAL
}
h.Write(buf)
return
NORMAL:
if x < 0 {
buf[0] = negative
x = -x
} else {
buf[0] = positive
}
i, f := math.Modf(x)
if i < 1<<64 {
binary.LittleEndian.PutUint64(buf[1:], uint64(i))
} else {
binary.LittleEndian.PutUint64(buf[1:], math.Float64bits(x))
}
binary.LittleEndian.PutUint64(buf[9:], math.Float64bits(f))
h.Write(buf)
}

func WriteUint64(x uint64, h hash.Hash) {
buf := new([]byte, 17)
if x > 0 {
buf[0] = positive
binary.LittleEndian.PutUint64(buf[1:], x)
}
h.Write(buf)
}

Discussion on the Scheme and Implementation

There are several points worth discussing about the scheme and within the implementation:

Category Flag

This scheme use the first byte to signify special values, which helps handle some quirky cases of IEEE-754 floating-point numbers, including:

  1. Zeros IEEE-754 has both positive and negative zeros, which are different in bit representation but should be treated as the same value.
  2. Nans IEEE-754 has multiple representations for Nan, all of which should be treated as the same value during digesting.

Short-path for Uint64

The scheme enables direct conversion from uint64 to byte sequence, without the need to convert to float64 first. This would save some computation and accelerate the digesting process.

math.Modf

For some values, the scheme involves decomposition into their integral and fractional parts. This operation has several advantages:

  1. It forms an injection, that is, different x corresponds to different pair of (floor(x), frac(x)).
  2. It can be implemented via bit operations in most programming languages, with promising efficiency and strong consistency.

One can try to verify these properties by referring to IEEE-754 standards.


Author: hsfzxjy.
Link: .
License: CC BY-NC-ND 4.0.
All rights reserved by the author.
Commercial use of this post in any form is NOT permitted.
Non-commercial use of this post should be attributed with this block of text.

OOPS!

A comment box should be right here...But it was gone due to network issues :-(If you want to leave comments, make sure you have access to disqus.com.