Cohesive Digests for Ints and Floats

2025-01-14 | Tech

On developing sdxtra there comes a problem that is to calculate cohesive digests for numeric values, in a way agnostic to their types. This post discusses the problem and presents solutions in Go, though the ideas are applicable to other languages as well.

The Cohesive Digesting Problem

Numeric Values and Their Representations

In most programming languages, numeric values are usually represented as various types, such as float64, int32, uint64, etc., to meet different requirements of precision, range, and memory usage. Internally, these types diverse in bit lengths, layouts or signedness. Therefore, even referring to the same number in math, they may have different bit representation.

package main

import (
  "fmt"
  "math"
)

func main() {
  fmt.Printf("%016x %x\n", uint64(42), math.Float64bits(42.0))
  // 000000000000002a 4045000000000000
}

Digesting

Digesting is a process that maps arbitary data into a fixed-length string, called a digest.

Digests are deterministic and discriminative. By deterministic, it means the same input data would always generate the same digest string. By discriminative, it means different input data would hopefully generate different digest strings. Though there will always be collisions, due to pigeonhole principle, modern digesting algorithms like SHA-256 are designed to minimize such probability, which effectively enables usage of digests in scenarios like data deduplication, data indexing, etc.

Modern digesting algorithms are byte-oriented. During the digesting process, one should first convert the input data into a byte sequence following a designated scheme, with which the algorithms like SHA-256 further calculate the digest string. The scheme one adopts should be carefully chosen, such that the resulting byte sequence faithfully reflects the information of the input data.

package main

import (
  "crypto/sha256"
  "fmt"
  "encoding/binary"
)


type Foo struct {
  A uint32
  B string
}

func main() {
  foo := Foo{42, "hello"}
  h := sha256.New()
  h.Write(binary.BigEndian.AppendUint32(nil, foo.A))
  h.Write([]byte(foo.B))
  fmt.Printf("%x\n", h.Sum(nil))
}

Above is a simple example of digesting a struct Foo using SHA-256 algorithm. The scheme here for byte-sequence generation is straightforward: we first encode the uint32 value foo.A with a 4-byte big-endian format, then append all bytes of string value foo.B to the end and finalize the byte sequence.

Type Leakage in Digesting

One might easily derive such a digesting scheme, which, however, could sometimes encode more information than what we expect.

The problem lies in the type information leakage. For the field foo.A, we not only encode the numeric value 42, the way we encode it also reveals the type information of the field – the 4-byte length, the big-endian format, altogether constitutes the byte sequence. If later we switch to int64 or float32, but keep encoding by their bit representation, the digest would change accordingly.

Well… this might frustrate our users, as most of them are unaware of the distinction between ints and floats. Imagine they enquire our application with the same set of parameters, only to find out resources exist yesterday are gone now, owing to an upgrade at midnight that modifies a field type. What a nightmare.

The Problem

Thereby we have the cohesive digesting problem: how to properly calculate digests for numeric values, such that their types are not leaked into the digested results?

As said, the problem is equivalent to finding an type-agnostic injective that converts numeric values to byte sequences. To simplify the problem, this post focuses on handling uint64 and float64 types, as they are the most generic numeric types in Go. Towards this goal, there are two possible solutions:

Convert uint64 and float64 to some “super type” X, then convert X to byte sequence.
Develop respective dedicated schemes for uint64 and float64.

Attempt 1: Find “Super Type” X

The first idea is natural – we purge the type information by introducing a “super type” X. This type should be able to cover all possible values of uint64 and float64, and should be able to convert between them without loss of information.

Considering integers are a subset of real nunbers, we are going to investigate two candidates: float64 and big.Float.

`float64` as Super Type

One might easily think of float64 as the super type, as it has a wide value range of $-10^{308} \sim 10^{308}$, enough to cover all possible uint64 values, likely.

This is however not true. In fact, the distribution of float64s becomes “sparse” when the value is large, due to limited precision. Only integers $\leq 2^{53}$ can be represented exactly in float64, while others larger are rounded to the nearest representable value. The following code snippet demonstrates that large integers are not well preserved when coerced into float64:

package main

func main() {
  println(float64(1<<53) == float64(1<<53+1)) // true
}

Obviously, float64 must be crossed out from the list.

`big.Float` as Super Type

If float64 is not suitable, how about types with more precision?

big.Float is a handy type in Go that emulates arbitary precision floating-point numbers. One can tweak the precision with (big.Float).SetPrec to cover all possible uint64 and float64 values, then convert the value to a byte sequence.

func WriteUint64(v uint64, h hash.Hash) {
  var f big.Float
  f.SetPrec(128)
  f.SetUint64(v)
  b, _ := f.GobEncode()
  h.Write(b)
}

func WriteFloat64(v float64, h hash.Hash) {
  var f big.Float
  f.SetPrec(128)
  f.SetFloat64(v)
  b, _ := f.GobEncode()
  h.Write(b)
}

Here we set the precision of big.Float to 128 bits, such that conversion from either uint64 or float64 to big.Float won’t be truncated. We then serialize the big.Float value using GobEncode method, which converts the value to a byte sequence.

This solution is neat, yet with several drawbacks:

GobEncode is designed for serialization, not digesting, with no guarantee for an injection between the values and the byte sequences (though it does for now).
It depends on the specific implementation of GobEncode, which might change in the future.
It cannot handle Nan values. SetFloat64 would panic if the input is Nan. If that is a must, the solution is not applicable.
It is hard to replicate this solution in other languages. As math/big is unique to Go, other parts of the system written by different languages will struggle to align with this specific behavior.

Of course, one could roll their own digesting scheme based on big.Float. I instead chose to develop a dedicated scheme for uint64 and float64, as described in the next section.

Attempt 2 (Final): Develop Dedicated Schemes

To avoid the drawbacks of the first attempt, I have developed a customized scheme that cohesively maps number x to a 17-byte length sequence b. The scheme is designed as follows:

b[0] is the category flag. b[0] could be 0 (zero), 1 (positive), 2 (negative), 3 (positive inf), 4 (negative inf), 5 (Nan), indicating the characteristic of the digested number.
If x is zero or not finite (b[0] is 0, 3, 4, 5),
- let b[1:17] be all zeros;
If x is non-zero finite and abs(x) < 2^64,
- let b[1:9] be the little-endian representation of floor(abs(x)),
- let b[9:17] be the IEEE-754 representation of frac(abs(x));
If x is non-zero finite and abs(x) >= 2^64,
- b[1:9] being the IEEE-754 representation of abs(x).
- let b[9:17] be all zeros.

A reference implementation in Go is as follows (adapted from sdxtra/…/helper.go):

const (
  zero     = 0
  positive = 1
  negative = 2
  inf      = 3
  neginf   = 4
  nan      = 5
)

func WriteFloat64(x float64, h hash.Hash) {
  buf := make([]byte, 17)
  switch {
  case x == 0:
    buf[0] = zero
  case math.IsInf(x, 1):
    buf[0] = inf
  case math.IsInf(x, -1):
    buf[0] = neginf
  case math.IsNaN(x):
    buf[0] = nan
  default:
    goto NORMAL
  }
  h.Write(buf)
  return
NORMAL:
  if x < 0 {
    buf[0] = negative
    x = -x
  } else {
    buf[0] = positive
  }
  i, f := math.Modf(x)
  if i < 1<<64 {
    binary.LittleEndian.PutUint64(buf[1:], uint64(i))
  } else {
    binary.LittleEndian.PutUint64(buf[1:], math.Float64bits(x))
  }
  binary.LittleEndian.PutUint64(buf[9:], math.Float64bits(f))
  h.Write(buf)
}

func WriteUint64(x uint64, h hash.Hash) {
  buf := new([]byte, 17)
  if x > 0 {
    buf[0] = positive
    binary.LittleEndian.PutUint64(buf[1:], x)
  }
  h.Write(buf)
}

Discussion on the Scheme and Implementation

There are several points worth discussing about the final scheme and within the implementation:

Category Flag

This scheme use the first byte to signify special values, which helps handle some quirky cases of IEEE-754 floating-point numbers, including:

Zeros IEEE-754 has both positive and negative zeros, which are different in bit representation but should be treated as the same value.
Nans IEEE-754 has multiple representations for Nan, all of which should be treated as the same value during digesting.

Short-path for Uint64

The scheme enables direct conversion from uint64 to byte sequence, without the need to convert to float64 first. This would save some computation and accelerate the digesting process.

`math.Modf`

For some values, the scheme involves decomposition into their integral and fractional parts. This operation has several advantages:

It forms an injection, that is, different x corresponds to different pair of (floor(x), frac(x)).
It can be implemented via bit operations in most programming languages, with promising efficiency and strong consistency.

One can try to verify these properties by referring to IEEE-754 standards.

Author: hsfzxjy.
Link: .
License: CC BY-NC-ND 4.0.
All rights reserved by the author.
Commercial use of this post in any form is NOT permitted.
Non-commercial use of this post should be attributed with this block of text.

Go Algorithm Hashing

OOPS!

A comment box should be right here...But it was gone due to network issues :-(If you want to leave comments, make sure you have access to disqus.com.