Showing posts with label FloatingPoint. Show all posts
Showing posts with label FloatingPoint. Show all posts

2023-01-30

Can you use fixed-point?

I do scientific computing. Mostly in c++ which offers a host of places to have problems, but that isn't what I want to talk about today. Instead I want to talk about language-independent issues with math.

From a science or engineering point of view, the formulae that we look up and equations we write down and manipulate all assume we're working with real or complex values (or at least with integers). Notably all those fields are infinite and we represent a working sub-set of them with not-very-big sets of bits. And that's where the trouble sets in.

Now, many language provide types that offer a "floating-point" representation of some of the reals. Think about a binary version of scientific notation: $1.xyz \times 2^{abc}$. In modern times this stuff is actually well standardized with most hardware implementing IEEE754.

A not-at-all exhaustive list of the common problems for floating-point representations include

  • Easy to write down fractions like $\frac{1}{3}$ don't have exact representations in floating point because the format is finite.
  • Worse, even fractions like $\frac{1}{5}$ that have finite representations in decimal notation don't have one in binary notation and so are also inexactly represented.
  • The Commutative and Associative rules for basic operations like addition and multiplication are lost in some circumstances.
  • It takes special care to insure that you can accurately round-trip a in-memory value through a textual representation.
  • As a result of the above, it is very easy to write down an expression that has an equal sign in the middle on paper, but when you compute the two sides in code and compare them with == it returns false.
  • As a result of library differences in IO routines and some functions even if you get it right on one machine/compiler combination it can break if ported to a different machine/compiler combination even if they both implement the same standard!
As a result of these and other details floating-point math is notoriously hard to use correctly. The more so if you worry about unreasonable inputs (as you must).

We use floating-point math anyway because it supports values over a huge range of magnitude (for the number of bits used in the representation) and often has a fast, hardware-supported implementation. Still, sometimes, if you know the use domain well enough you can select a more limited range of necessary values and use fixed-point math to avoid some of the problems with floating-point.

Recently at work we dealt with "it's not comparing right" problems for angles on the sphere by coding azimuth and elevation in terms of a integer numbers of arc-minutes which provided more than sufficient precision for our needs, gets along nicely with the domain practice of describing angles in degrees, means that each value fits into a 16-bit field, and can be reliably round-tripped through a customer-specified text format.

Alas, many languages (most that are promoted for scientific computing) don't have bulit-in types or library support for fixed point so it isn't always practical: you have to ask how you will implement any special functions you need before you make that choice.

But it is worth asking right at the start.