I do scientific computing. Mostly in c++ which offers a host of places to
have problems, but that isn't what I want to talk about today. Instead I want to
talk about language-independent issues with math.
From a science or engineering point of view, the formulae that we look up
and equations we write down and manipulate all assume we're working with real or
complex values (or at least with integers). Notably all those fields are
infinite and we represent a working sub-set of them with not-very-big sets of
bits. And that's where the trouble sets in.
Now, many language provide types that offer a "floating-point"
representation of some of the reals. Think about a binary version of scientific
notation: $1.xyz \times 2^{abc}$. In modern times this stuff is actually well
standardized with most hardware
implementing IEEE754.
A not-at-all exhaustive list of the common problems for floating-point
representations include
- Easy to write down fractions like $\frac{1}{3}$ don't have exact
representations in floating point because the format is finite.
- Worse, even fractions like $\frac{1}{5}$ that have finite representations
in decimal notation don't have one in binary notation and so are also
inexactly represented.
- The Commutative and Associative rules for basic operations like addition
and multiplication are lost in some circumstances.
- It takes special care to insure that you can accurately round-trip a
in-memory value through a textual representation.
- As a result of the above, it is very easy to write down an expression
that has an equal sign in the middle on paper, but when you compute the two
sides in code and compare them with
==
it
returns false
.
- As a result of library differences in IO routines and some functions even
if you get it right on one machine/compiler combination it can break if ported
to a different machine/compiler combination even if they both implement the
same standard!
As a result of these and other details floating-point math is notoriously hard
to use correctly. The more so if you worry about unreasonable inputs (as you
must).
We use floating-point math anyway because it supports values over a huge
range of magnitude (for the number of bits used in the representation) and often
has a fast, hardware-supported implementation. Still, sometimes, if you know the
use domain well enough you can select a more limited range of necessary values
and use fixed-point math to avoid some of the problems with floating-point.
Recently at work we dealt with "it's not comparing right" problems for
angles on the sphere by coding azimuth and elevation in terms of a integer
numbers of arc-minutes which provided more than sufficient precision for our
needs, gets along nicely with the domain practice of describing angles in
degrees, means that each value fits into a 16-bit field, and can be reliably
round-tripped through a customer-specified text format.
Alas, many languages (most that are promoted for scientific computing) don't
have bulit-in types or library support for fixed point so it isn't always
practical: you have to ask how you will implement any special functions you need
before you make that choice.
But it is worth asking right at the start.