These notes discuss the following:
float
type in Java, C, C++...)To be clear, these notes discuss only interconversions, not operations on floating point numbers (e.g., addition, multiplication, etc.). These operations on floating point numbers are much more complex than their equivalent operations on decimal numbers, mostly due to the need for rounding and precision loss concerns. As such, we do not cover these operations.
The IEEE-754 standard was developed as a standardized representation of floating-point numbers in binary. Before the standard there were many incompatible implementations which all suffered from their own unique quirks. IEEE-754 attempts to alleviate some of these quirks, though it has some quirks of its own.
From unsigned and two's complement binary numbers, you're already used to the problem of not having enough bits to represent a given value.
This same problem arises with the IEEE-754 standard, where a value may be too large or too small to be represented.
An additional twist is that a number may not be able to be encoded precisely in IEEE-754, leading to a loss of precision.
For example, the format we will soon discuss (specifically binary32
), the value 0.1
cannot be encoded exactly.
Attempts to encode this value will produce the closest possible encoding possible, which happens to be 0.100000001490116119384765625
.
This is admittedly very close, though the precision loss can become substantial when we start to perform operations on these numbers.
We won't worry about those sort of problems in class, but there are ways to correct this for the curious.
While the IEEE-754 standard defines several different floating-point representations, two of these stand out in popularity:
binary32
, the internal representation of float
in Java, C, C++, and many others.
This uses 32 bits.
binary64
, the internal representation of double
in Java, C, C++, and many others.
This uses 64 bits.
While the above two representations are separate, they work very similarly internally.
In fact, the only real difference lies in the number of bits they use in their representation of floating-point numbers.
Because binary64
uses twice as many bits as binary32
, it can encode larger values more precisely.
Because these two formats work in basically the same way, we will only work with the binary32
representation in this class.
With half as many bits, this will mean substantially less work to do than with the binary64
representation (though it may still be a lot of work).
To be clear, however, these two formats work the same.
binary32
Step-by-step instructions follow which discuss how to convert from decimal floating-point values to an equivalent binary representation in binary32
.
These instructions are similar to those presented here, though the step numbers are not one-to-one (the instructions below use more steps, though the process is the same).
Additionally, an automatic conversion is available online if you want to experiment a bit with some different numbers.
Note that this converter will only give you the final result, whereas in the lab I ask for the results of all the intermediate steps.
If the number is positive, then the sign bit will be 0
.
If the number is negative, then the sign bit will be 1
.
For the number zero, both positive and negative zero are possible, and these are considered different values (a quirk of using sign bits).
Convert the integral portion of the floating-point value to unsigned binary (not two's complement).
The integral portion is the part of the number before the decimal point.
For example, if the number to convert is -0.75
, then 0
is the integral portion, and it's unsigned binary representation is simply 0
.
As another example, if the number to convert is 127.99
, then the integral portion would be 127
, and it's unsigned binary representation is 1111111
.
The fractional portion of the number must also be converted to binary, though the conversion process is much different from what you're used to.
The algorithm you'll used is based on performing repeated multiplications by 2
, and then checking if the result is >= 1.0
.
If the result is >= 1.0
, then a 1
is recorded for the binary fractional component, and the leading 1
is chopped of the result.
If the result is < 1.0
, then a 0
is recorded for the binary fractional component, and the result is kept as-is.
The recorded builds are built-up left-to-right.
The result keeps getting chained along in this way until one of the following is true:
1.0
With the first possible terminating condition (the result is exactly 1.0
), this means that the fractional component has been represented without any loss of precision.
With the second possible terminating condition (23 iterations have passed), this means that we ran out of bits in the final result, which can never exceed 23.
In this case, precision loss occurs (an unfortunate consequence of using a finite number of bits).
To see this algorithm in action, let's see how this works for the conversion of 0.75
.
A table is shown below showing each iteration as it progresses.
Iteration | Calculation | >= 1.0 ?
| Output Bit |
---|---|---|---|
1 | 0.75 * 2 = 1.5 |
yes | 1 |
2 | 0.5 * 2 = 1.0 |
yes | 1 |
Final value: (top-to-bottom, left-to-right) 11
In this case, the algorithm terminated with a precise result (it reached exactly 1.0
).
Another example follows, this time for the conversion of 0.68
:
Iteration | Calculation | >= 1.0 ?
| Output Bit |
---|---|---|---|
1 | 0.68 * 2 = 1.36 |
yes | 1 |
2 | 0.36 * 2 = 0.72 |
no | 0 |
3 | 0.72 * 2 = 1.44 |
yes | 1 |
4 | 0.44 * 2 = 0.88 |
no | 0 |
5 | 0.88 * 2 = 1.76 |
yes | 1 |
6 | 0.76 * 2 = 1.52 |
yes | 1 |
7 | 0.52 * 2 = 1.04 |
yes | 1 |
8 | 0.04 * 2 = 0.08 |
no | 0 |
9 | 0.08 * 2 = 0.16 |
no | 0 |
10 | 0.16 * 2 = 0.32 |
no | 0 |
11 | 0.32 * 2 = 0.64 |
no | 0 |
12 | 0.64 * 2 = 1.28 |
yes | 1 |
13 | 0.28 * 2 = 0.56 |
no | 0 |
14 | 0.56 * 2 = 1.12 |
yes | 1 |
15 | 0.12 * 2 = 0.24 |
no | 0 |
16 | 0.24 * 2 = 0.48 |
no | 0 |
17 | 0.48 * 2 = 0.96 |
no | 0 |
18 | 0.96 * 2 = 1.92 |
yes | 1 |
19 | 0.92 * 2 = 1.84 |
yes | 1 |
20 | 0.84 * 2 = 1.68 |
yes | 1 |
21 | 0.68 * 2 = 1.36 |
yes | 1 |
22 | 0.36 * 2 = 0.72 |
no | 0 |
23 | 0.72 * 2 = 1.44 |
yes | 1 |
Final value: (top-to-bottom, left-to-right) 10101110000101000111101
In this case, the algorithm terminated after 23 iterations without a precise result.
While the above algorithm may seem strange, this is actually the inverse of the usual algorithm for converting decimal to binary, which works by repeated divisions by 2, constructing the value right-to-left. With the above algorithm, the intuition is that now we are working on the right-hand side of the decimal point. As such, we multiply instead of divide, and we construct left-to-right.
Each one of the bits encodes a value which is a power of two, just as with normal unsigned binary working with ever increasing powers of two as we move to the left.
However, because with fractions we are working on the right-hand side of the decimal point, the exponents become negative.
To better illustrate this, consider the binary floating-point number 1010.1101
(this is not a typical representation, but it serves our purposes).
Each bit has values as such:
Bits: | 1 | 0 | 1 | 0 | . | 1 | 1 | 0 | 1 |
Values: | 1 * 23 | 0 * 22 | 1 * 21 | 0 * 20 | 1 * 2-1 | 1 * 2-2 | 0 * 2-3 | 1 * 2-4 |
Using the result of 11
from our first conversion example of 0.75
, this has the following representation:
Bits: | 0 | . | 1 | 1 |
Values: | 0 * 20 | 1 * 2-1 | 1 * 2-2 |
Sure enough,
(0 * 20) + (1 * 2-1) + (1 * 2-2) = 0.75
.
A trick to encode an extra bit is to make it so that the binary scientific representation is always of the form 1.XXXX * 2YYYY
.
That is, a 1
always leads, so there is no need to explicitly encode it.
In order to encode this properly, we need to move the decimal point to a position where it is immediately after the first 1
, and then record exactly how we moved it.
To see this in action, consider again the example of 0.75
, which is encoded in binary as such (not IEEE-754 notation):
0.11
In order to make the decimal point be after the first 1
, we will need to move it one position to the right, like so:
1.1
Most importantly, we need to record that we moved the decimal point by one position to the right.
Moves to the right result in negative exponents, and moves to the left result in positive exponents.
In this case, because we moved the decimal point one position to the right, the recorded exponent should be -1
.
As another example, consider the following binary floating point representation (again, not IEEE-754):
1111111.11100
In this case, we need to move the decimal point six positions to the left to make this begin with a single 1
, like so:
1.11111111100
Because this moves six positions to the left, the recorded exponent should be 6
.
Internally, IEEE-754 values store their exponents in an unsigned representation, which may seem odd considering that the exponent can be negative.
Negative exponents are accomodated by using a biased representation, wherein a pre-set number is always subtracted from the given unsigned number.
Because the given unsigned number may be less than this number, this allows for negative values to be effectively encoded without resorting to two's complement.
Specifically, for the binary32
representation, the number 127
will be subtracted from anything encoded in the exponent field of the IEEE-754 number.
As such, in this step, we need to add 127
to the normalized exponent value from the previous step.
To see this in action, if we recorded an exponent of -1
in the previous step, then the result of this step should be 126
(-1 + 127 = 126
).
Similarly, if we recorded an exponent of 6
in the previous step, the the result of this step should be 133
(6 + 127 = 133
).
The biased exponent value from the previous step must be converted into unsigned binary, using the usual process. The result must be exactly 8 bits. It should not be possible to need more than 8 bits. If fewer than 8 bits are needed in this conversion process, then leading zeros must be added to the front of the result to produce an 8-bit value.
After step 4, there are a bunch of bits after the normalized decimal point.
These bits will become the mantissa (note that we ignore the bits to the left of the decimal point - normalization allows us to do this, because it should always be just a 1
).
We need exactly 23 mantissa bits.
If less than 23 mantissa bits follow the decimal point, and the algorithm in step 3 ended with a result that wasn't 1.0
, then follow the algorithm in step 3 until we can fill enough bits.
If that's still not enough (eventually reaching 1.0
before we had enough bits, or perhaps it had ended with 1.0
already), then the right side can be padded with zeros until 23 bits is reached.
If there are more than 23 bits after the decimal point in step 4, then these extra bits are simply cutoff from the right. For example, if we had 26 bits to the right of the decimal point, then the last three would need to be cutoff to get us to 23 bits. Note that in this case we will necessarily lose some precision.
The sign bit from step 1 will be the first bit of the final result.
The next 8 bits will be from the exponent from step 6.
The last 23 bits will be from the mantissa from step 7.
The result will be a 32-bit number encoded in IEEE-754 binary32
format, assuming no mistakes were made in the process.
binary32
to Floating-Point Decimal
The reverse process, that of going from IEEE-754 binary32
to floating-point decimal, is much simpler.
Steps for this process follow.
If the first bit is a 1
, then the result will be negative.
If the first bit is a 0
, then the result will be positive.
The eight bits following the sign bit encode the exponent. Extract these eight bits, and then convert them to an unsigned decimal integer. To be clear, this value is not in two's complement.
The result from the previous step is biased, with a bias of 127
.
As such, you'll need to subtract 127
from this value.
For example, if the bias exponent was 126
, then the result of this step should be -1
(126 - 127 = -1
).
The last 23 bits of the number encode the mantissa.
The leftmost bit (B1
) has value B1 * 2-1
, the next-to-leftmost bit (B2
) has value B2 * 2-2
, and so on, following the same pattern as shown in step 3 of the previous section.
To see this in action, consider a mantissa beginning with 1101
, followed by 19 trailing zeros (for a total of 23 bits).
The calculation for this would be:
(1 * 2-1) + (1 * 2-2) + (0 * 2-3) + (1 * 2-4) = 0.8125
Using the mantissa calculation from the previous step, as well as the unbiased exponent from step 3, the overall magnitude of the number will be:
(1 + mantissa) * 2^(unbiased_exponent)
For example, if the mantissa were 0.8125
, and the exponent was 2
, then the magnitude should be 7.25
((1 + 0.8125) * 22
).
If the number is positive (determined from step 1), then the magnitude from the previous step is the final result. If the number is negative (from step 1), then negate the magnitude from the previous step. This negated magnitude is the final result.
The above rules cover the usual sort of numbers we wish to represent. However, there are still some cases which are not accounted for above. We won't get into these cases in this class, though a birds-eye view of them is presented below for the curious.
Zero: encoded with exponent and mantissa fields consisting of nothing but zeros. Both positive zero and negative zero exists, thanks to the sign bit.
Infinity: A finite representation of both positive and negative infinity exists. Infinity is encoded with an exponent of all ones and a mantissa of all zeros. The sign bit encodes whether the infinity is positive or negative.
NaN
: Not-a-number.
This is a special value that indicates either an indeterminate value, or is the result of a nonsensical operation.
These are encoded with an exponent of all ones and a mantissa of all zeros.
Subnormal numbers: These are specifically for representing values close to zero, and make it so the IEEE-754 standard has higher precision specifically between 0
and 1
than between other numbers.
These are useful because this range tends to be especially important for a wide variety of applications, including statistics.
These are encoded by an exponent of all zeros and a non-zero mantissa.