IT Новости с интернет пространства
0 просмотров
Рейтинг статьи
1 звезда2 звезды3 звезды4 звезды5 звезд

Floating point number

Floating Point Number

This page is about standard implementations and the concept. For platform information, please see the relevant page

Floating Point Number



This page or section is a stub. You can help the wiki by accurately contributing to it.

Floating point numbers are a way to represent real numbers inside computer memory which (usually) operates with binary digits. As opposed to fixed point numbers which have a fixed number of digits before and after the decimal mark, floating point numbers can be considered to have a certain amount of significant digits or ‘accurate leading numbers’ that we consider to carry an accurate approximation to some value. Floating point numbers are inherently not a precise storage medium precisely because they store an approximation to a value and not the precise value itself This requires care when writing code both in low or high-level languages to avoid running into precision errors which may break your program.

Because floating point numbers have a different representation than integers, they can not be treated as ordinary integers with ordinary instructions. Luckily for you, modern processors come with a dedicated on-chip FPU (Floating Point Unit) co-processor that quickly performs floating point calculations in hardware. On x86 the FPU and it’s instructions are named ‘x87’ because of historical reasons. Embedded devices may or may not come with an FPU, depending on the complexity and price range of these devices. Check your spec! (Before FPUs became an affordable upgrade for desktop computers, floating point emulation libraries would be either present or installable on computers. Sometimes these libraries even required manual installation by the end user!)

This article will explain the IEEE 754 standard for floating point number representation. This standard is extremely common, not x86-specific and has multiple ‘precisions’ or representation sizes which are all similar in how they work except for the number of bits used.


All floating point numbers are a variation on the theme of x = m * b^e, that is, they all save some kind of mantissa that is value limited to be between 1 and the base, and an exponent, and the whole number is formed by raising a constant base b to the exponent and then multiplying in the mantissa.

In most implementations, the base will be 2 or 10. But the latter is called «decimal float», and is a bit of an outlier. There are a lot of software implementations for it, but on hardware it was only implemented on the IBM POWER6 and newer, the IBM System z9 and newer, as well as the Fujitsu Sparc 64. However, most of these also maintain compatibility to «binary float» (that is what base 2 floating point is called), and almost all hardware with FP support will support binary float, including PCs.

Therefore the rest of this page will focus on IEEE 754 binary floating point.

IEE 754

This standard standardized the format of binary floating point numbers: All FP numbers are bit structures, with one sign bit in the highest place, then a collection exponent bits, and then the rest of the number being the mantissa. Since bits can only form unsigned integers, interpretation is necessary.

If all exponent bits are set, then this value is special. If the mantissa is zero, then this means the value is infinite, else the value is not-a-number (NaN). On platforms where this is important, the highest mantissa bit selects between quiet NaN and signaling NaN. Use of the latter results in an exception, but the PC hardware does not support this behavior.

Otherwise the exponent is saved as a biased integer. That is, we define a number, called the bias, which is all bits set, and one bit shorter than the exponent, and the exponent saved in the FP number is the actual exponent plus this bias. This allows saving negative values without two’s complement or anything.

Since in binary float, the mantissa has to be between 1 and 2, the «1.» leading the mantissa is implicit. The rest of the mantissa is the first bits behind the decimal point. Therefore, if the number is interpreted as integer, it has to be divided by 2 raised to the bit width of the mantissa. Then add 1. Only special case is, if the exponent is all 0 then the implicit integer part is 0 instead of 1. Such numbers are not in normal form, and are therefore called «subnormal».

Fixed Point and Floating Point Number Representations

Digital Computers use Binary number system to represent all types of information inside the computers. Alphanumeric characters are represented using binary bits (i.e., 0 and 1). Digital representations are easier to design, storage is easy, accuracy and precision are greater.

There are various types of number representation techniques for digital number representation, for example: Binary number system, octal number system, decimal number system, and hexadecimal number system etc. But Binary number system is most relevant and popular for representing numbers in digital computer system.

Storing Real Number:

These are structures as following below:

There are two major approaches to store real numbers (i.e., numbers with fractional component) in modern computing. These are (i) Fixed Point Notation and (ii) Floating Point Notation. In fixed point notation, there are a fixed number of digits after the decimal point, whereas floating point number allows for a varying number of digits after the decimal point.

Fixed-Point Representation:

This representation has fixed number of bits for integer part and for fractional part. For example, if given fixed-point representation is IIII.FFFF, then you can store minimum value is 0000.0001 and maximum value is 9999.9999. There are three parts of a fixed-point number representation: the sign field, integer field, and fractional field.

We can represent these numbers using:

  • Signed representation: range from -(2 (k-1) -1) to (2 (k-1) -1), for k bits.
  • 1’s complement representation: range from -(2 (k-1) -1) to (2 (k-1) -1), for k bits.
  • 2’s complementation representation: range from -(2 (k-1) ) to (2 (k-1) -1), for k bits.
Читать еще:  Access пользователи и разрешения

2’s complementation representation is preferred in computer system because of unambiguous property and easier for arithmetic operations.

Example: Assume number is using 32-bit format which reserve 1 bit for the sign, 15 bits for the integer part and 16 bits for the fractional part.

Then, -43.625 is represented as following:

Where, 0 is used to represent + and 1 is used to represent. 000000000101011 is 15 bit binary value for decimal 43 and 1010000000000000 is 16 bit binary value for fractional 0.625.

The advantage of using a fixed-point representation is performance and disadvantage is relatively limited range of values that they can represent. So, it is usually inadequate for numerical analysis as it does not allow enough numbers and accuracy. A number whose representation exceeds 32 bits would have to be stored inexactly.

These are above smallest positive number and largest positive number which can be store in 32-bit representation as given above format. Therefore, the smallest positive number is 2 -16 ≈ 0.000015 approximate and the largest positive number is (2 15 -1)+(1-2 -16 )=2 15 (1-2 -16 ) =32768, and gap between these numbers is 2 -16 .

We can move the radix point either left or right with the help of only integer field is 1.

Floating-Point Representation:

This representation does not reserve a specific number of bits for the integer part or the fractional part. Instead it reserves a certain number of bits for the number (called the mantissa or significand) and a certain number of bits to say where within that number the decimal place sits (called the exponent).

The floating number representation of a number has two part: the first part represents a signed fixed point number called mantissa. The second part of designates the position of the decimal (or binary) point and is called the exponent. The fixed point mantissa may be fraction or an integer. Floating -point is always interpreted to represent a number in the following form: Mxr e .

Only the mantissa m and the exponent e are physically represented in the register (including their sign). A floating-point binary number is represented in a similar manner except that is uses base 2 for the exponent. A floating-point number is said to be normalized if the most significant digit of the mantissa is 1.

So, actual number is (-1) s (1+m)x2 (e-Bias) , where s is the sign bit, m is the mantissa, e is the exponent value, and Bias is the bias number.

Note that signed integers and exponent are represented by either sign representation, or one’s complement representation, or two’s complement representation.

The floating point representation is more flexible. Any non-zero number can be represented in the normalized form of ±(1.b1b2b3 . )2x2 n This is normalized form of a number x.

Example: Suppose number is using 32-bit format: the 1 bit sign bit, 8 bits for signed exponent, and 23 bits for the fractional part. The leading bit 1 is not stored (as it is always 1 for a normalized number) and is referred to as a “hidden bit”.

Then −53.5 is normalized as -53.5=(-110101.1)2=(-1.101011)x2 5 , which is represented as following below,

Where 00000101 is the 8-bit binary value of exponent value +5.

Note that 8-bit exponent field is used to store integer exponents -126 ≤ n ≤ 127.

The smallest normalized positive number that fits into 32 bits is (1.00000000000000000000000)2x2 -126 =2 -126 ≈1.18×10 -38 , and largest normalized positive number that fits into 32 bits is (1.11111111111111111111111)2x2 127 =(2 24 -1)x2 104 ≈ 3.40×10 38 . These numbers are represented as following below,

The precision of a floating-point format is the number of positions reserved for binary digits plus one (for the hidden bit). In the examples considered here the precision is 23+1=24.

The gap between 1 and the next normalized floating-point number is known as machine epsilon. the gap is (1+2 -23 )-1=2 -23 for above example, but this is same as the smallest positive floating-point number because of non-uniform spacing unlike in the fixed-point scenario.

Note that non-terminating binary numbers can be represented in floating point representation, e.g., 1/3 = (0.010101 . )2 cannot be a floating-point number as its binary representation is non-terminating.

IEEE Floating point Number Representation:

IEEE (Institute of Electrical and Electronics Engineers) has standardized Floating-Point Representation as following diagram.

So, actual number is (-1) s (1+m)x2 (e-Bias) , where s is the sign bit, m is the mantissa, e is the exponent value, and Bias is the bias number. The sign bit is 0 for positive number and 1 for negative number. Exponents are represented by or two’s complement representation.

According to IEEE 754 standard, the floating-point number is represented in following ways:

  • Half Precision (16 bit): 1 sign bit, 5 bit exponent, and 10 bit mantissa
  • Single Precision (32 bit): 1 sign bit, 8 bit exponent, and 23 bit mantissa
  • Double Precision (64 bit): 1 sign bit, 11 bit exponent, and 52 bit mantissa
  • Quadruple Precision (128 bit): 1 sign bit, 15 bit exponent, and 112 bit mantissa

Special Value Representation:

There are some special values depended upon different values of the exponent and mantissa in the IEEE 754 standard.

  • All the exponent bits 0 with all mantissa bits 0 represents 0. If sign bit is 0, then +0, else -0.
  • All the exponent bits 1 with all mantissa bits 0 represents infinity. If sign bit is 0, then +∞, else -∞.
  • All the exponent bits 0 and mantissa bits non-zero represents denormalized number.
  • All the exponent bits 1 and mantissa bits non-zero represents error.

Урок №33. Типы данных с плавающей точкой: float, double и long double

Обновл. 29 Дек 2019 |

В этом уроке мы рассмотрим типы данных с плавающей точкой, их точность и диапазон, что такое экспоненциальная запись и как она используется, а также рассмотрим ошибки округления, что такое nan и inf .

Типы данных с плавающей точкой

Целочисленные типы данных отлично подходят для работы с целыми числами, но есть ведь ещё и дробные числа. И тут нам на помощь приходит тип данных с плавающей точкой (или ещё «тип данных с плавающей запятой», англ. «floating point»). Переменная такого типа может хранить любые действительные дробные числа, например: 4320.0, -3.33 или 0.01226. Почему точка «плавающая»? Дело в том, точка/запятая перемещается («плавает») между цифрами, разделяя целую и дробную части значения.

Читать еще:  Access выпадающий список в форме

Есть три типа данных с плавающей точкой: float, double и long double. Как и с целочисленными типами, C++ определяет только их минимальный размер. Типы данных с плавающей точкой всегда являются signed (т.е. могут хранить как положительные, так и отрицательные числа).

Объявление переменных разных типов данных с плавающей точкой:

Если нужно использовать целое число с переменной типа с плавающей точкой, то тогда нужно указать после разделительной точки нуль. Это позволяет различать переменные целочисленных типов от переменных типов с плавающей запятой:

Обратите внимание, литералы типа с плавающей точкой по умолчанию относятся к типу double. «f» в конце числа означает тип float.

Экспоненциальная запись

Экспоненциальная запись очень полезна для написания длинных чисел в краткой форме. Числа в экспоненциальной записи имеют следующий вид: мантисса х 10 экспонент . Например, рассмотрим выражение 1.2 x 10 4 . Значение 1.2 — это мантисса (или ещё «значащая часть числа»), а 4 — это экспонент (или ещё «порядок числа»). Результатом этого выражения является значение 12 000.

Обычно, в экспоненциальной записи, в целой части находится только одна цифра, все остальные пишутся после разделительной точки (в дробной части).

Рассмотрим массу Земли. В десятичной системе счисления она представлена как 5973600000000000000000000 кг . Согласитесь, очень большое число (даже слишком большое, чтобы поместиться в целочисленную переменную размером 8 байт). Это число даже трудно читать (там 19 или 20 нулей?). Но, используя экспоненциальную запись, массу Земли можно представить как 5.9736 х 10 24 кг (что гораздо легче воспринимается, согласитесь). Ещё одним преимуществом экспоненциальной записи является сравнение двух очень больших или очень маленьких чисел — для этого достаточно просто сравнить их экспоненты.

В C++ буква е / Е означает, что число 10 нужно возвести в степень, который следует за этой буквой. Например: 1.2 x 10 4 эквивалентно 1.2e4 , значение 5.9736 x 10 24 ещё можно записать как 5.9736e24 .

Для чисел меньше единицы экспонент может быть отрицательным. Например, 5e-2 эквивалентно 5 * 10 -2 , что, в свою очередь, означает 5 / 10 2 или 0.05 . Масса электрона равна 9.1093822e-31 кг .

На практике экспоненциальная запись может использоваться в операциях присваивания:

Floating-point Numbers

Scalars of type float are stored using four bytes (32-bits). The format used follows the IEEE-754 standard.

A floating-point number is expressed as the product of two parts: the mantissa and a power of two. For example:

В±mantissa Г— 2 exponent

The mantissa represents the actual binary digits of the floating-point number.

The power of two is represented by the exponent. The stored form of the exponent is an 8-bit value from 0 to 255. The actual value of the exponent is calculated by subtracting 127 from the stored value (0 to 255) giving a range of –127 to +128.

The mantissa is a 24-bit value (representing about seven decimal digits) whose most significant bit (MSB) is always 1 and is, therefore, not stored. There is also a sign bit that indicates whether the floating-point number is positive or negative.

Floating-point numbers are stored on byte boundaries in the following format:

Zero is a special value denoted with an exponent field of 0 and a mantissa of 0.

Using the above format, the floating-point number -12.5 is stored as a hexadecimal value of 0xC1480000. In memory, this value appears as follows:

It is fairly simple to convert floating-point numbers to and from their hexadecimal storage equivalents. The following example demonstrates how this is done for the value -12.5 shown above.

The floating-point storage representation is not an intuitive format. To convert this to a floating-point number, the bits must be separated as specified in the floating-point number storage format table shown above. For example:

From this illustration, you can determine the following:

  • The sign bit is 1, indicating a negative number.
  • The exponent value is 10000010 binary or 130 decimal. Subtracting 127 from 130 leaves 3, which is the actual exponent.
  • The mantissa appears as the following binary number:

There is an understood binary point at the left of the mantissa that is always preceded by a 1. This digit is omitted from the stored form of the floating-point number. Adding 1 and the binary point to the beginning of the mantissa gives the following value:

To adjust the mantissa for the exponent, move the decimal point to the left for negative exponent values or right for positive exponent values. Since the exponent is three, the mantissa is adjusted as follows:

The result is a binary floating-point number. Binary digits to the left of the decimal point represent the power of two corresponding to their position. For example, 1100 represents (1 Г— 2 3 ) + (1 Г— 2 2 ) + (0 Г— 2 1 ) + (0 Г— 2 0 ), which is 12.

Binary digits to the right of the decimal point also represent the power of two corresponding to their position. However, the powers are negative. For example, .100. represents (1 Г— 2 -1 ) + (0 Г— 2 -2 ) + (0 Г— 2 -3 ) + . which equals .5.

The sum of these values is 12.5. Because the sign bit was set, this number should be negative.

So, the hexadecimal value 0xC1480000 is -12.5.

Floating point number

Floating point numbers (also known as ‘real numbers’) give a certain freedom in being able to represent both very large and very small numbers in the confines of a 32 bit word (that’s a double word in our PLCs). Up until this point the range of numbers we were able to represent with a double word would be from 0 to 4,294,967,295. Floating point on the other hand allows a representation of 0.0000000000000001 as well as +/-1,000,000,000,000. It allows for such large numbers that we can even keep track of the US national debt.

Floating point gives us an easy way to deal with fractions. Before, a word could only represent an integer, that is, a whole number. We’d have to use some tricks to maybe imply a decimal point. For instance, a number like 2300 in a word could be taken to represent 23.00 if the decimal point is «implied» to be in the 1/100th place. This might be all we need but it can get a bit tricky when it comes to math where we want to retain a remainder. The trick is to get some sort of format where the decimal point can «float» around the number.

Читать еще:  Тип функции pointer to c

Real Numbers in the Real World

At this point let’s deal with an example. In this case we’re using an Automation Direct DL250 PLC which conveniently has the ability to handle real numbers (floating point). Our PLC is reading a pressure transducer input whose max reading is 250 psi. In our PLC the max number is represented by 4095 (FFF in hex). So essentially to get our real world reading we would need to divide 4095 by 16.38 (4095 reading / 250 max pressure). This is easily done with real numbers but our reading is in decimal. So the BTOR instruction is used to convert the decimal number to a real number format. Then we use the special DIVR instruction to divide it with a real number and get our reading. The resulting ladder logic would look like below.

If you’re a complete newbie at this and don’t understand the ladder logic then don’t worry about that. We’ll get into ladder latter. Just understand that when you need to deal in fractions you’ll most likely want to turn to real number formats in the PLC instruction set.

Sinking Deeper into Floating Point Numbers

Floating point is basically a representation of scientific notation. Oh yeah? What’s scientific notation? Scientific notation represents numbers as a base number and an exponent. For example, 123.456 would be 1.23456 x 10 2 . That 10 with a little 2 above is telling us to move the decimal two spaces to the right to get the real number. Another example, 0.0123 would be 1.23 x 10 -2 . That little -2 indicates we move the decimal point in the opposite direction to the left. (Just a heads up, in the PLC you may be able to use scientific notation but in a different form like 1.23456E2 which is the same as a first example.) The number 10 here means we’re dealing in decimal. We could just as easily do scientific notation in hexadecimal (123.ABC x 16 2 ) or even binary ( 1.0101 x 2 2 , this binary one becomes important later on).

The Format

At some point in history a bunch of geeks got together and agreed upon a certain format or layout for a 32-bit floating point number. Due to a lack of originality, it officially became called «IEEE Standard 754». Here it is in all it’s glory.

The exponent is the same as our little number above the 10 in scientific notation. It tells us which way the decimal should go so it needs to be positive (go to the right) or negative (go to the left). Here we are again trying to deal with negative numbers but in this case the geeks decided to use what’s called a bias or offset of 127. Basically this means that at a value of 127 the exponent is 0. Any number below 127 will cause a negative exponent. Any number above 127 will be a positive exponent. So a stored value of 200 indicates an exponent of 73 (200-127).

The mantissa (or significand, if that is any easier to say) represent the precision bits of the number. In our example above it was the 1.23456 part of the scientific notation.

The final nomenclature in scientific notation would be: (sign) mantissa x base exponent

Normally the base would be 10 but in this case it will be 2 since we are only dealing in binary. Since it’s in base 2 (or binary) there’s a little optimization trick that can be done to save one bit. Waste not, want not, you know. The trick comes about by realizing that scientific notation allows us to write numbers in many different way. Consider how the number five can be

These are all the same number. Floating point numbers are typically in a normalized form with one digit to the left of the decimal (i.e. 5.00 x 10 0 or 4.0 x 10 3 ). The exponent is always adjusted to make this happen. In terms of using binary we’ll always have a 1 in front (i.e. 1.0 x 2 3 ). You wouldn’t have 0.1 x 2 4 as it wouldn’t be normalized. So in this case it’s always safe to assume that the leading digit is a 1 and therefore we don’t have to store it. That makes the mantissa actually 24 bits long when all we have are 23 bits of storage. Ah, what we do to save one bit.

WARNING: It’s Not a Perfect World

With all this power using floating point you are probably thinking, «I’ll just use it all the time». There’s a problem though as this method can actually lose some precision. In many cases it will be negligible and therefore well worth it to use real numbers. In other cases though it could cause significant errors. So beware.

Consider what would happen if the mantissa part of the floating point format was actually longer then 24 bits? Something has to give and what happens is the end is truncated, that is, it is cut off the end and lost.

Here’s an example of a 32-bit number

11110000 11001100 10101010 00001111 which would be 4039944719 in decimal

In floating point with only 24 bits it would have to be

1.1110000 11001100 10101010 x 2 31 which when coverted back would be

11110000 11001100 10101010 00000000 and therefore 4039944704 in decimal.

That’s a difference of 15. During normal math this might not be of concern but if you are accumulating and totalizing values then that kind of error could really make the bean counters mad. This is simply a case of knowing your limitations.

Glutton for Punishment: Further Reading

There’s more on this subject concerning things like double precision, overflow, zero and ‘not a number’ which you can read about in these excellent articles.

Ссылка на основную публикацию