1 Introduction
-
If \(x = 2.20\) and \(y < 2.27\overline{72}\), then \(xy < 5.01\). If \(y \ge 2.27\overline{72}\) instead, then \(xy \ge 5.01\).
-
If \(x = 2.50\) and \(y = 2.00\), then \(xy = 5.00\).
-
An algorithm to efficiently compute the next normal floating-point factor of any floating-point number in radix 2.
-
This algorithm is also partially applicable in higher bases. We do not assume a particular radix in any of our results.
-
-
When the above algorithm is applicable, an efficient procedure for finding optimal interval bounds on variables of floating-point constraints of the form \(x \otimes y = z\).
-
A demonstration that “round-trip” rounding errors have useful, nontrivial piecewise polynomial extensions to the reals.
-
A characterization of upward and downward floating-point rounding in terms of the remainder of floored division.
-
A floating-point predicate for deciding, in any base, whether a number is floating-point factor of another floating-point number.
-
Results on solving certain kinds of interval constraints over integer remainders.
1.1 Related Work
1.2 Outline
2 Preliminaries
2.1 Floating-Point Arithmetic
2.1.1 Floating-Point Numbers, Briefly
2.1.2 Definitions and Terminology
-
Suppose \({|m|} \ge 1\). Then \({|m \beta ^e|} \ge \beta ^{e_{\textrm{min}}}\) and \(\left\lfloor \log _\beta {|m|}\right\rfloor = 0\), and thus \(E(m \beta ^e) = e\).
-
Suppose \(e = {e_{\textrm{min}}}\) and \({|m|} < 1\). Then \({|m \beta ^e|} < \beta ^{e_{\textrm{min}}}\) and so we have \(E(m \beta ^e) = {e_{\textrm{min}}}= e\) immediately by the definition of E.
-
Suppose \(e > {e_{\textrm{min}}}\) and \({|m|} < 1\). If \({|m \beta ^e|} \ge \beta ^{e_{\textrm{min}}}\), then \(\left\lfloor \log _\beta {|m|}\right\rfloor < 0\) and hence \(E(m \beta ^e) = \left\lfloor \log _\beta {|m|}\right\rfloor + e < e\). If \({|m \beta ^e|} < \beta ^{e_{\textrm{min}}}\) instead, then \(E(m \beta ^e) = {e_{\textrm{min}}}< e\).
2.2 The Classical Iterative Procedure
\(\cdot \) | 1.2 | 1.3 | 1.4 | \(\otimes \) | 1.2 | 1.3 | 1.4 |
---|---|---|---|---|---|---|---|
1.2 | 1.44 | 1.56 | 1.68 | 1.2 | 1.4 | 1.6 | 1.7 |
1.3 | 1.56 | 1.69 | 1.82 | 1.3 | 1.6 | 1.7 | 1.8 |
3 Simplifying the Problem
3.1 Sufficient and Necessary Conditions
-
Suppose x is positive. Multiplying by x, it follows thatand by the monotonicity of rounding, we have$$\begin{aligned} x \min Y \le xy \le x \max Y, \end{aligned}$$Combining these bounds with our previous bounds on \({{\,\textrm{fl}\,}}(xy)\), we obtain$$\begin{aligned} x \otimes \min Y \le {{\,\textrm{fl}\,}}(xy) \le x \otimes \max Y. \end{aligned}$$and hence the result holds.$$\begin{aligned} x \otimes \min Y \le&{{\,\textrm{fl}\,}}(xy) \le \max Z,\\ \min Z \le&{{\,\textrm{fl}\,}}(xy) \le x \otimes \max Y, \end{aligned}$$
-
Suppose x is negative. Then, similarly, we haveTherefore, since \({{\,\textrm{fl}\,}}(xy) \in Z\), it follows that$$\begin{aligned} x \otimes \min Y \ge {{\,\textrm{fl}\,}}(xy) \ge x \otimes \max Y. \end{aligned}$$and thus we obtain the result.$$\begin{aligned} x \otimes \min Y \ge&{{\,\textrm{fl}\,}}(xy) \ge \min Z,\\ \max Z \ge&{{\,\textrm{fl}\,}}(xy) \ge x \otimes \max Y, \end{aligned}$$
3.2 Narrowing Down Infeasibility
4 Analyzing the Error
4.1 Further Preliminaries
4.1.1 The Mod Operator
4.1.2 Floating-Point Numbers with Unbounded exponents
4.1.3 Plausibility
4.2 Relating Feasibility, Plausibility, and Rounding Errors
-
Suppose \({|m_z|} < m_x\). Then by Lemma 2.12, we have \({{\,\textrm{ufp}\,}}(m_z/m_x) = 1/\beta \). Then, since \(x < \beta {{\,\textrm{ufp}\,}}(x) \in {{\,\textrm{Plaus}\,}}(Z)\), we have \(t \le \beta {{\,\textrm{ufp}\,}}(x)\).
-
Suppose \(t < \beta {{\,\textrm{ufp}\,}}(x)\). Then, since \({{\,\textrm{ufp}\,}}(x) \le x \le t\), we have \({{\,\textrm{ufp}\,}}(t) = {{\,\textrm{ufp}\,}}(x)\) and hence \({|m_z|} < m_x \le m_t\). Therefore, by Lemma 2.12, we have \({{\,\textrm{ufp}\,}}(m_z/m_t) = 1/\beta \) as well, as desired.
-
Suppose \(t = \beta {{\,\textrm{ufp}\,}}(x)\) instead. Then,$$\begin{aligned} {{\,\textrm{ufp}\,}}(z/t)&= {{\,\textrm{ufp}\,}}\left( \frac{z}{\beta {{\,\textrm{ufp}\,}}(x)}\right) \\&= \frac{1}{\beta } \cdot \frac{{{\,\textrm{ufp}\,}}(z)}{{{\,\textrm{ufp}\,}}(x)}\\&= {{\,\textrm{ufp}\,}}(m_z/m_x) {{\,\textrm{ufp}\,}}(z)/{{\,\textrm{ufp}\,}}(x)\\&= {{\,\textrm{ufp}\,}}(z/x). \end{aligned}$$
-
-
Suppose \(0 < m_x \le {|m_z|}\) instead. Then, since \({|m_z|} {{\,\textrm{ufp}\,}}(x) \in {{\,\textrm{Plaus}\,}}(Z)\), we haveThus \({{\,\textrm{ufp}\,}}(t) = {{\,\textrm{ufp}\,}}(x)\) and hence$$\begin{aligned} x \le t \le {|m_z|} {{\,\textrm{ufp}\,}}(x) < \beta {{\,\textrm{ufp}\,}}(x). \end{aligned}$$Thus by Lemma 2.12, we also have \({{\,\textrm{ufp}\,}}(m_z/m_x) = {{\,\textrm{ufp}\,}}(m_z/m_t) = 1\), as desired.$$\begin{aligned} m_x \le m_t = t/{{\,\textrm{ufp}\,}}(x) \le {|m_z|}. \end{aligned}$$
-
Suppose \(-{|m_z|} \le m_x < 0\) instead. Then, since \(-{{\,\textrm{ufp}\,}}(x) \in {{\,\textrm{Plaus}\,}}(Z)\), we haveso \({{\,\textrm{ufp}\,}}(t) = {{\,\textrm{ufp}\,}}(x)\) and hence$$\begin{aligned} x \le t \le -{{\,\textrm{ufp}\,}}(x), \end{aligned}$$Therefore, by Lemma 2.12, we have \({{\,\textrm{ufp}\,}}(m_z/m_x) = {{\,\textrm{ufp}\,}}(m_z/m_t) = 1\), as desired.$$\begin{aligned} -{|m_z|} \le m_x \le m_t \le -1. \end{aligned}$$
-
Suppose \(m_x < -{|m_z|}\) instead. Then \({{\,\textrm{ufp}\,}}(m_z/m_x) = 1/\beta \) by Lemma 2.12, and since \(-{|m_z|} {{\,\textrm{ufp}\,}}(x) \in {{\,\textrm{Plaus}\,}}(Z)\), we haveand thus \({{\,\textrm{ufp}\,}}(x) = {{\,\textrm{ufp}\,}}(t)\).$$\begin{aligned} x \le t \le -{|m_z|} {{\,\textrm{ufp}\,}}(x), \end{aligned}$$
-
Suppose \(t < -{|m_z|} {{\,\textrm{ufp}\,}}(x)\). Thenand hence \({{\,\textrm{ufp}\,}}(z/t) = {{\,\textrm{ufp}\,}}(z/x)\), as desired.$$\begin{aligned} {{\,\textrm{ufp}\,}}(m_z/m_x) = {{\,\textrm{ufp}\,}}(m_z/m_t) = 1/\beta , \end{aligned}$$
-
Suppose \(t = -{|m_z|} {{\,\textrm{ufp}\,}}(x)\) instead. Then,hence z is an integer multiple of \(t {{\,\textrm{ulp}\,}}(z/x)\). Thus$$\begin{aligned} t {{\,\textrm{ulp}\,}}(z/x)&= -{|m_z|} {{\,\textrm{ufp}\,}}(x) {{\,\textrm{ulp}\,}}(z/x)\\&= -{|m_z|} {{\,\textrm{ufp}\,}}(x) {{\,\textrm{ufp}\,}}(m_z/m_x) \frac{{{\,\textrm{ufp}\,}}(z)}{{{\,\textrm{ufp}\,}}(x)} \beta ^{1-p} \\&= -{|m_z|}{{\,\textrm{ufp}\,}}(z)\beta ^{1-p}\\&= -\frac{{|z|}}{\beta ^{p-1}}, \end{aligned}$$and so the rounding error is zero. Similarly, since$$\begin{aligned} (z \bmod t {{\,\textrm{ulp}\,}}(z/x)) = (-z \bmod t {{\,\textrm{ulp}\,}}(z/x)) = 0, \end{aligned}$$it is a (possibly negated) power of \(\beta \), and thus in \(\mathbb {F}^*_\infty \). Therefore,$$\begin{aligned} z/t = \frac{z}{-{|m_z|} {{\,\textrm{ufp}\,}}(x)} = -{{\,\textrm{sgn}\,}}(z)\frac{{{\,\textrm{ufp}\,}}(z)}{{{\,\textrm{ufp}\,}}(x)}, \end{aligned}$$exactly, and hence the result follows.$$\begin{aligned} t {{\,\textrm{RD}\,}}_\infty (z/t) = t {{\,\textrm{RU}\,}}_\infty (z/t) = t \cdot \frac{z}{t} = z \end{aligned}$$
-
-
Suppose \(0 \in {{\,\textrm{Feas}\,}}(Z)\). Then \(0 \otimes y \in Z\) for some \(y \in \mathbb {F}\) and hence \(0 \in Z\). Thus \(0 \otimes x = 0 \in Z\), and so \(x \in {{\,\textrm{Feas}\,}}(Z)\). Therefore \(a \le x\), and hence \(x = a = b\).
-
Suppose \(+\infty \in {{\,\textrm{Feas}\,}}(Z)\). Then \(+\infty \otimes y \in Z\) for some nonzero \(y \in {\overline{\mathbb {F}}}\). If \(y > 0\), then \(+\infty \otimes y = +\infty = +\infty \otimes x\), and so \(x \in {{\,\textrm{Feas}\,}}(Z)\). If \(y < 0\), then \(+\infty \otimes y = -\infty = -\infty \otimes x\), and so \(x \in {{\,\textrm{Feas}\,}}(Z)\) again. Thus \(a \le x\) and so \(x = a = b\).
-
Suppose a is negative. Then \(b \le a < 0\), and thus \(0< {|a|} \le {|b|} < {|z|}/\max \mathbb {F}\). Since \(a \in {{\,\textrm{Feas}\,}}(Z)\) by definition, Lemma 4.21 implies that \(b \in {{\,\textrm{Feas}\,}}(Z)\) also and thus \(a = b\).
-
Suppose a is positive. Then \([x, 0] \cap {{\,\textrm{Feas}\,}}(Z)\) is empty, and since \({{\,\textrm{Feas}\,}}(Z)\) is closed under negation, it follows that \(a > {|x|}\). Thus, since \(a \in {{\,\textrm{Plaus}\,}}(Z)\), we have \(a \in C\) and hence \(c \le a \le \max \mathbb {F}\). Therefore \(c \in {{\,\textrm{Feas}\,}}(Z)\) by our earlier result, so \(a \le c\) and therefore \(a = c\), as desired.
5 Minimizing Remainders of a Constant Divided by a Variable
5.1 Conditions for a Solution via Extrapolation
-
either \((-ab \bmod {\hat{x}}_1) \in I\) or \(-(ab \bmod {\hat{x}}_1) \in I\), and
-
either \((ab \bmod {\hat{x}}_2) \in I\) or \(-(-ab \bmod {\hat{x}}_2) \in I\).
-
Suppose \(f({\left\lceil \llceil x\right\rceil \rrceil })\) and x have the same sign. Then \(-f(\left\lfloor x\right\rfloor )\) also has the same sign, and \(-f({\left\lceil \llceil x\right\rceil \rrceil })\) and \(f(\left\lfloor x\right\rfloor )\) have the opposite sign. Therefore,and thusHence either , and also either \((ab \bmod \left\lfloor x\right\rfloor ) \in I\) or \(-(-ab \bmod {\left\lceil \llceil x\right\rceil \rrceil }) \in I\).$$\begin{aligned} -f({\left\lceil \llceil x\right\rceil \rrceil })&= -(-ab \bmod {\left\lceil \llceil x\right\rceil \rrceil }),\\ f(\left\lfloor x\right\rfloor )&= -(ab \bmod \left\lfloor x\right\rfloor ). \end{aligned}$$
-
Suppose instead \(f({\left\lceil \llceil x\right\rceil \rrceil })\) and x have opposite signs. Then \(f(\left\lfloor x\right\rfloor )\) and \(-f({\left\lceil \llceil x\right\rceil \rrceil })\) have the same sign as x, and hence \(-f(\left\lfloor x\right\rfloor )\) has the opposite sign of x. Therefore,and thus$$\begin{aligned} f(\left\lfloor x\right\rfloor ) = (f(\left\lfloor x\right\rfloor ) \bmod \left\lfloor x\right\rfloor ) = (-ab \bmod \left\lfloor x\right\rfloor ),\\ -f({\left\lceil \llceil x\right\rceil \rrceil }) = (-f({\left\lceil \llceil x\right\rceil \rrceil }) \bmod {\left\lceil \llceil x\right\rceil \rrceil }) = (ab \bmod {\left\lceil \llceil x\right\rceil \rrceil }), \end{aligned}$$Hence either \((-ab \bmod \left\lfloor x\right\rfloor ) \in I\) or \(-(ab \bmod {\left\lceil \llceil x\right\rceil \rrceil }) \in I\), and also either \((ab \bmod {\left\lceil \llceil x\right\rceil \rrceil }) \in I\) or \(-(-ab \bmod \left\lfloor x\right\rfloor ) \in I\).$$\begin{aligned} -f(\left\lfloor x\right\rfloor ) = -(-ab \bmod \left\lfloor x\right\rfloor ),\\ f({\left\lceil \llceil x\right\rceil \rrceil }) = -(ab \bmod {\left\lceil \llceil x\right\rceil \rrceil }). \end{aligned}$$
-
Suppose \(M_z > 0\). Then \(b = M_z\), and hence by Lemma 5.6, we have either \(-(ab \bmod {\hat{r}}) = -(M_z\beta ^{p-k} \bmod {\hat{r}}) \in I \beta ^{p-k}\) or \((-ab \bmod {\hat{r}}) = (-M_z\beta ^{p-k} \bmod {\hat{r}}) \in I \beta ^{p-k}\) for some \({\hat{r}} \in \{ \left\lfloor r\right\rfloor , {\left\lceil \llceil r\right\rceil \rrceil } \}\) and hence \(M \le {\hat{r}} \le {\left\lceil \llceil r\right\rceil \rrceil }\) by Theorem 4.23.
-
Suppose \(M_z < 0\). Then \(b = -M_z\) and by Lemma 5.6 we have either \((ab \bmod {\hat{r}}) = (-M_z\beta ^{p-k} \bmod {\hat{r}}) \in I \beta ^{p-k}\) or \(-(-ab \bmod {\hat{r}}) = -(M_z\beta ^{p-k} \bmod {\hat{r}}) \in I \beta ^{p-k}\) for some \({\hat{r}} \in \{ \left\lfloor r\right\rfloor , {\left\lceil \llceil r\right\rceil \rrceil } \}\) and hence \(M \le {\hat{r}} \le {\left\lceil \llceil r\right\rceil \rrceil }\) by Theorem 4.23.
6 Algorithms
7 Conclusion
-
What is the optimal time complexity for computing feasible numbers in general (i.e. in all bases and with subnormals)? Can we asymptotically outperform brute force search? Does an efficient algorithm exist for this task?
-
Although the algorithms in this paper are relatively brief, the work to develop them is much longer. Is there are shorter derivation of this solution?
-
The results of this paper rely on a mixture of integer and floating-point reasoning. Can these results be stated naturally using only floating-point numbers? If so, is there a version of this algorithm which uses only floating-point arithmetic?
-
Separately, Sect. 5 contains results which may be useful for solving mod constraints beyond the scope of this paper. Can they be applied more broadly?