Skip to content

Basic demonstrator for mem-mode heatmap/op flagging.#18

Draft
Mittagskogel wants to merge 3 commits into
mainfrom
feature/heatmap
Draft

Basic demonstrator for mem-mode heatmap/op flagging.#18
Mittagskogel wants to merge 3 commits into
mainfrom
feature/heatmap

Conversation

@Mittagskogel

Copy link
Copy Markdown
Collaborator

Revive tracking for individual floating-point errors in mem-mode.

@Mittagskogel

Copy link
Copy Markdown
Collaborator Author

@lucasmnd this is the basic proof-of-concept for tracing errors to individual operations and building a heatmap of errors. In this state, the idea is quite primitive: compute the error of the truncated operation results compared to double-precision. Then, divide by the result to get a relative error. See Mpfr.cpp lines 435-450. I've added a few extra print statements here to show what's going on; you will want to remove them if you plan to use this code with any more than a few tens of operations.

Calling __raptor_fprt_op_dump_status(unsigned n) will print the top n instances where the relative error was large than the threshold defined in Mpfr.cpp line 281.

I've added a quick example in examples/heatmap/heatmap.cpp for you to play around with. After re-building (and re-installing) Raptor, you can build it with raptor-clang++ ./heatmap.cpp -O3 -g -o heatmap. Then try something like ./heatmap 5.0 and you should see the following:

fmul: trunc = 1.500000e+01 err = 0.000000e+00 err/trunc = 0.000000e+00
fmul: trunc = 4.997253e-04 err = 2.746582e-07 err/trunc = 5.496183e-04 (flagged)
fadd: trunc = 5.000000e+00 err = 5.000000e-04 err/trunc = 1.000000e-04
fadd: trunc = 2.000000e+01 err = 5.000000e-04 err/trunc = 2.500000e-05
Exact: 20.0005
Truncated: 20
Information about top 4 operations.
./heatmap.cpp:17:23: 1xfmul L1 Error Norm: 2.74658e-07 Number of violations: 1 Ignored 0 times.
./heatmap.cpp:18:24: 1xfadd L1 Error Norm: 0.0005 Number of violations: 0 Ignored 0 times.
./heatmap.cpp:16:21: 1xfmul L1 Error Norm: 0 Number of violations: 0 Ignored 0 times.
./heatmap.cpp:20:16: 1xfadd L1 Error Norm: 0.0005 Number of violations: 0 Ignored 0 times.
4 ops were truncated.

So you can see that the multiplication in * frac in line 17 of heatmap.cpp got flagged for a large relative error. The bad addition (large+small fp) in line 18 doesn't get flagged because the overall relative error stays small.

A few more details:

  • It only works in mem-mode because it exploits the __raptor_fp struct defined in Trace.cpp.
  • RAPTOR_FPRT_ENABLE_SHADOW_RESIDUALS needs to be enabled in Mpfr.cpp (For this demo, I've turned it on permanently in line 276)
  • The code snippet to calculate and flag the error is currently duplicated for each operation type (__RAPTOR_MPFR_SINGOP, __RAPTOR_MPFR_BIN, etc.)
  • Statistics are grouped by debug location (-g). If you call foo() twice, you will see that the number of "violations" increases.

This is very much a prototype. I'm looking forward to getting some insights into your use-case and building on your feedback to turn this into a proper feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant