1 {-# LANGUAGE GADTs #-}

3 cmmSink

6 import GhcPrelude

8 import Cmm

9 import CmmOpt

10 import CmmLive

11 import CmmUtils

19 import DynFlags

20 import Unique

21 import UniqFM

30 -- Compact sets for membership tests of local variables.

34 emptyLRegSet :: LRegSet

46 -- -----------------------------------------------------------------------------

47 -- Sinking and inlining

49 -- This is an optimisation pass that

50 -- (a) moves assignments closer to their uses, to reduce register pressure

51 -- (b) pushes assignments into a single branch of a conditional if possible

52 -- (c) inlines assignments to registers that are mentioned only once

53 -- (d) discards dead assignments

54 --

55 -- This tightens up lots of register-heavy code. It is particularly

56 -- helpful in the Cmm generated by the Stg->Cmm code generator, in

57 -- which every function starts with a copyIn sequence like:

58 --

59 -- x1 = R1

60 -- x2 = Sp[8]

61 -- x3 = Sp[16]

62 -- if (Sp - 32 < SpLim) then L1 else L2

63 --

64 -- we really want to push the x1..x3 assignments into the L2 branch.

65 --

66 -- Algorithm:

67 --

68 -- * Start by doing liveness analysis.

69 --

70 -- * Keep a list of assignments A; earlier ones may refer to later ones.

71 -- Currently we only sink assignments to local registers, because we don't

72 -- have liveness information about global registers.

73 --

74 -- * Walk forwards through the graph, look at each node N:

75 --

76 -- * If it is a dead assignment, i.e. assignment to a register that is

77 -- not used after N, discard it.

78 --

79 -- * Try to inline based on current list of assignments

80 -- * If any assignments in A (1) occur only once in N, and (2) are

81 -- not live after N, inline the assignment and remove it

82 -- from A.

83 --

84 -- * If an assignment in A is cheap (RHS is local register), then

85 -- inline the assignment and keep it in A in case it is used afterwards.

86 --

87 -- * Otherwise don't inline.

88 --

89 -- * If N is assignment to a local register pick up the assignment

90 -- and add it to A.

91 --

92 -- * If N is not an assignment to a local register:

93 -- * remove any assignments from A that conflict with N, and

94 -- place them before N in the current block. We call this

95 -- "dropping" the assignments.

96 --

97 -- * An assignment conflicts with N if it:

98 -- - assigns to a register mentioned in N

99 -- - mentions a register assigned by N

100 -- - reads from memory written by N

101 -- * do this recursively, dropping dependent assignments

102 --

103 -- * At an exit node:

104 -- * drop any assignments that are live on more than one successor

105 -- and are not trivial

106 -- * if any successor has more than one predecessor (a join-point),

107 -- drop everything live in that successor. Since we only propagate

108 -- assignments that are not dead at the successor, we will therefore

109 -- eliminate all assignments dead at this point. Thus analysis of a

110 -- join-point will always begin with an empty list of assignments.

111 --

112 --

113 -- As a result of above algorithm, sinking deletes some dead assignments

114 -- (transitively, even). This isn't as good as removeDeadAssignments,

115 -- but it's much cheaper.

117 -- -----------------------------------------------------------------------------

118 -- things that we aren't optimising very well yet.

119 --

120 -- -----------

121 -- (1) From GHC's FastString.hashStr:

122 --

123 -- s2ay:

124 -- if ((_s2an::I64 == _s2ao::I64) >= 1) goto c2gn; else goto c2gp;

125 -- c2gn:

126 -- R1 = _s2au::I64;

127 -- call (I64[Sp])(R1) args: 8, res: 0, upd: 8;

128 -- c2gp:

129 -- _s2cO::I64 = %MO_S_Rem_W64(%MO_UU_Conv_W8_W64(I8[_s2aq::I64 + (_s2an::I64 << 0)]) + _s2au::I64 * 128,

130 -- 4091);

131 -- _s2an::I64 = _s2an::I64 + 1;

132 -- _s2au::I64 = _s2cO::I64;

133 -- goto s2ay;

134 --

135 -- a nice loop, but we didn't eliminate the silly assignment at the end.

136 -- See Note [dependent assignments], which would probably fix this.

137 -- This is #8336 on Trac.

138 --

139 -- -----------

140 -- (2) From stg_atomically_frame in PrimOps.cmm

141 --

142 -- We have a diamond control flow:

143 --

144 -- x = ...

145 -- |

146 -- / \

147 -- A B

148 -- \ /

149 -- |

150 -- use of x

151 --

152 -- Now x won't be sunk down to its use, because we won't push it into

153 -- both branches of the conditional. We certainly do have to check

154 -- that we can sink it past all the code in both A and B, but having

155 -- discovered that, we could sink it to its use.

156 --

158 -- -----------------------------------------------------------------------------

161 -- Assignment caches AbsMem, an abstraction of the memory read by

162 -- the RHS of the assignment.

165 -- A sequence of assignments; kept in *reverse* order

166 -- So the list [ x=e1, y=e2 ] means the sequence of assignments

167 -- y = e2

168 -- x = e1

172 where

173 liveness = cmmLocalLiveness dflags graph

176 blocks = revPostorder graph

178 join_pts = findJoinPoints blocks

181 sink _ [] = []

183 -- pprTrace "sink" (ppr lbl) $

185 where

186 lbl = entryLabel b

191 -- Annotate the middle nodes with the registers live *after*

192 -- the node. This will help us decide whether we can inline

193 -- an assignment in the current node or not.

198 -- Now sink and inline in this block

203 -- We cannot sink into join points (successors with more than

204 -- one predecessor), so identify the join points and the set

205 -- of registers live in them.

209 -- We do not want to sink an assignment into multiple branches,

210 -- so identify the set of registers live in multiple successors.

211 -- This is made more complicated because when we sink an assignment

212 -- into one branch, this might change the set of registers that are

213 -- now live in multiple branches.

215 live_in_multi live_sets r =

220 -- Now, drop any assignments that we will not sink any further.

224 where

225 should_drop = conflicts dflags a final_last

227 || r `Set.member` live_in_joins

235 live_rhs = foldRegsUsed dflags extendRegSet emptyRegSet rhs

243 {- TODO: enable this later, when we have some good tests in place to

244 measure the effect and tune it.

246 -- small: an expression we don't mind duplicating

247 isSmall :: CmmExpr -> Bool

248 isSmall (CmmReg (CmmLocal _)) = True --

249 isSmall (CmmLit _) = True

250 isSmall (CmmMachOp (MO_Add _) [x,y]) = isTrivial x && isTrivial y

251 isSmall (CmmRegOff (CmmLocal _) _) = True

252 isSmall _ = False

253 -}

255 --

256 -- We allow duplication of trivial expressions: registers (both local and

257 -- global) and literals.

258 --

265 -- GlobalRegs that are loads from BaseReg are not trivial

269 --

270 -- annotate each node with the set of registers live *after* the node

271 --

276 --

277 -- Find the blocks that have multiple successors (join points)

278 --

281 where

287 --

288 -- filter the list of assignments to remove any assignments that

289 -- are not live in a continuation.

290 --

296 where

299 -- Note that we must keep assignments that are

300 -- referred to by other assignments we have

301 -- already kept.

303 -- -----------------------------------------------------------------------------

304 -- Walk through the nodes of a block, sinking and inlining assignments

305 -- as we go.

306 --

307 -- On input we pass in a:

308 -- * list of nodes in the block

309 -- * a list of assignments that appeared *before* this block and

310 -- that are being sunk.

311 --

312 -- On output we get:

313 -- * a new block

314 -- * a list of assignments that will be placed *after* that block.

315 --

317 walk :: DynFlags

319 -- the set of registers live *after*

320 -- this node.

323 -- assignments we are sinking.

324 -- Earlier assignments may refer

325 -- to later ones.

329 )

331 walk dflags nodes assigs = go nodes emptyBlock assigs

332 where

336 -- discard dead assignment

339 where

340 node1 = constantFoldNode dflags node

350 --

351 -- Heuristic to decide whether to pick up and sink an assignment

352 -- Currently we pick up all assignments to local registers. It might

353 -- be profitable to sink assignments to global regs too, but the

354 -- liveness analysis doesn't track those (yet) so we can't.

355 --

359 shouldSink _ _other = Nothing

361 --

362 -- discard dead assignments. This doesn't do as good a job as

363 -- removeDeadAssignments, because it would need multiple passes

364 -- to get all the dead code, but it catches the common case of

365 -- superfluous reloads from the stack that the stack allocator

366 -- leaves behind.

367 --

368 -- Also we catch "r = r" here. You might think it would fall

369 -- out of inlining, but the inliner will see that r is live

370 -- after the instruction and choose not to inline r in the rhs.

371 --

373 shouldDiscard node live

389 dropAssignments dflags should_drop state assigs

391 where

398 where

403 -- -----------------------------------------------------------------------------

404 -- Try to inline assignments into a node.

405 -- This also does constant folding for primpops, since

406 -- inlining opens up opportunities for doing so.

408 tryToInline

409 :: DynFlags

411 -- node. We cannot inline anything

412 -- that is live after the node, unless

413 -- it is small enough to duplicate.

416 -> (

417 CmmNode O x -- New node

419 )

421 tryToInline dflags live node assigs = go usages node emptyLRegSet assigs

422 where

424 usages = foldLocalRegsUsed dflags addUsage emptyUFM node

429 | cannot_inline = dont_inline

431 | occurs_once = inline_and_discard

432 | isTrivial dflags rhs = inline_and_keep

434 where

438 discard = go usages node skipped rest

446 usages rhs

447 -- we must not inline anything that is mentioned in the RHS

448 -- of a binding that we have already skipped, so we set the

449 -- usages of the regs on the RHS to 2.

452 || l `elemLRegSet` skipped

455 l_usages = lookupUFM usages l

456 l_live = l `elemRegSet` live

464 -- inl_exp is where the inlining actually takes place!

467 = cmmOffset dflags rhs off

468 -- re-constant fold after inlining

470 inl_exp other = other

473 {- Note [improveConditional]

475 cmmMachOpFold tries to simplify conditionals to turn things like

476 (a == b) != 1

477 into

478 (a != b)

479 but there's one case it can't handle: when the comparison is over

480 floating-point values, we can't invert it, because floating-point

481 comparisions aren't invertible (because NaN).

483 But we *can* optimise this conditional by swapping the true and false

484 branches. Given

485 CmmCondBranch ((a >## b) != 1) t f

486 we can turn it into

487 CmmCondBranch (a >## b) f t

489 So here we catch conditionals that weren't optimised by cmmMachOpFold,

490 and apply above transformation to eliminate the comparison against 1.

492 It's tempting to just turn every != into == and then let cmmMachOpFold

493 do its thing, but that risks changing a nice fall-through conditional

494 into one that requires two jumps. (see swapcond_last in

495 CmmContFlowOpt), so instead we carefully look for just the cases where

496 we can eliminate a comparison.

497 -}

499 improveConditional

501 | neLike mop, isComparisonExpr x

503 where

508 improveConditional other = other

510 -- Note [dependent assignments]

511 -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~

512 --

513 -- If our assignment list looks like

514 --

515 -- [ y = e, x = ... y ... ]

516 --

517 -- We cannot inline x. Remember this list is really in reverse order,

518 -- so it means x = ... y ...; y = e

519 --

520 -- Hence if we inline x, the outer assignment to y will capture the

521 -- reference in x's right hand side.

522 --

523 -- In this case we should rename the y in x's right-hand side,

524 -- i.e. change the list to [ y = e, x = ... y1 ..., y1 = y ]

525 -- Now we can go ahead and inline x.

526 --

527 -- For now we do nothing, because this would require putting

528 -- everything inside UniqSM.

529 --

530 -- One more variant of this (#7366):

531 --

532 -- [ y = e, y = z ]

533 --

534 -- If we don't want to inline y = e, because y is used many times, we

535 -- might still be tempted to inline y = z (because we always inline

536 -- trivial rhs's). But of course we can't, because y is equal to e,

537 -- not z.

539 -- Note [discard during inlining]

540 -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

541 -- Opportunities to discard assignments sometimes appear after we've

542 -- done some inlining. Here's an example:

543 --

544 -- x = R1;

545 -- y = P64[x + 7];

546 -- z = P64[x + 15];

547 -- /* z is dead */

548 -- R1 = y & (-8);

549 --

550 -- The x assignment is trivial, so we inline it in the RHS of y, and

551 -- keep both x and y. z gets dropped because it is dead, then we

552 -- inline y, and we have a dead assignment to x. If we don't notice

553 -- that x is dead in tryToInline, we end up retaining it.

563 f _ z = z

565 -- we don't inline into CmmUnsafeForeignCall if the expression refers

566 -- to global registers. This is a HACK to avoid global registers

567 -- clashing with C argument-passing registers, really the back-end

568 -- ought to be able to handle it properly, but currently neither PprC

569 -- nor the NCG can do it. See Note [Register parameter passing]

570 -- See also StgCmmForeign:load_args_into_temps.

576 -- -----------------------------------------------------------------------------

578 -- | @conflicts (r,e) node@ is @False@ if and only if the assignment

579 -- @r = e@ can be safely commuted past statement @node@.

583 -- (1) node defines registers used by rhs of assignment. This catches

584 -- assignments and all three kinds of calls. See Note [Sinking and calls]

588 -- (2) node uses register defined by assignment

591 -- (3) a store to an address conflicts with a read of the same memory

595 -- (4) an assignment to Hp/Sp conflicts with a heap/stack read respectively

600 -- (5) foreign calls clobber heap: see Note [Foreign calls clobber heap]

603 -- (6) native calls clobber any memory

606 -- (7) otherwise, no conflict

609 -- Returns True if node defines any global registers that are used in the

610 -- Cmm expression

612 globalRegistersConflict dflags expr node =

614 False node

616 -- Returns True if node defines any local registers that are used in the

617 -- Cmm expression

619 localRegistersConflict dflags expr node =

621 False node

623 -- Note [Sinking and calls]

624 -- ~~~~~~~~~~~~~~~~~~~~~~~~

625 --

626 -- We have three kinds of calls: normal (CmmCall), safe foreign (CmmForeignCall)

627 -- and unsafe foreign (CmmUnsafeForeignCall). We perform sinking pass after

628 -- stack layout (see Note [Sinking after stack layout]) which leads to two

629 -- invariants related to calls:

630 --

631 -- a) during stack layout phase all safe foreign calls are turned into

632 -- unsafe foreign calls (see Note [Lower safe foreign calls]). This

633 -- means that we will never encounter CmmForeignCall node when running

634 -- sinking after stack layout

635 --

636 -- b) stack layout saves all variables live across a call on the stack

637 -- just before making a call (remember we are not sinking assignments to

638 -- stack):

639 --

640 -- L1:

641 -- x = R1

642 -- P64[Sp - 16] = L2

643 -- P64[Sp - 8] = x

644 -- Sp = Sp - 16

645 -- call f() returns L2

646 -- L2:

647 --

648 -- We will attempt to sink { x = R1 } but we will detect conflict with

649 -- { P64[Sp - 8] = x } and hence we will drop { x = R1 } without even

650 -- checking whether it conflicts with { call f() }. In this way we will

651 -- never need to check any assignment conflicts with CmmCall. Remember

652 -- that we still need to check for potential memory conflicts.

653 --

654 -- So the result is that we only need to worry about CmmUnsafeForeignCall nodes

655 -- when checking conflicts (see Note [Unsafe foreign calls clobber caller-save registers]).

656 -- This assumption holds only when we do sinking after stack layout. If we run

657 -- it before stack layout we need to check for possible conflicts with all three

658 -- kinds of calls. Our `conflicts` function does that by using a generic

659 -- foldRegsDefd and foldRegsUsed functions defined in DefinerOfRegs and

660 -- UserOfRegs typeclasses.

661 --

663 -- An abstraction of memory read or written.

664 data AbsMem

666 | AnyMem -- arbitrary memory

667 | HeapMem -- definitely heap memory

668 | StackMem -- definitely stack memory

669 | SpMem -- <size>[Sp+n]

673 -- Having SpMem is important because it lets us float loads from Sp

674 -- past stores to Sp as long as they don't overlap, and this helps to

675 -- unravel some long sequences of

676 -- x1 = [Sp + 8]

677 -- x2 = [Sp + 16]

678 -- ...

679 -- [Sp + 8] = xi

680 -- [Sp + 16] = xj

681 --

682 -- Note that SpMem is invalidated if Sp is changed, but the definition

683 -- of 'conflicts' above handles that.

685 -- ToDo: this won't currently fix the following commonly occurring code:

686 -- x1 = [R1 + 8]

687 -- x2 = [R1 + 16]

688 -- ..

689 -- [Hp - 8] = x1

690 -- [Hp - 16] = x2

691 -- ..

693 -- because [R1 + 8] and [Hp - 8] are both HeapMem. We know that

694 -- assignments to [Hp + n] do not conflict with any other heap memory,

695 -- but this is tricky to nail down. What if we had

696 --

697 -- x = Hp + n

698 -- [x] = ...

699 --

700 -- the store to [x] should be "new heap", not "old heap".

701 -- Furthermore, you could imagine that if we started inlining

702 -- functions in Cmm then there might well be reads of heap memory

703 -- that was written in the same basic block. To take advantage of

704 -- non-aliasing of heap memory we will have to be more clever.

706 -- Note [Foreign calls clobber heap]

707 -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

708 --

709 -- It is tempting to say that foreign calls clobber only

710 -- non-heap/stack memory, but unfortunately we break this invariant in

711 -- the RTS. For example, in stg_catch_retry_frame we call

712 -- stmCommitNestedTransaction() which modifies the contents of the

713 -- TRec it is passed (this actually caused incorrect code to be

714 -- generated).

715 --

716 -- Since the invariant is true for the majority of foreign calls,

717 -- perhaps we ought to have a special annotation for calls that can

718 -- modify heap/stack memory. For now we just use the conservative

719 -- definition here.

720 --

721 -- Some CallishMachOp imply a memory barrier e.g. AtomicRMW and

722 -- therefore we should never float any memory operations across one of

723 -- these calls.

727 bothMems NoMem x = x

728 bothMems x NoMem = x

729 bothMems HeapMem HeapMem = HeapMem

730 bothMems StackMem StackMem = StackMem

735 bothMems StackMem SpMem{} = StackMem

736 bothMems _ _ = AnyMem

751 exprMem dflags (CmmLoad addr w) = bothMems (loadAddr dflags addr (typeWidth w)) (exprMem dflags addr)

753 exprMem _ _ = NoMem

756 loadAddr dflags e w =

759 CmmRegOff r i -> regAddr dflags r i w

760 _other | regUsedIn dflags spReg e -> StackMem

767 regAddr dflags r _ _ | isGcPtrType (cmmRegType dflags r) = HeapMem -- yay! GCPtr pays for itself

768 regAddr _ _ _ _ = AnyMem

770 {-

771 Note [Inline GlobalRegs?]

773 Should we freely inline GlobalRegs?

775 Actually it doesn't make a huge amount of difference either way, so we

776 *do* currently treat GlobalRegs as "trivial" and inline them

777 everywhere, but for what it's worth, here is what I discovered when I

778 (SimonM) looked into this:

780 Common sense says we should not inline GlobalRegs, because when we

781 have

783 x = R1

785 the register allocator will coalesce this assignment, generating no

786 code, and simply record the fact that x is bound to $rbx (or

787 whatever). Furthermore, if we were to sink this assignment, then the

788 range of code over which R1 is live increases, and the range of code

789 over which x is live decreases. All things being equal, it is better

790 for x to be live than R1, because R1 is a fixed register whereas x can

791 live in any register. So we should neither sink nor inline 'x = R1'.

793 However, not inlining GlobalRegs can have surprising

794 consequences. e.g. (cgrun020)

796 c3EN:

797 _s3DB::P64 = R1;

798 _c3ES::P64 = _s3DB::P64 & 7;

799 if (_c3ES::P64 >= 2) goto c3EU; else goto c3EV;

800 c3EU:

801 _s3DD::P64 = P64[_s3DB::P64 + 6];

802 _s3DE::P64 = P64[_s3DB::P64 + 14];

803 I64[Sp - 8] = c3F0;

804 R1 = _s3DE::P64;

805 P64[Sp] = _s3DD::P64;

807 inlining the GlobalReg gives:

809 c3EN:

810 if (R1 & 7 >= 2) goto c3EU; else goto c3EV;

811 c3EU:

812 I64[Sp - 8] = c3F0;

813 _s3DD::P64 = P64[R1 + 6];

814 R1 = P64[R1 + 14];

815 P64[Sp] = _s3DD::P64;

817 but if we don't inline the GlobalReg, instead we get:

819 _s3DB::P64 = R1;

820 if (_s3DB::P64 & 7 >= 2) goto c3EU; else goto c3EV;

821 c3EU:

822 I64[Sp - 8] = c3F0;

823 R1 = P64[_s3DB::P64 + 14];

824 P64[Sp] = P64[_s3DB::P64 + 6];

826 This looks better - we managed to inline _s3DD - but in fact it

827 generates an extra reg-reg move:

829 .Lc3EU:

830 movq $c3F0_info,-8(%rbp)

831 movq %rbx,%rax

832 movq 14(%rbx),%rbx

833 movq 6(%rax),%rax

834 movq %rax,(%rbp)

836 because _s3DB is now live across the R1 assignment, we lost the

837 benefit of coalescing.

839 Who is at fault here? Perhaps if we knew that _s3DB was an alias for

840 R1, then we would not sink a reference to _s3DB past the R1

841 assignment. Or perhaps we *should* do that - we might gain by sinking

842 it, despite losing the coalescing opportunity.

844 Sometimes not inlining global registers wins by virtue of the rule

845 about not inlining into arguments of a foreign call, e.g. (T7163) this

846 is what happens when we inlined F1:

848 _s3L2::F32 = F1;

849 _c3O3::F32 = %MO_F_Mul_W32(F1, 10.0 :: W32);

850 (_s3L7::F32) = call "ccall" arg hints: [] result hints: [] rintFloat(_c3O3::F32);

852 but if we don't inline F1:

854 (_s3L7::F32) = call "ccall" arg hints: [] result hints: [] rintFloat(%MO_F_Mul_W32(_s3L2::F32,

855 10.0 :: W32));

856 -}