1 {-# LANGUAGE GADTs #-}

3 cmmSink

6 import GhcPrelude

8 import Cmm

9 import CmmOpt

10 import CmmLive

11 import CmmUtils

19 import DynFlags

20 import Unique

21 import UniqFM

29 -- Compact sets for membership tests of local variables.

33 emptyLRegSet :: LRegSet

45 -- -----------------------------------------------------------------------------

46 -- Sinking and inlining

48 -- This is an optimisation pass that

49 -- (a) moves assignments closer to their uses, to reduce register pressure

50 -- (b) pushes assignments into a single branch of a conditional if possible

51 -- (c) inlines assignments to registers that are mentioned only once

52 -- (d) discards dead assignments

53 --

54 -- This tightens up lots of register-heavy code. It is particularly

55 -- helpful in the Cmm generated by the Stg->Cmm code generator, in

56 -- which every function starts with a copyIn sequence like:

57 --

58 -- x1 = R1

59 -- x2 = Sp[8]

60 -- x3 = Sp[16]

61 -- if (Sp - 32 < SpLim) then L1 else L2

62 --

63 -- we really want to push the x1..x3 assignments into the L2 branch.

64 --

65 -- Algorithm:

66 --

67 -- * Start by doing liveness analysis.

68 --

69 -- * Keep a list of assignments A; earlier ones may refer to later ones.

70 -- Currently we only sink assignments to local registers, because we don't

71 -- have liveness information about global registers.

72 --

73 -- * Walk forwards through the graph, look at each node N:

74 --

75 -- * If it is a dead assignment, i.e. assignment to a register that is

76 -- not used after N, discard it.

77 --

78 -- * Try to inline based on current list of assignments

79 -- * If any assignments in A (1) occur only once in N, and (2) are

80 -- not live after N, inline the assignment and remove it

81 -- from A.

82 --

83 -- * If an assignment in A is cheap (RHS is local register), then

84 -- inline the assignment and keep it in A in case it is used afterwards.

85 --

86 -- * Otherwise don't inline.

87 --

88 -- * If N is assignment to a local register pick up the assignment

89 -- and add it to A.

90 --

91 -- * If N is not an assignment to a local register:

92 -- * remove any assignments from A that conflict with N, and

93 -- place them before N in the current block. We call this

94 -- "dropping" the assignments.

95 --

96 -- * An assignment conflicts with N if it:

97 -- - assigns to a register mentioned in N

98 -- - mentions a register assigned by N

99 -- - reads from memory written by N

100 -- * do this recursively, dropping dependent assignments

101 --

102 -- * At an exit node:

103 -- * drop any assignments that are live on more than one successor

104 -- and are not trivial

105 -- * if any successor has more than one predecessor (a join-point),

106 -- drop everything live in that successor. Since we only propagate

107 -- assignments that are not dead at the successor, we will therefore

108 -- eliminate all assignments dead at this point. Thus analysis of a

109 -- join-point will always begin with an empty list of assignments.

110 --

111 --

112 -- As a result of above algorithm, sinking deletes some dead assignments

113 -- (transitively, even). This isn't as good as removeDeadAssignments,

114 -- but it's much cheaper.

116 -- -----------------------------------------------------------------------------

117 -- things that we aren't optimising very well yet.

118 --

119 -- -----------

120 -- (1) From GHC's FastString.hashStr:

121 --

122 -- s2ay:

123 -- if ((_s2an::I64 == _s2ao::I64) >= 1) goto c2gn; else goto c2gp;

124 -- c2gn:

125 -- R1 = _s2au::I64;

126 -- call (I64[Sp])(R1) args: 8, res: 0, upd: 8;

127 -- c2gp:

128 -- _s2cO::I64 = %MO_S_Rem_W64(%MO_UU_Conv_W8_W64(I8[_s2aq::I64 + (_s2an::I64 << 0)]) + _s2au::I64 * 128,

129 -- 4091);

130 -- _s2an::I64 = _s2an::I64 + 1;

131 -- _s2au::I64 = _s2cO::I64;

132 -- goto s2ay;

133 --

134 -- a nice loop, but we didn't eliminate the silly assignment at the end.

135 -- See Note [dependent assignments], which would probably fix this.

136 -- This is #8336.

137 --

138 -- -----------

139 -- (2) From stg_atomically_frame in PrimOps.cmm

140 --

141 -- We have a diamond control flow:

142 --

143 -- x = ...

144 -- |

145 -- / \

146 -- A B

147 -- \ /

148 -- |

149 -- use of x

150 --

151 -- Now x won't be sunk down to its use, because we won't push it into

152 -- both branches of the conditional. We certainly do have to check

153 -- that we can sink it past all the code in both A and B, but having

154 -- discovered that, we could sink it to its use.

155 --

157 -- -----------------------------------------------------------------------------

160 -- Assignment caches AbsMem, an abstraction of the memory read by

161 -- the RHS of the assignment.

164 -- A sequence of assignments; kept in *reverse* order

165 -- So the list [ x=e1, y=e2 ] means the sequence of assignments

166 -- y = e2

167 -- x = e1

171 where

172 liveness = cmmLocalLiveness dflags graph

175 blocks = revPostorder graph

177 join_pts = findJoinPoints blocks

180 sink _ [] = []

182 -- pprTrace "sink" (ppr lbl) $

184 where

185 lbl = entryLabel b

190 -- Annotate the middle nodes with the registers live *after*

191 -- the node. This will help us decide whether we can inline

192 -- an assignment in the current node or not.

197 -- Now sink and inline in this block

202 -- We cannot sink into join points (successors with more than

203 -- one predecessor), so identify the join points and the set

204 -- of registers live in them.

208 -- We do not want to sink an assignment into multiple branches,

209 -- so identify the set of registers live in multiple successors.

210 -- This is made more complicated because when we sink an assignment

211 -- into one branch, this might change the set of registers that are

212 -- now live in multiple branches.

214 live_in_multi live_sets r =

219 -- Now, drop any assignments that we will not sink any further.

223 where

224 should_drop = conflicts dflags a final_last

226 || r `Set.member` live_in_joins

234 live_rhs = foldRegsUsed dflags extendRegSet emptyRegSet rhs

242 {- TODO: enable this later, when we have some good tests in place to

243 measure the effect and tune it.

245 -- small: an expression we don't mind duplicating

246 isSmall :: CmmExpr -> Bool

247 isSmall (CmmReg (CmmLocal _)) = True --

248 isSmall (CmmLit _) = True

249 isSmall (CmmMachOp (MO_Add _) [x,y]) = isTrivial x && isTrivial y

250 isSmall (CmmRegOff (CmmLocal _) _) = True

251 isSmall _ = False

252 -}

254 --

255 -- We allow duplication of trivial expressions: registers (both local and

256 -- global) and literals.

257 --

264 -- GlobalRegs that are loads from BaseReg are not trivial

268 --

269 -- annotate each node with the set of registers live *after* the node

270 --

275 --

276 -- Find the blocks that have multiple successors (join points)

277 --

280 where

286 --

287 -- filter the list of assignments to remove any assignments that

288 -- are not live in a continuation.

289 --

295 where

298 -- Note that we must keep assignments that are

299 -- referred to by other assignments we have

300 -- already kept.

302 -- -----------------------------------------------------------------------------

303 -- Walk through the nodes of a block, sinking and inlining assignments

304 -- as we go.

305 --

306 -- On input we pass in a:

307 -- * list of nodes in the block

308 -- * a list of assignments that appeared *before* this block and

309 -- that are being sunk.

310 --

311 -- On output we get:

312 -- * a new block

313 -- * a list of assignments that will be placed *after* that block.

314 --

316 walk :: DynFlags

318 -- the set of registers live *after*

319 -- this node.

322 -- assignments we are sinking.

323 -- Earlier assignments may refer

324 -- to later ones.

328 )

330 walk dflags nodes assigs = go nodes emptyBlock assigs

331 where

335 -- discard dead assignment

338 where

339 node1 = constantFoldNode dflags node

349 --

350 -- Heuristic to decide whether to pick up and sink an assignment

351 -- Currently we pick up all assignments to local registers. It might

352 -- be profitable to sink assignments to global regs too, but the

353 -- liveness analysis doesn't track those (yet) so we can't.

354 --

358 shouldSink _ _other = Nothing

360 --

361 -- discard dead assignments. This doesn't do as good a job as

362 -- removeDeadAssignments, because it would need multiple passes

363 -- to get all the dead code, but it catches the common case of

364 -- superfluous reloads from the stack that the stack allocator

365 -- leaves behind.

366 --

367 -- Also we catch "r = r" here. You might think it would fall

368 -- out of inlining, but the inliner will see that r is live

369 -- after the instruction and choose not to inline r in the rhs.

370 --

372 shouldDiscard node live

388 dropAssignments dflags should_drop state assigs

390 where

397 where

402 -- -----------------------------------------------------------------------------

403 -- Try to inline assignments into a node.

404 -- This also does constant folding for primpops, since

405 -- inlining opens up opportunities for doing so.

407 tryToInline

408 :: DynFlags

410 -- node. We cannot inline anything

411 -- that is live after the node, unless

412 -- it is small enough to duplicate.

415 -> (

416 CmmNode O x -- New node

418 )

420 tryToInline dflags live node assigs = go usages node emptyLRegSet assigs

421 where

423 usages = foldLocalRegsUsed dflags addUsage emptyUFM node

428 | cannot_inline = dont_inline

430 | occurs_once = inline_and_discard

431 | isTrivial dflags rhs = inline_and_keep

433 where

437 discard = go usages node skipped rest

445 usages rhs

446 -- we must not inline anything that is mentioned in the RHS

447 -- of a binding that we have already skipped, so we set the

448 -- usages of the regs on the RHS to 2.

451 || l `elemLRegSet` skipped

454 l_usages = lookupUFM usages l

455 l_live = l `elemRegSet` live

463 -- inl_exp is where the inlining actually takes place!

466 = cmmOffset dflags rhs off

467 -- re-constant fold after inlining

469 inl_exp other = other

472 {- Note [improveConditional]

474 cmmMachOpFold tries to simplify conditionals to turn things like

475 (a == b) != 1

476 into

477 (a != b)

478 but there's one case it can't handle: when the comparison is over

479 floating-point values, we can't invert it, because floating-point

480 comparisons aren't invertible (because of NaNs).

482 But we *can* optimise this conditional by swapping the true and false

483 branches. Given

484 CmmCondBranch ((a >## b) != 1) t f

485 we can turn it into

486 CmmCondBranch (a >## b) f t

488 So here we catch conditionals that weren't optimised by cmmMachOpFold,

489 and apply above transformation to eliminate the comparison against 1.

491 It's tempting to just turn every != into == and then let cmmMachOpFold

492 do its thing, but that risks changing a nice fall-through conditional

493 into one that requires two jumps. (see swapcond_last in

494 CmmContFlowOpt), so instead we carefully look for just the cases where

495 we can eliminate a comparison.

496 -}

498 improveConditional

500 | neLike mop, isComparisonExpr x

502 where

507 improveConditional other = other

509 -- Note [dependent assignments]

510 -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~

511 --

512 -- If our assignment list looks like

513 --

514 -- [ y = e, x = ... y ... ]

515 --

516 -- We cannot inline x. Remember this list is really in reverse order,

517 -- so it means x = ... y ...; y = e

518 --

519 -- Hence if we inline x, the outer assignment to y will capture the

520 -- reference in x's right hand side.

521 --

522 -- In this case we should rename the y in x's right-hand side,

523 -- i.e. change the list to [ y = e, x = ... y1 ..., y1 = y ]

524 -- Now we can go ahead and inline x.

525 --

526 -- For now we do nothing, because this would require putting

527 -- everything inside UniqSM.

528 --

529 -- One more variant of this (#7366):

530 --

531 -- [ y = e, y = z ]

532 --

533 -- If we don't want to inline y = e, because y is used many times, we

534 -- might still be tempted to inline y = z (because we always inline

535 -- trivial rhs's). But of course we can't, because y is equal to e,

536 -- not z.

538 -- Note [discard during inlining]

539 -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

540 -- Opportunities to discard assignments sometimes appear after we've

541 -- done some inlining. Here's an example:

542 --

543 -- x = R1;

544 -- y = P64[x + 7];

545 -- z = P64[x + 15];

546 -- /* z is dead */

547 -- R1 = y & (-8);

548 --

549 -- The x assignment is trivial, so we inline it in the RHS of y, and

550 -- keep both x and y. z gets dropped because it is dead, then we

551 -- inline y, and we have a dead assignment to x. If we don't notice

552 -- that x is dead in tryToInline, we end up retaining it.

562 f _ z = z

564 -- we don't inline into CmmUnsafeForeignCall if the expression refers

565 -- to global registers. This is a HACK to avoid global registers

566 -- clashing with C argument-passing registers, really the back-end

567 -- ought to be able to handle it properly, but currently neither PprC

568 -- nor the NCG can do it. See Note [Register parameter passing]

569 -- See also StgCmmForeign:load_args_into_temps.

575 -- -----------------------------------------------------------------------------

577 -- | @conflicts (r,e) node@ is @False@ if and only if the assignment

578 -- @r = e@ can be safely commuted past statement @node@.

582 -- (1) node defines registers used by rhs of assignment. This catches

583 -- assignments and all three kinds of calls. See Note [Sinking and calls]

587 -- (2) node uses register defined by assignment

590 -- (3) a store to an address conflicts with a read of the same memory

594 -- (4) an assignment to Hp/Sp conflicts with a heap/stack read respectively

599 -- (5) foreign calls clobber heap: see Note [Foreign calls clobber heap]

602 -- (6) native calls clobber any memory

605 -- (7) otherwise, no conflict

608 -- Returns True if node defines any global registers that are used in the

609 -- Cmm expression

611 globalRegistersConflict dflags expr node =

613 False node

615 -- Returns True if node defines any local registers that are used in the

616 -- Cmm expression

618 localRegistersConflict dflags expr node =

620 False node

622 -- Note [Sinking and calls]

623 -- ~~~~~~~~~~~~~~~~~~~~~~~~

624 --

625 -- We have three kinds of calls: normal (CmmCall), safe foreign (CmmForeignCall)

626 -- and unsafe foreign (CmmUnsafeForeignCall). We perform sinking pass after

627 -- stack layout (see Note [Sinking after stack layout]) which leads to two

628 -- invariants related to calls:

629 --

630 -- a) during stack layout phase all safe foreign calls are turned into

631 -- unsafe foreign calls (see Note [Lower safe foreign calls]). This

632 -- means that we will never encounter CmmForeignCall node when running

633 -- sinking after stack layout

634 --

635 -- b) stack layout saves all variables live across a call on the stack

636 -- just before making a call (remember we are not sinking assignments to

637 -- stack):

638 --

639 -- L1:

640 -- x = R1

641 -- P64[Sp - 16] = L2

642 -- P64[Sp - 8] = x

643 -- Sp = Sp - 16

644 -- call f() returns L2

645 -- L2:

646 --

647 -- We will attempt to sink { x = R1 } but we will detect conflict with

648 -- { P64[Sp - 8] = x } and hence we will drop { x = R1 } without even

649 -- checking whether it conflicts with { call f() }. In this way we will

650 -- never need to check any assignment conflicts with CmmCall. Remember

651 -- that we still need to check for potential memory conflicts.

652 --

653 -- So the result is that we only need to worry about CmmUnsafeForeignCall nodes

654 -- when checking conflicts (see Note [Unsafe foreign calls clobber caller-save registers]).

655 -- This assumption holds only when we do sinking after stack layout. If we run

656 -- it before stack layout we need to check for possible conflicts with all three

657 -- kinds of calls. Our `conflicts` function does that by using a generic

658 -- foldRegsDefd and foldRegsUsed functions defined in DefinerOfRegs and

659 -- UserOfRegs typeclasses.

660 --

662 -- An abstraction of memory read or written.

663 data AbsMem

665 | AnyMem -- arbitrary memory

666 | HeapMem -- definitely heap memory

667 | StackMem -- definitely stack memory

668 | SpMem -- <size>[Sp+n]

672 -- Having SpMem is important because it lets us float loads from Sp

673 -- past stores to Sp as long as they don't overlap, and this helps to

674 -- unravel some long sequences of

675 -- x1 = [Sp + 8]

676 -- x2 = [Sp + 16]

677 -- ...

678 -- [Sp + 8] = xi

679 -- [Sp + 16] = xj

680 --

681 -- Note that SpMem is invalidated if Sp is changed, but the definition

682 -- of 'conflicts' above handles that.

684 -- ToDo: this won't currently fix the following commonly occurring code:

685 -- x1 = [R1 + 8]

686 -- x2 = [R1 + 16]

687 -- ..

688 -- [Hp - 8] = x1

689 -- [Hp - 16] = x2

690 -- ..

692 -- because [R1 + 8] and [Hp - 8] are both HeapMem. We know that

693 -- assignments to [Hp + n] do not conflict with any other heap memory,

694 -- but this is tricky to nail down. What if we had

695 --

696 -- x = Hp + n

697 -- [x] = ...

698 --

699 -- the store to [x] should be "new heap", not "old heap".

700 -- Furthermore, you could imagine that if we started inlining

701 -- functions in Cmm then there might well be reads of heap memory

702 -- that was written in the same basic block. To take advantage of

703 -- non-aliasing of heap memory we will have to be more clever.

705 -- Note [Foreign calls clobber heap]

706 -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

707 --

708 -- It is tempting to say that foreign calls clobber only

709 -- non-heap/stack memory, but unfortunately we break this invariant in

710 -- the RTS. For example, in stg_catch_retry_frame we call

711 -- stmCommitNestedTransaction() which modifies the contents of the

712 -- TRec it is passed (this actually caused incorrect code to be

713 -- generated).

714 --

715 -- Since the invariant is true for the majority of foreign calls,

716 -- perhaps we ought to have a special annotation for calls that can

717 -- modify heap/stack memory. For now we just use the conservative

718 -- definition here.

719 --

720 -- Some CallishMachOp imply a memory barrier e.g. AtomicRMW and

721 -- therefore we should never float any memory operations across one of

722 -- these calls.

726 bothMems NoMem x = x

727 bothMems x NoMem = x

728 bothMems HeapMem HeapMem = HeapMem

729 bothMems StackMem StackMem = StackMem

734 bothMems StackMem SpMem{} = StackMem

735 bothMems _ _ = AnyMem

750 exprMem dflags (CmmLoad addr w) = bothMems (loadAddr dflags addr (typeWidth w)) (exprMem dflags addr)

752 exprMem _ _ = NoMem

755 loadAddr dflags e w =

758 CmmRegOff r i -> regAddr dflags r i w

759 _other | regUsedIn dflags spReg e -> StackMem

766 regAddr dflags r _ _ | isGcPtrType (cmmRegType dflags r) = HeapMem -- yay! GCPtr pays for itself

767 regAddr _ _ _ _ = AnyMem

769 {-

770 Note [Inline GlobalRegs?]

772 Should we freely inline GlobalRegs?

774 Actually it doesn't make a huge amount of difference either way, so we

775 *do* currently treat GlobalRegs as "trivial" and inline them

776 everywhere, but for what it's worth, here is what I discovered when I

777 (SimonM) looked into this:

779 Common sense says we should not inline GlobalRegs, because when we

780 have

782 x = R1

784 the register allocator will coalesce this assignment, generating no

785 code, and simply record the fact that x is bound to $rbx (or

786 whatever). Furthermore, if we were to sink this assignment, then the

787 range of code over which R1 is live increases, and the range of code

788 over which x is live decreases. All things being equal, it is better

789 for x to be live than R1, because R1 is a fixed register whereas x can

790 live in any register. So we should neither sink nor inline 'x = R1'.

792 However, not inlining GlobalRegs can have surprising

793 consequences. e.g. (cgrun020)

795 c3EN:

796 _s3DB::P64 = R1;

797 _c3ES::P64 = _s3DB::P64 & 7;

798 if (_c3ES::P64 >= 2) goto c3EU; else goto c3EV;

799 c3EU:

800 _s3DD::P64 = P64[_s3DB::P64 + 6];

801 _s3DE::P64 = P64[_s3DB::P64 + 14];

802 I64[Sp - 8] = c3F0;

803 R1 = _s3DE::P64;

804 P64[Sp] = _s3DD::P64;

806 inlining the GlobalReg gives:

808 c3EN:

809 if (R1 & 7 >= 2) goto c3EU; else goto c3EV;

810 c3EU:

811 I64[Sp - 8] = c3F0;

812 _s3DD::P64 = P64[R1 + 6];

813 R1 = P64[R1 + 14];

814 P64[Sp] = _s3DD::P64;

816 but if we don't inline the GlobalReg, instead we get:

818 _s3DB::P64 = R1;

819 if (_s3DB::P64 & 7 >= 2) goto c3EU; else goto c3EV;

820 c3EU:

821 I64[Sp - 8] = c3F0;

822 R1 = P64[_s3DB::P64 + 14];

823 P64[Sp] = P64[_s3DB::P64 + 6];

825 This looks better - we managed to inline _s3DD - but in fact it

826 generates an extra reg-reg move:

828 .Lc3EU:

829 movq $c3F0_info,-8(%rbp)

830 movq %rbx,%rax

831 movq 14(%rbx),%rbx

832 movq 6(%rax),%rax

833 movq %rax,(%rbp)

835 because _s3DB is now live across the R1 assignment, we lost the

836 benefit of coalescing.

838 Who is at fault here? Perhaps if we knew that _s3DB was an alias for

839 R1, then we would not sink a reference to _s3DB past the R1

840 assignment. Or perhaps we *should* do that - we might gain by sinking

841 it, despite losing the coalescing opportunity.

843 Sometimes not inlining global registers wins by virtue of the rule

844 about not inlining into arguments of a foreign call, e.g. (T7163) this

845 is what happens when we inlined F1:

847 _s3L2::F32 = F1;

848 _c3O3::F32 = %MO_F_Mul_W32(F1, 10.0 :: W32);

849 (_s3L7::F32) = call "ccall" arg hints: [] result hints: [] rintFloat(_c3O3::F32);

851 but if we don't inline F1:

853 (_s3L7::F32) = call "ccall" arg hints: [] result hints: [] rintFloat(%MO_F_Mul_W32(_s3L2::F32,

854 10.0 :: W32));

855 -}