-------------------------------------------- Notes on nofib programs -------------------------------------------- Every time I do performance tuning on GHC I re-discover wrinkles in particular nofib programs. I intend to record these wrinkles herein. NOTE: With 4.09 we introduced Unicode. That means that (C# x) always allocates, whereas it didn't before. So allocations go up a bit. --------------------------------------- Imaginary suite --------------------------------------- integrate ~~~~~~~~~ integrate1D is strict in its second argument 'u', but it also passes 'u' to function 'f'. Hence it now does some reboxing, which pushes up allocation slightly. gen_regexps ~~~~~~~~~~~ I found that there were some very bad loss-of-arity cases in PrelShow. In particular, we had: showl "" = showChar '"' s showl ('"':xs) = showString "\\\"" . showl xs showl (x:xs) = showLitChar x . showl xs Trouble is, we get showl = \xs -> case xs of ... (x:xs) -> let f = showLitChar x g = showl xs in \s -> f (g s) which is TERRIBLE. We can't spot that showLitChar has arity 2 because it looks like this: ...other eqns... showLitChar c = showString ('\\' : asciiTab!!ord c) notice that the (asciiTab!!orc c) is outside the \s, so GHC can't rewrite it to showLitChar c = \s -> showString ('\\' : asciiTab!!ord c) s So I've changed PrelShow.showLitChar to use explicit \s. Even then, showl doesn't work, because GHC can't see that showl xs can be pushed inside the \s. So I've put an explict \s there too. showl "" s = showChar '"' s showl ('"':xs) s = showString "\\\"" (showl xs s) showl (x:xs) s = showLitChar x (showl xs s) Net result: imaginary/gen_regexps more than halves in allocation! x2n1 ~~~~ Relies quite heavily on specialisations for Num (Complex Double). If this test suddenly gets worse by a factor of 2 or so, chances are that specialisation is broken. --------------------------------------- Spectral suite --------------------------------------- rewrite ~~~~~~~ It's important to inline p_ident. There's a very delicate CSE in p_expr p_expr = seQ q_op [p_term1, p_op, p_term2] ## p_term3 (where all the pterm1,2,3 are really just p_term). This expands into p_expr s = case p_term1 s of Nothing -> p_term3 s Just (v,s') -> case p_op s' of ... Nothing -> p_term3 s Notice that if the bit before (##) fails, we call p_term3 s, and if CSE notices we can re-use the result of the first call. But that depends on how much inlining takes place. In particular, the follow-up calls p_term3 can get hidden inside join points, which don't "see" the first call. cse ~~~ The functions 'go' and 'go1', called from draw, get arity 1, but they should get arity 2. We need a better arity analysis, a la Gill. 'go' looks like this: go_r1og = \ (ds_aGI :: [GHC.Base.Char]) -> case ds_aGI of wild_aGJ { [] -> z_r1oe; : y_aGN ys_aGO -> let { xs7_s1i8 :: GHC.Prim.Int# -> [GHC.Base.Char] [Str: DmdType] xs7_s1i8 = go_r1og ys_aGO } in \ (m_XWf :: GHC.Prim.Int#) -> case GHC.Prim.<=# m_XWf 1 of wild1_aSI { GHC.Base.False -> GHC.Base.: @ GHC.Base.Char y_aGN (xs7_s1i8 (GHC.Prim.-# m_XWf 1)); GHC.Base.True -> GHC.Base.: @ GHC.Base.Char y_aGN lvl13_r1n0 } } Notice the 'let' which stops the lambda moving out. Eliza ~~~~~ In June 2002, GHC 5.04 emitted four successive NOTE: Simplifier still going after 4 iterations; bailing out. messages. I suspect that the simplifer is looping somehow. Expert ~~~~~~ In spectral/expert/Search.ask there's a statically visible CSE. Catching this depends almost entirely on chance, which is a pity. Fish ~~~~ The performance of fish depends crucially on inlining scale_vec2. It turns out to be right on the edge of GHC's normal threshold size, so small changes to the compiler can knock it to and fro. [Sept 08] The new instance-declaration story makes show slightly worse. Look at Main.showl in the simplified core. You'll see this: a79_aRZ :: GHC.Types.Int -> (GHC.Types.Int, GHC.Types.Int, GHC.Types.Int, GHC.Types.Int) -> GHC.Show.ShowS [Str: DmdType] a79_aRZ = GHC.Show.showsPrec9 @ Int @ Int @ Int @ Int GHC.Show.$f16 GHC.Show.$f16 GHC.Show.$f16 GHC.Show.$f16 Rec { showl_sWK :: [Main.Line_segment] -> String -> String [Arity 2 Str: DmdType SL] showl_sWK = \ (ds_dQJ :: [Main.Line_segment]) (s_alB :: GHC.Base.String) -> case ds_dQJ of wild_X1d [ALWAYS Just A] { [] -> GHC.Types.: @ GHC.Types.Char lvl_sYT s_alB; : x_alD [ALWAYS Just L] xs_alF [ALWAYS Just L] -> GHC.Base.++ @ GHC.Types.Char a_s12R (a79_aRZ a_s1uS x_alD (showl_sWK xs_alF s_alB)) } end Rec } Note the non-inlined call to GHC.Show.showsPrec9, which looks like this: showsPrec9 :: forall a b c d. (Show a, Show b, Show c, Show d) => GHC.Types.Int -> (a, b, c, d) -> GHC.Show.ShowS {- Arity: 4 Strictness: LLLL Unfolding: (\ @ a b c d $dShow :: GHC.Show.Show a $dShow1 :: GHC.Show.Show b $dShow2 :: GHC.Show.Show c $dShow3 :: GHC.Show.Show d -> let { shows1 :: d -> GHC.Show.ShowS = GHC.Show.shows @ d $dShow3 } in let { shows2 :: c -> GHC.Show.ShowS = GHC.Show.shows @ c $dShow2 } in let { shows3 :: b -> GHC.Show.ShowS = GHC.Show.shows @ b $dShow1 } in let { shows4 :: a -> GHC.Show.ShowS = GHC.Show.shows @ a $dShow } in \ ds :: GHC.Types.Int ds1 :: (a, b, c, d) s :: GHC.Base.String -> case @ GHC.Base.String ds1 of wild { (a79, b, c, d) -> GHC.Show.show_tuple (GHC.Types.: @ GHC.Show.ShowS (shows4 a79) (GHC.Types.: @ GHC.Show.ShowS (shows3 b) (GHC.Types.: @ GHC.Show.ShowS (shows2 c) (GHC.Types.: @ GHC.Show.ShowS (shows1 d) (GHC.Types.[] @ GHC.Show.ShowS))))) s }) -} We would do better to inpline showsPrec9 but it looks too big. Before it was inlined regardless by the instance-decl stuff. So perf drops slightly. Integer ~~~~~~~ A good benchmark for beating on big-integer arithmetic Knights ~~~~~~~ In knights/KnightHeuristic, we don't find that possibleMoves is strict (with important knock-on effects) unless we apply rules before floating out the literal list [A,B,C...]. Similarly, in f_se (F_Cmp ...) in listcompr (but a smaller effect) Lambda ~~~~~~ This program shows the cost of the non-eta-expanded lambdas that arise from a state monad. mandel2 ~~~~~~~ check_perim's several calls to point_colour lead to opportunities for CSE which may be more or less well taken. Mandel ~~~~~~ Relies heavily on having a specialised version of Complex.magnitude (:: Complex Double -> Double) available. Mandel.mandelset has a go-loop inside an enumDeltaToIntegerFB, out of which 4.08.2 managed to float a constant expression, but HEAD did not. I think this is because the pre-let-floating simplification did too little inlining; in particular, it did not inline windowToViewport Multiplier ~~~~~~~~~~ In spectral/multiplier, we have xor = lift21 forceBit f where f :: Bit -> Bit -> Bit f 0 0 = 0 f 0 1 = 1 f 1 0 = 1 f 1 1 = 0 Trouble is, f is CPR'd, and that means that instead of returning the constants I# 0, I# 1, it returns 0,1 and then boxes them. So allocation goes up. I don't see a way around this. Parstof ~~~~~~~ spectral/hartel/parstof ends up saying case (unpackCString "x") of { c:cs -> ... } quite a bit. We should spot these and behave accordingly. Power ~~~~~ With GHC 4.08, for some reason the arithmetic defaults to Double. The right thing is to default to Rational, which accounts for the big increase in runtime after 4.08 Puzzle ~~~~~~ The main function is 'transfer'. It has some complicated join points, and a big issue is the full laziness can float out many small MFEs that then make much bigger closures. It's quite delicate: small changes can make big differences, and I spent far too long gazing at it. I found that in my experimental proto 4.09 compiler I had let ds = go xs in let $j = .... ds ... in case e of p1 -> $j p2 -> $j p3 -> ds But ds wasn't getting inlined because we couldn't spot that it was actually only being used once. Keith's analyser should find this! Also, making concat into a good producer made a large gain. My proto 4.09 still allocates more, partly because of more full laziness relative to 4.08; I don't know why that happens Extra allocation is happening in 5.02 as well; perhaps for the same reasons. There is at least one instance of floating that prevents fusion; namely the enumerated lists in 'transfer'. Sphere ~~~~~~ A key function is vecsub, which looks like this (after w/w) $wvecsub = \ ww :: Double ww1 :: Double ww2 :: Double ww3 :: Double ww4 :: Double ww5 :: Double -> let { a = case ww of wild { D# x -> case ww3 of wild1 { D# y -> let { a1 = -## x y } in $wD# a1 } } } in let { a1 = case ww1 of wild { D# x -> case ww4 of wild1 { D# y -> let { a2 = -## x y } in $wD# a2 } } } in let { a2 = case ww2 of wild { D# x -> case ww5 of wild1 { D# y -> let { a3 = -## x y } in $wD# a3 } } } in (# a, a1, a2 #) Currently it gets guidance: IF_ARGS 6 [2 2 2 2 2 2] 25 4 It occurs in a context like case $wvecsub a b c d e f of (# p, q, r #) -> case p of D# p' -> case q of D# q' -> case r of D# r' -> So it would be much, much better to inline it. But the result-discounting doesn't "see" that the three results are each taken apart, and that's the problem. One question is this: why did minusDouble get inlined? If it didn't then $vecsub would look much smaller: $wvecsub = \ ww :: Double ww1 :: Double ww2 :: Double ww3 :: Double ww4 :: Double ww5 :: Double -> let { a = minusDouble ww ww3 } in let { a1 = minusDouble ww1 ww4 } in let { a2 = minusDouble ww2 ww5 } in (# a, a1, a2 #) Reason minusDouble gets inlined: because we are trying to get update in place. So I added a flag -funfolding-update-in-place to enable this "optimisation". Omitting the flag gives much better inlining for $wvecsub at least. Sphere also does 60,000 calls to hPutStr, so I/O plays a major role. Currently this I/O does a *lot* of allocation, much of it since the adddition of thread-safety. treejoin ~~~~~~~~ Does a lot of IO.readFile. In GHC.IO.Encoding.UTF8 the demand analyser sees a strict function with type a_s1gj :: GHC.IO.Buffer.Buffer GHC.Word.Word8 -> GHC.IO.Buffer.Buffer GHC.Types.Char -> GHC.Prim.State# GHC.Prim.RealWorld -> (# GHC.Prim.State# GHC.Prim.RealWorld, (GHC.IO.Encoding.Types.CodingProgress, GHC.IO.Buffer.Buffer GHC.Word.Word8, GHC.IO.Buffer.Buffer GHC.Types.Char) #) Unboxing both Buffer arguments makes a HUGE difference (halves allocation); but that makes the worker function have 12 arguments. A good reason for unboxing even if the worker gets a lot of args. sorting ~~~~~~~ Same issue with GHC.IO.Encoding.UTF8 as treejoin --------------------------------------- Real suite --------------------------------------- gg ~~ Same issue with GHC.IO.Encoding.UTF8 as treejoin Maillist ~~~~~~~~ Uses appendFile repeatedly rather than opening the output file once, which leads to numerous file opens/closes. Allocations will rise with the new I/O subsystem in 5.02 because the I/O buffer will be re-allocated on the heap for each open, whereas previously it would be allocated on the C heap and therefore not show up in the stats.