Make the example a little more complex
[ghc.git] / docs / backpack / backpack-impl.tex
1 \documentclass{article}
2
3 \usepackage{graphicx} %[pdftex] OR [dvips]
4 \usepackage{fullpage}
5 \usepackage{float}
6 \usepackage{titling}
7 \usepackage{hyperref}
8 \usepackage{tikz}
9 \usetikzlibrary{arrows}
10 \usetikzlibrary{positioning}
11 \setlength{\droptitle}{-6em}
12
13 \newcommand{\ghcfile}[1]{\textsl{#1}}
14
15 \title{Implementing Backpack}
16
17 \begin{document}
18
19 \maketitle
20
21 The purpose of this document is to describe an implementation path
22 for Backpack in GHC\@.
23
24 We start off by outlining the current architecture of GHC, ghc-pkg and Cabal,
25 which constitute the existing packaging system. We then state what our subgoals
26 are, since there are many similar sounding but different problems to solve. Next,
27 we describe the ``probably correct'' implementation plan, and finish off with
28 some open design questions. This is intended to be an evolving design document,
29 so please contribute!
30
31 \section{Current packaging architecture}
32
33 The overall architecture is described in Figure~\ref{fig:arch}.
34
35 \begin{figure}[H]
36 \center{\scalebox{0.8}{\includegraphics{arch.png}}}
37 \label{fig:arch}\caption{Architecture of GHC, ghc-pkg and Cabal. Green bits indicate additions from upcoming IHG work, red bits indicate additions from Backpack. Orange indicates a Haskell library.}
38 \end{figure}
39
40 Here, arrows indicate dependencies from one component to another. Color
41 coding is as follows: orange components are libaries, green components
42 are to be added with the IHG work, red components are to be added with
43 Backpack. (Thus, black and orange can be considered the current)
44
45 \subsection{Installed package database}
46
47 Starting from the bottom, we have the \emph{installed package database}
48 (actually a collection of such databases), which stores information
49 about what packages have been installed are thus available to be
50 compiled against. There is both a global database (for the system
51 administrator) and a local database (for end users), which can be
52 updated independently. One way to think about the package database
53 is as a \emph{cache of object code}. In principle, one could compile
54 any piece of code by repeatedly recompiling all of its dependencies;
55 the installed package database describes when this can be bypassed.
56
57 \begin{figure}[H]
58 \center{\scalebox{0.8}{\includegraphics{pkgdb.png}}}
59 \label{fig:pkgdb}\caption{Anatomy of a package database.}
60 \end{figure}
61
62 In Figure~\ref{fig:pkgdb}, we show the structure of a package database.
63 The installed package are created from a Cabal file through the process
64 of dependency resolution and compilation. In database terms, the primary key
65 of a package database is the InstalledPackageId
66 (Figure~\ref{fig:current-pkgid}). This ID uniquely identifies an
67 instance of an installed package. The PackageId omits the ABI hash and
68 is used to qualify linker exported symbols: the current value of this
69 parameter is communicated to GHC using the \verb|-package-id| flag.
70
71 In principle, packages with different PackageIds should be linkable
72 together in the same compiled program, whereas packages with the same
73 PackageId are not (even if they have different InstalledPackageIds). In
74 practice, GHC is currently only able to select one version of a package,
75 as it clears out all old versions of the package in
76 \ghcfile{compiler/main/Package.lhs}:applyPackageFlag.
77
78 \begin{figure}
79 \center{\begin{tabular}{r l}
80 PackageId & package name, package version \\
81 InstalledPackageId & PackageId, ABI hash \\
82 \end{tabular}}
83 \label{fig:current-pkgid}\caption{Current structure of package identifiers.}
84 \end{figure}
85
86 The database entry itself contains the information from the installed package ID,
87 as well as information such as what dependencies it was linked against, where
88 its compiled code and interface files live, its compilation flags, what modules
89 it exposes, etc. Much of this information is only relevant to Cabal; GHC
90 uses a subset of the information in the package database.
91
92 \subsection{GHC}
93
94 The two programs which access the package database directly are GHC
95 proper (for compilation) and ghc-pkg (which is a general purpose
96 command line tool for manipulating the database.) GHC relies on
97 the package database in the following ways:
98
99 \begin{itemize}
100 \item It imports the local and global package databases into
101 its runtime database, and applies modifications to the exposed
102 and trusted status of the entries via the flags \verb|-package|
103 and others (\ghcfile{compiler/main/Packages.lhs}). The internal
104 package state can be seen at \verb|-v4| or higher.
105 \item It uses this package database to find the location of module
106 interfaces when it attempts to load the module info of an external
107 module (\ghcfile{compiler/iface/LoadIface.hs}).
108 \end{itemize}
109
110 GHC itself performs a type checking phase, which generates an interface
111 file representing the module (so that later invocations of GHC can load the type
112 of a module), and then after compilation projects object files and linked archives
113 for programs to use.
114
115 \paragraph{Original names} Original names are an important design pattern
116 in GHC\@.
117 Sometimes, a name can be exposed in an hi file even if its module
118 wasn't exposed. Here is an example (compiled in package R):
119
120 \begin{verbatim}
121 module X where
122 import Internal (f)
123 g = f
124
125 module Internal where
126 import Internal.Total (f)
127 \end{verbatim}
128
129 Then in X.hi:
130
131 \begin{verbatim}
132 g = <R.id, Internal.Total, f> (this is the original name)
133 \end{verbatim}
134
135 (The reason we refer to the package as R.id is because it's the
136 full package ID, and not just R).
137
138 \subsection{hs-boot}
139
140 \verb|hs-boot| is a special mechanism used to support recursive linking
141 of modules within a package, today. Suppose I have a recursive module
142 dependency between modules and A and B. I break one of\ldots
143
144 (ToDo: describe how hs-boot mechanism works)
145
146 \subsection{Cabal}
147
148 Cabal is the build system for GHC, we can think of it as parsing a Cabal
149 file describing a package, and then making (possibly multiple)
150 invocations to GHC to perform the appropriate compilation. What
151 information does Cabal pass onto GHC\@? One can get an idea for this by
152 looking at a prototypical command line that Cabal invokes GHC with:
153
154 \begin{verbatim}
155 ghc --make
156 -package-name myapp-0.1
157 -hide-all-packages
158 -package-id containers-0.9-ABCD
159 Module1 Module2
160 \end{verbatim}
161
162 There are a few things going on here. First, Cabal has to tell GHC
163 what the name of the package it's compiling (otherwise, GHC can't appropriately
164 generate symbols that other code referring to this package might generate).
165 There are also a number of commands which configure its in-memory view of
166 the package database (GHC's view of the package database may not directly
167 correspond to what is on disk). There's also an optimization here: in principle,
168 GHC can compile each module one-by-one, but instead we use the \verb|--make| flag
169 because this allows GHC to reuse some data structures, resulting in a nontrivial
170 speedup.
171
172 (ToDo: describe cabal-install/sandbox)
173
174 \section{Goals}
175
176 Here are some of the high-level goals which motivate our improvements to
177 the module system.
178
179 \begin{itemize}
180 \item Solve \emph{Cabal hell}, a situation which occurs when conflicting
181 version ranges on a wide range of dependencies leads to a situation
182 where it is impossible to satisfy the constraints. We're seeking
183 to solve this problem in two ways: first, we want to support
184 multiple instances of containers-2.9 in the database which are
185 compiled with different dependencies (and even link them
186 together), and second, we want to abolish (often inaccurate)
187 version ranges and move to a regime where packages depend on
188 signatures.
189
190 \item Support \emph{hermetic builds with sharing}. A hermetic build
191 system is one which simulates rebuilding every package whenever
192 it is built; on the other hand, actually rebuilding every time
193 is extremely inefficient (but what happens in practice with
194 Cabal sandboxes). We seek to solve this problem with the IHG work,
195 by allowing multiple instances of a package in the database, where
196 the only difference is compilation parameters. We don't care
197 about being able to link these together in a single program.
198
199 \item Support \emph{module-level pluggability} as an alternative to
200 existing (poor) usage of type classes. The canonical example are
201 strings, where a library might want to either use the convenient
202 but inefficient native strings, or the efficient packed Text data
203 type, but would really like to avoid having to say \verb|StringLike s => ...|
204 in all of their type signatures. While we do not plan on supporting
205 separate compilation, Cabal should understand how to automatically
206 recompile these ``indefinite'' packages when they are instantiated
207 with a new plugin.
208
209 \item Support \emph{separate modular development}, where a library and
210 an application using the library can be developed and type-checked
211 separately, intermediated by an interface. This is applicable in
212 the pluggability example as well, where we want to ensure that all
213 of the $M \times N$ configurations of libraries versus applications
214 type check, by only running the typechecker $M + N$ times. A closely
215 related concern is related toolchain support for extracting a signature
216 from an existing implementation, as current Haskell culture is averse
217 to explicitly writing separate signature files.
218
219 \item Subsume existing support for \emph{mutually recursive modules},
220 without the double vision problem.
221 \end{itemize}
222
223 A \emph{non-goal} is to allow users to upgrade upstream libraries
224 without recompiling downstream. This is an ABI concern and we're not
225 going to worry about it.
226
227 \section{Aside: Recent IHG work}\label{sec:ihg}
228
229 The IHG project has allocated some funds to relax the package instance
230 constraint in the package database, so that multiple instances can be
231 stored, but now the user of GHC must explicitly list package-IDs to be
232 linked against. In the far future, it would be expected that tools like
233 Cabal automatically handle instance selection among a large number of
234 instances, but this is subtle and so this work is only to do some
235 foundational work, allowing a package database to optionally relax the
236 unique package-version requirement, and utilize environment files to
237 select which packages should be used. See Duncan's email for more
238 details on the proposal.
239
240 To implement this:
241
242 \begin{enumerate}
243
244 \item Remove the ``removal step'' when registering a package (with a flag)
245
246 \item Check \ghcfile{compiler/main/Packages.lhs}:mkPackagesState to look out for shadowing
247 within a database. We believe it already does the right thing, since
248 we already need to handle shadowing between the local and global database.
249
250 \end{enumerate}
251
252 Once these changes are implemented, we can program multiple instances by
253 using \verb|-hide-all-packages -package-id ...|, even if there is no
254 high-level tool support.
255
256 Actually, this concern is orthogonal to the purposes of Backpack, if
257 we redefine PackageId appropriately.
258
259 \paragraph{The ABI hash} Currently, InstalledPackageId
260 is constructed of a package, version and ABI hash
261 (generateRegistrationInfo in
262 \ghcfile{libraries/Cabal/Cabal/Distribution/Simple/Register.hs}). The
263 use of an ABI hash is a bit of GHC-specific hack introduced in 2009,
264 intended to make sure these installed package IDs are unique. While
265 this is quite clever, using the ABI is actually a bit inflexible, as one
266 might reasonably want to have multiple copies of a package with the same
267 ABI but different source code changes.\footnote{In practice, our ABIs
268 are so unstable that it doesn't really matter.}
269
270 In Figure~\ref{fig:proposed-pkgid}, there is an alternate logical
271 representation of InstalledPackageId which attempts to extricate the
272 notion of ABI compatibility from what actually might uniquely identify a
273 package beyond its PackageId. We imagine these components to be:
274
275 \begin{itemize}
276 \item A hash of the source code (so one can register different
277 in-development versions without having to bump the version
278 number);
279 \item Compilation way (profiling? dynamic?)
280 \item Compilation flags (such as compilation way, optimization,
281 profiling settings)\footnote{This is a little undefined on a package bases, because in principle the flags could be varied on a per-file basis. More likely this will be approximated against the relevant fields in the Cabal file as well as arguments passed to Cabal.};
282 \end{itemize}
283
284 A historical note: in the 2012 GSoC project to allow multiple instances
285 of a package to be installed at the same time, use of \emph{random
286 numbers} was used to workaround the inability to get an ABI early
287 enough. We are not using this plan.
288
289 \section{Concrete physical identity = PackageId + Module name + Module dependencies}\label{sec:ipi}
290
291 \begin{figure}
292 \center{\begin{tabular}{r l}
293 PackageId & hash of ``package name, package version, dependency resolution, \emph{module} environment'' \\
294 InstalledPackageId & hash of ``PackageId, source code, way, compiler flags'' \\
295 \end{tabular}}
296 \label{fig:proposed-pkgid}\caption{Proposed new structure of package identifiers.}
297 \end{figure}
298
299 In Backpack, there needs to be some mechanism for assigning
300 \emph{physical module identities} to modules, which are essential for
301 typechecking Backpack packages, since they let us tell if two types are
302 equal or not.
303
304 In GHC, our method of testing whether or not two types are equal is by
305 comparing their original names, which is a tuple of a PackageId and the
306 module name (summarized in Figure~\ref{fig:current-pkgid}). If it looks
307 like a duck and quacks like a duck, it is a duck: we might reasonable
308 imagine that \emph{concrete physical identity = PackageId and module
309 name}. But is this sufficient to represent Backpack's physical module
310 identities?
311
312 If we look at how physical module identities are allocated in Paper Backpack,
313 we can see that each module corresponds to an identity constructor (in
314 Backpack this is the package name, some representation of the module
315 source, and a disambiguating integer if module source is duplicated)
316 \emph{as well as} physical module identities which were applied to
317 this constructor. The current PackageId + Module name scheme adequately encodes
318 the constructor (in particular, a single package can't have multiple modules
319 with the same name), but it does not encode the \emph{module} dependencies.
320 We have to add these to our original names. There are three ways we can do
321 this:
322
323 \begin{enumerate}
324 \item Augment original names to have a third component. This is less than ideal because it requires pervasive changes to all parts of the compiler.
325 \item Augment the module name to record module dependency information. This is a bit of an odd place to put the information, because our module names correspond quite concretely to the actual source file a user wrote some source in.
326 \item Augment the PackageId to record module dependency information. For now, this is the regime we will use; however, see Section~\ref{sec:flatten} for why the issue is more subtle.
327 \end{enumerate}
328
329 (XXX Figure~\ref{fig:proposed-pkgid} is out of date now.)
330
331 And now, the complications\ldots
332
333 \paragraph{Relaxing package selection restrictions} As mentioned
334 previously, GHC is unable to select multiple packages with the same
335 package name (but different PackageIds). This restriction needs to be
336 lifted. For backwards compatibility reasons, it's probably better to
337 not overload \verb|-package-id| but add a new flag, maybe \verb|-force-package-id|;
338 we also need to disable the old version masking behavior. This is orthogonal
339 to the IHG work, which wants to allow multiple \emph{InstalledPackageIds} in the
340 \emph{database} (here, we want to allow multiple \emph{PackageIds} in \emph{compiled code}).
341
342 \paragraph{Linker symbols} As we increase the amount of information in
343 PackageId, it's important to be careful about the length of these IDs,
344 as they are used for exported linker symbols (e.g.
345 \verb|base_TextziReadziLex_zdwvalDig_info|). Very long symbol names
346 hurt compile and link time, object file sizes, GHCi startup time,
347 dynamic linking, and make gdb hard to use. As such, the current plan is
348 to do away with full package names and versions, and instead use just a
349 base-62 encoded hash, perhaps with the first four characters of the package
350 name for user-friendliness.
351
352 \paragraph{Wired-in names} One annoying thing to remember is that GHC
353 has wired-in names, which refer to packages without any version. Now
354 these wired names also have to accomodate dependency trees. A
355 suggested approach is to have a fixed table from these wired names to
356 package IDs; alternately we can use something like the special \verb|inplace|
357 version number.
358
359 \paragraph{Version numbers} The interaction of Backpack's package polymorphism
360 (the ability to mixin different implementations to holes) and Cabal's dependency
361 resolution mechanism (which permits a name libfoo to be resolved to a specific
362 version libfoo-0.1) can be subtle: there are in fact \emph{two} levels of
363 indirection going on here.
364
365 The simplest way to think about the distinction is as follows. When I write
366 a Backpack package which does \verb|include foobar|, I have not actually
367 written a Paper Backpack package. Instead, I have written a \emph{pre-package},
368 which Cabal's dependency solver then takes and rewrites all package references
369 into versioned references \verb|include foobar-0.1|, which now corresponds to
370 a Paper Backpack package. For example, this is a pre-package:
371
372 \begin{verbatim}
373 package foo where
374 include bar
375 \end{verbatim}
376
377 and this is a Paper Backpack package:
378
379 \begin{verbatim}
380 package foo-0.3(bar-0.1) where
381 include bar-0.1(baz-0.2)
382 \end{verbatim}
383
384 Notably, we must \emph{rename} the package to include information about
385 how we resolved all of the inner package references, and if these inner
386 package references had dependencies, those must be included too! In
387 effect, the \emph{dependency resolution} must be encoded into the package ID,
388 along with the existing Backpack \emph{physical identity regular tree}.
389 Phew!
390
391 \paragraph{Free variables (or, what is a non-concrete physical
392 identity?)} Physical identities in their full generality are permitted
393 to have free variables, which represent holes. Handling this is a
394 tricky question, and we defer it to Section~\ref{sec:typechecking-indefinite}, when
395 we talk about packages with holes.
396
397 \section{The granularity of packages versus system libraries}\label{sec:flatten}
398
399 In the original design of GHC's packaging system, the intent was that
400 a package be compiled into a system library, e.g. \verb|libHSbase-4.7.1.0.so|
401 or \verb|libHSbase-4.7.1.0.a|. Dependencies between packages translate simply
402 into dependencies between system libraries.
403
404 Interestingly enough, Paper Backpack specifies dependency tracking at
405 the \emph{module} level, which is a finer granularity than previously
406 implemented by GHC\@. Here is a Backpack package which demonstrates
407 this type of module-level dependency:
408
409 \begin{verbatim}
410 package a where
411 P :: ...
412 Q :: [ x :: Int ]
413 A = [ import P; ... ]
414 B = [ import Q; ... ]
415 package p where
416 P = [ ... ]
417 package x where
418 include p
419 Q = [ x = 0 ]
420 include a
421 package y where
422 include p
423 Q = [ x = 1 ]
424 include a
425 \end{verbatim}
426
427 The operant question: is module
428 \verb|A| from package \verb|x| the same as module \verb|A| from package
429 \verb|y|? In Paper Backpack, the answer is yes.
430
431 From an end-user perspective, why is this desirable? One answer is that
432 it permits users to modularize third-party packages which were not
433 packaged with modularity in mind, but happen to be modular. In this
434 case, package \verb|a| could have been implemented as two separate
435 packages. More generally, when libraries ship with test-cases, they currently have to
436 split these test-cases to separate packages, so as to not introduce
437 spurious dependencies with various test frameworks, which the user may
438 not have or care about. If dependencies are done on a per-module basis,
439 as long as the user doesn't import a test module, they never gain the
440 extra dependency. Another situation is when a library also happens to
441 have a helper utility module which doesn't have any of the dependencies
442 of the primary library. And yet another situation is when we want to
443 define type class instances for data types without orphan instances, but
444 without pulling in a pile of packages defining type classes for things
445 an end user doesn't care about.
446
447 From an implementation perspective, however, this answer is quite distressing:
448
449 \begin{enumerate}
450 \item When I install package \verb|x| and then install package
451 \verb|y|, what should happen to the installed module files for
452 \verb|A|? Currently, the installed package database organizes
453 these files at the package level, so it is not clear where these
454 files should live; surely they shouldn't be duplicated!
455 Similarly, when I typecheck, I need to ensure that the original
456 name for \verb|x:A| and \verb|y:A| are the same: thus, the
457 PackageId I assign cannot be \verb|x| or \verb|y|.
458
459 \item When I create a \verb|libHSx.so| and a \verb|libHSy.so|, which
460 dynamic library contains the object files for \verb|A|? Either
461 of these dynamic libraries could be used alone, so both must
462 contain \verb|A|. But an application could also use both at the
463 same time, at which point a program will see two copies of the
464 program text for \verb|A|.
465 \end{enumerate}
466
467 The first problem can be solved by ``flattening'' the package database,
468 so that instead of organizing object files by library, the database
469 is just a directory of physical module identities to objects/interfaces.
470 Installed packages are now represented as (possibly overlapping) sets over
471 this store of modules. However, the second problem is far more persistent:
472 if a package is a dynamic library, we still are unable to get the sharing of
473 object files necessary. (This problem isn't present for static linking, where
474 one doesn't actually have to link against \verb|libHSx.a|, the object files
475 are just fine.)
476
477 In the next few sections, we describe a few different designs for Backpack which
478 vary the granularity of packages. In general, the coarser the granularity,
479 the closer to the current package-library correspondence the design is, but
480 at the cost of giving up more sharing. Finer granularity requires smaller
481 and smaller corresponding libraries. We'll use the following running example.
482
483 \begin{verbatim}
484 package p where
485 A :: [ a :: Bool ]
486 B = [ import A; b = a ]
487 C = [ c = True ]
488
489 package q where
490 A = [ a = False ]
491 include p
492 D = [ import A; import C; d = a && c ]
493 E = [ import D; import C; e = c && d ]
494 F = [ import B; f = b ]
495 \end{verbatim}
496
497 As a notational convenience, we'll omit the full physical identity of a
498 module when it's clear from context. We'll start by recapping Paper Backpack,
499 which has the finest granularity and coarsen.
500
501 \subsection{One library per physical module identity}
502
503 To truly implement a Backpack design, we must generate a library per
504 physical module identity. The physical module identities for
505 our running example are:
506
507 \begin{verbatim}
508 expanded unexpanded
509 A -> q:A q:A
510 B -> p:B(q:A) p:B($A)
511 C -> p:C p:C
512 D -> q:D(q:A,p:C) q:D($A,$C)
513 E -> q:E(q:D(q:A,p:C),p:C) q:E($D,$C)
514 F -> q:F(p:B(q:A)) q:F($B)
515 \end{verbatim}
516
517 This results in the following
518 libraries (and their respective dependencies): \\
519
520 \begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,node distance=6cm,
521 thick,m/.style={rectangle,draw,font=\sffamily\Large\bfseries}]
522 \node[m] (1) {libHSq:A.so};
523 \node[m] (2) [left of=1,fill=blue!20] {libHSp:B(q:A).so};
524 \node[m] (3) [below=0.5cm of 1,fill=blue!20] {libHSp:C.so};
525 \node[m] (4) [left of=3] {libHSq:D.so};
526 \node[m] (5) [below=0.5cm of 4] {libHSq:E.so};
527 \node[m] (6) [left of=2] {libHSq:F.so};
528 \path
529 (2) edge node {} (1)
530 (4) edge node {} (1)
531 (4) edge node {} (3)
532 (5) edge node {} (3)
533 (5) edge node {} (4)
534 (6) edge node {} (2)
535 ;
536 \end{tikzpicture}
537
538 Notice that there is no ``ordering'' that the packages could have been
539 built in; rather, we walk the package tree, building modules as we encounter
540 them (thus, the instantiated package \verb|p| in light blue is fully built before all
541 the modules of \verb|q| are finished).
542
543 Also, note that in the case of module \verb|B|, we had to include its
544 dependency in the name of the library. This is because package \verb|p| could
545 be instantiated a different implementation of logical module \verb|A|: the
546 library must specify which one it's compiled with! Otherwise, in the following situation:
547
548 \begin{verbatim}
549 package x where
550 A :: [ a :: Bool ]
551 B = [ import A; b = a ]
552 package y where
553 A1 = [ a = True ]
554 A2 = [ a = False ]
555 package linker1 where
556 include y
557 include x (A as A1, B as B1)
558 package linker2 where
559 include y
560 include x (A as A2, B as B2)
561 \end{verbatim}
562
563 both instantiations of \verb|B| (\verb|x:B(y:A1)| and \verb|x:B(y:A2)|)
564 would be given the same name \verb|x:B|.
565 Technically, all of the other
566 modules should also record this information, but as an optimization, only
567 recording the assignments for \emph{holes} is necessary.
568
569 \subsection{One library per package identity}
570
571 In this world, we simplify physical module identities to ``only contain
572 package information'', and not full physical module identities. These
573 simplified identities we call \emph{package identities}. One perspective
574 is that we're coalescing together the libraries created from the
575 previous scheme. However, another perspective is we are taking our
576 packages and \emph{slicing} them into pieces which are actually
577 considered libraries.
578
579 Before we jump in earnest into specifying how package identities are
580 computed, it's useful to first specify what some desirable properties
581 of any such slicing scheme are:
582
583 \begin{itemize}
584
585 \item \emph{No junk.} Given an assignment of a module to a library,
586 loading that library minimize the number of transitive
587 dependencies which were irrelevant for the module. A good
588 heuristic is that it's OK to load unused code from the same
589 package (after all, the whole point is to have multiple modules
590 in a single library), but it's not OK to cause unused external packages to
591 be loaded.
592
593 \item \emph{Caching.} When compiling a package, we compile libraries
594 for a set of package identifiers. Compilation of any other
595 packages which produce overlapping identifiers must be able to
596 reuse the libraries from the original compilation.
597
598 \item \emph{Economy.} Subject to the previous two constraints, we
599 should minimize the number of package identifiers required
600 to compile any given package (otherwise, per physical module
601 identifier is a valid solution.)
602
603 \end{itemize}
604
605 Here is an example mapping of package identities:
606
607 \begin{verbatim}
608 physical mod identity pkg identity
609 A -> q:A q
610 B -> p:B(q:A) p(q:A)
611 C -> p:C p
612 D -> q:D(q:A,p:C) q(p)
613 E -> q:E(q:D(q:A,p:C),p:C) q(p)
614 F -> q:F(p:B(q:A)) q(p(q:A))
615 \end{verbatim}
616
617 Just as in the previous section, we still have to record the explicit
618 module choice for holes. However, it's worth noting that the
619 \verb|include| induced dependency for \verb|q(p)| has \emph{not} been
620 refined, since \verb|C| is not a hole in package \verb|q|. It is not
621 possible for another package to generate a compiled module that should
622 be placed in \verb|q(p)|, since \verb|q| can't be reinstantiated and a
623 new package would have a different package name.
624
625 The process also calculates a dependency list for the package
626 identities, which does not necessarily coincide with the tree structure
627 of the package identities. Here is the resulting library graph: \\
628
629 \begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,node distance=6cm,
630 thick,m/.style={rectangle,draw,font=\sffamily\Large\bfseries}]
631 \node[m] (1) {libHSq.so (A)};
632 \node[m] (2) [left of=1,fill=blue!20] {libHSp(q:A).so (B)};
633 \node[m] (3) [below=0.5cm of 1,fill=blue!20] {libHSp.so (C)};
634 \node[m] (4) [left of=3] {libHSq(p).so (D,E)};
635 \node[m] (5) [left of=2] {libHSq(p(q:A)).so (F)};
636 \path
637 (2) edge node [right] {} (1)
638 (4) edge node [right] {} (3)
639 (4) edge node [right] {} (1)
640 (5) edge node [right] {} (2)
641 ;
642 \end{tikzpicture}
643
644 We can observe the following changes that \verb|D| and \verb|E| have
645 been placed in the same library. This is because they both have a
646 dependency on library \verb|p|, internal dependencies not withstanding.
647
648 \paragraph{So what's the algorithm?} While intuitively, it seems like
649 one could simply erase module names from physical module identities and
650 get package identities, things are a bit trickier than that: internal
651 dependencies could cause us to fail to place two modules together which
652 should be in the same package. So\ldots I don't actually know what the
653 algorithm to be used here is.
654
655 \subsection{One library per definite package}
656
657 In this world, we create a dynamic library per definite package (package with
658 no holes). Indefinite packages don't get compiled into libraries, the code
659 is duplicated and type equality is only seen if a type came from the same
660 definite package. In the running example, we only generate \verb|libHSq.so|
661 which exports all of the modules (\verb|p| is indefinite), and if another
662 package instantiates \verb|p|, the instances of \verb|C| will be considered
663 different. \\
664
665 \begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,node distance=6cm,
666 thick,m/.style={rectangle,draw,font=\sffamily\Large\bfseries}]
667 \node[m] (1) {libHSq.so (A,B,C,D,E)};
668 \end{tikzpicture}
669
670 As a refinement, if the instantiations of an indefinite package's holes
671 live in libraries which do not have a mutually recursive dependency on
672 the indefinite package, then it can be instantiated. In the previous
673 example, this was not true: hole \verb|A| in package \verb|p| was
674 instantiated with \verb|q:A|, but package \verb|q| has a dependency
675 on \verb|p|. However, we could break the dependence by separating \verb|A|
676 into another package:
677
678 \begin{verbatim}
679 package a where
680 A = [ a = False ]
681 package q where
682 include a
683 include p
684 D = [ import C; d = c ]
685 E = [ import C; e = c ]
686 \end{verbatim}
687
688 Now the library dependencies look like: \\
689
690 \begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,node distance=6cm,
691 thick,m/.style={rectangle,draw,font=\sffamily\Large\bfseries}]
692 \node[m] (1) {libHSa.so (A)};
693 \node[m] (2) [below=1cm, left of=1, fill=blue!20] {libHSp(a:a).so (B,C)};
694 \node[m] (3) [above=1cm, left of=2] {libHSq.so (D,E)};
695 \path
696 (2) edge node [right] {} (1)
697 (3) edge node [right] {} (1)
698 (3) edge node [right] {} (2);
699 \end{tikzpicture}
700
701 Intuitively, indefinite packages can share code, if all of the holes are
702 placed in standalone packages. Otherwise, the indefinite package is
703 generatively created. Notice, as was the case in the previous section,
704 that we have to keep information about specific modules in the names
705 of instantiated modules.
706
707 \subsection{Commentary}
708
709 A library per physical module identity obviously won't work for real systems,
710 so the primary choice is between a library per package identity or
711 definite package. The benefits of package identities:
712
713 \begin{itemize}
714 \item More programs type-check (since type equality holds when packages
715 are reinstantiated),
716 \item Packages are automatically sliced into their most modular form,
717 allowing package writers to avoid having to manually split up modules.
718 \end{itemize}
719
720 The benefits of definite packages:
721
722 \begin{itemize}
723 \item We actually know how to implement it (the algorithm in the other
724 case is not worked out, and we haven't even begun to talk about
725 mutual recursion),
726 \item It is the smallest delta with GHC today: e.g., \verb|cabal install|
727 always results in the installation of one library,
728 \item Monolithic library files mean that we pay less cost of
729 crossing the boundaries of dynamic libraries,
730 \item Requiring users to place hole instantiations in separate packages
731 to be shared is not a wholly unreasonable stipulation.
732 \end{itemize}
733
734 One major open question which could be empirically tested is this: for the
735 current package ecosystem, how many subpackages would slicing create?
736 This is an important experiment to run on any proposed slicing algorithm.
737
738 \section{Exposed modules should allow external modules}\label{sec:reexport}
739
740 In Backpack, the definition of a package consists of a logical context,
741 which maps logical module names to physical module names. These do not
742 necessarily coincide, since some physical modules may have been defined
743 in other packages and mixed into this package. This mapping specifies
744 what modules other packages including this package can access.
745 However, in the current installed package database, we have exposed-modules which
746 specify what modules are accessible, but we assume that the current
747 package is responsible for providing these modules.
748
749 To implement Backpack, we have to extend this exposed-modules (``Export declarations''
750 on Figure~\ref{fig:pkgdb}). Rather
751 than a list of logical module names, we provide a new list of tuples:
752 the exported logical module name and original physical module name (this
753 is in two parts: the InstalledPackageId and the original module name).
754 For example, an traditional module export is simply (Name, my-pkg-id, Name);
755 a renamed module is (NewName, my-pkg-id, OldName), and an external module
756 is (Name, external-pkg-id, Name).
757
758 As an example:
759
760 \begin{verbatim}
761 package P where
762 M = ...
763 N = ...
764 package Q (M, R, T)
765 include P (N -> R)
766 T = ...
767 \end{verbatim}
768
769 And now if we look at Q\@:
770
771 \begin{verbatim}
772 exposed-modules:
773 <M, P.id, M>
774 <R, P.id, N>
775 <T, Q.id, T>
776 \end{verbatim}
777
778 When we compile Q, and the interface file gets generated, we have
779 to generate identifiers for each of the exposed modules. These should
780 be calculated to directly refer to the ``original name'' of each them;
781 so for example M and R point directly to package P, but they also
782 include the original name they had in the original definition.
783
784 \paragraph{Error messages} When it is possible for multiple physical
785 entities to have the same ``user-friendly'' name, this can result in a
786 very confusing situation if both entities have to be referred to in the
787 same message. This is especially true when renaming is in place:
788 you not only want to print out the name in scope, but probably some indication
789 of what the original name is. In short, when it comes to error messages, tread with care!
790
791 \section{Shapeless Backpack}\label{sec:simplifying-backpack}
792
793 Backpack as currently defined always requires a \emph{shaping} pass,
794 which calculates the shapes of all modules defined in a package.
795 The shaping pass is critical to the solution of the double-vision problem
796 in recursive module linking, but it also presents a number of unpalatable
797 implementation problems:
798
799 \begin{itemize}
800
801 \item \emph{Shaping is a lot of work.} A module shape specifies the
802 providence of all data types and identifiers defined by a
803 module. To calculate this, we must preprocess and parse all
804 modules, even before we do the type-checking pass. (Fortunately,
805 shaping doesn't require a full parse of a module, only enough
806 to get identifiers, which makes it a slightly more expensive
807 version of \verb|ghc -M|.)
808
809 \item \emph{Shaping must be done upfront.} In the current Backpack
810 design, all shapes must be computed before any typechecking can
811 occur. While performing the shaping pass upfront is necessary
812 in order to solve the double vision problem (where a module
813 identity may be influenced by later definitions), it means
814 that GHC must first do a shaping pass, and then revisit every module and
815 compile them proper. Nor is it (easily) possible to skip the
816 shaping pass when it is unnecessary, as one might expect to be
817 the case in the absence of mutual recursion. Shaping is not
818 a ``pay as you go'' language feature.
819
820 \item \emph{GHC can't compile all programs shaping accepts.} Shaping
821 accepts programs that GHC, with its current hs-boot mechanism, cannot
822 compile. In particular, GHC requires that any data type or function
823 in a signature actually be \emph{defined} in the module corresponding
824 to that file (i.e., an original name can be assigned to these entities
825 immediately.) Shaping permits unrestricted exports to implement
826 modules; this shows up in the formalism as $\beta$ module variables.
827
828 \item \emph{Shaping encourages inefficient program organization.}
829 Shaping is designed to enable mutually recursive modules, but as
830 currently implemented, mutual recursion is less efficient than
831 code without recursive dependencies. Programmers should avoid
832 this code organization, except when it is absolutely necessary.
833
834 \item \emph{GHC is architecturally ill-suited for directly
835 implementing shaping.} Shaping implies that GHC's internal
836 concept of an ``original name'' be extended to accommodate
837 module variables. This is an extremely invasive change to all
838 aspects of GHC, since the original names assumption is baked
839 quite deeply into the compiler. Plausible implementations of
840 shaping requires all these variables to be skolemized outside
841 of GHC\@.
842
843 \end{itemize}
844
845 To be clear, the shaping pass is fundamentally necessary for some
846 Backpack packages. Here is the example which convinced Simon:
847
848 \begin{verbatim}
849 package p where
850 A :: [data T; f :: T -> T]
851 B = [export T(MkT), h; import A(f); data T = MkT; h x = f MkT]
852 A = [export T(MkT), f, h; import B; f MkT = MkT]
853 \end{verbatim}
854
855 The key to this example is that B \emph{may or may not typecheck} depending
856 on the definition of A. Because A reexports B's definition T, B will
857 typecheck; but if A defined T on its own, B would not typecheck. Thus,
858 we \emph{cannot} typecheck B until we have done some analysis of A (the
859 shaping analysis!)
860
861 Thus, it is beneficial (from an optimization point of view) to
862 consider a subset of Backpack for which shaping is not necessary.
863 Here is a programming discipline which does just that, which we will call the \textbf{linking restriction}: \emph{Module implementations must be declared before
864 signatures.} Formally, this restriction modifies the rule for merging
865 polarized module shapes ($\widetilde{\tau}_1^{m_1} \oplus \widetilde{\tau}_2^{m_2}$) so that
866 $\widetilde{\tau}_1^- \oplus \widetilde{\tau}_2^+$ is always undefined.\footnote{This seemed to be the crispest way of defining the restriction, although this means an error happens a bit later than I'd like it to: I'd prefer if we errored while merging logical contexts, but we don't know what is a hole at that point.}
867
868 Here is an example of the linking restriction. Consider these two packages:
869
870 \begin{verbatim}
871 package random where
872 System.Random = [ ... ].hs
873
874 package monte-carlo where
875 System.Random :: ...
876 System.MonteCarlo = [ ... ].hs
877 \end{verbatim}
878
879 Here, random is a definite package which may have been compiled ahead
880 of time; monte-carlo is an indefinite package with a dependency
881 on any package which provides \verb|System.Random|.
882
883 Now, to link these two applications together, only one ordering
884 is permissible:
885
886 \begin{verbatim}
887 package myapp where
888 include random
889 include monte-carlo
890 \end{verbatim}
891
892 If myapp wants to provide its own random implementation, it can do so:
893
894 \begin{verbatim}
895 package myapp2 where
896 System.Random = [ ... ].hs
897 include monte-carlo
898 \end{verbatim}
899
900 In both cases, all of \verb|monte-carlo|'s holes have been filled in by the time
901 it is included. The alternate ordering is not allowed.
902
903 Why does this discipline prevent mutually recursive modules? Intuitively,
904 a hole is the mechanism by which we can refer to an implementation
905 before it is defined; otherwise, we can only refer to definitions which
906 preceed our definition. If there are never any holes \emph{which get filled},
907 implementation links can only go backwards, ruling out circularity.
908
909 It's easy to see how mutual recursion can occur if we break this discipline:
910
911 \begin{verbatim}
912 package myapp2 where
913 include monte-carlo
914 System.Random = [ import System.MonteCarlo ].hs
915 \end{verbatim}
916
917 \subsection{Typechecking of definite modules without shaping}
918
919 If we are not carrying out a shaping pass, we need to be able to calculate
920 $\widetilde{\Xi}_{\mathsf{pkg}}$ on the fly. In the case that we are
921 compiling a package---there will be no holes in the final package---we
922 can show that shaping is unnecessary quite easily, since with the
923 linking restriction, everything is definite from the get-go.
924
925 Observe the following invariant: at any given step of the module
926 bindings, the physical context $\widetilde{\Phi}$ contains no
927 holes. We can thus conclude that there are no module variables in any
928 type shapes. As the only time a previously calculated package shape can
929 change is due to unification, the incrementally computed shape is in
930 fact the true one.
931
932 As far as the implementation is concerned, we never have to worry
933 about handling module variables; we only need to do extra typechecks
934 against (renamed) interface files.
935
936 \subsection{Compilation of definite modules}\label{sec:compiling-definite}
937
938 Of course, we still have to compile the code, and this includes any
939 subpackages which we have mixed in the dependencies to make them fully
940 definite. Let's take the following set of packages as an example:
941
942 \begin{verbatim}
943 package pkg-a where
944 A = ...
945 B = ... -- this code is ignored
946 package pgk-b where -- indefinite package
947 A :: ...
948 B = [ import A; ... ]
949 package pkg-c where
950 include pkg-a (A)
951 include pkg-b
952 C = [ import B; ... ]
953 \end{verbatim}
954
955 With the linking invariant, we can simply walk the Backpack package ``tree'',
956 compiling each of its dependencies. Let's walk through it explicitly.\footnote{To simplify matters, we assume that there is only one instance of any
957 PackageId in the database, so we omit the unique-ifying hashes from the
958 ghc-pkg registration commands; we ignore the existence of version numbers
959 and Cabal's dependency solver; finally, we do the compilation in
960 one-shot mode, even though Cabal in practice will use the Make mode.}
961
962 First, we have to build \verb|pkg-a|. This package has no dependencies
963 of any kind, so compiling is much like compiling ordinary Haskell. If
964 it was already installed in the database, we wouldn't even bother compiling it.
965
966 \begin{verbatim}
967 ADEPS = # empty!
968 ghc -c A.hs -package-name pkg-a-ADEPS
969 ghc -c B.hs -package-name pkg-a-ADEPS
970 # install and register pkg-a-ADEPS
971 \end{verbatim}
972
973 Next, we have to build \verb|pkg-b|. This package has a hole \verb|A|,
974 intuitively, it depends on package A. This is done in two steps:
975 first we check if the signature given for the hole matches up with the
976 actual implementation provided, and then we build the module properly.
977
978 \begin{verbatim}
979 BDEPS = "A -> pkg-a-ADEPS:A"
980 # This doesn't actually create an hi-boot file
981 ghc -c A.hs-boot -package-name pkg-b-BDEPS -module-env BDEPS
982 ghc -c B.hs -package-name pkg-b-BDEPS -module-env BDEPS
983 # install and register pkg-b-BDEPS
984 \end{verbatim}
985
986 Notably, these commands diverge slightly from the traditional compilation process.
987 Traditionally, we would specify the flags
988 \verb|-hide-all-packages| \verb|-package-id package-a-ADEPS| in order to
989 let GHC know that it should look at this package to find modules,
990 including \verb|A|. However, if we did this here, we would also pull in
991 module B, which is incorrect, as this module was thinned out in the parent
992 package description. Conceptually, this package is being compiled in the
993 context of some module environment \verb|BDEPS| (a logical context, in Backpack lingo)
994 which maps modules to original names and is utilized by the module finder to
995 lookup the import in \verb|B.hs|.
996
997 Finally, we created all of the necessary subpackages, and we can compile
998 our package proper.
999
1000 \begin{verbatim}
1001 CDEPS = # empty!!
1002 ghc -c C.hs -package-name pkg-c-CDEPS -hide-all-packages \
1003 -package pkg-a-ADEPS -hide-module B \
1004 -package pkg-b-BDEPS
1005 # install and register package pkg-c-CDEPS
1006 \end{verbatim}
1007
1008 This mostly resembles traditional compilation, but there are a few
1009 interesting things to note. First, GHC needs to know about thinning/renaming
1010 in the package description (here, it's transmitted using the \verb|-hide-module|
1011 command, intended to apply to the most recent package definition).\footnote{Concrete
1012 command line syntax is, of course, up for discussion.} Second, even though C
1013 ``depends'' on subpackages, these do not show in its package-name identifier,
1014 e.g. CDEPS\@. This is because this package \emph{chose} the values of ADEPS and BDEPS
1015 explicitly (by including the packages in this particular order), so there are no
1016 degrees of freedom.\footnote{In the presence of a Cabal-style dependency solver
1017 which associates a-0.1 with a concrete identifier a, these choices need to be
1018 recorded in the package ID.} Finally, in principle, we could have also used
1019 the \verb|-module-env| flag to communicate how to lookup the B import (notice
1020 that the \verb|-package pkg-a-ADEPS| argument is a bit useless because we
1021 never end up using the import.) I will talk a little more about the tradeoffs
1022 shortly.
1023
1024 Overall, there are a few important things to notice about this architecture.
1025 First, because the \verb|pkg-b-BDEPS| product is installed, if in another package
1026 build we instantiate the indefinite module B with exactly the same \verb|pkg-a|
1027 implementation, we can skip the compilation process and reuse the version.
1028 This is because the calculated \verb|BDEPS| will be the same, and thus the package
1029 IDs will be the same.
1030
1031 XXX ToDo: actually write down pseudocode algorithm for this
1032
1033 \paragraph{Module environment or package flags?} In the previous
1034 section, I presented two ways by which one can tweak the behavior of
1035 GHC's module finder, which is responsible for resolving \verb|import B|
1036 into an actual physical module. The first, \verb|-module-env| is to
1037 explicitly describe a full mapping from module names to original names;
1038 the second, \verb|-package| with \verb|-hide-module| and
1039 \verb|-rename-module|, is to load packages as before but apply
1040 thinning/renaming as necessary.
1041
1042 In general, it seems that using \verb|-package| is desirable, because its
1043 input size is smaller. (If a package exports 200 modules, we would be obligated
1044 to list all of them in a module environment.) However, a tricky situation
1045 occurs when some of these modules come from a parent package:
1046
1047 \begin{verbatim}
1048 package myapp2 where
1049 System.Random = [ ... ].hs
1050 include monte-carlo
1051 App = [ ... ].hs
1052 \end{verbatim}
1053
1054 Here, monte-carlo depends on a ``subpart of the myapp2 package'', and it's
1055 not entirely clear how monte-carlo should be represented in the installed
1056 package database: should myapp2 be carved up into pieces so that subparts
1057 of its package description can be installed to the database? A package
1058 stub like this would never used by any other package, it is strictly local.
1059
1060 On the other hand, if one uses a module environment for specifying how
1061 \verb|monte-carlo| should handle \verb|System.Random|, we don't actually
1062 have to create a stub package: all we have to do is make sure GHC knows
1063 how to find the module with this original name. To make things better,
1064 the size of the module environment will only be as long as all of the
1065 holes visible to the module in the package description, so the user will
1066 have explicitly asked for all of this pluggability.
1067
1068 The conclusion seems clear: package granularity for modules from includes,
1069 and module environments for modules from parent packages!
1070
1071 \paragraph{An appealing but incorrect alternative} In this section,
1072 we briefly describe an alternate compilation strategy that might
1073 sound good, but isn't so good. The basic idea is, when compiling
1074 \verb|pkg-c|, to compile all of its indefinite packages as if the
1075 package were one single, big package.
1076 (Of course, if we encounter a
1077 definite package, don't bother recompiling it; just use it directly.)
1078 In particular, we maintain the invariant that any code generated will
1079 export symbols under the current package's namespace. So the identifier
1080 \verb|b| in the example becomes a symbol \verb|pkg-c_pkg-b_B_b| rather
1081 than \verb|pkg-b_B_b| (package subqualification is necessary because
1082 package C may define its own B module after thinning out the import.)
1083
1084 The fatal problem with this proposal is that it doesn't implement applicative
1085 semantics beyond compilation units. While modules within a single
1086 compilation will get reused, if there is another package:
1087
1088 \begin{verbatim}
1089 package pkg-d where
1090 include pkg-a
1091 include pkg-b
1092 \end{verbatim}
1093
1094 when it is compiled by itself, it will generate its own instance of B,
1095 even though it should be the same as C. This is bad news.
1096
1097 \subsection{Restricted recursive modules ala hs-boot}\label{sec:hs-boot-restrict}
1098
1099 It should be possible to support GHC-style mutual recursion using the
1100 Backpack formalism immediately using hs-boot files. However, to avoid
1101 the need for a shaping pass, we must adopt an existing constraint that
1102 already applies to hs-boot files: \emph{at the time we define a signature,
1103 we must know what the original name for all data types is}. In practice,
1104 GHC enforces this by stating that: (1) an hs-boot file must be
1105 accompanied with an implementation, and (2) the implementation must
1106 in fact define (and not reexport) all of the declarations in the signature.
1107
1108 Why does this not require a shaping pass? The reason is that the
1109 signature is not really polymorphic: we require that the $\alpha$ module
1110 variable be resolved to a concrete module later in the same package, and
1111 that all the $\beta$ module variables be unified with $\alpha$. Thus, we
1112 know ahead of time the original names and don't need to deal with any
1113 renaming.\footnote{This strategy doesn't completely resolve the problem
1114 of cross-package mutual recursion, because we need to first compile a
1115 bit of the first package (signatures), then the second package, and then
1116 the rest of the first package.}
1117
1118 Compiling packages in this way gives the tantalizing possibility
1119 of true separate compilation: the only thing we don't know is what the actual
1120 package name of an indefinite package will be, and what the correct references
1121 to have are. This is a very minor change to the assembly, so one could conceive
1122 of dynamically rewriting these references at the linking stage. But
1123 separate compilation achieved in this fashion would not be able to take
1124 advantage of cross-module optimizations.
1125
1126 \section{Shaped Backpack}
1127
1128 Despite the simplicity of shapeless Backpack with the linking
1129 restriction in the absence of holes, we will find that when we have
1130 holes, it will be very difficult to do type-checking without
1131 some form of shaping. This section is very much a work in progress,
1132 but the ability to typecheck against holes, even with the linking restriction,
1133 is a very important part of modular separate development, so we will need
1134 to support it at some ponit.
1135
1136 \subsection{Efficient shaping}
1137
1138 (These are Edward's opinion, he hasn't convinced other folks that this is
1139 the right way to do it.)
1140
1141 In this section, I want to argue that, although shaping constitutes
1142 a pre-pass which must be run before compilation in earnest, it is only
1143 about as bad as the dependency resolution analysis that GHC already does
1144 in \verb|ghc -M| or \verb|ghc --make|.
1145
1146 In Paper Backpack, what information does shaping compute? It looks at
1147 exports, imports, data declarations and value declarations (but not the
1148 actual expressions associated with these values.) As a matter of fact,
1149 GHC already must look at the imports associated with a package in order
1150 to determine the dependency graph, so that it can have some order to compile
1151 modules in. There is a specialized parser which just parses these statements,
1152 and then ignores the rest of the file.
1153
1154 A bit of background: the \emph{renamer} is responsible for resolving
1155 imports and figuring out where all of these entities actually come from.
1156 SPJ would really like to avoid having to run the renamer in order to perform
1157 a shaping pass.
1158
1159 \paragraph{Is it necessary to run the Renamer to do shaping?}
1160 Edward and Scott believe the answer is no, well, partially.
1161 Shaping needs to know the original names of all entities exposed by a
1162 module/signature. Then it needs to know (a) which entities a module/signature
1163 defines/declares locally and (b) which entities that module/signature exports.
1164 The former, (a), can be determined by a straightforward inspection of a parse
1165 tree of the source file.\footnote{Note that no expression or type parsing
1166 is necessary. We only need names of local values, data types, and data
1167 constructors.} The latter, (b), is a bit trickier. Right now it's the Renamer
1168 that interprets imports and exports into original names, so we would still
1169 rely on that implementation. However, the Renamer does other, harder things
1170 that we don't need, so ideally we could factor out the import/export
1171 resolution from the Renamer for use in shaping.
1172
1173 Unfortunately the Renamer's import resolution analyzes \verb|.hi| files, but for
1174 local modules, which haven't yet been typechecked, we don't have those.
1175 Instead, we could use a new file format, \verb|.hsi| files, to store the shape of
1176 a locally defined module. (Defined packages are bundled with their shapes,
1177 so included modules have \verb|.hsi| files as well.) (What about the logical
1178 vs.~physical distinction in file names?) If we refactor the import/export
1179 resolution code, could we rewrite it to generically operate on both
1180 \verb|.hi| files and \verb|.hsi| files?
1181
1182 Alternatively, rather than storing shapes on a per-source basis, we could
1183 store (in memory) the entire package shape. Similarly, included packages
1184 could have a single shape file for the entire package. Although this approach
1185 would make shaping non-incremental, since an entire package's shape would
1186 be recomputed any time a constituent module's shape changes, we do not expect
1187 shaping to be all that expensive.
1188
1189 \subsection{Typechecking of indefinite modules}\label{sec:typechecking-indefinite}
1190
1191 Recall in our argument in the definite case, where we showed there are
1192 no holes in the physical context. With indefinite modules, this is no
1193 longer true. While (with the linking restriction) these holes will never
1194 be linked against a physical implementation, they may be linked against
1195 other signatures. (Note: while disallowing signature linking would
1196 solve our problem, it would disallow a wide array of useful instances of
1197 signature reuse, for example, a package mylib that implements both
1198 mylib-1x-sig and mylib-2x-sig.)
1199
1200 With holes, we must handle module variables, and we sometimes must unify them:
1201
1202 \begin{verbatim}
1203 package p where
1204 A :: [ data A ]
1205 package q where
1206 A :: [ data A ]
1207 package r where
1208 include p
1209 include q
1210 \end{verbatim}
1211
1212 In this package, it is not possible to a priori assign original names to
1213 module A in p and q, because in package r, they should have the same
1214 original name. When signature linking occurs, unification may occur,
1215 which means we have to rename all relevant original names. (A similar
1216 situation occurs when a module is typechecked against a signature.)
1217
1218 An invariant which would be nice to have is this: when typechecking a
1219 signature or including a package, we may apply renaming to the entities
1220 being brought into context. But once we've picked an original name for
1221 our entities, no further renaming should be necessary. (Formally, in the
1222 unification for semantic object shapes, apply the unifier to the second
1223 shape, but not the first one.)
1224
1225 However, there are plenty of counterexamples here:
1226
1227 \begin{verbatim}
1228 package p where
1229 A :: [ data A ]
1230 B :: [ data A ]
1231 M = ...
1232 A = B
1233 \end{verbatim}
1234
1235 In this package, does module M know that A.A and B.A are type equal? In
1236 fact, the shaping pass will have assigned equal module identities to A
1237 and B, so M \emph{equates these types}, despite the aliasing occurring
1238 after the fact.
1239
1240 We can make this example more sophisticated, by having a later
1241 subpackage which causes the aliasing; now, the decision is not even a
1242 local one (on the other hand, the equality should be evident by inspection
1243 of the package interface associated with q):
1244
1245 \begin{verbatim}
1246 package p where
1247 A :: [ data A ]
1248 B :: [ data A ]
1249 package q where
1250 A :: [ data A ]
1251 B = A
1252 package r where
1253 include p
1254 include q
1255 \end{verbatim}
1256
1257 Another possibility is that it might be acceptable to do a mini-shaping
1258 pass, without parsing modules or signatures, \emph{simply} looking at
1259 names and aliases. But logical names are not the only mechanism by
1260 which unification may occur:
1261
1262 \begin{verbatim}
1263 package p where
1264 C :: [ data A ]
1265 A = [ data A = A ]
1266 B :: [ import A(A) ]
1267 C = B
1268 \end{verbatim}
1269
1270 It is easy to conclude that the original names of C and B are the same. But
1271 more importantly, C.A must be given the original name of p:A.A. This can only
1272 be discovered by looking at the signature definition for B. In any case, it
1273 is worth noting that this situation parallels the situation with hs-boot
1274 files (although there is no mutual recursion here).
1275
1276 The conclusion is that you will probably, in fact, have to do real
1277 shaping in order to typecheck all of these examples.
1278
1279 \paragraph{Hey, these signature imports are kind of tricky\ldots}
1280
1281 When signatures and modules are interleaved, the interaction can be
1282 complex. Here is an example:
1283
1284 \begin{verbatim}
1285 package p where
1286 C :: [ data A ]
1287 M = [ import C; ... ]
1288 A = [ import M; data A = A ]
1289 C :: [ import A(A) ]
1290 \end{verbatim}
1291
1292 Here, the second signature for C refers to a module implementation A
1293 (this is permissible: it simply means that the original name for p:C.A
1294 is p:A.A). But wait! A relies on M, and M relies on C. Circularity?
1295 Fortunately not: a client of package p will find it impossible to have
1296 the hole C implemented in advance, since they will need to get their hands on module
1297 A\ldots but it will not be defined prior to package p.
1298
1299 In any case, however, it would be good to emit a warning if a package
1300 cannot be compiled without mutual recursion.
1301
1302 \subsection{Incremental typechecking}
1303 We want to typecheck modules incrementally, i.e., when something changes in
1304 a package, we only want to re-typecheck the modules that care about that
1305 change. GHC already does this today.%
1306 \footnote{\url{https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler/RecompilationAvoidance}}
1307 Is the same mechanism sufficient for Backpack? Edward and Scott think that it
1308 is, mostly. Our conjecture is that a module should be re-typechecked if the
1309 existing mechanism says it should \emph{or} if the logical shape
1310 context (which maps logical names to physical names) has changed. The latter
1311 condition is due to aliases that affect typechecking of modules.
1312
1313 Let's look again at an example from before:
1314 \begin{verbatim}
1315 package p where
1316 A :: [ data A ]
1317 B :: [ data A ]
1318 M = [ import A; import B; ... ]
1319 \end{verbatim}
1320 Let's say that \verb|M| is typechecked successfully. Now we add an alias binding
1321 at the end of the package, \verb|A = B|. Does \verb|M| need to be
1322 re-typechecked? Yes! (Well, it seems so, but let's just assert ``yes'' for now.
1323 Certainly in the reverse case---if we remove the alias and then ask---this
1324 is true, since \verb|M| might have depended on the two \verb|A| types
1325 being the same.)
1326 The logical shape context changed to say that \verb|A| and
1327 \verb|B| now map to the same physical module identity. But does the existing
1328 recompilation avoidance mechanism say that \verb|M| should be re-typechecked?
1329 It's unclear. The \verb|.hi| file for \verb|M| records that it imported \verb|A| and
1330 \verb|B| with particular ABIs, but does it also know about the physical module
1331 identities (or rather, original module names) of these modules?
1332
1333 Scott thinks this highlights the need for us to get our story straight about
1334 the connection between logical names, physical module identities, and file
1335 names!
1336
1337
1338 \subsection{Installing indefinite packages}\label{sec:installing-indefinite}
1339
1340 If an indefinite package contains no code at all, we only need
1341 to install the interface file for the signatures. However, if
1342 they include code, we must provide all of the
1343 ingredients necessary to compile them when the holes are linked against
1344 actual implementations. (Figure~\ref{fig:pkgdb})
1345
1346 \paragraph{Source tarball or preprocessed source?} What is the representation of the source that is saved is. There
1347 are a number of possible choices:
1348
1349 \begin{itemize}
1350 \item The original tarballs downloaded from Hackage,
1351 \item Preprocessed source files,
1352 \item Some sort of internal, type-checked representation of Haskell code (maybe the output of the desugarer).
1353 \end{itemize}
1354
1355 Storing the tarballs is the simplest and most straightforward mechanism,
1356 but we will have to be very certain that we can recompile the module
1357 later in precisely the same we compiled it originally, to ensure the hi
1358 files match up (fortunately, it should be simple to perform an optional
1359 sanity check before proceeding.) The appeal of saving preprocessed
1360 source, or even the IRs, is that this is conceptually this is exactly
1361 what an indefinite package is: we have paused the compilation process
1362 partway, intending to finish it later. However, our compilation strategy
1363 for definite packages requires us to run this step using a \emph{different}
1364 choice of original names, so it's unclear how much work could actually be reused.
1365
1366 \section{Surface syntax}
1367
1368 In the Backpack paper, a brand new module language is presented, with
1369 syntax for inline modules and signatures. This syntax is probably worth implementing,
1370 because it makes it easy to specify compatibility packages, whose module
1371 definitions in general may be very short:
1372
1373 \begin{verbatim}
1374 package ishake-0.12-shake-0.13 where
1375 include shake-0.13
1376 Development.Shake.Sys = Development.Shake.Cmd
1377 Development.Shake = [ (**>) = (&>) ; (*>>) = (|*>)]
1378 Development.Shake.Rule = [ defaultPriority = rule . priority 0.5 ]
1379 include ishake-0.12
1380 \end{verbatim}
1381
1382 However, there are a few things that are less than ideal about the
1383 surface syntax proposed by Paper Backpack:
1384
1385 \begin{itemize}
1386 \item It's completely different from the current method users
1387 specify packages. There's nothing wrong with this per se
1388 (one simply needs to support both formats) but the smaller
1389 the delta, the easier the new packaging format is to explain
1390 and implement.
1391
1392 \item Sometimes order matters (relative ordering of signatures and
1393 module implementations), and other times it does not (aliases).
1394 This can be confusing for users.
1395
1396 \item Users have to order module definitions topologically,
1397 whereas in current Cabal modules can be listed in any order, and
1398 GHC figures out an appropriate order to compile them.
1399 \end{itemize}
1400
1401 Here is an alternative proposal, closely based on Cabal syntax. Given
1402 the following Backpack definition:
1403
1404 \begin{verbatim}
1405 package libfoo(A, B, C, Foo) where
1406 include base
1407 -- renaming and thinning
1408 include libfoo (Foo, Bar as Baz)
1409 -- holes
1410 A :: [ a :: Bool ].hsig
1411 A2 :: [ b :: Bool ].hsig
1412 -- normal module
1413 B = [
1414 import {-# SOURCE #-} A
1415 import Foo
1416 import Baz
1417 ...
1418 ].hs
1419 -- recursively linked pair of modules, one is private
1420 C :: [ data C ].hsig
1421 D = [ import {-# SOURCE #-} C; data D = D C ].hs
1422 C = [ import D; data C = C D ].hs
1423 -- alias
1424 A = A2
1425 \end{verbatim}
1426
1427 We can write the following Cabal-like syntax instead (where
1428 all of the signatures and modules are placed in appropriately
1429 named files):
1430
1431 \begin{verbatim}
1432 package: libfoo
1433 ...
1434 build-depends: base, libfoo (Foo, Bar as Baz)
1435 holes: A A2 -- deferred for now
1436 exposed-modules: Foo B C
1437 aliases: A = A2
1438 other-modules: D
1439 \end{verbatim}
1440
1441 Notably, all of these lists are \emph{insensitive} to ordering!
1442 The key idea is use of the \verb|{-# SOURCE #-}| pragma, which
1443 is enough to solve the important ordering constraint between
1444 signatures and modules.
1445
1446 Here is how the elaboration works. For simplicity, in this algorithm
1447 description, we assume all packages being compiled have no holes
1448 (including \verb|build-depends| packages). Later, we'll discuss how to
1449 extend the algorithm to handle holes in both subpackages and the main
1450 package itself.
1451
1452 \begin{enumerate}
1453
1454 \item At the top-level with \verb|package| $p$ and
1455 \verb|exposed-modules| $ms$, record \verb|package p (ms) where|
1456
1457 \item For each package $p$ with thinning/renaming $ms$ in
1458 \verb|build-depends|, record a \verb|include p (ms)| in the
1459 Backpack package. The ordering of these includes does not
1460 matter, since none of these packages have holes.
1461
1462 \item Take all modules $m$ in \verb|other-modules| and
1463 \verb|exposed-modules| which were not exported by build
1464 dependencies, and create a directed graph where hs and hs-boot
1465 files are nodes and imports are edges (the target of an edge is
1466 an hs file if it is a normal import, and an hs-boot file if it
1467 is a SOURCE import). Topologically sort this graph, erroring if
1468 this graph contains cycles (even with recursive modules, the
1469 cycle should have been broken by an hs-boot file). For each
1470 node, in this order, record \verb|M = [ ... ]| or \verb|M :: [ ... ]|
1471 depending on whether or not it is an hs or hs-boot. If possible,
1472 sort signatures before implementations when there is no constraint
1473 otherwise.
1474
1475 \end{enumerate}
1476
1477 Here is a simple example which shows how SOURCE can be used to disambiguate
1478 between two important cases. Suppose we have these modules:
1479
1480 \begin{verbatim}
1481 -- A1.hs
1482 import {-# SOURCE #-} B
1483
1484 -- A2.hs
1485 import B
1486
1487 -- B.hs
1488 x = True
1489
1490 -- B.hs-boot
1491 x :: Bool
1492 \end{verbatim}
1493
1494 Then we translate the following packages as follows:
1495
1496 \begin{verbatim}
1497 exposed-modules: A1 B
1498 -- translates to
1499 B :: [ x :: Bool ]
1500 A1 = [ import B ]
1501 B = [ x = True ]
1502 \end{verbatim}
1503
1504 but
1505
1506 \begin{verbatim}
1507 exposed-modules: A2 B
1508 -- translates to
1509 B = [ x = True ]
1510 B :: [ x :: Bool ]
1511 A2 = [ import B ]
1512 \end{verbatim}
1513
1514 The import controls placement between signature and module, and in A1 it
1515 forces B's signature to be sorted before B's implementation (whereas in
1516 the second section, there is no constraint so we preferentially place
1517 the B's implementation first)
1518
1519 \paragraph{Holes in the database} In the presence of holes,
1520 \verb|build-depends| resolution becomes more complicated. First,
1521 let's consider the case where the package we are building is
1522 definite, but the package database contains indefinite packages with holes.
1523 In order to maintain the linking restriction, we now have to order packages
1524 from step (2) of the previous elaboration. We can do this by creating
1525 a directed graph, where nodes are packages and edges are from holes to the
1526 package which implements them. If there is a cycle, this indicates a mutually
1527 recursive package. In the absence of cycles, a topological sorting of this
1528 graph preserves the linking invariant.
1529
1530 One subtlety to consider is the fact that an entry in \verb|build-depends|
1531 can affect how a hole is instantiated by another entry. This might be a
1532 bit weird to users, who might like to explicitly say how holes are
1533 filled when instantiating a package. Food for thought, surface syntax wise.
1534
1535 \paragraph{Holes in the package} Actually, this is quite simple: the
1536 ordering of includes goes as before, but some indefinite packages in the
1537 database are less constrained as they're ``dependencies'' are fulfilled
1538 by the holes at the top-level of this package. It's also worth noting
1539 that some dependencies will go unresolved, since the following package
1540 is valid:
1541
1542 \begin{verbatim}
1543 package a where
1544 A :: ...
1545 package b where
1546 include a
1547 \end{verbatim}
1548
1549 \paragraph{Multiple signatures} In Backpack syntax, it's possible to
1550 define a signature multiple times, which is necessary for mutually
1551 recursive signatures:
1552
1553 \begin{verbatim}
1554 package a where
1555 A :: [ data A ]
1556 B :: [ import A; data B = B A ]
1557 A :: [ import B; data A = A B ]
1558 \end{verbatim}
1559
1560 Critically, notice that we can see the constructors for both module B and A
1561 after the signatures are linked together. This is not possible in GHC
1562 today, but could be possible by permitting multiple hs-boot files. Now
1563 the SOURCE pragma indicating an import must \emph{disambiguate} which
1564 hs-boot file it intends to include. This might be one way of doing it:
1565
1566 \begin{verbatim}
1567 -- A.hs-boot2
1568 data A
1569
1570 -- B.hs-boot
1571 import {-# SOURCE hs-boot2 #-} A
1572
1573 -- A.hs-boot
1574 import {-# SOURCE hs-boot #-} B
1575 \end{verbatim}
1576
1577 \paragraph{Explicit or implicit reexports} One annoying property of
1578 this proposal is that, looking at the \verb|exposed-modules| list, it is
1579 not immediately clear what source files one would expect to find in the
1580 current package. It's not obvious what the proper way to go about doing
1581 this is.
1582
1583 \paragraph{Better syntax for SOURCE} If we enshrine the SOURCE import
1584 as a way of solving Backpack ordering problems, it would be nice to have
1585 some better syntax for it. One possibility is:
1586
1587 \begin{verbatim}
1588 abstract import Data.Foo
1589 \end{verbatim}
1590
1591 which makes it clear that this module is pluggable, typechecking against
1592 a signature. Note that this only indicates how type checking should be
1593 done: when actually compiling the module we will compile against the
1594 interface file for the true implementation of the module.
1595
1596 It's worth noting that the SOURCE annotation was originally made a
1597 pragma because, in principle, it should have been possible to compile
1598 some recursive modules without needing the hs-boot file at all. But if
1599 we're moving towards boot files as signatures, this concern is less
1600 relevant.
1601
1602 \section{Bits and bobs}
1603
1604 \subsection{Abstract type synonyms}
1605
1606 In Paper Backpack, abstract type synonyms are not permitted, because GHC doesn't
1607 understand how to deal with them. The purpose of this section is to describe
1608 one particularly nastiness of abstract type synonyms, by way of the occurs check:
1609
1610 \begin{verbatim}
1611 A :: [ type T ]
1612 B :: [ import qualified A; type T = [A.T] ]
1613 \end{verbatim}
1614
1615 At this point, it is illegal for \verb|A = B|, otherwise this type synonym would
1616 fail the occurs check. This seems like pretty bad news, since every instance
1617 of the occurs check in the type-checker could constitute a module inequality.
1618
1619 \subsection{Type families}
1620
1621 Like type classes, type families must not overlap (and this is a question of
1622 type safety!)
1623
1624 A more subtle question is compatibility and apartness of type family
1625 equations. Under these checks, aliasing of modules can fail if it causes
1626 two type families to be identified, but their definitions are not apart.
1627 Here is a simple example:
1628
1629 \begin{verbatim}
1630 A :: [
1631 type family F a :: *
1632 type instance F Int = Char
1633 ]
1634 B :: [
1635 type family F a :: *
1636 type instance F Int = Bool
1637 ]
1638 \end{verbatim}
1639
1640 Now it is illegal for \verb|A = B|, because when the type families are
1641 unified, the instances now fail the apartness check. However, if the second
1642 instance was \verb|F Int = Char|, the families would be able to link together.
1643
1644 To make matters worse, an implementation may define more axioms than are
1645 visible in the signature:
1646
1647 \begin{verbatim}
1648 package a where
1649 A :: [
1650 type family F a :: *
1651 type instance F Int = Bool
1652 ]
1653 package b where
1654 include a
1655 B = [
1656 import A
1657 type instance F Bool = Bool
1658 ...
1659 ]
1660 package c where
1661 A = [
1662 type family F a :: *
1663 type instance F Int = Bool
1664 type instance F Bool = Int
1665 ]
1666 include b
1667 \end{verbatim}
1668
1669 It would seem that private axioms cannot be naively supported. Is
1670 there any way that thinning axioms could be made to work?
1671
1672 \section{Open questions}\label{sec:open-questions}
1673
1674 Here are open problems about the implementation which still require
1675 hashing out.
1676
1677 \begin{itemize}
1678
1679 \item In Section~\ref{sec:simplifying-backpack}, we argued that we
1680 could implement Backpack without needing a shaping pass. We're
1681 pretty certain that this will work for typechecking and
1682 compiling fully definite packages with no recursive linking, but
1683 in Section~\ref{sec:typechecking-indefinite}, we described some
1684 of the prevailing difficulties of supporting signature linking.
1685 Renaming is not an insurmountable problem, but backwards flow of
1686 shaping information can be, and it is unclear how best to
1687 accommodate this. This is probably the most important problem
1688 to overcome.
1689
1690 \item In Section~\ref{sec:installing-indefinite}, a few choices for how to
1691 store source code were pitched, however, there is not consensus on which
1692 one is best.
1693
1694 \item What is the impact of the multiplicity of PackageIds on
1695 dependency solving in Cabal? Old questions of what to prefer
1696 when multiple package-versions are available (Cabal originally
1697 only needed to solve this between different versions of the same
1698 package, preferring the oldest version), but with signatures,
1699 there are more choices. Should there be a complex solver that
1700 does all signature solving, or a preprocessing step that puts
1701 things back into the original Cabal version. Authors may want
1702 to suggest policy for what packages should actually link against
1703 signatures (so a crypto library doesn't accidentally link
1704 against a null cipher package).
1705
1706 \end{itemize}
1707
1708 \end{document}