9be64652b329cade665444f2657b313da7d319a3
[ghc.git] / docs / backpack / backpack-impl.tex
1 \documentclass{article}
2
3 \usepackage{graphicx} %[pdftex] OR [dvips]
4 \usepackage{fullpage}
5 \usepackage{float}
6 \usepackage{titling}
7 \setlength{\droptitle}{-6em}
8
9 \newcommand{\ghcfile}[1]{\textsl{#1}}
10
11 \title{Implementing Backpack}
12
13 \begin{document}
14
15 \maketitle
16
17 The purpose of this document is to describe an implementation path
18 for Backpack~\cite{Kilpatrick:2014:BRH:2535838.2535884} in GHC\@.
19
20 We start off by outlining the current architecture of GHC, ghc-pkg and Cabal,
21 which constitute the existing packaging system. We then state what our subgoals
22 are, since there are many similar sounding but different problems to solve. Next,
23 we describe the ``probably correct'' implementation plan, and finish off with
24 some open design questions. This is intended to be an evolving design document,
25 so please contribute!
26
27 \section{Current packaging architecture}
28
29 The overall architecture is described in Figure~\ref{fig:arch}.
30
31 \begin{figure}[H]
32 \center{\scalebox{0.8}{\includegraphics{arch.png}}}
33 \label{fig:arch}\caption{Architecture of GHC, ghc-pkg and Cabal. Green bits indicate additions from upcoming IHG work, red bits indicate additions from Backpack. Orange indicates a Haskell library.}
34 \end{figure}
35
36 Here, arrows indicate dependencies from one component to another. Color
37 coding is as follows: orange components are libaries, green components
38 are to be added with the IHG work, red components are to be added with
39 Backpack. (Thus, black and orange can be considered the current)
40
41 \subsection{Installed package database}
42
43 Starting from the bottom, we have the \emph{installed package database}
44 (actually a collection of such databases), which stores information
45 about what packages have been installed are thus available to be
46 compiled against. There is both a global database (for the system
47 administrator) and a local database (for end users), which can be
48 updated independently. One way to think about the package database
49 is as a \emph{cache of object code}. In principle, one could compile
50 any piece of code by repeatedly recompiling all of its dependencies;
51 the installed package database describes when this can be bypassed.
52
53 \begin{figure}[H]
54 \center{\scalebox{0.8}{\includegraphics{pkgdb.png}}}
55 \label{fig:pkgdb}\caption{Anatomy of a package database.}
56 \end{figure}
57
58 In Figure~\ref{fig:pkgdb}, we show the structure of a package database.
59 The installed package are created from a Cabal file through the process
60 of dependency resolution and compilation. In database terms, the primary key
61 of a package database is the InstalledPackageId
62 (Figure~\ref{fig:current-pkgid}). This ID uniquely identifies an
63 instance of an installed package. The PackageId omits the ABI hash and
64 is used to qualify linker exported symbols: the current value of this
65 parameter is communicated to GHC using the \verb|-package-id| flag.
66
67 In principle, packages with different PackageIds should be linkable
68 together in the same compiled program, whereas packages with the same
69 PackageId are not (even if they have different InstalledPackageIds). In
70 practice, GHC is currently only able to select one version of a package,
71 as it clears out all old versions of the package in
72 \ghcfile{compiler/main/Package.lhs}:applyPackageFlag.
73
74 \begin{figure}
75 \center{\begin{tabular}{r l}
76 PackageId & package name, package version \\
77 InstalledPackageId & PackageId, ABI hash \\
78 \end{tabular}}
79 \label{fig:current-pkgid}\caption{Current structure of package identifiers.}
80 \end{figure}
81
82 The database entry itself contains the information from the installed package ID,
83 as well as information such as what dependencies it was linked against, where
84 its compiled code and interface files live, its compilation flags, what modules
85 it exposes, etc. Much of this information is only relevant to Cabal; GHC
86 uses a subset of the information in the package database.
87
88 \subsection{GHC}
89
90 The two programs which access the package database directly are GHC
91 proper (for compilation) and ghc-pkg (which is a general purpose
92 command line tool for manipulating the database.) GHC relies on
93 the package database in the following ways:
94
95 \begin{itemize}
96 \item It imports the local and global package databases into
97 its runtime database, and applies modifications to the exposed
98 and trusted status of the entries via the flags \verb|-package|
99 and others (\ghcfile{compiler/main/Packages.lhs}). The internal
100 package state can be seen at \verb|-v4| or higher.
101 \item It uses this package database to find the location of module
102 interfaces when it attempts to load the module info of an external
103 module (\ghcfile{compiler/iface/LoadIface.hs}).
104 \end{itemize}
105
106 GHC itself performs a type checking phase, which generates an interface
107 file representing the module (so that later invocations of GHC can load the type
108 of a module), and then after compilation projects object files and linked archives
109 for programs to use.
110
111 \subsection{hs-boot}
112
113 \verb|hs-boot| is a special mechanism used to support recursive linking
114 of modules within a package, today. Suppose I have a recursive module
115 dependency between modules and A and B. I break one of\ldots
116
117 (ToDo: describe how hs-boot mechanism works)
118
119 \subsection{Cabal}
120
121 Cabal is the build system for GHC, we can think of it as parsing a Cabal
122 file describing a package, and then making (possibly multiple)
123 invocations to GHC to perform the appropriate compilation. What
124 information does Cabal pass onto GHC\@? One can get an idea for this by
125 looking at a prototypical command line that Cabal invokes GHC with:
126
127 \begin{verbatim}
128 ghc --make
129 -package-name myapp-0.1
130 -hide-all-packages
131 -package-id containers-0.9-ABCD
132 Module1 Module2
133 \end{verbatim}
134
135 There are a few things going on here. First, Cabal has to tell GHC
136 what the name of the package it's compiling (otherwise, GHC can't appropriately
137 generate symbols that other code referring to this package might generate).
138 There are also a number of commands which configure its in-memory view of
139 the package database (GHC's view of the package database may not directly
140 correspond to what is on disk). There's also an optimization here: in principle,
141 GHC can compile each module one-by-one, but instead we use the \verb|--make| flag
142 because this allows GHC to reuse some data structures, resulting in a nontrivial
143 speedup.
144
145 (ToDo: describe cabal-install/sandbox)
146
147 \section{Goals}
148
149 There are actually a number of different goals we have for modifying the
150 packaging system, some of which are subsets of the Backpack system.
151
152 \begin{itemize}
153 \item As a prerequisite, support multiple instances of containers-2.9 \emph{in the
154 package database}. These instances may be compiled against
155 different dependencies, have the same dependencies but different
156 source files (as when a package is being developed), or be
157 compiled with different options. It is less important to allow
158 these instances to be linkable together.\footnote{Actually, I think
159 this is completely orthogonal to Backpack, since we are going to treat
160 dependencies as part of the package ID, so they would be considered
161 separate entries in the package database anyway.}
162
163 \item Support typechecking a library against a module interface
164 as opposed to an actual implementation. This would be useful
165 for moving towards a model where Cabal package dependency versions
166 are replaced with proper signature packages. % See Section~\ref{sec:signature-packages} for more details.
167
168 \item Support compiling the aforementioned libraries with actual implementations.
169 It is \emph{not} a goal to be able to compile a library while only
170 partially providing dependencies, and it is low priority to support
171 mutually recursive implementations of these implementations.
172
173 \iffalse%
174 \item Support insertion of backwards compatibility shims for packages
175 that are using old versions of packages, so that you can continue
176 using them without having to patch them manually. This is a
177 stylized use-case of Backpack features. See Section~LINKME for
178 more details.
179
180 \item Some easy-to-implement subset of the functionality provided by
181 packages with holes (Backpack proper). This includes support
182 of linking an executable containing multiple packages with the
183 same package name but different PackageIds.
184 \fi
185 \end{itemize}
186
187 A lower priority goal is to actually allow multiple instances of
188 containers-2.9 to be linked together in the same executable
189 program.\footnote{In particular, this requires changes to how linker
190 symbols are assigned. However, this feature is important to
191 implement a number of Backpack features.}
192
193 A \emph{non-goal} is to allow users to upgrade upstream libraries
194 without recompiling downstream. This is an ABI concern and we're not
195 going to worry about it.
196
197 \section{Aside: Recent IHG work}\label{sec:ihg}
198
199 The IHG project has allocated some funds to relax the package instance
200 constraint in the package database, so that multiple instances can be
201 stored, but now the user of GHC must explicitly list package-IDs to be
202 linked against. In the far future, it would be expected that tools like
203 Cabal automatically handle instance selection among a large number of
204 instances, but this is subtle and so this work is only to do some
205 foundational work, allowing a package database to optionally relax the
206 unique package-version requirement, and utilize environment files to
207 select which packages should be used. See Duncan's email for more
208 details on the proposal.
209
210 To implement this:
211
212 \begin{enumerate}
213
214 \item Remove the ``removal step'' when registering a package (with a flag)
215
216 \item Check \ghcfile{compiler/main/Packages.lhs}:mkPackagesState to look out for shadowing
217 within a database. We believe it already does the right thing, since
218 we already need to handle shadowing between the local and global database.
219
220 \end{enumerate}
221
222 Once these changes are implemented, we can program multiple instances by
223 using \verb|-hide-all-packages -package-id ...|, even if there is no
224 high-level tool support.
225
226 Actually, this concern is orthogonal to the purposes of Backpack, if
227 we redefine PackageId appropriately.
228
229 \paragraph{The ABI hash} Currently, InstalledPackageId
230 is constructed of a package, version and ABI hash
231 (generateRegistrationInfo in
232 \ghcfile{libraries/Cabal/Cabal/Distribution/Simple/Register.hs}). The
233 use of an ABI hash is a bit of GHC-specific hack introduced in 2009,
234 intended to make sure these installed package IDs are unique. While
235 this is quite clever, using the ABI is actually a bit inflexible, as one
236 might reasonably want to have multiple copies of a package with the same
237 ABI but different source code changes.\footnote{In practice, our ABIs
238 are so unstable that it doesn't really matter.}
239
240 In Figure~\ref{fig:proposed-pkgid}, there is an alternate logical
241 representation of InstalledPackageId which attempts to extricate the
242 notion of ABI compatibility from what actually might uniquely identify a
243 package beyond its PackageId. We imagine these components to be:
244
245 \begin{itemize}
246 \item A hash of the source code (so one can register different
247 in-development versions without having to bump the version
248 number);
249 \item Compilation way (profiling? dynamic?)
250 \item Compilation flags (such as compilation way, optimization,
251 profiling settings)\footnote{This is a little undefined on a package bases, because in principle the flags could be varied on a per-file basis. More likely this will be approximated against the relevant fields in the Cabal file as well as arguments passed to Cabal.};
252 \end{itemize}
253
254 A historical note: in the 2012 GSoC project to allow multiple instances
255 of a package to be installed at the same time, use of \emph{random
256 numbers} was used to workaround the inability to get an ABI early
257 enough. We are not using this plan.
258
259 \section{Infrastructural improvements}
260
261 There are some infrastructural improvements that must be made before
262 Backpack proper can be implemented. These additions are described in
263 red in the architectural diagrams. The current structure of this
264 section is to describe the additions bottom up.
265
266 \subsection{Concrete physical identity = PackageId$^*$ + Module name}\label{sec:ipi}
267
268 \begin{figure}
269 \center{\begin{tabular}{r l}
270 PackageId & hash of ``package name, package version, PackageIds of dependent packages'' \\
271 InstalledPackageId & hash of ``PackageId, source code, way, compiler flags'' \\
272 \end{tabular}}
273 \label{fig:proposed-pkgid}\caption{Proposed new structure of package identifiers.}
274 \end{figure}
275
276 In Backpack, there needs to be some mechanism for assigning
277 \emph{physical module identities} to modules, which are essential for
278 typechecking Backpack packages, since they let us tell if two types are
279 equal or not. In the paper, the physical identity was represented as the
280 package that constructed it as well as some representation of the module
281 source. We can simplify this slightly: in current Cabal packages, we
282 require that modules always be given a package-unique logical name;
283 thus, physical identities can be simply represented as a PackageId plus
284 module name. (See \ghcfile{compiler/basicTypes/Module.lhs:Module})
285 In fact, this coincides with how GHC already internally handles
286 the problem of type equality: it appeals to an \emph{original name}
287 which is, presently, a PackageId and the module name.
288
289 However, with the current representation of PackageIds, this is
290 insufficient: a package is not just its name, but also the regular
291 tree representing all of its package dependencies. Thus, we have
292 to adjust the representation of a PackageId so that it includes this
293 regular tree, as seen Figure~\ref{fig:proposed-pkgid}. Since this
294 tree in general may be quite long, it needs to be shortened in some way,
295 such as by hashing.
296
297 And now, the complications\ldots
298
299 \paragraph{Relaxing package selection restrictions} As mentioned
300 previously, GHC is unable to select multiple packages with the same
301 package name (but different PackageIds). This restriction needs to be
302 lifted. For backwards compatibility reasons, it's probably better to
303 not overload \verb|-package-id| but add a new flag, maybe \verb|-force-package-id|;
304 we also need to disable the old version masking behavior. This is orthogonal
305 to the IHG work, which wants to allow multiple InstalledPackageIds in the
306 \emph{database} (here, we want to allow multiple PackageIds in compiled code).
307
308 \paragraph{Linker symbols} As we increase the amount of information in
309 PackageId, it's important to be careful about the length of these IDs,
310 as they are used for exported linker symbols (e.g.
311 \verb|base_TextziReadziLex_zdwvalDig_info|). Very long symbol names
312 hurt compile and link time, object file sizes, GHCi startup time,
313 dynamic linking, and make gdb hard to use. As such, the current plan is
314 to do away with full package names and versions, and instead use just a
315 base-62 encoded hash, perhaps with the first four characters of the package
316 name for user-friendliness.
317
318 Edward: I'm still partial to a short hash of the dependency bits (or
319 even Simon's registry of short strings to dependency trees), and keeping
320 everything else the same.
321
322 \paragraph{Wired-in names} One annoying thing to remember is that GHC
323 has wired-in names, which refer to packages without any version. Now
324 these wired names also have to accomodate dependency trees. A
325 suggested approach is to have a fixed table from these wired names to
326 package IDs; alternately we can use something like the special \verb|inplace|
327 version number.
328
329 \paragraph{Free variables (or, what is a non-concrete physical
330 identity?)} Physical identities in their full generality are permitted
331 to have free variables, which represent holes. Handling this is a
332 tricky question, and we defer it to Section~\ref{sec:variables}, when
333 we talk about packages with holes.
334
335 \subsection{Exposed modules should allow external modules}\label{sec:reexport}
336
337 In Backpack, the definition of a package consists of a logical context,
338 which maps logical module names to physical module names. These do not
339 necessarily coincide, since some physical modules may have been defined
340 in other packages and mixed into this package. This mapping specifies
341 what modules other packages including this package can access.
342 However, in the current installed package database, we have exposed-modules which
343 specify what modules are accessible, but we assume that the current
344 package is responsible for providing these modules.
345
346 To implement Backpack, we have to extend this exposed-modules (``Export declarations''
347 on Figure~\ref{fig:pkgdb}). Rather
348 than a list of logical module names, we provide a new list of tuples:
349 the exported logical module name and original physical module name (this
350 is in two parts: the InstalledPackageId and the original module name).
351 For example, an traditional module export is simply (Name, my-pkg-id, Name);
352 a renamed module is (NewName, my-pkg-id, OldName), and an external module
353 is (Name, external-pkg-id, Name).
354
355 \section{Simplifying Backpack}\label{sec:simplifying-backpack}
356
357 Backpack as currently defined always requires a \emph{shaping} pass,
358 which calculates the shapes of all modules defined in a package.
359 The shaping pass is critical to the solution of the double-vision problem
360 in recursive module linking, but it also presents a number of unpalatable
361 implementation problems:
362
363 \begin{itemize}
364
365 \item A shaping pass means that it's not possible to compile modules
366 in a package one-by-one, even if there is \emph{no mutual
367 recursion} at all in the package. Without mutual recursion, the
368 shaping pass should not be necessary. (Maybe this is not true!)
369 Always enforcing two-passes means Backpack is not a ``pay as you
370 go'' language feature.
371
372 \item The calculated module shapes are quite intricate: in particular,
373 they contain information about the providences of all data types
374 and identifiers defined by a module. To run a shape analysis, we
375 must preprocess and parse all modules\ldots and then parse them
376 again later, in the second, type-checking pass; furthermore, all
377 of this information must be passed to GHC later.
378
379 \item The conclusions of the shaping pass are \emph{incompatible} with
380 what types of mutual recursion GHC can actually compile today
381 via hs-boot: primarily, GHC requires that any data type or function
382 defined in an hs-boot file also be \emph{defined} in its corresponding
383 hs file. (In this way, we can assign an original name to it immediately,
384 which simplifies compilation immensely.)
385
386 \item The mode of program organization Backpack encourages, mutually
387 recursive modules, is less efficient than code without recursive
388 dependencies: thus programmers have good motivation to avoid
389 this design pattern when possible.
390
391 \end{itemize}
392
393 The shaping pass is fundamentally necessary for some types of Backpack packages.
394 Here is the example which convinced Simon:
395
396 \begin{verbatim}
397 package p where
398 A :: [data T; f :: T -> T]
399 B = [export T(MkT), h; import A(f); data T = MkT; h x = f MkT]
400 A = [export T(MkT), f, h; import B; f MkT = MkT]
401 \end{verbatim}
402
403 The key is to observe that B \emph{may or may not typecheck} depending
404 on the definition of A. Because A reexports B's definition T, B will
405 typecheck; but if A defined T on its own, B would not typecheck. Thus,
406 we \emph{cannot} typecheck B until we have done some analysis of A (the
407 shaping analysis!)
408
409 All of these restrictions point to the desirability of a subset of
410 Backpack which does not allow mutually recursive modules, and
411 consequently does not require a shaping pass.
412
413 Here is the programming discipline, called the \textbf{linking restriction}: \emph{Module implementations
414 must be declared before signatures.} Alternately, $+ \oplus - \neq - \oplus +$. Here's an example:
415
416 \begin{verbatim}
417 package random where
418 System.Random = [ ... ].hs
419
420 package monte-carlo where
421 System.Random :: ...
422 System.MonteCarlo = [ ... ].hs
423 \end{verbatim}
424
425 Now, to link these two applications together, only one ordering
426 is permissible:
427
428 \begin{verbatim}
429 package myapp where
430 include random
431 include monte-carlo
432 \end{verbatim}
433
434 or even (if myapp wants to provide its own random library):
435
436 \begin{verbatim}
437 package myapp2 where
438 System.Random = [ ... ].hs
439 include monte-carlo
440 \end{verbatim}
441
442 That is, all of \verb|monte-carlo|'s holes have been filled in by the time
443 it is included. The alternate ordering is not allowed, as it could have
444 resulted in a recursive
445
446 Why does this discipline prevent mutually recursive modules? Intuitively,
447 the signature is the mechanism by which we can get a handle on an implementation
448 before it is defined, thus allowing circularity. If there is never an earlier handle
449 to an implementation, circularity is not possible. A simple example of
450 this violation is in the reordered version of \verb|myapp2|:
451
452 \begin{verbatim}
453 package myapp2 where
454 include monte-carlo
455 System.Random = [ import System.MonteCarlo ].hs
456 \end{verbatim}
457
458 (Nota bene: this restriction applies to aliasing to, which can induce
459 linking as well).
460
461 \subsection{Why isn't shaping necessary?}
462
463 (ToDo: have this formally setup and everything)
464
465 In the case that we are compiling a definite package (there will be no leftover
466 holes), the argument is quite simple. In vanilla Backpack, when a
467 signature is defined, we have to generate a pile of module variables
468 which may get unified with physical module identities. However, with
469 the linking restriction, we only ever link a signature with a
470 \emph{fully instantiated} module definition. Thus, all module variables
471 can be immediately eliminated, and all we need to do is maintain an
472 updating logical context (i.e., module environment) so that later
473 modules know how to resolve imports.
474
475 When we are typechecking an indefinite package (there may be leftover
476 holes), things are a little more subtle. When a signature is defined,
477 we \emph{know} that it will never be unified against a proper module
478 name. So we can assign it some fresh original name.
479
480 Edward: What I find a bit dodgy is that there still are circumstances
481 where we will link a signature with a signature. For example, if I have:
482
483 \begin{verbatim}
484 package a where
485 A :: [ data A ]
486 B :: [ data A ]
487 M = ...
488 A = B
489 Q = ...
490 \end{verbatim}
491
492 We haven't violated the linking rule, but if we chose separate original
493 names for A and B, they need to be identified after the alias.
494
495 Fortunately, Q should typecheck as if the As are the same, but M should
496 typecheck as if they are distinct. There is some ordering here, and our
497 claim (to be formally verified) is that these unifications can be
498 resolved as we process them.
499
500 \subsection{Installing indefinite packages}\label{sec:indefinite-packages}
501
502 If an indefinite package contains no code at all, we only need
503 to install the interface file for the signatures. However, if
504 they include code, we must provide all of the
505 ingredients necessary to compile them when the holes are linked against
506 actual implementations. (Figure~\ref{fig:pkgdb})
507
508 \paragraph{Source tarball or preprocessed source?} An ancillary design choice
509 to be made is what the representation of the source that is saved is. There
510 are a number of possible choices:
511
512 \begin{itemize}
513 \item The original tarballs downloaded from Hackage,
514 \item Preprocessed source files,
515 \item Some sort of internal, type-checked representation of Haskell code (maybe the output of the desugarer).
516 \end{itemize}
517
518 Storing the tarballs is the simplest and most straightforward mechanism,
519 but we will have to be very certain that we can recompile the module
520 later in precisely the same we compiled it originally, to ensure the hi
521 files match up (fortunately, it should be simple to perform an optional
522 sanity check before proceeding.) The appeal of saving preprocessed
523 source, or even the IRs, is that this is conceptually this is exactly
524 what an indefinite package is: we have paused the compilation process
525 partway, intending to finish it later. It allows us to reuse the work
526 done preprocessing or typechecking. However, it may be the case that
527 these phases work differently once library dependencies are known; typechecking
528 especially, see Section~\ref{sec:variables}.
529
530 \subsection{Compilation}
531
532 When we need to compile an indefinite package (since all of its
533 dependencies have been found), there seem to be two implementation
534 strategies possible. Here is a very simple example to consider for the
535 various cases:
536
537 \begin{verbatim}
538 package pkg-a where
539 A = ...
540 package pgk-b where -- indefinite package
541 A :: ...
542 B = [ b = ... ]
543 package pkg-c where
544 include pkg-a
545 include pkg-b
546 \end{verbatim}
547
548 \paragraph{The ``downstream'' proposal} This is Simon's favorite
549 proposal. Because of the linking invariant, if we traverse the Backpack
550 specification, any given module we need to compile will have all of its
551 dependencies compiled (since it could only depend on the dependency if
552 there was a signature, and the signature could not have been linked
553 unless it was implemented previously.) So we just go ahead and compile
554 everything one-by-one, as if the package was a single big package. (Of
555 course, if we encounter a definite package, don't bother recompiling
556 it; just use it directly.) In
557 particular, we maintain the invariant that any code generated will
558 export symbols under the current package's namespace. So the identifier
559 \verb|b| in the example becomes a symbol \verb|pkg-c_pkg-b_B_b| rather
560 than \verb|pkg-b_B_b| (package subqualification is necessary because
561 package C may define its own B module after thinning out the import.)
562
563 One big problem with this proposal is that it doesn't implement applicative
564 semantics. If there is another package:
565
566 \begin{verbatim}
567 package pkg-d where
568 include pkg-a
569 include pkg-b
570 \end{verbatim}
571
572 this will generate its own instance of B, even though it should be the same
573 as C. Simon was willing to entertain the idea that, well, as long as the
574 type-checker is able to figure out they are the same, then it might be OK
575 if we accidentally generate two copies of the code (provided they actually
576 are the same).
577
578 \paragraph{The ``upstream'' proposal}
579 The problem with the ``downstream'' proposal is that it always recompiles
580 all of the indefinite parts of the package, even if some of them should
581 be shared in some sense. Hypothetically, a fully linked version of an
582 indefinite package should be considered a package in its own right
583 (in particular, it is associated with a physical module identity in Backpack).
584 So all we need to do is store these installed packages somewhere, probably
585 the local package database. If we recompile the same dependency chain,
586 the installed package can be simply reused. These products do not have
587 to be exposed.
588
589 One problem with this proposal is in this previous example:
590
591 \begin{verbatim}
592 package myapp2 where
593 System.Random = [ ... ].hs
594 include monte-carlo
595 \end{verbatim}
596
597 Here, monte-carlo depends on a ``subpart of the myapp2 package'', and it's
598 not entirely clear how to express this.
599
600 \subsection{Restricted recursive modules ala hs-boot}
601
602 It should be possible to support GHC-style mutual recursion using the
603 Backpack formalism immediately using hs-boot files. However, to avoid
604 the need for a shaping pass, we must adopt an existing constraint that
605 already applies to hs-boot files: \emph{at the time we define a signature,
606 we must know what the original name for all data types is}. We then
607 compile modules as usual, but compiling against the signature as if it
608 were an hs-boot file.
609
610 (ToDo: Figure out why this eliminates the shaping pass)
611
612 Compiling packages in this way gives the tantalizing possibility
613 of true separate compilation: the only thing we don't know is what the actual
614 package name of an indefinite package will be, and what the correct references
615 to have are. This is a very minor change to the assembly, so one could conceive
616 of dynamically rewriting these references at the linking stage.
617
618 \iffalse%
619
620 \section{Module variables and original names}\label{sec:variables}
621
622 In Backpack, the physical identities of modules are in fact trees of
623 variables, constructors and recursive module identities. As we
624 mentioned in Section~\ref{sec:ipi}, for concrete modules where there are
625 no free variables, physical identity maps nicely to an original name.
626 For packages with holes, the original names mechanism needs to be more
627 flexible; we need to first run a shaping pass in order to figure out if
628 a module is actually equal to another.
629
630 The direct approach is to replace original names with Backpack's
631 physical identities and use unification to do comparisons. However,
632 this would require extremely invasive changes to all aspects of GHC, as
633 the original names assumption is baked quite deeply into the compiler:
634 we'd like to avoid upsetting this applecart if possible.
635
636 One key insight is that when we can only compile when all of the holes
637 are eliminated; thus all parts of the compilation pipeline beyond
638 typechecking can, in fact, assume that they will only be dealing with
639 concrete module identities. One interpretation of this fact is that
640 we might actually be able to implement unification properly.
641
642 This gives us more latitude for how to
643 deal with the identities during typechecking: we may, for example,
644 adopt a conservative analysis (or an optional, dangerously liberal one),
645 secure in the knowledge that when we compile the final concrete module,
646 we will typecheck again with everything put together properly.
647
648 The next subsection describes one proposal for dealing with this issue,
649 and in what ways it is unsound. We imagine there are other strategies
650
651 \subsection{Holes have physical module identities associated with definition site}
652
653 When compiling a module with a hole, we create a set of interface files
654 from the signature. In the current architecture of GHC, we implicitly
655 assume that all interface files are associated with an \emph{original
656 name} (see below); thus, this means that it is very natural to assign an
657 original name to every hole. If we have a signature package:
658
659 \begin{verbatim}
660 package foo-1x-sig where
661 Foo :: ...
662 \end{verbatim}
663
664 Then the physical identity associated with Foo would be
665 \verb|foo-1x-sig_Foo| (rather than a fresh module variable $\alpha$).
666
667 This has implications for signature linking. Take this example from
668 Section 2.3 of the Backpack paper:
669
670 \begin{verbatim}
671 package yourlib where
672 Prelude :: [data List a = ...]
673 Yours :: [import Prelude; ...]
674
675 package mylib where
676 Prelude :: [data Bool = ...]
677 include yourlib
678 Mine = [import Prelude; import Yours; ...]
679 \end{verbatim}
680
681 The first order of business is that we have to merge the interfaces of
682 the two Prelude holes, which are associated with different generated
683 physical identities (\verb|yourlib_Prelude| and \verb|mylib_Prelude|),
684 and furthermore, \emph{assign an original name to it}. If we pick,
685 for example, \verb|mylib_Prelude|, none of the types defined in \verb|yourlib|
686 will unify in \verb|Mine|; thus, we would need to rename the original names
687 in \verb|yourlib|.\footnote{Edward: I don't think this is really the
688 right way to talk about this: Backpack requires a shaping pass and we should
689 be talking about it}
690
691 % ezyang: So, I thought I knew what I was talking about here, but actually
692 % the text here needs to be put in the context of shaping, so I need to
693 % think about this some more
694
695 \paragraph{Instantiation and reuse}
696
697 If we only ever linked any written signature with a signle implementation, this would actually
698 be great: when it is time to properly compile a module with holes, we
699 type check against the true original name of the new module (and just
700 do a consistency check between the signature and the implementation, modulo
701 differences in the physical identity)---a case of typechecking proceeding
702 differently between the initial typecheck and final compilation.
703
704 However, conflating a variable with an original name is very bad business.
705 Consider the graph library
706
707 However, Backpack supports the linking of two signatures together, which
708 now presents a formidable difficulty: the original names of these two
709 signatures now must be identified. Here is a real-world example where
710 this shows up:
711
712 \begin{verbatim}
713 package foo-1x-sig where
714 Foo :: ...
715
716 package foo-2x-sig where
717 include foo-1x-sig
718 Foo :: ...
719
720 package foo-2.1 where
721 include foo-2x-sig
722 Foo = ...
723
724 package A where
725 include foo-1x-sig
726 A = ...
727
728 package B where
729 include foo-2x-sig
730 B = ...
731
732 package C where
733 include A
734 include B
735 C = ...
736
737 package D where
738 include C
739 include foo-2.1
740 \end{verbatim}
741
742 Here, we have two signatures for the \verb|foo| library, one of which
743 is a subset of another (but they are ostensibly backwards compatible).
744 Package A only uses the 1.x interface, but package B uses the 2.x interface.
745 Package C uses both of these packages (and implicitly links the two signatures
746 together), and package D finally links in an actual implementation of the
747 library.
748
749
750
751 However, directly modifying original names in this fashion
752 is likely to be too invasive of a change, since the current compiler has
753 the assumption that original names can always be checked for equality
754 is deeply wired in.
755
756 Aliasing of signatures means that it is no longer the case that
757 original name means type equality. We were not able to convince
758 Simon of any satisfactory resolution. Strawman proposal is to
759 extending original names to also be variables probably won't work
760 because it is so deeply wired, but it's difficult to construct hi
761 files so that everything works out (quadratic).
762
763
764 There are some problems with respect to what occurs when two
765 distinct signatures are linked together (aliasing), we talk these problems in
766 Section~\ref{sec:open-questions}.
767
768 \fi
769
770 \subsection{Original names} Original names are an important design pattern
771 in GHC\@.
772 Sometimes, a name can be exposed in an hi file even if its module
773 wasn't exposed. Here is an example (compiled in package R):
774
775 \begin{verbatim}
776 module X where
777 import Internal (f)
778 g = f
779
780 module Internal where
781 import Internal.Total (f)
782 \end{verbatim}
783
784 Then in X.hi:
785
786 \begin{verbatim}
787 g = <R.id, Internal.Total, f> (this is the original name)
788 \end{verbatim}
789
790 (The reason we refer to the package as R.id is because it's the
791 full package ID, and not just R).
792
793 How might internal names work with Backpack?
794
795 \begin{verbatim}
796 package P where
797 M = ...
798 N = ...
799 package Q (M, R, T)
800 include P (N -> R)
801 T = ...
802 \end{verbatim}
803
804 And now if we look at Q\@:
805
806 \begin{verbatim}
807 exposed-modules:
808 M -> <P.id, M>
809 R -> <P.id, N>
810 T -> <Q.id, T>
811 \end{verbatim}
812
813 When we compile Q, and the interface file gets generated, we have
814 to generate identifiers for each of the exposed modules. These should
815 be calculated to directly refer to the ``original name'' of each them;
816 so for example M and R point directly to package P, but they also
817 include the original name they had in the original definition.
818
819 \section{Shaping}\label{sec:shaping}
820
821 Here is a hyper-simplified picture of how Backpack in its full glory might look.
822
823 Recall in Figure~\ref{fig:pkgdb}, we have Cabal which is responsible for
824 taking Cabal files and invoking GHC to build an installed package. The
825 command line looks something like:
826
827 \begin{verbatim}
828 ghc --make
829 -package-name a-0.1
830 -hide-all-packages
831 -package-id containers-0.9-ABCD
832 Module1 Module2
833 \end{verbatim}
834
835 We notice a few things, first that GHC accepts some flags which
836 configure its view of the package database (really it's just loaded
837 in-memory), so that when we perform an import, GHC knows how to find the
838 module (\verb|findModule|).
839
840 The claim is that we can implement Backpack by adding an extra pass
841 to Cabal which we will call the shaper, which performs the shaping pass
842 as described by the Backpack paper. In full generality, the \emph{module shape}
843 is what needs to be passed from Cabal to GHC, but in our simplified example
844 we're just going to deal with what SPJ calls the ``module environment'' (which
845 roughly corresponds to a logical shape context in the paper.) This module
846 environment maps logical module names to the \emph{original names}, i.e.
847 overloading the behavior of how we find modules on imports.
848
849 We'll go by way of an example:
850
851 \begin{verbatim}
852 package a1 where
853 A :: [ data T; f : T -> T ]
854 package a2 where
855 A :: [ data T; g : T -> T ]
856 package b
857 include a1
858 include a2
859 A = [ ... ] -- call physical identity b-2.7:A
860 \end{verbatim}
861
862 At the end of the day, we'd like package b to be compiled with a map from
863 logical name A to physical identity \verb|b-2.7:A|. But what about packages
864 a1 and a2? Since these have holes, we don't really know what their physical
865 identities are; but that doesn't mean we don't talk about them; instead,
866 we have a module variable for the holes. These holes have signatures, which can be thought
867 of as the interface files. So we have:\footnote{The type signatures are not quite right because they need provenance information, but never-mind that for now.} \\
868
869 \noindent
870 \verb|a1:| $\forall \alpha.\ \mathrm{A} \mapsto \alpha \mbox{\texttt{\ ::\ [ data T\@; f :\ T -> T ]}}$ \\ % chktex 26
871 \verb|a2:| $\forall \alpha.\ \mathrm{A} \mapsto \alpha \mbox{\texttt{\ ::\ [ data T\@; g :\ T -> T ]}}$ % chktex 26
872
873 Now lets consider what happens as we do the shaping analysis on package b.
874 We process each line of the module one-by-one, and initially, we start
875 with an empty module environment. When we process the first include, we
876 now know about a mapping $A \mapsto \beta$, but we don't know the physical
877 identity of the module so we gin up a module variable for now. The second
878 include, we have to instantiate the universal quantifier with $\beta$ because
879 signature linking demands the modules be the same. Finally, when we hit the
880 module definition, we see that $A \mapsto \mathtt{b-2.7:A}$.
881
882 Now, importantly, remember that there were hi files associated with each
883 of the signatures, and these hi files may contained original names
884 corresponding to our module variables. As we processed the mapping, we
885 also built up a renaming of physical variable names to their true
886 symbols, so we need to rename the respective $\alpha$s in each interface
887 file to point ot \texttt{b-2.7:A}, as we ended up seeing.
888
889 (SPJ and I went through a few other examples, including the case of recursive
890 module linking:
891
892 \begin{verbatim}
893 package a1 where
894 A :: ...
895 B = [import A]
896 package b where
897 include a1
898 A = [import B]
899 \end{verbatim}
900
901 and verified it worked. Note that recursively linked modules do not get
902 inlining! We ust need to know what the original name of what we're linking
903 against is, so we can stick in the right symbols.)
904
905 Now, the most paper-like thing one an do in this circumstance is to pass
906 over the renaming, but we'd like to avoid having command line arguments
907 that are too long. So SPJ's proposal is that, instead, we have a \emph{module
908 interface file} (or \emph{module manifest}) which collects the various information
909 that we might be interested in about module providence and renaming. Now,
910 in order to program GHC to have the right logical namespace, we only need
911 to list the appropriate package manifests but with the appropriate thinning
912 and renaming. This could be shorter, and maps more nicely onto the existing
913 package command line flags. We'd have to make sure this works out, but it's
914 exactly analogous to how we do interface files.
915
916 \section{Open questions}\label{sec:open-questions}
917
918 Here are open problems about the implementation which still require
919 hashing out.
920
921 \begin{itemize}
922 \item Aliasing of signatures means that it is no longer the case that
923 original name means type equality. We were not able to convince
924 Simon of any satisfactory resolution. Strawman proposal is to
925 extending original names to also be variables probably won't work
926 because it is so deeply wired, but it's difficult to construct hi
927 files so that everything works out (quadratic).
928
929 \item Relationship between linker names and InstalledPackageId? The reason
930 the obvious thing to do is use all of InstalledPackageId for linker
931 name, but this breaks recompilation. So some of these things
932 should go in the linker name, and not others (yes package, yes
933 version, yes some compile flags (definitely ways), eventually
934 dependency package IDs, what about cabal build flags). This is
935 approximately an ABI hash, but it's computable before compilation.
936 This worries Simon because now there are two names, but maybe
937 the DB can solve that problem---unfortunately, GHC doesn't ever
938 register during compilation; only later.
939
940 Simon also thought we should use shorter names for linker
941 names and InstallPkgIds. This appears to be orthogonal.
942
943 \item In this example:
944
945 \begin{verbatim}
946 package A where
947 A = ...
948 package A2 where
949 A2 = ...
950 package B (B)
951 A :: ...
952 B = ...
953 package C where
954 include A
955 include B
956 package D where
957 include A
958 include B
959 package E where
960 include C (B as CB)
961 include D (B as DB)
962 \end{verbatim}
963
964 Do the seperate instantiations of B exist as seperate artifacts
965 in the database, or do they get constructed on the fly by
966 definite packages which include them? The former is conceptually
967 nicer (identity of types work, closer to Backpack semantics), but
968 the latter is easier for linker names. (Simon's first inclination
969 is to construct it on the fly.)
970
971 There was another example, showing that we need to solve this
972 problem even for indefinite combinations of indefinite modules.
973 You can get to it by modifying the earlier example so that C and
974 D still have holes, which E does not fill.
975
976 \item We have to store the preprocessed sources for indefinite packages.
977 This is hard when we're constructing packages on the fly.
978
979 \item What is the impact on dependency solving in Cabal? Old questions
980 of what to prefer when multiple package-versions are available
981 (Cabal originally only needed to solve this between different
982 versions of the same package, preferring the oldest version), but
983 with signatures, there are more choices. Should there be a
984 complex solver that does all signature solving, or a preprocessing
985 step that puts things back into the original Cabal version.
986 Authors may want to suggest policy for what packages should actually
987 link against signatures (so a crypto library doesn't accidentally
988 link against a null cipher package).
989 \end{itemize}
990
991 \section{Immediate tasks}
992
993 At this point in time, it seems we can identify the following set
994 of non-controversial tasks which can be started immediately.
995
996 \begin{itemize}
997 \item Relax the package database constraint to allow multiple
998 instances of package-version. (Section~\ref{sec:ihg})
999 \item Propagate the use of \verb|InstalledPackageId| instead of
1000 package IDs for typechecking. (Section~\ref{sec:ipi})
1001 \item Implement export declarations in package format, so
1002 packages can reexport modules from other packages. (Section~\ref{sec:reexport})
1003 \end{itemize}
1004
1005 The aliasing problem is probably the most important open problem
1006 blocking the rest of the system.
1007
1008 \bibliographystyle{plain}
1009 \bibliography{backpack-impl}
1010
1011 \end{document}