[ghc.git] / docs / backpack / backpack-impl.tex
1 \documentclass{article}
3 \usepackage{graphicx} %[pdftex] OR [dvips]
4 \usepackage{fullpage}
5 \usepackage{float}
6 \usepackage{titling}
7 \setlength{\droptitle}{-6em}
9 \newcommand{\ghcfile}[1]{\textsl{#1}}
11 \title{Implementing Backpack}
13 \begin{document}
15 \maketitle
17 The purpose of this document is to describe an implementation path
18 for Backpack~\cite{Kilpatrick:2014:BRH:2535838.2535884} in GHC\@.
20 We start off by outlining the current architecture of GHC, ghc-pkg and Cabal,
21 which constitute the existing packaging system. We then state what our subgoals
22 are, since there are many similar sounding but different problems to solve. Next,
23 we describe the ``probably correct'' implementation plan, and finish off with
24 some open design questions. This is intended to be an evolving design document,
25 so please contribute!
27 \section{Current packaging architecture}
29 The overall architecture is described in Figure~\ref{fig:arch}.
31 \begin{figure}[H]
32 \center{\scalebox{0.8}{\includegraphics{arch.png}}}
33 \label{fig:arch}\caption{Architecture of GHC, ghc-pkg and Cabal. Green bits indicate additions from upcoming IHG work, red bits indicate additions from Backpack. Orange indicates a Haskell library.}
34 \end{figure}
36 Here, arrows indicate dependencies from one component to another. Color
37 coding is as follows: orange components are libaries, green components
38 are to be added with the IHG work, red components are to be added with
39 Backpack. (Thus, black and orange can be considered the current)
41 \subsection{Installed package database}
43 Starting from the bottom, we have the \emph{installed package database}
44 (actually a collection of such databases), which stores information
45 about what packages have been installed are thus available to be
46 compiled against. There is both a global database (for the system
47 administrator) and a local database (for end users), which can be
48 updated independently. One way to think about the package database
49 is as a \emph{cache of object code}. In principle, one could compile
50 any piece of code by repeatedly recompiling all of its dependencies;
51 the installed package database describes when this can be bypassed.
53 \begin{figure}[H]
54 \center{\scalebox{0.8}{\includegraphics{pkgdb.png}}}
55 \label{fig:pkgdb}\caption{Anatomy of a package database.}
56 \end{figure}
58 In Figure~\ref{fig:pkgdb}, we show the structure of a package database.
59 The installed package are created from a Cabal file through the process
60 of dependency resolution and compilation. In database terms, the primary key
61 of a package database is the InstalledPackageId
62 (Figure~\ref{fig:current-pkgid}). This ID uniquely identifies an
63 instance of an installed package. The PackageId omits the ABI hash and
64 is used to qualify linker exported symbols: the current value of this
65 parameter is communicated to GHC using the \verb|-package-id| flag.
67 In principle, packages with different PackageIds should be linkable
68 together in the same compiled program, whereas packages with the same
69 PackageId are not (even if they have different InstalledPackageIds). In
70 practice, GHC is currently only able to select one version of a package,
71 as it clears out all old versions of the package in
72 \ghcfile{compiler/main/Package.lhs}:applyPackageFlag.
74 \begin{figure}
75 \center{\begin{tabular}{r l}
76 PackageId & package name, package version \\
77 InstalledPackageId & PackageId, ABI hash \\
78 \end{tabular}}
79 \label{fig:current-pkgid}\caption{Current structure of package identifiers.}
80 \end{figure}
82 The database entry itself contains the information from the installed package ID,
83 as well as information such as what dependencies it was linked against, where
84 its compiled code and interface files live, its compilation flags, what modules
85 it exposes, etc. Much of this information is only relevant to Cabal; GHC
86 uses a subset of the information in the package database.
88 \subsection{GHC}
90 The two programs which access the package database directly are GHC
91 proper (for compilation) and ghc-pkg (which is a general purpose
92 command line tool for manipulating the database.) GHC relies on
93 the package database in the following ways:
95 \begin{itemize}
96 \item It imports the local and global package databases into
97 its runtime database, and applies modifications to the exposed
98 and trusted status of the entries via the flags \verb|-package|
99 and others (\ghcfile{compiler/main/Packages.lhs}). The internal
100 package state can be seen at \verb|-v4| or higher.
101 \item It uses this package database to find the location of module
102 interfaces when it attempts to load the module info of an external
103 module (\ghcfile{compiler/iface/LoadIface.hs}).
104 \end{itemize}
106 GHC itself performs a type checking phase, which generates an interface
107 file representing the module (so that later invocations of GHC can load the type
108 of a module), and then after compilation projects object files and linked archives
109 for programs to use.
111 \subsection{hs-boot}
113 \verb|hs-boot| is a special mechanism used to support recursive linking
114 of modules within a package, today. Suppose I have a recursive module
115 dependency between modules and A and B. I break one of\ldots
117 (ToDo: describe how hs-boot mechanism works)
119 \subsection{Cabal}
121 Cabal is the build system for GHC, we can think of it as parsing a Cabal
122 file describing a package, and then making (possibly multiple)
123 invocations to GHC to perform the appropriate compilation. What
124 information does Cabal pass onto GHC\@? One can get an idea for this by
125 looking at a prototypical command line that Cabal invokes GHC with:
127 \begin{verbatim}
128 ghc --make
129 -package-name myapp-0.1
130 -hide-all-packages
131 -package-id containers-0.9-ABCD
132 Module1 Module2
133 \end{verbatim}
135 There are a few things going on here. First, Cabal has to tell GHC
136 what the name of the package it's compiling (otherwise, GHC can't appropriately
137 generate symbols that other code referring to this package might generate).
138 There are also a number of commands which configure its in-memory view of
139 the package database (GHC's view of the package database may not directly
140 correspond to what is on disk). There's also an optimization here: in principle,
141 GHC can compile each module one-by-one, but instead we use the \verb|--make| flag
142 because this allows GHC to reuse some data structures, resulting in a nontrivial
143 speedup.
145 (ToDo: describe cabal-install/sandbox)
147 \section{Goals}
149 There are actually a number of different goals we have for modifying the
150 packaging system, some of which are subsets of the Backpack system.
152 \begin{itemize}
153 \item As a prerequisite, support multiple instances of containers-2.9 \emph{in the
154 package database}. These instances may be compiled against
155 different dependencies, have the same dependencies but different
156 source files (as when a package is being developed), or be
157 compiled with different options. It is less important to allow
158 these instances to be linkable together.\footnote{Actually, I think
159 this is completely orthogonal to Backpack, since we are going to treat
160 dependencies as part of the package ID, so they would be considered
161 separate entries in the package database anyway.}
163 \item Support typechecking a library against a module interface
164 as opposed to an actual implementation. This would be useful
165 for moving towards a model where Cabal package dependency versions
166 are replaced with proper signature packages. % See Section~\ref{sec:signature-packages} for more details.
168 \item Support compiling the aforementioned libraries with actual implementations.
169 It is \emph{not} a goal to be able to compile a library while only
170 partially providing dependencies, and it is low priority to support
171 mutually recursive implementations of these implementations.
173 \iffalse%
174 \item Support insertion of backwards compatibility shims for packages
175 that are using old versions of packages, so that you can continue
176 using them without having to patch them manually. This is a
177 stylized use-case of Backpack features. See Section~LINKME for
178 more details.
180 \item Some easy-to-implement subset of the functionality provided by
181 packages with holes (Backpack proper). This includes support
182 of linking an executable containing multiple packages with the
183 same package name but different PackageIds.
184 \fi
185 \end{itemize}
187 A lower priority goal is to actually allow multiple instances of
188 containers-2.9 to be linked together in the same executable
189 program.\footnote{In particular, this requires changes to how linker
190 symbols are assigned. However, this feature is important to
191 implement a number of Backpack features.}
193 A \emph{non-goal} is to allow users to upgrade upstream libraries
194 without recompiling downstream. This is an ABI concern and we're not
195 going to worry about it.
197 \section{Aside: Recent IHG work}\label{sec:ihg}
199 The IHG project has allocated some funds to relax the package instance
200 constraint in the package database, so that multiple instances can be
201 stored, but now the user of GHC must explicitly list package-IDs to be
202 linked against. In the far future, it would be expected that tools like
203 Cabal automatically handle instance selection among a large number of
204 instances, but this is subtle and so this work is only to do some
205 foundational work, allowing a package database to optionally relax the
206 unique package-version requirement, and utilize environment files to
207 select which packages should be used. See Duncan's email for more
208 details on the proposal.
210 To implement this:
212 \begin{enumerate}
214 \item Remove the ``removal step'' when registering a package (with a flag)
216 \item Check \ghcfile{compiler/main/Packages.lhs}:mkPackagesState to look out for shadowing
217 within a database. We believe it already does the right thing, since
218 we already need to handle shadowing between the local and global database.
220 \end{enumerate}
222 Once these changes are implemented, we can program multiple instances by
223 using \verb|-hide-all-packages -package-id ...|, even if there is no
224 high-level tool support.
226 Actually, this concern is orthogonal to the purposes of Backpack, if
227 we redefine PackageId appropriately.
229 \paragraph{The ABI hash} Currently, InstalledPackageId
230 is constructed of a package, version and ABI hash
231 (generateRegistrationInfo in
232 \ghcfile{libraries/Cabal/Cabal/Distribution/Simple/Register.hs}). The
233 use of an ABI hash is a bit of GHC-specific hack introduced in 2009,
234 intended to make sure these installed package IDs are unique. While
235 this is quite clever, using the ABI is actually a bit inflexible, as one
236 might reasonably want to have multiple copies of a package with the same
237 ABI but different source code changes.\footnote{In practice, our ABIs
238 are so unstable that it doesn't really matter.}
240 In Figure~\ref{fig:proposed-pkgid}, there is an alternate logical
241 representation of InstalledPackageId which attempts to extricate the
242 notion of ABI compatibility from what actually might uniquely identify a
243 package beyond its PackageId. We imagine these components to be:
245 \begin{itemize}
246 \item A hash of the source code (so one can register different
247 in-development versions without having to bump the version
248 number);
249 \item Compilation way (profiling? dynamic?)
250 \item Compilation flags (such as compilation way, optimization,
251 profiling settings)\footnote{This is a little undefined on a package bases, because in principle the flags could be varied on a per-file basis. More likely this will be approximated against the relevant fields in the Cabal file as well as arguments passed to Cabal.};
252 \end{itemize}
254 A historical note: in the 2012 GSoC project to allow multiple instances
255 of a package to be installed at the same time, use of \emph{random
256 numbers} was used to workaround the inability to get an ABI early
257 enough. We are not using this plan.
259 \section{Infrastructural improvements}
261 There are some infrastructural improvements that must be made before
262 Backpack proper can be implemented. These additions are described in
263 red in the architectural diagrams. The current structure of this
264 section is to describe the additions bottom up.
266 \subsection{Concrete physical identity = PackageId$^*$ + Module name}\label{sec:ipi}
268 \begin{figure}
269 \center{\begin{tabular}{r l}
270 PackageId & hash of ``package name, package version, PackageIds of dependent packages'' \\
271 InstalledPackageId & hash of ``PackageId, source code, way, compiler flags'' \\
272 \end{tabular}}
273 \label{fig:proposed-pkgid}\caption{Proposed new structure of package identifiers.}
274 \end{figure}
276 In Backpack, there needs to be some mechanism for assigning
277 \emph{physical module identities} to modules, which are essential for
278 typechecking Backpack packages, since they let us tell if two types are
279 equal or not. In the paper, the physical identity was represented as the
280 package that constructed it as well as some representation of the module
281 source. We can simplify this slightly: in current Cabal packages, we
282 require that modules always be given a package-unique logical name;
283 thus, physical identities can be simply represented as a PackageId plus
284 module name. (See \ghcfile{compiler/basicTypes/Module.lhs:Module})
285 In fact, this coincides with how GHC already internally handles
286 the problem of type equality: it appeals to an \emph{original name}
287 which is, presently, a PackageId and the module name.
289 However, with the current representation of PackageIds, this is
290 insufficient: a package is not just its name, but also the regular
291 tree representing all of its package dependencies. Thus, we have
292 to adjust the representation of a PackageId so that it includes this
293 regular tree, as seen Figure~\ref{fig:proposed-pkgid}. Since this
294 tree in general may be quite long, it needs to be shortened in some way,
295 such as by hashing.
297 And now, the complications\ldots
299 \paragraph{Relaxing package selection restrictions} As mentioned
300 previously, GHC is unable to select multiple packages with the same
301 package name (but different PackageIds). This restriction needs to be
302 lifted. For backwards compatibility reasons, it's probably better to
303 not overload \verb|-package-id| but add a new flag, maybe \verb|-force-package-id|;
304 we also need to disable the old version masking behavior. This is orthogonal
305 to the IHG work, which wants to allow multiple InstalledPackageIds in the
306 \emph{database} (here, we want to allow multiple PackageIds in compiled code).
308 \paragraph{Linker symbols} As we increase the amount of information in
309 PackageId, it's important to be careful about the length of these IDs,
310 as they are used for exported linker symbols (e.g.
311 \verb|base_TextziReadziLex_zdwvalDig_info|). Very long symbol names
312 hurt compile and link time, object file sizes, GHCi startup time,
313 dynamic linking, and make gdb hard to use. As such, the current plan is
314 to do away with full package names and versions, and instead use just a
315 base-62 encoded hash, perhaps with the first four characters of the package
316 name for user-friendliness.
318 Edward: I'm still partial to a short hash of the dependency bits (or
319 even Simon's registry of short strings to dependency trees), and keeping
320 everything else the same.
322 \paragraph{Wired-in names} One annoying thing to remember is that GHC
323 has wired-in names, which refer to packages without any version. Now
324 these wired names also have to accomodate dependency trees. A
325 suggested approach is to have a fixed table from these wired names to
326 package IDs; alternately we can use something like the special \verb|inplace|
327 version number.
329 \paragraph{Free variables (or, what is a non-concrete physical
330 identity?)} Physical identities in their full generality are permitted
331 to have free variables, which represent holes. Handling this is a
332 tricky question, and we defer it to Section~\ref{sec:variables}, when
333 we talk about packages with holes.
335 \subsection{Exposed modules should allow external modules}\label{sec:reexport}
337 In Backpack, the definition of a package consists of a logical context,
338 which maps logical module names to physical module names. These do not
339 necessarily coincide, since some physical modules may have been defined
340 in other packages and mixed into this package. This mapping specifies
341 what modules other packages including this package can access.
342 However, in the current installed package database, we have exposed-modules which
343 specify what modules are accessible, but we assume that the current
344 package is responsible for providing these modules.
346 To implement Backpack, we have to extend this exposed-modules (``Export declarations''
347 on Figure~\ref{fig:pkgdb}). Rather
348 than a list of logical module names, we provide a new list of tuples:
349 the exported logical module name and original physical module name (this
350 is in two parts: the InstalledPackageId and the original module name).
351 For example, an traditional module export is simply (Name, my-pkg-id, Name);
352 a renamed module is (NewName, my-pkg-id, OldName), and an external module
353 is (Name, external-pkg-id, Name).
355 \section{Simplifying Backpack}\label{sec:simplifying-backpack}
357 Backpack as currently defined always requires a \emph{shaping} pass,
358 which calculates the shapes of all modules defined in a package.
359 The shaping pass is critical to the solution of the double-vision problem
360 in recursive module linking, but it also presents a number of unpalatable
361 implementation problems:
363 \begin{itemize}
365 \item A shaping pass means that it's not possible to compile modules
366 in a package one-by-one, even if there is \emph{no mutual
367 recursion} at all in the package. Without mutual recursion, the
368 shaping pass should not be necessary. (Maybe this is not true!)
369 Always enforcing two-passes means Backpack is not a ``pay as you
370 go'' language feature.
372 \item The calculated module shapes are quite intricate: in particular,
373 they contain information about the providences of all data types
374 and identifiers defined by a module. To run a shape analysis, we
375 must preprocess and parse all modules\ldots and then parse them
376 again later, in the second, type-checking pass; furthermore, all
377 of this information must be passed to GHC later.
379 \item The conclusions of the shaping pass are \emph{incompatible} with
380 what types of mutual recursion GHC can actually compile today
381 via hs-boot: primarily, GHC requires that any data type or function
382 defined in an hs-boot file also be \emph{defined} in its corresponding
383 hs file. (In this way, we can assign an original name to it immediately,
384 which simplifies compilation immensely.)
386 \item The mode of program organization Backpack encourages, mutually
387 recursive modules, is less efficient than code without recursive
388 dependencies: thus programmers have good motivation to avoid
389 this design pattern when possible.
391 \end{itemize}
393 The shaping pass is fundamentally necessary for some types of Backpack packages.
394 Here is the example which convinced Simon:
396 \begin{verbatim}
397 package p where
398 A :: [data T; f :: T -> T]
399 B = [export T(MkT), h; import A(f); data T = MkT; h x = f MkT]
400 A = [export T(MkT), f, h; import B; f MkT = MkT]
401 \end{verbatim}
403 The key is to observe that B \emph{may or may not typecheck} depending
404 on the definition of A. Because A reexports B's definition T, B will
405 typecheck; but if A defined T on its own, B would not typecheck. Thus,
406 we \emph{cannot} typecheck B until we have done some analysis of A (the
407 shaping analysis!)
409 All of these restrictions point to the desirability of a subset of
410 Backpack which does not allow mutually recursive modules, and
411 consequently does not require a shaping pass.
413 Here is the programming discipline, called the \textbf{linking restriction}: \emph{Module implementations
414 must be declared before signatures.} Alternately, $+ \oplus - \neq - \oplus +$. Here's an example:
416 \begin{verbatim}
417 package random where
418 System.Random = [ ... ].hs
420 package monte-carlo where
421 System.Random :: ...
422 System.MonteCarlo = [ ... ].hs
423 \end{verbatim}
425 Now, to link these two applications together, only one ordering
426 is permissible:
428 \begin{verbatim}
429 package myapp where
430 include random
431 include monte-carlo
432 \end{verbatim}
434 or even (if myapp wants to provide its own random library):
436 \begin{verbatim}
437 package myapp2 where
438 System.Random = [ ... ].hs
439 include monte-carlo
440 \end{verbatim}
442 That is, all of \verb|monte-carlo|'s holes have been filled in by the time
443 it is included. The alternate ordering is not allowed, as it could have
444 resulted in a recursive
446 Why does this discipline prevent mutually recursive modules? Intuitively,
447 the signature is the mechanism by which we can get a handle on an implementation
448 before it is defined, thus allowing circularity. If there is never an earlier handle
449 to an implementation, circularity is not possible. A simple example of
450 this violation is in the reordered version of \verb|myapp2|:
452 \begin{verbatim}
453 package myapp2 where
454 include monte-carlo
455 System.Random = [ import System.MonteCarlo ].hs
456 \end{verbatim}
458 (Nota bene: this restriction applies to aliasing to, which can induce
459 linking as well).
461 \subsection{Why isn't shaping necessary?}
463 (ToDo: have this formally setup and everything)
465 In the case that we are compiling a definite package (there will be no leftover
466 holes), the argument is quite simple. In vanilla Backpack, when a
467 signature is defined, we have to generate a pile of module variables
468 which may get unified with physical module identities. However, with
469 the linking restriction, we only ever link a signature with a
470 \emph{fully instantiated} module definition. Thus, all module variables
471 can be immediately eliminated, and all we need to do is maintain an
472 updating logical context (i.e., module environment) so that later
473 modules know how to resolve imports.
475 When we are typechecking an indefinite package (there may be leftover
476 holes), things are a little more subtle. When a signature is defined,
477 we \emph{know} that it will never be unified against a proper module
478 name. So we can assign it some fresh original name.
480 Edward: What I find a bit dodgy is that there still are circumstances
481 where we will link a signature with a signature. For example, if I have:
483 \begin{verbatim}
484 package a where
485 A :: [ data A ]
486 B :: [ data A ]
487 M = ...
488 A = B
489 Q = ...
490 \end{verbatim}
492 We haven't violated the linking rule, but if we chose separate original
493 names for A and B, they need to be identified after the alias.
495 Fortunately, Q should typecheck as if the As are the same, but M should
496 typecheck as if they are distinct. There is some ordering here, and our
497 claim (to be formally verified) is that these unifications can be
498 resolved as we process them.
500 \subsection{Installing indefinite packages}\label{sec:indefinite-packages}
502 If an indefinite package contains no code at all, we only need
503 to install the interface file for the signatures. However, if
504 they include code, we must provide all of the
505 ingredients necessary to compile them when the holes are linked against
506 actual implementations. (Figure~\ref{fig:pkgdb})
508 \paragraph{Source tarball or preprocessed source?} An ancillary design choice
509 to be made is what the representation of the source that is saved is. There
510 are a number of possible choices:
512 \begin{itemize}
513 \item The original tarballs downloaded from Hackage,
514 \item Preprocessed source files,
515 \item Some sort of internal, type-checked representation of Haskell code (maybe the output of the desugarer).
516 \end{itemize}
518 Storing the tarballs is the simplest and most straightforward mechanism,
519 but we will have to be very certain that we can recompile the module
520 later in precisely the same we compiled it originally, to ensure the hi
521 files match up (fortunately, it should be simple to perform an optional
522 sanity check before proceeding.) The appeal of saving preprocessed
523 source, or even the IRs, is that this is conceptually this is exactly
524 what an indefinite package is: we have paused the compilation process
525 partway, intending to finish it later. It allows us to reuse the work
526 done preprocessing or typechecking. However, it may be the case that
527 these phases work differently once library dependencies are known; typechecking
528 especially, see Section~\ref{sec:variables}.
530 \subsection{Compilation}
532 When we need to compile an indefinite package (since all of its
533 dependencies have been found), there seem to be two implementation
534 strategies possible. Here is a very simple example to consider for the
535 various cases:
537 \begin{verbatim}
538 package pkg-a where
539 A = ...
540 package pgk-b where -- indefinite package
541 A :: ...
542 B = [ b = ... ]
543 package pkg-c where
544 include pkg-a
545 include pkg-b
546 \end{verbatim}
548 \paragraph{The ``downstream'' proposal} This is Simon's favorite
549 proposal. Because of the linking invariant, if we traverse the Backpack
550 specification, any given module we need to compile will have all of its
551 dependencies compiled (since it could only depend on the dependency if
552 there was a signature, and the signature could not have been linked
553 unless it was implemented previously.) So we just go ahead and compile
554 everything one-by-one, as if the package was a single big package. (Of
555 course, if we encounter a definite package, don't bother recompiling
556 it; just use it directly.) In
557 particular, we maintain the invariant that any code generated will
558 export symbols under the current package's namespace. So the identifier
559 \verb|b| in the example becomes a symbol \verb|pkg-c_pkg-b_B_b| rather
560 than \verb|pkg-b_B_b| (package subqualification is necessary because
561 package C may define its own B module after thinning out the import.)
563 One big problem with this proposal is that it doesn't implement applicative
564 semantics. If there is another package:
566 \begin{verbatim}
567 package pkg-d where
568 include pkg-a
569 include pkg-b
570 \end{verbatim}
572 this will generate its own instance of B, even though it should be the same
573 as C. Simon was willing to entertain the idea that, well, as long as the
574 type-checker is able to figure out they are the same, then it might be OK
575 if we accidentally generate two copies of the code (provided they actually
576 are the same).
578 \paragraph{The ``upstream'' proposal}
579 The problem with the ``downstream'' proposal is that it always recompiles
580 all of the indefinite parts of the package, even if some of them should
581 be shared in some sense. Hypothetically, a fully linked version of an
582 indefinite package should be considered a package in its own right
583 (in particular, it is associated with a physical module identity in Backpack).
584 So all we need to do is store these installed packages somewhere, probably
585 the local package database. If we recompile the same dependency chain,
586 the installed package can be simply reused. These products do not have
587 to be exposed.
589 One problem with this proposal is in this previous example:
591 \begin{verbatim}
592 package myapp2 where
593 System.Random = [ ... ].hs
594 include monte-carlo
595 \end{verbatim}
597 Here, monte-carlo depends on a ``subpart of the myapp2 package'', and it's
598 not entirely clear how to express this.
600 \subsection{Restricted recursive modules ala hs-boot}
602 It should be possible to support GHC-style mutual recursion using the
603 Backpack formalism immediately using hs-boot files. However, to avoid
604 the need for a shaping pass, we must adopt an existing constraint that
605 already applies to hs-boot files: \emph{at the time we define a signature,
606 we must know what the original name for all data types is}. We then
607 compile modules as usual, but compiling against the signature as if it
608 were an hs-boot file.
610 (ToDo: Figure out why this eliminates the shaping pass)
612 Compiling packages in this way gives the tantalizing possibility
613 of true separate compilation: the only thing we don't know is what the actual
614 package name of an indefinite package will be, and what the correct references
615 to have are. This is a very minor change to the assembly, so one could conceive
616 of dynamically rewriting these references at the linking stage.
618 \iffalse%
620 \section{Module variables and original names}\label{sec:variables}
622 In Backpack, the physical identities of modules are in fact trees of
623 variables, constructors and recursive module identities. As we
624 mentioned in Section~\ref{sec:ipi}, for concrete modules where there are
625 no free variables, physical identity maps nicely to an original name.
626 For packages with holes, the original names mechanism needs to be more
627 flexible; we need to first run a shaping pass in order to figure out if
628 a module is actually equal to another.
630 The direct approach is to replace original names with Backpack's
631 physical identities and use unification to do comparisons. However,
632 this would require extremely invasive changes to all aspects of GHC, as
633 the original names assumption is baked quite deeply into the compiler:
634 we'd like to avoid upsetting this applecart if possible.
636 One key insight is that when we can only compile when all of the holes
637 are eliminated; thus all parts of the compilation pipeline beyond
638 typechecking can, in fact, assume that they will only be dealing with
639 concrete module identities. One interpretation of this fact is that
640 we might actually be able to implement unification properly.
642 This gives us more latitude for how to
643 deal with the identities during typechecking: we may, for example,
644 adopt a conservative analysis (or an optional, dangerously liberal one),
645 secure in the knowledge that when we compile the final concrete module,
646 we will typecheck again with everything put together properly.
648 The next subsection describes one proposal for dealing with this issue,
649 and in what ways it is unsound. We imagine there are other strategies
651 \subsection{Holes have physical module identities associated with definition site}
653 When compiling a module with a hole, we create a set of interface files
654 from the signature. In the current architecture of GHC, we implicitly
655 assume that all interface files are associated with an \emph{original
656 name} (see below); thus, this means that it is very natural to assign an
657 original name to every hole. If we have a signature package:
659 \begin{verbatim}
660 package foo-1x-sig where
661 Foo :: ...
662 \end{verbatim}
664 Then the physical identity associated with Foo would be
665 \verb|foo-1x-sig_Foo| (rather than a fresh module variable $\alpha$).
667 This has implications for signature linking. Take this example from
668 Section 2.3 of the Backpack paper:
670 \begin{verbatim}
671 package yourlib where
672 Prelude :: [data List a = ...]
673 Yours :: [import Prelude; ...]
675 package mylib where
676 Prelude :: [data Bool = ...]
677 include yourlib
678 Mine = [import Prelude; import Yours; ...]
679 \end{verbatim}
681 The first order of business is that we have to merge the interfaces of
682 the two Prelude holes, which are associated with different generated
683 physical identities (\verb|yourlib_Prelude| and \verb|mylib_Prelude|),
684 and furthermore, \emph{assign an original name to it}. If we pick,
685 for example, \verb|mylib_Prelude|, none of the types defined in \verb|yourlib|
686 will unify in \verb|Mine|; thus, we would need to rename the original names
687 in \verb|yourlib|.\footnote{Edward: I don't think this is really the
688 right way to talk about this: Backpack requires a shaping pass and we should
689 be talking about it}
691 % ezyang: So, I thought I knew what I was talking about here, but actually
692 % the text here needs to be put in the context of shaping, so I need to
693 % think about this some more
695 \paragraph{Instantiation and reuse}
697 If we only ever linked any written signature with a signle implementation, this would actually
698 be great: when it is time to properly compile a module with holes, we
699 type check against the true original name of the new module (and just
700 do a consistency check between the signature and the implementation, modulo
701 differences in the physical identity)---a case of typechecking proceeding
702 differently between the initial typecheck and final compilation.
704 However, conflating a variable with an original name is very bad business.
705 Consider the graph library
707 However, Backpack supports the linking of two signatures together, which
708 now presents a formidable difficulty: the original names of these two
709 signatures now must be identified. Here is a real-world example where
710 this shows up:
712 \begin{verbatim}
713 package foo-1x-sig where
714 Foo :: ...
716 package foo-2x-sig where
717 include foo-1x-sig
718 Foo :: ...
720 package foo-2.1 where
721 include foo-2x-sig
722 Foo = ...
724 package A where
725 include foo-1x-sig
726 A = ...
728 package B where
729 include foo-2x-sig
730 B = ...
732 package C where
733 include A
734 include B
735 C = ...
737 package D where
738 include C
739 include foo-2.1
740 \end{verbatim}
742 Here, we have two signatures for the \verb|foo| library, one of which
743 is a subset of another (but they are ostensibly backwards compatible).
744 Package A only uses the 1.x interface, but package B uses the 2.x interface.
745 Package C uses both of these packages (and implicitly links the two signatures
746 together), and package D finally links in an actual implementation of the
747 library.
751 However, directly modifying original names in this fashion
752 is likely to be too invasive of a change, since the current compiler has
753 the assumption that original names can always be checked for equality
754 is deeply wired in.
756 Aliasing of signatures means that it is no longer the case that
757 original name means type equality. We were not able to convince
758 Simon of any satisfactory resolution. Strawman proposal is to
759 extending original names to also be variables probably won't work
760 because it is so deeply wired, but it's difficult to construct hi
761 files so that everything works out (quadratic).
764 There are some problems with respect to what occurs when two
765 distinct signatures are linked together (aliasing), we talk these problems in
766 Section~\ref{sec:open-questions}.
768 \fi
770 \subsection{Original names} Original names are an important design pattern
771 in GHC\@.
772 Sometimes, a name can be exposed in an hi file even if its module
773 wasn't exposed. Here is an example (compiled in package R):
775 \begin{verbatim}
776 module X where
777 import Internal (f)
778 g = f
780 module Internal where
781 import Internal.Total (f)
782 \end{verbatim}
784 Then in X.hi:
786 \begin{verbatim}
787 g = <R.id, Internal.Total, f> (this is the original name)
788 \end{verbatim}
790 (The reason we refer to the package as R.id is because it's the
791 full package ID, and not just R).
793 How might internal names work with Backpack?
795 \begin{verbatim}
796 package P where
797 M = ...
798 N = ...
799 package Q (M, R, T)
800 include P (N -> R)
801 T = ...
802 \end{verbatim}
804 And now if we look at Q\@:
806 \begin{verbatim}
807 exposed-modules:
808 M -> <P.id, M>
809 R -> <P.id, N>
810 T -> <Q.id, T>
811 \end{verbatim}
813 When we compile Q, and the interface file gets generated, we have
814 to generate identifiers for each of the exposed modules. These should
815 be calculated to directly refer to the ``original name'' of each them;
816 so for example M and R point directly to package P, but they also
817 include the original name they had in the original definition.
819 \section{Shaping}\label{sec:shaping}
821 Here is a hyper-simplified picture of how Backpack in its full glory might look.
823 Recall in Figure~\ref{fig:pkgdb}, we have Cabal which is responsible for
824 taking Cabal files and invoking GHC to build an installed package. The
825 command line looks something like:
827 \begin{verbatim}
828 ghc --make
829 -package-name a-0.1
830 -hide-all-packages
831 -package-id containers-0.9-ABCD
832 Module1 Module2
833 \end{verbatim}
835 We notice a few things, first that GHC accepts some flags which
836 configure its view of the package database (really it's just loaded
837 in-memory), so that when we perform an import, GHC knows how to find the
838 module (\verb|findModule|).
840 The claim is that we can implement Backpack by adding an extra pass
841 to Cabal which we will call the shaper, which performs the shaping pass
842 as described by the Backpack paper. In full generality, the \emph{module shape}
843 is what needs to be passed from Cabal to GHC, but in our simplified example
844 we're just going to deal with what SPJ calls the ``module environment'' (which
845 roughly corresponds to a logical shape context in the paper.) This module
846 environment maps logical module names to the \emph{original names}, i.e.
847 overloading the behavior of how we find modules on imports.
849 We'll go by way of an example:
851 \begin{verbatim}
852 package a1 where
853 A :: [ data T; f : T -> T ]
854 package a2 where
855 A :: [ data T; g : T -> T ]
856 package b
857 include a1
858 include a2
859 A = [ ... ] -- call physical identity b-2.7:A
860 \end{verbatim}
862 At the end of the day, we'd like package b to be compiled with a map from
863 logical name A to physical identity \verb|b-2.7:A|. But what about packages
864 a1 and a2? Since these have holes, we don't really know what their physical
865 identities are; but that doesn't mean we don't talk about them; instead,
866 we have a module variable for the holes. These holes have signatures, which can be thought
867 of as the interface files. So we have:\footnote{The type signatures are not quite right because they need provenance information, but never-mind that for now.} \\
869 \noindent
870 \verb|a1:| $\forall \alpha.\ \mathrm{A} \mapsto \alpha \mbox{\texttt{\ ::\ [ data T\@; f :\ T -> T ]}}$ \\ % chktex 26
871 \verb|a2:| $\forall \alpha.\ \mathrm{A} \mapsto \alpha \mbox{\texttt{\ ::\ [ data T\@; g :\ T -> T ]}}$ % chktex 26
873 Now lets consider what happens as we do the shaping analysis on package b.
874 We process each line of the module one-by-one, and initially, we start
875 with an empty module environment. When we process the first include, we
876 now know about a mapping $A \mapsto \beta$, but we don't know the physical
877 identity of the module so we gin up a module variable for now. The second
878 include, we have to instantiate the universal quantifier with $\beta$ because
879 signature linking demands the modules be the same. Finally, when we hit the
880 module definition, we see that $A \mapsto \mathtt{b-2.7:A}$.
882 Now, importantly, remember that there were hi files associated with each
883 of the signatures, and these hi files may contained original names
884 corresponding to our module variables. As we processed the mapping, we
885 also built up a renaming of physical variable names to their true
886 symbols, so we need to rename the respective $\alpha$s in each interface
887 file to point ot \texttt{b-2.7:A}, as we ended up seeing.
889 (SPJ and I went through a few other examples, including the case of recursive
890 module linking:
892 \begin{verbatim}
893 package a1 where
894 A :: ...
895 B = [import A]
896 package b where
897 include a1
898 A = [import B]
899 \end{verbatim}
901 and verified it worked. Note that recursively linked modules do not get
902 inlining! We ust need to know what the original name of what we're linking
903 against is, so we can stick in the right symbols.)
905 Now, the most paper-like thing one an do in this circumstance is to pass
906 over the renaming, but we'd like to avoid having command line arguments
907 that are too long. So SPJ's proposal is that, instead, we have a \emph{module
908 interface file} (or \emph{module manifest}) which collects the various information
909 that we might be interested in about module providence and renaming. Now,
910 in order to program GHC to have the right logical namespace, we only need
911 to list the appropriate package manifests but with the appropriate thinning
912 and renaming. This could be shorter, and maps more nicely onto the existing
913 package command line flags. We'd have to make sure this works out, but it's
914 exactly analogous to how we do interface files.
916 \section{Open questions}\label{sec:open-questions}
918 Here are open problems about the implementation which still require
919 hashing out.
921 \begin{itemize}
922 \item Aliasing of signatures means that it is no longer the case that
923 original name means type equality. We were not able to convince
924 Simon of any satisfactory resolution. Strawman proposal is to
925 extending original names to also be variables probably won't work
926 because it is so deeply wired, but it's difficult to construct hi
927 files so that everything works out (quadratic).
929 \item Relationship between linker names and InstalledPackageId? The reason
930 the obvious thing to do is use all of InstalledPackageId for linker
931 name, but this breaks recompilation. So some of these things
932 should go in the linker name, and not others (yes package, yes
933 version, yes some compile flags (definitely ways), eventually
934 dependency package IDs, what about cabal build flags). This is
935 approximately an ABI hash, but it's computable before compilation.
936 This worries Simon because now there are two names, but maybe
937 the DB can solve that problem---unfortunately, GHC doesn't ever
938 register during compilation; only later.
940 Simon also thought we should use shorter names for linker
941 names and InstallPkgIds. This appears to be orthogonal.
943 \item In this example:
945 \begin{verbatim}
946 package A where
947 A = ...
948 package A2 where
949 A2 = ...
950 package B (B)
951 A :: ...
952 B = ...
953 package C where
954 include A
955 include B
956 package D where
957 include A
958 include B
959 package E where
960 include C (B as CB)
961 include D (B as DB)
962 \end{verbatim}
964 Do the seperate instantiations of B exist as seperate artifacts
965 in the database, or do they get constructed on the fly by
966 definite packages which include them? The former is conceptually
967 nicer (identity of types work, closer to Backpack semantics), but
968 the latter is easier for linker names. (Simon's first inclination
969 is to construct it on the fly.)
971 There was another example, showing that we need to solve this
972 problem even for indefinite combinations of indefinite modules.
973 You can get to it by modifying the earlier example so that C and
974 D still have holes, which E does not fill.
976 \item We have to store the preprocessed sources for indefinite packages.
977 This is hard when we're constructing packages on the fly.
979 \item What is the impact on dependency solving in Cabal? Old questions
980 of what to prefer when multiple package-versions are available
981 (Cabal originally only needed to solve this between different
982 versions of the same package, preferring the oldest version), but
983 with signatures, there are more choices. Should there be a
984 complex solver that does all signature solving, or a preprocessing
985 step that puts things back into the original Cabal version.
986 Authors may want to suggest policy for what packages should actually
987 link against signatures (so a crypto library doesn't accidentally
988 link against a null cipher package).
989 \end{itemize}
991 \section{Immediate tasks}
993 At this point in time, it seems we can identify the following set
994 of non-controversial tasks which can be started immediately.
996 \begin{itemize}
997 \item Relax the package database constraint to allow multiple
998 instances of package-version. (Section~\ref{sec:ihg})
999 \item Propagate the use of \verb|InstalledPackageId| instead of
1000 package IDs for typechecking. (Section~\ref{sec:ipi})
1001 \item Implement export declarations in package format, so
1002 packages can reexport modules from other packages. (Section~\ref{sec:reexport})
1003 \end{itemize}
1005 The aliasing problem is probably the most important open problem
1006 blocking the rest of the system.
1008 \bibliographystyle{plain}
1009 \bibliography{backpack-impl}
1011 \end{document}