[ghc.git] / docs / backpack / backpack-impl.tex
1 \documentclass{article}
3 \usepackage{graphicx} %[pdftex] OR [dvips]
4 \usepackage{fullpage}
5 \usepackage{float}
6 \usepackage{titling}
7 \setlength{\droptitle}{-6em}
9 \newcommand{\ghcfile}[1]{\textsl{#1}}
11 \title{Implementing Backpack}
13 \begin{document}
15 \maketitle
17 The purpose of this document is to describe an implementation path
18 for Backpack~\cite{Kilpatrick:2014:BRH:2535838.2535884} in GHC\@.
20 We start off by outlining the current architecture of GHC, ghc-pkg and Cabal,
21 which constitute the existing packaging system. We then state what our subgoals
22 are, since there are many similar sounding but different problems to solve. Next,
23 we describe the ``probably correct'' implementation plan, and finish off with
24 some open design questions. This is intended to be an evolving design document,
25 so please contribute!
27 \section{Current packaging architecture}
29 The overall architecture is described in Figure~\ref{fig:arch}.
31 \begin{figure}[H]
32 \center{\scalebox{0.8}{\includegraphics{arch.png}}}
33 \label{fig:arch}\caption{Architecture of GHC, ghc-pkg and Cabal. Green bits indicate additions from upcoming IHG work, red bits indicate additions from Backpack. Orange indicates a Haskell library.}
34 \end{figure}
36 Here, arrows indicate dependencies from one component to another. Color
37 coding is as follows: orange components are libaries, green components
38 are to be added with the IHG work, red components are to be added with
39 Backpack. (Thus, black and orange can be considered the current)
41 \subsection{Installed package database}
43 Starting from the bottom, we have the \emph{installed package database}
44 (actually a collection of such databases), which stores information
45 about what packages have been installed are thus available to be
46 compiled against. There is both a global database (for the system
47 administrator) and a local database (for end users), which can be
48 updated independently. One way to think about the package database
49 is as a \emph{cache of object code}. In principle, one could compile
50 any piece of code by repeatedly recompiling all of its dependencies;
51 the installed package database describes when this can be bypassed.
53 \begin{figure}[H]
54 \center{\scalebox{0.8}{\includegraphics{pkgdb.png}}}
55 \label{fig:pkgdb}\caption{Anatomy of a package database.}
56 \end{figure}
58 In Figure~\ref{fig:pkgdb}, we show the structure of a package database.
59 The installed package are created from a Cabal file through the process
60 of dependency resolution and compilation. In database terms, the primary key
61 of a package database is the InstalledPackageId
62 (Figure~\ref{fig:current-pkgid}). This ID uniquely identifies an
63 instance of an installed package. The PackageId omits the ABI hash and
64 is used to qualify linker exported symbols: the current value of this
65 parameter is communicated to GHC using the \verb|-package-id| flag.
67 In principle, packages with different PackageIds should be linkable
68 together in the same compiled program, whereas packages with the same
69 PackageId are not (even if they have different InstalledPackageIds). In
70 practice, GHC is currently only able to select one version of a package,
71 as it clears out all old versions of the package in
72 \ghcfile{compiler/main/Package.lhs}:applyPackageFlag.
74 \begin{figure}
75 \center{\begin{tabular}{r l}
76 PackageId & package name, package version \\
77 InstalledPackageId & PackageId, ABI hash \\
78 \end{tabular}}
79 \label{fig:current-pkgid}\caption{Current structure of package identifiers.}
80 \end{figure}
82 The database entry itself contains the information from the installed package ID,
83 as well as information such as what dependencies it was linked against, where
84 its compiled code and interface files live, its compilation flags, what modules
85 it exposes, etc. Much of this information is only relevant to Cabal; GHC
86 uses a subset of the information in the package database.
88 \subsection{GHC}
90 The two programs which access the package database directly are GHC
91 proper (for compilation) and ghc-pkg (which is a general purpose
92 command line tool for manipulating the database.) GHC relies on
93 the package database in the following ways:
95 \begin{itemize}
96 \item It imports the local and global package databases into
97 its runtime database, and applies modifications to the exposed
98 and trusted status of the entries via the flags \verb|-package|
99 and others (\ghcfile{compiler/main/Packages.lhs}). The internal
100 package state can be seen at \verb|-v4| or higher.
101 \item It uses this package database to find the location of module
102 interfaces when it attempts to load the module info of an external
103 module (\ghcfile{compiler/iface/LoadIface.hs}).
104 \end{itemize}
106 GHC itself performs a type checking phase, which generates an interface
107 file representing the module (so that later invocations of GHC can load the type
108 of a module), and then after compilation projects object files and linked archives
109 for programs to use.
111 \paragraph{Original names} Original names are an important design pattern
112 in GHC\@.
113 Sometimes, a name can be exposed in an hi file even if its module
114 wasn't exposed. Here is an example (compiled in package R):
116 \begin{verbatim}
117 module X where
118 import Internal (f)
119 g = f
121 module Internal where
122 import Internal.Total (f)
123 \end{verbatim}
125 Then in X.hi:
127 \begin{verbatim}
128 g = <R.id, Internal.Total, f> (this is the original name)
129 \end{verbatim}
131 (The reason we refer to the package as R.id is because it's the
132 full package ID, and not just R).
134 \subsection{hs-boot}
136 \verb|hs-boot| is a special mechanism used to support recursive linking
137 of modules within a package, today. Suppose I have a recursive module
138 dependency between modules and A and B. I break one of\ldots
140 (ToDo: describe how hs-boot mechanism works)
142 \subsection{Cabal}
144 Cabal is the build system for GHC, we can think of it as parsing a Cabal
145 file describing a package, and then making (possibly multiple)
146 invocations to GHC to perform the appropriate compilation. What
147 information does Cabal pass onto GHC\@? One can get an idea for this by
148 looking at a prototypical command line that Cabal invokes GHC with:
150 \begin{verbatim}
151 ghc --make
152 -package-name myapp-0.1
153 -hide-all-packages
154 -package-id containers-0.9-ABCD
155 Module1 Module2
156 \end{verbatim}
158 There are a few things going on here. First, Cabal has to tell GHC
159 what the name of the package it's compiling (otherwise, GHC can't appropriately
160 generate symbols that other code referring to this package might generate).
161 There are also a number of commands which configure its in-memory view of
162 the package database (GHC's view of the package database may not directly
163 correspond to what is on disk). There's also an optimization here: in principle,
164 GHC can compile each module one-by-one, but instead we use the \verb|--make| flag
165 because this allows GHC to reuse some data structures, resulting in a nontrivial
166 speedup.
168 (ToDo: describe cabal-install/sandbox)
170 \section{Goals}
172 There are actually a number of different goals we have for modifying the
173 packaging system, some of which are subsets of the Backpack system.
175 \begin{itemize}
176 \item As a prerequisite, support multiple instances of containers-2.9 \emph{in the
177 package database}. These instances may be compiled against
178 different dependencies, have the same dependencies but different
179 source files (as when a package is being developed), or be
180 compiled with different options. It is less important to allow
181 these instances to be linkable together.\footnote{Actually, I think
182 this is completely orthogonal to Backpack, since we are going to treat
183 dependencies as part of the package ID, so they would be considered
184 separate entries in the package database anyway.}
186 \item Support typechecking a library against a module interface
187 as opposed to an actual implementation. This includes not
188 only support proper, but also toolchain support for generating
189 and keeping interface files up-to-date, and taking advantage
190 of this discipline to augment Cabal package dependency resolution
191 with proper signature packages. % See Section~\ref{sec:signature-packages} for more details.
193 \item Support compiling the aforementioned libraries with actual implementations.
194 It is \emph{not} a goal to be able to compile a library while only
195 partially providing dependencies, and it is low priority to support
196 mutually recursive implementations of these implementations.
197 \end{itemize}
199 A lower priority goal is to actually allow multiple instances of
200 containers-2.9 to be linked together in the same executable
201 program.\footnote{In particular, this requires changes to how linker
202 symbols are assigned. However, this feature is important to
203 implement a number of Backpack features.}
205 A \emph{non-goal} is to allow users to upgrade upstream libraries
206 without recompiling downstream. This is an ABI concern and we're not
207 going to worry about it.
209 \section{Aside: Recent IHG work}\label{sec:ihg}
211 The IHG project has allocated some funds to relax the package instance
212 constraint in the package database, so that multiple instances can be
213 stored, but now the user of GHC must explicitly list package-IDs to be
214 linked against. In the far future, it would be expected that tools like
215 Cabal automatically handle instance selection among a large number of
216 instances, but this is subtle and so this work is only to do some
217 foundational work, allowing a package database to optionally relax the
218 unique package-version requirement, and utilize environment files to
219 select which packages should be used. See Duncan's email for more
220 details on the proposal.
222 To implement this:
224 \begin{enumerate}
226 \item Remove the ``removal step'' when registering a package (with a flag)
228 \item Check \ghcfile{compiler/main/Packages.lhs}:mkPackagesState to look out for shadowing
229 within a database. We believe it already does the right thing, since
230 we already need to handle shadowing between the local and global database.
232 \end{enumerate}
234 Once these changes are implemented, we can program multiple instances by
235 using \verb|-hide-all-packages -package-id ...|, even if there is no
236 high-level tool support.
238 Actually, this concern is orthogonal to the purposes of Backpack, if
239 we redefine PackageId appropriately.
241 \paragraph{The ABI hash} Currently, InstalledPackageId
242 is constructed of a package, version and ABI hash
243 (generateRegistrationInfo in
244 \ghcfile{libraries/Cabal/Cabal/Distribution/Simple/Register.hs}). The
245 use of an ABI hash is a bit of GHC-specific hack introduced in 2009,
246 intended to make sure these installed package IDs are unique. While
247 this is quite clever, using the ABI is actually a bit inflexible, as one
248 might reasonably want to have multiple copies of a package with the same
249 ABI but different source code changes.\footnote{In practice, our ABIs
250 are so unstable that it doesn't really matter.}
252 In Figure~\ref{fig:proposed-pkgid}, there is an alternate logical
253 representation of InstalledPackageId which attempts to extricate the
254 notion of ABI compatibility from what actually might uniquely identify a
255 package beyond its PackageId. We imagine these components to be:
257 \begin{itemize}
258 \item A hash of the source code (so one can register different
259 in-development versions without having to bump the version
260 number);
261 \item Compilation way (profiling? dynamic?)
262 \item Compilation flags (such as compilation way, optimization,
263 profiling settings)\footnote{This is a little undefined on a package bases, because in principle the flags could be varied on a per-file basis. More likely this will be approximated against the relevant fields in the Cabal file as well as arguments passed to Cabal.};
264 \end{itemize}
266 A historical note: in the 2012 GSoC project to allow multiple instances
267 of a package to be installed at the same time, use of \emph{random
268 numbers} was used to workaround the inability to get an ABI early
269 enough. We are not using this plan.
271 \section{Infrastructural improvements}
273 There are some infrastructural improvements that must be made before
274 Backpack proper can be implemented. These additions are described in
275 red in the architectural diagrams. The current structure of this
276 section is to describe the additions bottom up.
278 \subsection{Concrete physical identity = PackageId + Module name + Module dependencies}\label{sec:ipi}
280 \begin{figure}
281 \center{\begin{tabular}{r l}
282 PackageId & hash of ``package name, package version, dependency resolution, \emph{module} environment'' \\
283 InstalledPackageId & hash of ``PackageId, source code, way, compiler flags'' \\
284 \end{tabular}}
285 \label{fig:proposed-pkgid}\caption{Proposed new structure of package identifiers.}
286 \end{figure}
288 In Backpack, there needs to be some mechanism for assigning
289 \emph{physical module identities} to modules, which are essential for
290 typechecking Backpack packages, since they let us tell if two types are
291 equal or not.
293 In GHC, our method of testing whether or not two types are equal is by
294 comparing their original names, which is a tuple of a PackageId and the
295 module name (summarized in Figure~\ref{fig:current-pkgid}). If it looks
296 like a duck and quacks like a duck, it is a duck: we might reasonable
297 imagine that \emph{concrete physical identity = PackageId and module
298 name}. But is this sufficient to represent Backpack's physical module
299 identities?
301 If we look at how physical module identities are allocated in Paper Backpack,
302 we can see that each module corresponds to an identity constructor (in
303 Backpack this is the package name, some representation of the module
304 source, and a disambiguating integer if module source is duplicated)
305 \emph{as well as} physical module identities which were applied to
306 this constructor. The current PackageId + Module name scheme adequately encodes
307 the constructor (in particular, a single package can't have multiple modules
308 with the same name), but it does not encode the \emph{module} dependencies.
309 We have to add these to our original names. There are three ways we can do
310 this:
312 \begin{enumerate}
313 \item Augment original names to have a third component. This is less than ideal because it requires pervasive changes to all parts of the compiler.
314 \item Augment the module name to record module dependency information. This is a bit of an odd place to put the information, because our module names correspond quite concretely to the actual source file a user wrote some source in.
315 \item Augment the PackageId to record module dependency information. For now, this is the regime we will use; however, see Section~\ref{sec:flatten} for why the issue is more subtle.
316 \end{enumerate}
318 (XXX Figure~\ref{fig:proposed-pkgid} is out of date now.)
320 And now, the complications\ldots
322 \paragraph{Relaxing package selection restrictions} As mentioned
323 previously, GHC is unable to select multiple packages with the same
324 package name (but different PackageIds). This restriction needs to be
325 lifted. For backwards compatibility reasons, it's probably better to
326 not overload \verb|-package-id| but add a new flag, maybe \verb|-force-package-id|;
327 we also need to disable the old version masking behavior. This is orthogonal
328 to the IHG work, which wants to allow multiple \emph{InstalledPackageIds} in the
329 \emph{database} (here, we want to allow multiple \emph{PackageIds} in \emph{compiled code}).
331 \paragraph{Linker symbols} As we increase the amount of information in
332 PackageId, it's important to be careful about the length of these IDs,
333 as they are used for exported linker symbols (e.g.
334 \verb|base_TextziReadziLex_zdwvalDig_info|). Very long symbol names
335 hurt compile and link time, object file sizes, GHCi startup time,
336 dynamic linking, and make gdb hard to use. As such, the current plan is
337 to do away with full package names and versions, and instead use just a
338 base-62 encoded hash, perhaps with the first four characters of the package
339 name for user-friendliness.
341 Edward: I'm still partial to a short hash of the dependency bits (or
342 even Simon's registry of short strings to dependency trees), and keeping
343 everything else the same.
345 \paragraph{Wired-in names} One annoying thing to remember is that GHC
346 has wired-in names, which refer to packages without any version. Now
347 these wired names also have to accomodate dependency trees. A
348 suggested approach is to have a fixed table from these wired names to
349 package IDs; alternately we can use something like the special \verb|inplace|
350 version number.
352 \paragraph{Version numbers} The interaction of Backpack's package polymorphism
353 (the ability to mixin different implementations to holes) and Cabal's dependency
354 resolution mechanism (which permits a name libfoo to be resolved to a specific
355 version libfoo-0.1) can be subtle: there are in fact \emph{two} levels of
356 indirection going on here.
358 The simplest way to think about the distinction is as follows. When I write
359 a Backpack package which does \verb|include foobar|, I have not actually
360 written a Paper Backpack package. Instead, I have written a \emph{pre-package},
361 which Cabal's dependency solver then takes and rewrites all package references
362 into versioned references \verb|include foobar-0.1|, which now corresponds to
363 a Paper Backpack package. For example, this is a pre-package:
365 \begin{verbatim}
366 package foo where
367 include bar
368 \end{verbatim}
370 and this is a Paper Backpack package:
372 \begin{verbatim}
373 package foo-0.3(bar-0.1) where
374 include bar-0.1(baz-0.2)
375 \end{verbatim}
377 Notably, we must \emph{rename} the package to include information about
378 how we resolved all of the inner package references, and if these inner
379 package references had dependencies, those must be included too! In
380 effect, the \emph{dependency resolution} must be encoded into the package ID,
381 along with the existing Backpack \emph{physical identity regular tree}.
382 Phew!
384 \paragraph{Free variables (or, what is a non-concrete physical
385 identity?)} Physical identities in their full generality are permitted
386 to have free variables, which represent holes. Handling this is a
387 tricky question, and we defer it to Section~\ref{sec:typechecking-indefinite}, when
388 we talk about packages with holes.
390 \subsection{Flatten the package database}\label{sec:flatten}
392 In the previous section, I implied that the \emph{module} dependencies
393 could be recorded with the \emph{package} ID\@. Type error? Yes!
395 In the old world order, we think of dependencies as being associated with
396 packages as a whole. However, Paper Backpack specifies dependency tracking
397 at the \emph{module} level, which means that different compilations of
398 an indefinite package may result in modules which should be identified.
399 Here is a Backpack package which demonstrates the issue:
401 \begin{verbatim}
402 package a where
403 P :: ...
404 Q :: [ x :: Int ]
405 A = [ import P; ... ]
406 B = [ import Q; ... ]
407 package p where
408 P = [ ... ]
409 package x where
410 include p
411 Q = [ x = 0 ]
412 include a
413 package y where
414 include p
415 Q = [ x = 1 ]
416 include a
417 \end{verbatim}
419 Here, we have a package \verb|a| which probably should have been defined as
420 two separate packages (since \verb|A| only relies on \verb|P| and \verb|B|
421 only relies on \verb|Q|), but the functionality has been glommed together.
422 Then, the downstream packages \verb|x| and \verb|y| fill in the holes using the
423 same implementation of \verb|P|, but differing implementations of \verb|Q|.
424 (For more explanation about how we would go about compiling this set of
425 packages, please see Section~\ref{sec:compiling-definite}.)
427 Is module \verb|A| from package \verb|x| the same as module \verb|A|
428 from package \verb|y|? In Paper Backpack, the answer is yes.
429 However, presently, the installed package database acts as a cache at the \emph{package}
430 level; code is only shared if it comes from the same package. Can we share
431 packages \verb|x| and \verb|y|? No!
432 \verb|x:B| is \emph{not} the same module as \verb|y:B| (they are using differing
433 versions of \verb|Q|). The upshot is that we are in an awkward position,
434 where package \verb|a| contains some modules which must be distinct, and other
435 modules which must be unified over several installs.
437 The theory, thus, demands that we flatten the package database, so that
438 we no longer insist that all compiled code that is produced when a
439 package is installed live together in a single directory: instead,
440 \emph{the installed package database is a directory of physical module
441 identities to objects/interfaces}. Installed packages are represented
442 as (possibly overlapping) sets over this store of modules.
444 \paragraph{Do we really care about this use case?} Scott gave me one
445 good argument why this is a nice feature to have: it permits users to
446 modularize third-party packages which were not packaged with modularity
447 in mind, but happen to be modular. For example, when libraries ship
448 with test-cases, they currently have to split these test-cases to separate
449 packages, so as to not introduce spurious dependencies with various
450 test frameworks, which the user may not have or care about. If dependencies
451 are done on a per-module basis, as long as the user doesn't import a test
452 module, they never gain the extra dependency. Another situation is when
453 a library also happens to have a helper utility module which doesn't have
454 any of the dependencies of the primary library.
456 One could argue that people already understand it is good practice to
457 separate such cases into separate packages, and there is no pressing
458 need to allow for sloppy habits. The counterargument, however, is that
459 you are often not in control of these third-party libraries, and the
460 more control in the end-user's hands, the better.
462 \paragraph{Operating system linking} When the package database is flattened
463 into a collection of modules, it becomes less clear how to generate library
464 files corresponding to a package. One possibility is to simply take the
465 set of files corresponding to a package and wrap it up into a library.
466 If an end-user links against two libraries which include the same object file,
467 the linker needs to suppress the symbols associated with one of the instances
468 of the object file (it's identical, right?)\footnote{It may not actually be
469 possible to do this in the static linking case: one must refer to the actual object
470 files}.
472 If your modules are truly applicative, this might even work OK\@. However, you will
473 be a sad panda if there is any unsafe, mutable global state in the shared
474 object (since the object files will each get separate data segments for this
475 global state): a generative semantics.\footnote{Even if we did get this right,
476 which would be possible when statically linking these modules together, the
477 interaction could still be surprising: Backpack can remodularize modules, but
478 it can't remodularize values inside a module, so if a module has a dependency
479 but some global state in the module doesn't, the resulting behavior can be
480 surprising. Perhaps the moral of the story really is, ``Don't do side effects
481 in an applicative module system! No really!''} \\
483 \subsection{Alternatives to flattening package DB}
484 Flattening can be seen as one of four different representations of packages
485 at the OS/library level. While it promotes maximal sharing of identical
486 modules, it is perhaps too fine-grained for most purposes.
487 \emph{TODO: Describe the alternatives.}
489 \paragraph{Package slicing} Instead of changing the package database,
490 we automatically
491 slice a single package into multiple packages, so that the sliced
492 packages have dependencies which accurately reflect their constitutent
493 modules. For a well modularized package, the slicing operation should
494 be a no-op. This would also be useful in some other situations (see the
495 \verb|-module-env| discussion in Section~\ref{sec:compiling-definite}).
496 In fact, which slice a module should be placed in can be automatically
497 calculated by taking the package name with the regular tree
498 associated with the module (Section~\ref{sec:ipi}).
500 A minor downside of package slicing is in a dynamically linked environment,
501 we pay a performance cost when we have to jump from one dynamic library
502 to another, and slicing could introduce are large number of separate
503 dynamic libraries which need to be switched between each other.
505 Edward likes this proposal.
507 \paragraph{Changing Backpack to disallow fine-grained dependencies}
509 Another perspective is that we fell into a trap when we decided that
510 dependencies should be calculated per-module. What if, instead, we
511 expanded the dependency of each module to cover \emph{all signatures}
512 in the package? Then the dependency tree would always be the same and
513 the package would be shared only if all holes were precisely the same.
514 Our motivating example, then, would fail to witness sharing.
516 This might be the simplest thing to do, but it is a change in the
517 Backpack semantics, and rules out modularization without splitting a package
518 into multiple packages. Maybe Scott can give other reasons why this
519 would not be so good. SPJ is quite keen on this plan.
521 \subsection{Exposed modules should allow external modules}\label{sec:reexport}
523 In Backpack, the definition of a package consists of a logical context,
524 which maps logical module names to physical module names. These do not
525 necessarily coincide, since some physical modules may have been defined
526 in other packages and mixed into this package. This mapping specifies
527 what modules other packages including this package can access.
528 However, in the current installed package database, we have exposed-modules which
529 specify what modules are accessible, but we assume that the current
530 package is responsible for providing these modules.
532 To implement Backpack, we have to extend this exposed-modules (``Export declarations''
533 on Figure~\ref{fig:pkgdb}). Rather
534 than a list of logical module names, we provide a new list of tuples:
535 the exported logical module name and original physical module name (this
536 is in two parts: the InstalledPackageId and the original module name).
537 For example, an traditional module export is simply (Name, my-pkg-id, Name);
538 a renamed module is (NewName, my-pkg-id, OldName), and an external module
539 is (Name, external-pkg-id, Name).
541 As an example:
543 \begin{verbatim}
544 package P where
545 M = ...
546 N = ...
547 package Q (M, R, T)
548 include P (N -> R)
549 T = ...
550 \end{verbatim}
552 And now if we look at Q\@:
554 \begin{verbatim}
555 exposed-modules:
556 <M, P.id, M>
557 <R, P.id, N>
558 <T, Q.id, T>
559 \end{verbatim}
561 When we compile Q, and the interface file gets generated, we have
562 to generate identifiers for each of the exposed modules. These should
563 be calculated to directly refer to the ``original name'' of each them;
564 so for example M and R point directly to package P, but they also
565 include the original name they had in the original definition.
567 \paragraph{Error messages} When it is possible for multiple physical
568 entities to have the same ``user-friendly'' name, this can result in a
569 very confusing situation if both entities have to be referred to in the
570 same message. This is especially true when renaming is in place:
571 you not only want to print out the name in scope, but probably some indication
572 of what the original name is. In short, when it comes to error messages, tread with care!
574 \section{Shapeless Backpack}\label{sec:simplifying-backpack}
576 Backpack as currently defined always requires a \emph{shaping} pass,
577 which calculates the shapes of all modules defined in a package.
578 The shaping pass is critical to the solution of the double-vision problem
579 in recursive module linking, but it also presents a number of unpalatable
580 implementation problems:
582 \begin{itemize}
584 \item \emph{Shaping is a lot of work.} A module shape specifies the
585 providence of all data types and identifiers defined by a
586 module. To calculate this, we must preprocess and parse all
587 modules, even before we do the type-checking pass. (Fortunately,
588 shaping doesn't require a full parse of a module, only enough
589 to get identifiers, which makes it a slightly more expensive
590 version of \verb|ghc -M|.)
592 \item \emph{Shaping must be done upfront.} In the current Backpack
593 design, all shapes must be computed before any typechecking can
594 occur. While performing the shaping pass upfront is necessary
595 in order to solve the double vision problem (where a module
596 identity may be influenced by later definitions), it means
597 that GHC must first do a shaping pass, and then revisit every module and
598 compile them proper. Nor is it (easily) possible to skip the
599 shaping pass when it is unnecessary, as one might expect to be
600 the case in the absence of mutual recursion. Shaping is not
601 a ``pay as you go'' language feature.
603 \item \emph{GHC can't compile all programs shaping accepts.} Shaping
604 accepts programs that GHC, with its current hs-boot mechanism, cannot
605 compile. In particular, GHC requires that any data type or function
606 in a signature actually be \emph{defined} in the module corresponding
607 to that file (i.e., an original name can be assigned to these entities
608 immediately.) Shaping permits unrestricted exports to implement
609 modules; this shows up in the formalism as $\beta$ module variables.
611 \item \emph{Shaping encourages inefficient program organization.}
612 Shaping is designed to enable mutually recursive modules, but as
613 currently implemented, mutual recursion is less efficient than
614 code without recursive dependencies. Programmers should avoid
615 this code organization, except when it is absolutely necessary.
617 \item \emph{GHC is architecturally ill-suited for directly
618 implementing shaping.} Shaping implies that GHC's internal
619 concept of an ``original name'' be extended to accommodate
620 module variables. This is an extremely invasive change to all
621 aspects of GHC, since the original names assumption is baked
622 quite deeply into the compiler. Plausible implementations of
623 shaping requires all these variables to be skolemized outside
624 of GHC\@.
626 \end{itemize}
628 To be clear, the shaping pass is fundamentally necessary for some
629 Backpack packages. Here is the example which convinced Simon:
631 \begin{verbatim}
632 package p where
633 A :: [data T; f :: T -> T]
634 B = [export T(MkT), h; import A(f); data T = MkT; h x = f MkT]
635 A = [export T(MkT), f, h; import B; f MkT = MkT]
636 \end{verbatim}
638 The key to this example is that B \emph{may or may not typecheck} depending
639 on the definition of A. Because A reexports B's definition T, B will
640 typecheck; but if A defined T on its own, B would not typecheck. Thus,
641 we \emph{cannot} typecheck B until we have done some analysis of A (the
642 shaping analysis!)
644 Thus, it is beneficial (from an optimization point of view) to
645 consider a subset of Backpack for which shaping is not necessary.
646 Here is a programming discipline which does just that, which we will call the \textbf{linking restriction}: \emph{Module implementations must be declared before
647 signatures.} Formally, this restriction modifies the rule for merging
648 polarized module shapes ($\widetilde{\tau}_1^{m_1} \oplus \widetilde{\tau}_2^{m_2}$) so that
649 $\widetilde{\tau}_1^- \oplus \widetilde{\tau}_2^+$ is always undefined.\footnote{This seemed to be the crispest way of defining the restriction, although this means an error happens a bit later than I'd like it to: I'd prefer if we errored while merging logical contexts, but we don't know what is a hole at that point.}
651 Here is an example of the linking restriction. Consider these two packages:
653 \begin{verbatim}
654 package random where
655 System.Random = [ ... ].hs
657 package monte-carlo where
658 System.Random :: ...
659 System.MonteCarlo = [ ... ].hs
660 \end{verbatim}
662 Here, random is a definite package which may have been compiled ahead
663 of time; monte-carlo is an indefinite package with a dependency
664 on any package which provides \verb|System.Random|.
666 Now, to link these two applications together, only one ordering
667 is permissible:
669 \begin{verbatim}
670 package myapp where
671 include random
672 include monte-carlo
673 \end{verbatim}
675 If myapp wants to provide its own random implementation, it can do so:
677 \begin{verbatim}
678 package myapp2 where
679 System.Random = [ ... ].hs
680 include monte-carlo
681 \end{verbatim}
683 In both cases, all of \verb|monte-carlo|'s holes have been filled in by the time
684 it is included. The alternate ordering is not allowed.
686 Why does this discipline prevent mutually recursive modules? Intuitively,
687 a hole is the mechanism by which we can refer to an implementation
688 before it is defined; otherwise, we can only refer to definitions which
689 preceed our definition. If there are never any holes \emph{which get filled},
690 implementation links can only go backwards, ruling out circularity.
692 It's easy to see how mutual recursion can occur if we break this discipline:
694 \begin{verbatim}
695 package myapp2 where
696 include monte-carlo
697 System.Random = [ import System.MonteCarlo ].hs
698 \end{verbatim}
700 \subsection{Typechecking of definite modules without shaping}
702 If we are not carrying out a shaping pass, we need to be able to calculate
703 $\widetilde{\Xi}_{\mathsf{pkg}}$ on the fly. In the case that we are
704 compiling a package---there will be no holes in the final package---we
705 can show that shaping is unnecessary quite easily, since with the
706 linking restriction, everything is definite from the get-go.
708 Observe the following invariant: at any given step of the module
709 bindings, the physical context $\widetilde{\Phi}$ contains no
710 holes. We can thus conclude that there are no module variables in any
711 type shapes. As the only time a previously calculated package shape can
712 change is due to unification, the incrementally computed shape is in
713 fact the true one.
715 As far as the implementation is concerned, we never have to worry
716 about handling module variables; we only need to do extra typechecks
717 against (renamed) interface files.
719 \subsection{Compilation of definite modules}\label{sec:compiling-definite}
721 Of course, we still have to compile the code, and this includes any
722 subpackages which we have mixed in the dependencies to make them fully
723 definite. Let's take the following set of packages as an example:
725 \begin{verbatim}
726 package pkg-a where
727 A = ...
728 B = ... -- this code is ignored
729 package pgk-b where -- indefinite package
730 A :: ...
731 B = [ import A; ... ]
732 package pkg-c where
733 include pkg-a (A)
734 include pkg-b
735 C = [ import B; ... ]
736 \end{verbatim}
738 With the linking invariant, we can simply walk the Backpack package ``tree'',
739 compiling each of its dependencies. Let's walk through it explicitly.\footnote{To simplify matters, we assume that there is only one instance of any
740 PackageId in the database, so we omit the unique-ifying hashes from the
741 ghc-pkg registration commands; we ignore the existence of version numbers
742 and Cabal's dependency solver; finally, we do the compilation in
743 one-shot mode, even though Cabal in practice will use the Make mode.}
745 First, we have to build \verb|pkg-a|. This package has no dependencies
746 of any kind, so compiling is much like compiling ordinary Haskell. If
747 it was already installed in the database, we wouldn't even bother compiling it.
749 \begin{verbatim}
750 ADEPS = # empty!
751 ghc -c A.hs -package-name pkg-a-ADEPS
752 ghc -c B.hs -package-name pkg-a-ADEPS
753 # install and register pkg-a-ADEPS
754 \end{verbatim}
756 Next, we have to build \verb|pkg-b|. This package has a hole \verb|A|,
757 intuitively, it depends on package A. This is done in two steps:
758 first we check if the signature given for the hole matches up with the
759 actual implementation provided, and then we build the module properly.
761 \begin{verbatim}
762 BDEPS = "A -> pkg-a-ADEPS:A"
763 # This doesn't actually create an hi-boot file
764 ghc -c A.hs-boot -package-name pkg-b-BDEPS -module-env BDEPS
765 ghc -c B.hs -package-name pkg-b-BDEPS -module-env BDEPS
766 # install and register pkg-b-BDEPS
767 \end{verbatim}
769 Notably, these commands diverge slightly from the traditional compilation process.
770 Traditionally, we would specify the flags
771 \verb|-hide-all-packages| \verb|-package-id package-a-ADEPS| in order to
772 let GHC know that it should look at this package to find modules,
773 including \verb|A|. However, if we did this here, we would also pull in
774 module B, which is incorrect, as this module was thinned out in the parent
775 package description. Conceptually, this package is being compiled in the
776 context of some module environment \verb|BDEPS| (a logical context, in Backpack lingo)
777 which maps modules to original names and is utilized by the module finder to
778 lookup the import in \verb|B.hs|.
780 Finally, we created all of the necessary subpackages, and we can compile
781 our package proper.
783 \begin{verbatim}
784 CDEPS = # empty!!
785 ghc -c C.hs -package-name pkg-c-CDEPS -hide-all-packages \
786 -package pkg-a-ADEPS -hide-module B \
787 -package pkg-b-BDEPS
788 # install and register package pkg-c-CDEPS
789 \end{verbatim}
791 This mostly resembles traditional compilation, but there are a few
792 interesting things to note. First, GHC needs to know about thinning/renaming
793 in the package description (here, it's transmitted using the \verb|-hide-module|
794 command, intended to apply to the most recent package definition).\footnote{Concrete
795 command line syntax is, of course, up for discussion.} Second, even though C
796 ``depends'' on subpackages, these do not show in its package-name identifier,
797 e.g. CDEPS\@. This is because this package \emph{chose} the values of ADEPS and BDEPS
798 explicitly (by including the packages in this particular order), so there are no
799 degrees of freedom.\footnote{In the presence of a Cabal-style dependency solver
800 which associates a-0.1 with a concrete identifier a, these choices need to be
801 recorded in the package ID.} Finally, in principle, we could have also used
802 the \verb|-module-env| flag to communicate how to lookup the B import (notice
803 that the \verb|-package pkg-a-ADEPS| argument is a bit useless because we
804 never end up using the import.) I will talk a little more about the tradeoffs
805 shortly.
807 Overall, there are a few important things to notice about this architecture.
808 First, because the \verb|pkg-b-BDEPS| product is installed, if in another package
809 build we instantiate the indefinite module B with exactly the same \verb|pkg-a|
810 implementation, we can skip the compilation process and reuse the version.
811 This is because the calculated \verb|BDEPS| will be the same, and thus the package
812 IDs will be the same.
814 XXX ToDo: actually write down pseudocode algorithm for this
816 \paragraph{Module environment or package flags?} In the previous
817 section, I presented two ways by which one can tweak the behavior of
818 GHC's module finder, which is responsible for resolving \verb|import B|
819 into an actual physical module. The first, \verb|-module-env| is to
820 explicitly describe a full mapping from module names to original names;
821 the second, \verb|-package| with \verb|-hide-module| and
822 \verb|-rename-module|, is to load packages as before but apply
823 thinning/renaming as necessary.
825 In general, it seems that using \verb|-package| is desirable, because its
826 input size is smaller. (If a package exports 200 modules, we would be obligated
827 to list all of them in a module environment.) However, a tricky situation
828 occurs when some of these modules come from a parent package:
830 \begin{verbatim}
831 package myapp2 where
832 System.Random = [ ... ].hs
833 include monte-carlo
834 App = [ ... ].hs
835 \end{verbatim}
837 Here, monte-carlo depends on a ``subpart of the myapp2 package'', and it's
838 not entirely clear how monte-carlo should be represented in the installed
839 package database: should myapp2 be carved up into pieces so that subparts
840 of its package description can be installed to the database? A package
841 stub like this would never used by any other package, it is strictly local.
843 On the other hand, if one uses a module environment for specifying how
844 \verb|monte-carlo| should handle \verb|System.Random|, we don't actually
845 have to create a stub package: all we have to do is make sure GHC knows
846 how to find the module with this original name. To make things better,
847 the size of the module environment will only be as long as all of the
848 holes visible to the module in the package description, so the user will
849 have explicitly asked for all of this pluggability.
851 The conclusion seems clear: package granularity for modules from includes,
852 and module environments for modules from parent packages!
854 \paragraph{An appealing but incorrect alternative} In this section,
855 we briefly describe an alternate compilation strategy that might
856 sound good, but isn't so good. The basic idea is, when compiling
857 \verb|pkg-c|, to compile all of its indefinite packages as if the
858 package were one single, big package.
859 (Of course, if we encounter a
860 definite package, don't bother recompiling it; just use it directly.)
861 In particular, we maintain the invariant that any code generated will
862 export symbols under the current package's namespace. So the identifier
863 \verb|b| in the example becomes a symbol \verb|pkg-c_pkg-b_B_b| rather
864 than \verb|pkg-b_B_b| (package subqualification is necessary because
865 package C may define its own B module after thinning out the import.)
867 The fatal problem with this proposal is that it doesn't implement applicative
868 semantics beyond compilation units. While modules within a single
869 compilation will get reused, if there is another package:
871 \begin{verbatim}
872 package pkg-d where
873 include pkg-a
874 include pkg-b
875 \end{verbatim}
877 when it is compiled by itself, it will generate its own instance of B,
878 even though it should be the same as C. This is bad news.
880 \subsection{Restricted recursive modules ala hs-boot}\label{sec:hs-boot-restrict}
882 It should be possible to support GHC-style mutual recursion using the
883 Backpack formalism immediately using hs-boot files. However, to avoid
884 the need for a shaping pass, we must adopt an existing constraint that
885 already applies to hs-boot files: \emph{at the time we define a signature,
886 we must know what the original name for all data types is}. In practice,
887 GHC enforces this by stating that: (1) an hs-boot file must be
888 accompanied with an implementation, and (2) the implementation must
889 in fact define (and not reexport) all of the declarations in the signature.
891 Why does this not require a shaping pass? The reason is that the
892 signature is not really polymorphic: we require that the $\alpha$ module
893 variable be resolved to a concrete module later in the same package, and
894 that all the $\beta$ module variables be unified with $\alpha$. Thus, we
895 know ahead of time the original names and don't need to deal with any
896 renaming.\footnote{This strategy doesn't completely resolve the problem
897 of cross-package mutual recursion, because we need to first compile a
898 bit of the first package (signatures), then the second package, and then
899 the rest of the first package.}
901 Compiling packages in this way gives the tantalizing possibility
902 of true separate compilation: the only thing we don't know is what the actual
903 package name of an indefinite package will be, and what the correct references
904 to have are. This is a very minor change to the assembly, so one could conceive
905 of dynamically rewriting these references at the linking stage. But
906 separate compilation achieved in this fashion would not be able to take
907 advantage of cross-module optimizations.
909 \section{Shaped Backpack}
911 Despite the simplicity of shapeless Backpack with the linking
912 restriction in the absence of holes, we will find that when we have
913 holes, it will be very difficult to do type-checking without
914 some form of shaping. This section is very much a work in progress,
915 but the ability to typecheck against holes, even with the linking restriction,
916 is a very important part of modular separate development, so we will need
917 to support it at some ponit.
919 \subsection{Efficient shaping}
921 (These are Edward's opinion, he hasn't convinced other folks that this is
922 the right way to do it.)
924 In this section, I want to argue that, although shaping constitutes
925 a pre-pass which must be run before compilation in earnest, it is only
926 about as bad as the dependency resolution analysis that GHC already does
927 in \verb|ghc -M| or \verb|ghc --make|.
929 In Paper Backpack, what information does shaping compute? It looks at
930 exports, imports, data declarations and value declarations (but not the
931 actual expressions associated with these values.) As a matter of fact,
932 GHC already must look at the imports associated with a package in order
933 to determine the dependency graph, so that it can have some order to compile
934 modules in. There is a specialized parser which just parses these statements,
935 and then ignores the rest of the file.
937 A bit of background: the \emph{renamer} is responsible for resolving
938 imports and figuring out where all of these entities actually come from.
939 SPJ would really like to avoid having to run the renamer in order to perform
940 a shaping pass.
942 \paragraph{Is it necessary to run the Renamer to do shaping?}
943 Edward and Scott believe the answer is no, well, partially.
944 Shaping needs to know the original names of all entities exposed by a
945 module/signature. Then it needs to know (a) which entities a module/signature
946 defines/declares locally and (b) which entities that module/signature exports.
947 The former, (a), can be determined by a straightforward inspection of a parse
948 tree of the source file.\footnote{Note that no expression or type parsing
949 is necessary. We only need names of local values, data types, and data
950 constructors.} The latter, (b), is a bit trickier. Right now it's the Renamer
951 that interprets imports and exports into original names, so we would still
952 rely on that implementation. However, the Renamer does other, harder things
953 that we don't need, so ideally we could factor out the import/export
954 resolution from the Renamer for use in shaping.
956 Unfortunately the Renamer's import resolution analyzes .hi files, but for
957 local modules, which haven't yet been typechecked, we don't have those.
958 Instead, we could use a new file format, .hsi files, to store the shape of
959 a locally defined module. (Defined packages are bundled with their shapes,
960 so included modules have .hsi files as well.) (What about the logical
961 vs.~physical distinction in file names?) If we refactor the import/export
962 resolution code, could we rewrite it to generically operate on both
963 .hi files and .hsi files?
965 Alternatively, rather than storing shapes on a per-source basis, we could
966 store (in memory) the entire package shape. Similarly, included packages
967 could have a single shape file for the entire package. Although this approach
968 would make shaping non-incremental, since an entire package's shape would
969 be recomputed any time a constituent module's shape changes, we do not expect
970 shaping to be all that expensive.
972 \subsection{Typechecking of indefinite modules}\label{sec:typechecking-indefinite}
974 Recall in our argument in the definite case, where we showed there are
975 no holes in the physical context. With indefinite modules, this is no
976 longer true. While (with the linking restriction) these holes will never
977 be linked against a physical implementation, they may be linked against
978 other signatures. (Note: while disallowing signature linking would
979 solve our problem, it would disallow a wide array of useful instances of
980 signature reuse, for example, a package mylib that implements both
981 mylib-1x-sig and mylib-2x-sig.)
983 With holes, we must handle module variables, and we sometimes must unify them:
985 \begin{verbatim}
986 package p where
987 A :: [ data A ]
988 package q where
989 A :: [ data A ]
990 package r where
991 include p
992 include q
993 \end{verbatim}
995 In this package, it is not possible to a priori assign original names to
996 module A in p and q, because in package r, they should have the same
997 original name. When signature linking occurs, unification may occur,
998 which means we have to rename all relevant original names. (A similar
999 situation occurs when a module is typechecked against a signature.)
1001 An invariant which would be nice to have is this: when typechecking a
1002 signature or including a package, we may apply renaming to the entities
1003 being brought into context. But once we've picked an original name for
1004 our entities, no further renaming should be necessary. (Formally, in the
1005 unification for semantic object shapes, apply the unifier to the second
1006 shape, but not the first one.)
1008 However, there are plenty of counterexamples here:
1010 \begin{verbatim}
1011 package p where
1012 A :: [ data A ]
1013 B :: [ data A ]
1014 M = ...
1015 A = B
1016 \end{verbatim}
1018 In this package, does module M know that A.A and B.A are type equal? In
1019 fact, the shaping pass will have assigned equal module identities to A
1020 and B, so M \emph{equates these types}, despite the aliasing occurring
1021 after the fact.
1023 We can make this example more sophisticated, by having a later
1024 subpackage which causes the aliasing; now, the decision is not even a
1025 local one (on the other hand, the equality should be evident by inspection
1026 of the package interface associated with q):
1028 \begin{verbatim}
1029 package p where
1030 A :: [ data A ]
1031 B :: [ data A ]
1032 package q where
1033 A :: [ data A ]
1034 B = A
1035 package r where
1036 include p
1037 include q
1038 \end{verbatim}
1040 Another possibility is that it might be acceptable to do a mini-shaping
1041 pass, without parsing modules or signatures, \emph{simply} looking at
1042 names and aliases. But logical names are not the only mechanism by
1043 which unification may occur:
1045 \begin{verbatim}
1046 package p where
1047 C :: [ data A ]
1048 A = [ data A = A ]
1049 B :: [ import A(A) ]
1050 C = B
1051 \end{verbatim}
1053 It is easy to conclude that the original names of C and B are the same. But
1054 more importantly, C.A must be given the original name of p:A.A. This can only
1055 be discovered by looking at the signature definition for B. In any case, it
1056 is worth noting that this situation parallels the situation with hs-boot
1057 files (although there is no mutual recursion here).
1059 The conclusion is that you will probably, in fact, have to do real
1060 shaping in order to typecheck all of these examples.
1062 \paragraph{Hey, these signature imports are kind of tricky\ldots}
1064 When signatures and modules are interleaved, the interaction can be
1065 complex. Here is an example:
1067 \begin{verbatim}
1068 package p where
1069 C :: [ data A ]
1070 M = [ import C; ... ]
1071 A = [ import M; data A = A ]
1072 C :: [ import A(A) ]
1073 \end{verbatim}
1075 Here, the second signature for C refers to a module implementation A
1076 (this is permissible: it simply means that the original name for p:C.A
1077 is p:A.A). But wait! A relies on M, and M relies on C. Circularity?
1078 Fortunately not: a client of package p will find it impossible to have
1079 the hole C implemented in advance, since they will need to get their hands on module
1080 A\ldots but it will not be defined prior to package p.
1082 In any case, however, it would be good to emit a warning if a package
1083 cannot be compiled without mutual recursion.
1085 \subsection{Incremental typechecking}
1086 We want to typecheck modules incrementally, i.e., when something changes in
1087 a package, we only want to re-typecheck the modules that care about that
1088 change. GHC already does this today.%
1089 \footnote{\url{https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler/RecompilationAvoidance}}
1090 Is the same mechanism sufficient for Backpack? Edward and Scott think that it
1091 is, mostly. Our conjecture is that a module should be re-typechecked if the
1092 existing mechanism says it should \emph{or} if the logical shape
1093 context (which maps logical names to physical names) has changed. The latter
1094 condition is due to aliases that affect typechecking of modules.
1096 Let's look again at an example from before:
1097 \begin{verbatim}
1098 package p where
1099 A :: [ data A ]
1100 B :: [ data A ]
1101 M = [ import A; import B; ... ]
1102 \end{verbatim}
1103 Let's say that \verb|M| is typechecked successfully. Now we add an alias binding
1104 at the end of the package, \verb|A = B|. Does \verb|M| need to be
1105 re-typechecked? Yes! (Well, it seems so, but let's just assert ``yes'' for now.
1106 Certainly in the reverse case---if we remove the alias and then ask---this
1107 is true, since \verb|M| might have depended on the two \verb|A| types
1108 being the same.)
1109 The logical shape context changed to say that \verb|A| and
1110 \verb|B| now map to the same physical module identity. But does the existing
1111 recompilation avoidance mechanism say that \verb|M| should be re-typechecked?
1112 It's unclear. The .hi file for \verb|M| records that it imported \verb|A| and
1113 \verb|B| with particular ABIs, but does it also know about the physical module
1114 identities (or rather, original module names) of these modules?
1116 Scott thinks this highlights the need for us to get our story straight about
1117 the connection between logical names, physical module identities, and file
1118 names!
1121 \subsection{Installing indefinite packages}\label{sec:installing-indefinite}
1123 If an indefinite package contains no code at all, we only need
1124 to install the interface file for the signatures. However, if
1125 they include code, we must provide all of the
1126 ingredients necessary to compile them when the holes are linked against
1127 actual implementations. (Figure~\ref{fig:pkgdb})
1129 \paragraph{Source tarball or preprocessed source?} What is the representation of the source that is saved is. There
1130 are a number of possible choices:
1132 \begin{itemize}
1133 \item The original tarballs downloaded from Hackage,
1134 \item Preprocessed source files,
1135 \item Some sort of internal, type-checked representation of Haskell code (maybe the output of the desugarer).
1136 \end{itemize}
1138 Storing the tarballs is the simplest and most straightforward mechanism,
1139 but we will have to be very certain that we can recompile the module
1140 later in precisely the same we compiled it originally, to ensure the hi
1141 files match up (fortunately, it should be simple to perform an optional
1142 sanity check before proceeding.) The appeal of saving preprocessed
1143 source, or even the IRs, is that this is conceptually this is exactly
1144 what an indefinite package is: we have paused the compilation process
1145 partway, intending to finish it later. However, our compilation strategy
1146 for definite packages requires us to run this step using a \emph{different}
1147 choice of original names, so it's unclear how much work could actually be reused.
1149 \section{Surface syntax}
1151 In the Backpack paper, a brand new module language is presented, with
1152 syntax for inline modules and signatures. This syntax is probably worth implementing,
1153 because it makes it easy to specify compatibility packages, whose module
1154 definitions in general may be very short:
1156 \begin{verbatim}
1157 package ishake-0.12-shake-0.13 where
1158 include shake-0.13
1159 Development.Shake.Sys = Development.Shake.Cmd
1160 Development.Shake = [ (**>) = (&>) ; (*>>) = (|*>)]
1161 Development.Shake.Rule = [ defaultPriority = rule . priority 0.5 ]
1162 include ishake-0.12
1163 \end{verbatim}
1165 However, there are a few things that are less than ideal about the
1166 surface syntax proposed by Paper Backpack:
1168 \begin{itemize}
1169 \item It's completely different from the current method users
1170 specify packages. There's nothing wrong with this per se
1171 (one simply needs to support both formats) but the smaller
1172 the delta, the easier the new packaging format is to explain
1173 and implement.
1175 \item Sometimes order matters (relative ordering of signatures and
1176 module implementations), and other times it does not (aliases).
1177 This can be confusing for users.
1179 \item Users have to order module definitions topologically,
1180 whereas in current Cabal modules can be listed in any order, and
1181 GHC figures out an appropriate order to compile them.
1182 \end{itemize}
1184 Here is an alternative proposal, closely based on Cabal syntax. Given
1185 the following Backpack definition:
1187 \begin{verbatim}
1188 package libfoo(A, B, C, Foo) where
1189 include base
1190 -- renaming and thinning
1191 include libfoo (Foo, Bar as Baz)
1192 -- holes
1193 A :: [ a :: Bool ].hsig
1194 A2 :: [ b :: Bool ].hsig
1195 -- normal module
1196 B = [
1197 import {-# SOURCE #-} A
1198 import Foo
1199 import Baz
1200 ...
1201 ].hs
1202 -- recursively linked pair of modules, one is private
1203 C :: [ data C ].hsig
1204 D = [ import {-# SOURCE #-} C; data D = D C ].hs
1205 C = [ import D; data C = C D ].hs
1206 -- alias
1207 A = A2
1208 \end{verbatim}
1210 We can write the following Cabal-like syntax instead (where
1211 all of the signatures and modules are placed in appropriately
1212 named files):
1214 \begin{verbatim}
1215 package: libfoo
1216 ...
1217 build-depends: base, libfoo (Foo, Bar as Baz)
1218 holes: A A2 -- deferred for now
1219 exposed-modules: Foo B C
1220 aliases: A = A2
1221 other-modules: D
1222 \end{verbatim}
1224 Notably, all of these lists are \emph{insensitive} to ordering!
1225 The key idea is use of the \verb|{-# SOURCE #-}| pragma, which
1226 is enough to solve the important ordering constraint between
1227 signatures and modules.
1229 Here is how the elaboration works. For simplicity, in this algorithm
1230 description, we assume all packages being compiled have no holes
1231 (including \verb|build-depends| packages). Later, we'll discuss how to
1232 extend the algorithm to handle holes in both subpackages and the main
1233 package itself.
1235 \begin{enumerate}
1237 \item At the top-level with \verb|package| $p$ and
1238 \verb|exposed-modules| $ms$, record \verb|package p (ms) where|
1240 \item For each package $p$ with thinning/renaming $ms$ in
1241 \verb|build-depends|, record a \verb|include p (ms)| in the
1242 Backpack package. The ordering of these includes does not
1243 matter, since none of these packages have holes.
1245 \item Take all modules $m$ in \verb|other-modules| and
1246 \verb|exposed-modules| which were not exported by build
1247 dependencies, and create a directed graph where hs and hs-boot
1248 files are nodes and imports are edges (the target of an edge is
1249 an hs file if it is a normal import, and an hs-boot file if it
1250 is a SOURCE import). Topologically sort this graph, erroring if
1251 this graph contains cycles (even with recursive modules, the
1252 cycle should have been broken by an hs-boot file). For each
1253 node, in this order, record \verb|M = [ ... ]| or \verb|M :: [ ... ]|
1254 depending on whether or not it is an hs or hs-boot. If possible,
1255 sort signatures before implementations when there is no constraint
1256 otherwise.
1258 \end{enumerate}
1260 Here is a simple example which shows how SOURCE can be used to disambiguate
1261 between two important cases. Suppose we have these modules:
1263 \begin{verbatim}
1264 -- A1.hs
1265 import {-# SOURCE #-} B
1267 -- A2.hs
1268 import B
1270 -- B.hs
1271 x = True
1273 -- B.hs-boot
1274 x :: Bool
1275 \end{verbatim}
1277 Then we translate the following packages as follows:
1279 \begin{verbatim}
1280 exposed-modules: A1 B
1281 -- translates to
1282 B :: [ x :: Bool ]
1283 A1 = [ import B ]
1284 B = [ x = True ]
1285 \end{verbatim}
1287 but
1289 \begin{verbatim}
1290 exposed-modules: A2 B
1291 -- translates to
1292 B = [ x = True ]
1293 B :: [ x :: Bool ]
1294 A2 = [ import B ]
1295 \end{verbatim}
1297 The import controls placement between signature and module, and in A1 it
1298 forces B's signature to be sorted before B's implementation (whereas in
1299 the second section, there is no constraint so we preferentially place
1300 the B's implementation first)
1302 \paragraph{Holes in the database} In the presence of holes,
1303 \verb|build-depends| resolution becomes more complicated. First,
1304 let's consider the case where the package we are building is
1305 definite, but the package database contains indefinite packages with holes.
1306 In order to maintain the linking restriction, we now have to order packages
1307 from step (2) of the previous elaboration. We can do this by creating
1308 a directed graph, where nodes are packages and edges are from holes to the
1309 package which implements them. If there is a cycle, this indicates a mutually
1310 recursive package. In the absence of cycles, a topological sorting of this
1311 graph preserves the linking invariant.
1313 One subtlety to consider is the fact that an entry in \verb|build-depends|
1314 can affect how a hole is instantiated by another entry. This might be a
1315 bit weird to users, who might like to explicitly say how holes are
1316 filled when instantiating a package. Food for thought, surface syntax wise.
1318 \paragraph{Holes in the package} Actually, this is quite simple: the
1319 ordering of includes goes as before, but some indefinite packages in the
1320 database are less constrained as they're ``dependencies'' are fulfilled
1321 by the holes at the top-level of this package. It's also worth noting
1322 that some dependencies will go unresolved, since the following package
1323 is valid:
1325 \begin{verbatim}
1326 package a where
1327 A :: ...
1328 package b where
1329 include a
1330 \end{verbatim}
1332 \paragraph{Multiple signatures} In Backpack syntax, it's possible to
1333 define a signature multiple times, which is necessary for mutually
1334 recursive signatures:
1336 \begin{verbatim}
1337 package a where
1338 A :: [ data A ]
1339 B :: [ import A; data B = B A ]
1340 A :: [ import B; data A = A B ]
1341 \end{verbatim}
1343 Critically, notice that we can see the constructors for both module B and A
1344 after the signatures are linked together. This is not possible in GHC
1345 today, but could be possible by permitting multiple hs-boot files. Now
1346 the SOURCE pragma indicating an import must \emph{disambiguate} which
1347 hs-boot file it intends to include. This might be one way of doing it:
1349 \begin{verbatim}
1350 -- A.hs-boot2
1351 data A
1353 -- B.hs-boot
1354 import {-# SOURCE hs-boot2 #-} A
1356 -- A.hs-boot
1357 import {-# SOURCE hs-boot #-} B
1358 \end{verbatim}
1360 \paragraph{Explicit or implicit reexports} One annoying property of
1361 this proposal is that, looking at the \verb|exposed-modules| list, it is
1362 not immediately clear what source files one would expect to find in the
1363 current package. It's not obvious what the proper way to go about doing
1364 this is.
1366 \paragraph{Better syntax for SOURCE} If we enshrine the SOURCE import
1367 as a way of solving Backpack ordering problems, it would be nice to have
1368 some better syntax for it. One possibility is:
1370 \begin{verbatim}
1371 abstract import Data.Foo
1372 \end{verbatim}
1374 which makes it clear that this module is pluggable, typechecking against
1375 a signature. Note that this only indicates how type checking should be
1376 done: when actually compiling the module we will compile against the
1377 interface file for the true implementation of the module.
1379 It's worth noting that the SOURCE annotation was originally made a
1380 pragma because, in principle, it should have been possible to compile
1381 some recursive modules without needing the hs-boot file at all. But if
1382 we're moving towards boot files as signatures, this concern is less
1383 relevant.
1385 \section{Bits and bobs}
1387 \subsection{Abstract type synonyms}
1389 In Paper Backpack, abstract type synonyms are not permitted, because GHC doesn't
1390 understand how to deal with them. The purpose of this section is to describe
1391 one particularly nastiness of abstract type synonyms, by way of the occurs check:
1393 \begin{verbatim}
1394 A :: [ type T ]
1395 B :: [ import qualified A; type T = [A.T] ]
1396 \end{verbatim}
1398 At this point, it is illegal for \verb|A = B|, otherwise this type synonym would
1399 fail the occurs check. This seems like pretty bad news, since every instance
1400 of the occurs check in the type-checker could constitute a module inequality.
1402 \subsection{Type families}
1404 Like type classes, type families must not overlap (and this is a question of
1405 type safety!)
1407 A more subtle question is compatibility and apartness of type family
1408 equations. Under these checks, aliasing of modules can fail if it causes
1409 two type families to be identified, but their definitions are not apart.
1410 Here is a simple example:
1412 \begin{verbatim}
1413 A :: [
1414 type family F a :: *
1415 type instance F Int = Char
1416 ]
1417 B :: [
1418 type family F a :: *
1419 type instance F Int = Bool
1420 ]
1421 \end{verbatim}
1423 Now it is illegal for \verb|A = B|, because when the type families are
1424 unified, the instances now fail the apartness check. However, if the second
1425 instance was \verb|F Int = Char|, the families would be able to link together.
1427 To make matters worse, an implementation may define more axioms than are
1428 visible in the signature:
1430 \begin{verbatim}
1431 package a where
1432 A :: [
1433 type family F a :: *
1434 type instance F Int = Bool
1435 ]
1436 package b where
1437 include a
1438 B = [
1439 import A
1440 type instance F Bool = Bool
1441 ...
1442 ]
1443 package c where
1444 A = [
1445 type family F a :: *
1446 type instance F Int = Bool
1447 type instance F Bool = Int
1448 ]
1449 include b
1450 \end{verbatim}
1452 It would seem that private axioms cannot be naively supported. Is
1453 there any way that thinning axioms could be made to work?
1455 \section{Open questions}\label{sec:open-questions}
1457 Here are open problems about the implementation which still require
1458 hashing out.
1460 \begin{itemize}
1462 \item In Section~\ref{sec:simplifying-backpack}, we argued that we
1463 could implement Backpack without needing a shaping pass. We're
1464 pretty certain that this will work for typechecking and
1465 compiling fully definite packages with no recursive linking, but
1466 in Section~\ref{sec:typechecking-indefinite}, we described some
1467 of the prevailing difficulties of supporting signature linking.
1468 Renaming is not an insurmountable problem, but backwards flow of
1469 shaping information can be, and it is unclear how best to
1470 accommodate this. This is probably the most important problem
1471 to overcome.
1473 \item In Section~\ref{sec:installing-indefinite}, a few choices for how to
1474 store source code were pitched, however, there is not consensus on which
1475 one is best.
1477 \item What is the impact of the multiplicity of PackageIds on
1478 dependency solving in Cabal? Old questions of what to prefer
1479 when multiple package-versions are available (Cabal originally
1480 only needed to solve this between different versions of the same
1481 package, preferring the oldest version), but with signatures,
1482 there are more choices. Should there be a complex solver that
1483 does all signature solving, or a preprocessing step that puts
1484 things back into the original Cabal version. Authors may want
1485 to suggest policy for what packages should actually link against
1486 signatures (so a crypto library doesn't accidentally link
1487 against a null cipher package).
1489 \end{itemize}
1491 \bibliographystyle{plain}
1492 \bibliography{backpack-impl}
1494 \end{document}