Lots of rewrites to further move toward new world order
[ghc.git] / docs / backpack / backpack-impl.tex
1 \documentclass{article}
2
3 \usepackage{graphicx} %[pdftex] OR [dvips]
4 \usepackage{fullpage}
5 \usepackage{float}
6 \usepackage{titling}
7 \setlength{\droptitle}{-6em}
8
9 \newcommand{\ghcfile}[1]{\textsl{#1}}
10
11 \title{Implementing Backpack}
12
13 \begin{document}
14
15 \maketitle
16
17 The purpose of this document is to describe an implementation path
18 for Backpack~\cite{Kilpatrick:2014:BRH:2535838.2535884} in GHC\@.
19
20 We start off by outlining the current architecture of GHC, ghc-pkg and Cabal,
21 which constitute the existing packaging system. We then state what our subgoals
22 are, since there are many similar sounding but different problems to solve. Next,
23 we describe the ``probably correct'' implementation plan, and finish off with
24 some open design questions. This is intended to be an evolving design document,
25 so please contribute!
26
27 \section{Current packaging architecture}
28
29 The overall architecture is described in Figure~\ref{fig:arch}.
30
31 \begin{figure}[H]
32 \center{\scalebox{0.8}{\includegraphics{arch.png}}}
33 \label{fig:arch}\caption{Architecture of GHC, ghc-pkg and Cabal. Green bits indicate additions from upcoming IHG work, red bits indicate additions from Backpack. Orange indicates a Haskell library.}
34 \end{figure}
35
36 Here, arrows indicate dependencies from one component to another. Color
37 coding is as follows: orange components are libaries, green components
38 are to be added with the IHG work, red components are to be added with
39 Backpack. (Thus, black and orange can be considered the current)
40
41 \subsection{Installed package database}
42
43 Starting from the bottom, we have the \emph{installed package database}
44 (actually a collection of such databases), which stores information
45 about what packages have been installed are thus available to be
46 compiled against. There is both a global database (for the system
47 administrator) and a local database (for end users), which can be
48 updated independently. One way to think about the package database
49 is as a \emph{cache of object code}. In principle, one could compile
50 any piece of code by repeatedly recompiling all of its dependencies;
51 the installed package database describes when this can be bypassed.
52
53 \begin{figure}[H]
54 \center{\scalebox{0.8}{\includegraphics{pkgdb.png}}}
55 \label{fig:pkgdb}\caption{Anatomy of a package database.}
56 \end{figure}
57
58 In Figure~\ref{fig:pkgdb}, we show the structure of a package database.
59 The installed package are created from a Cabal file through the process
60 of dependency resolution and compilation. In database terms, the primary key
61 of a package database is the InstalledPackageId
62 (Figure~\ref{fig:current-pkgid}). This ID uniquely identifies an
63 instance of an installed package. The PackageId omits the ABI hash and
64 is used to qualify linker exported symbols: the current value of this
65 parameter is communicated to GHC using the \verb|-package-id| flag.
66
67 In principle, packages with different PackageIds should be linkable
68 together in the same compiled program, whereas packages with the same
69 PackageId are not (even if they have different InstalledPackageIds). In
70 practice, GHC is currently only able to select one version of a package,
71 as it clears out all old versions of the package in
72 \ghcfile{compiler/main/Package.lhs}:applyPackageFlag.
73
74 \begin{figure}
75 \center{\begin{tabular}{r l}
76 PackageId & package name, package version \\
77 InstalledPackageId & PackageId, ABI hash \\
78 \end{tabular}}
79 \label{fig:current-pkgid}\caption{Current structure of package identifiers.}
80 \end{figure}
81
82 The database entry itself contains the information from the installed package ID,
83 as well as information such as what dependencies it was linked against, where
84 its compiled code and interface files live, its compilation flags, what modules
85 it exposes, etc. Much of this information is only relevant to Cabal; GHC
86 uses a subset of the information in the package database.
87
88 \subsection{GHC}
89
90 The two programs which access the package database directly are GHC
91 proper (for compilation) and ghc-pkg (which is a general purpose
92 command line tool for manipulating the database.) GHC relies on
93 the package database in the following ways:
94
95 \begin{itemize}
96 \item It imports the local and global package databases into
97 its runtime database, and applies modifications to the exposed
98 and trusted status of the entries via the flags \verb|-package|
99 and others (\ghcfile{compiler/main/Packages.lhs}). The internal
100 package state can be seen at \verb|-v4| or higher.
101 \item It uses this package database to find the location of module
102 interfaces when it attempts to load the module info of an external
103 module (\ghcfile{compiler/iface/LoadIface.hs}).
104 \end{itemize}
105
106 GHC itself performs a type checking phase, which generates an interface
107 file representing the module (so that later invocations of GHC can load the type
108 of a module), and then after compilation projects object files and linked archives
109 for programs to use.
110
111 \paragraph{Original names} Original names are an important design pattern
112 in GHC\@.
113 Sometimes, a name can be exposed in an hi file even if its module
114 wasn't exposed. Here is an example (compiled in package R):
115
116 \begin{verbatim}
117 module X where
118 import Internal (f)
119 g = f
120
121 module Internal where
122 import Internal.Total (f)
123 \end{verbatim}
124
125 Then in X.hi:
126
127 \begin{verbatim}
128 g = <R.id, Internal.Total, f> (this is the original name)
129 \end{verbatim}
130
131 (The reason we refer to the package as R.id is because it's the
132 full package ID, and not just R).
133
134 \subsection{hs-boot}
135
136 \verb|hs-boot| is a special mechanism used to support recursive linking
137 of modules within a package, today. Suppose I have a recursive module
138 dependency between modules and A and B. I break one of\ldots
139
140 (ToDo: describe how hs-boot mechanism works)
141
142 \subsection{Cabal}
143
144 Cabal is the build system for GHC, we can think of it as parsing a Cabal
145 file describing a package, and then making (possibly multiple)
146 invocations to GHC to perform the appropriate compilation. What
147 information does Cabal pass onto GHC\@? One can get an idea for this by
148 looking at a prototypical command line that Cabal invokes GHC with:
149
150 \begin{verbatim}
151 ghc --make
152 -package-name myapp-0.1
153 -hide-all-packages
154 -package-id containers-0.9-ABCD
155 Module1 Module2
156 \end{verbatim}
157
158 There are a few things going on here. First, Cabal has to tell GHC
159 what the name of the package it's compiling (otherwise, GHC can't appropriately
160 generate symbols that other code referring to this package might generate).
161 There are also a number of commands which configure its in-memory view of
162 the package database (GHC's view of the package database may not directly
163 correspond to what is on disk). There's also an optimization here: in principle,
164 GHC can compile each module one-by-one, but instead we use the \verb|--make| flag
165 because this allows GHC to reuse some data structures, resulting in a nontrivial
166 speedup.
167
168 (ToDo: describe cabal-install/sandbox)
169
170 \section{Goals}
171
172 There are actually a number of different goals we have for modifying the
173 packaging system, some of which are subsets of the Backpack system.
174
175 \begin{itemize}
176 \item As a prerequisite, support multiple instances of containers-2.9 \emph{in the
177 package database}. These instances may be compiled against
178 different dependencies, have the same dependencies but different
179 source files (as when a package is being developed), or be
180 compiled with different options. It is less important to allow
181 these instances to be linkable together.\footnote{Actually, I think
182 this is completely orthogonal to Backpack, since we are going to treat
183 dependencies as part of the package ID, so they would be considered
184 separate entries in the package database anyway.}
185
186 \item Support typechecking a library against a module interface
187 as opposed to an actual implementation. This would be useful
188 for moving towards a model where Cabal package dependency versions
189 are replaced with proper signature packages. % See Section~\ref{sec:signature-packages} for more details.
190
191 \item Support compiling the aforementioned libraries with actual implementations.
192 It is \emph{not} a goal to be able to compile a library while only
193 partially providing dependencies, and it is low priority to support
194 mutually recursive implementations of these implementations.
195
196 \iffalse%
197 \item Support insertion of backwards compatibility shims for packages
198 that are using old versions of packages, so that you can continue
199 using them without having to patch them manually. This is a
200 stylized use-case of Backpack features. See Section~LINKME for
201 more details.
202
203 \item Some easy-to-implement subset of the functionality provided by
204 packages with holes (Backpack proper). This includes support
205 of linking an executable containing multiple packages with the
206 same package name but different PackageIds.
207 \fi
208 \end{itemize}
209
210 A lower priority goal is to actually allow multiple instances of
211 containers-2.9 to be linked together in the same executable
212 program.\footnote{In particular, this requires changes to how linker
213 symbols are assigned. However, this feature is important to
214 implement a number of Backpack features.}
215
216 A \emph{non-goal} is to allow users to upgrade upstream libraries
217 without recompiling downstream. This is an ABI concern and we're not
218 going to worry about it.
219
220 \section{Aside: Recent IHG work}\label{sec:ihg}
221
222 The IHG project has allocated some funds to relax the package instance
223 constraint in the package database, so that multiple instances can be
224 stored, but now the user of GHC must explicitly list package-IDs to be
225 linked against. In the far future, it would be expected that tools like
226 Cabal automatically handle instance selection among a large number of
227 instances, but this is subtle and so this work is only to do some
228 foundational work, allowing a package database to optionally relax the
229 unique package-version requirement, and utilize environment files to
230 select which packages should be used. See Duncan's email for more
231 details on the proposal.
232
233 To implement this:
234
235 \begin{enumerate}
236
237 \item Remove the ``removal step'' when registering a package (with a flag)
238
239 \item Check \ghcfile{compiler/main/Packages.lhs}:mkPackagesState to look out for shadowing
240 within a database. We believe it already does the right thing, since
241 we already need to handle shadowing between the local and global database.
242
243 \end{enumerate}
244
245 Once these changes are implemented, we can program multiple instances by
246 using \verb|-hide-all-packages -package-id ...|, even if there is no
247 high-level tool support.
248
249 Actually, this concern is orthogonal to the purposes of Backpack, if
250 we redefine PackageId appropriately.
251
252 \paragraph{The ABI hash} Currently, InstalledPackageId
253 is constructed of a package, version and ABI hash
254 (generateRegistrationInfo in
255 \ghcfile{libraries/Cabal/Cabal/Distribution/Simple/Register.hs}). The
256 use of an ABI hash is a bit of GHC-specific hack introduced in 2009,
257 intended to make sure these installed package IDs are unique. While
258 this is quite clever, using the ABI is actually a bit inflexible, as one
259 might reasonably want to have multiple copies of a package with the same
260 ABI but different source code changes.\footnote{In practice, our ABIs
261 are so unstable that it doesn't really matter.}
262
263 In Figure~\ref{fig:proposed-pkgid}, there is an alternate logical
264 representation of InstalledPackageId which attempts to extricate the
265 notion of ABI compatibility from what actually might uniquely identify a
266 package beyond its PackageId. We imagine these components to be:
267
268 \begin{itemize}
269 \item A hash of the source code (so one can register different
270 in-development versions without having to bump the version
271 number);
272 \item Compilation way (profiling? dynamic?)
273 \item Compilation flags (such as compilation way, optimization,
274 profiling settings)\footnote{This is a little undefined on a package bases, because in principle the flags could be varied on a per-file basis. More likely this will be approximated against the relevant fields in the Cabal file as well as arguments passed to Cabal.};
275 \end{itemize}
276
277 A historical note: in the 2012 GSoC project to allow multiple instances
278 of a package to be installed at the same time, use of \emph{random
279 numbers} was used to workaround the inability to get an ABI early
280 enough. We are not using this plan.
281
282 \section{Infrastructural improvements}
283
284 There are some infrastructural improvements that must be made before
285 Backpack proper can be implemented. These additions are described in
286 red in the architectural diagrams. The current structure of this
287 section is to describe the additions bottom up.
288
289 \subsection{Concrete physical identity = PackageId$^*$ + Module name}\label{sec:ipi}
290
291 \begin{figure}
292 \center{\begin{tabular}{r l}
293 PackageId & hash of ``package name, package version, PackageIds of dependent packages'' \\
294 InstalledPackageId & hash of ``PackageId, source code, way, compiler flags'' \\
295 \end{tabular}}
296 \label{fig:proposed-pkgid}\caption{Proposed new structure of package identifiers.}
297 \end{figure}
298
299 In Backpack, there needs to be some mechanism for assigning
300 \emph{physical module identities} to modules, which are essential for
301 typechecking Backpack packages, since they let us tell if two types are
302 equal or not. In the paper, the physical identity was represented as the
303 package that constructed it as well as some representation of the module
304 source. We can simplify this slightly: in current Cabal packages, we
305 require that modules always be given a package-unique logical name;
306 thus, physical identities can be simply represented as a PackageId plus
307 module name. (See \ghcfile{compiler/basicTypes/Module.lhs:Module})
308 In fact, this coincides with how GHC already internally handles
309 the problem of type equality: it appeals to an \emph{original name}
310 which is, presently, a PackageId and the module name.
311
312 However, with the current representation of PackageIds, this is
313 insufficient: a package is not just its name, but also the regular
314 tree representing all of its package dependencies. Thus, we have
315 to adjust the representation of a PackageId so that it includes this
316 regular tree, as seen Figure~\ref{fig:proposed-pkgid}. Since this
317 tree in general may be quite long, it needs to be shortened in some way,
318 such as by hashing.
319
320 And now, the complications\ldots
321
322 \paragraph{Relaxing package selection restrictions} As mentioned
323 previously, GHC is unable to select multiple packages with the same
324 package name (but different PackageIds). This restriction needs to be
325 lifted. For backwards compatibility reasons, it's probably better to
326 not overload \verb|-package-id| but add a new flag, maybe \verb|-force-package-id|;
327 we also need to disable the old version masking behavior. This is orthogonal
328 to the IHG work, which wants to allow multiple InstalledPackageIds in the
329 \emph{database} (here, we want to allow multiple PackageIds in compiled code).
330
331 \paragraph{Linker symbols} As we increase the amount of information in
332 PackageId, it's important to be careful about the length of these IDs,
333 as they are used for exported linker symbols (e.g.
334 \verb|base_TextziReadziLex_zdwvalDig_info|). Very long symbol names
335 hurt compile and link time, object file sizes, GHCi startup time,
336 dynamic linking, and make gdb hard to use. As such, the current plan is
337 to do away with full package names and versions, and instead use just a
338 base-62 encoded hash, perhaps with the first four characters of the package
339 name for user-friendliness.
340
341 Edward: I'm still partial to a short hash of the dependency bits (or
342 even Simon's registry of short strings to dependency trees), and keeping
343 everything else the same.
344
345 \paragraph{Wired-in names} One annoying thing to remember is that GHC
346 has wired-in names, which refer to packages without any version. Now
347 these wired names also have to accomodate dependency trees. A
348 suggested approach is to have a fixed table from these wired names to
349 package IDs; alternately we can use something like the special \verb|inplace|
350 version number.
351
352 \paragraph{Free variables (or, what is a non-concrete physical
353 identity?)} Physical identities in their full generality are permitted
354 to have free variables, which represent holes. Handling this is a
355 tricky question, and we defer it to Section~\ref{sec:typechecking-indefinite}, when
356 we talk about packages with holes.
357
358 \subsection{Exposed modules should allow external modules}\label{sec:reexport}
359
360 In Backpack, the definition of a package consists of a logical context,
361 which maps logical module names to physical module names. These do not
362 necessarily coincide, since some physical modules may have been defined
363 in other packages and mixed into this package. This mapping specifies
364 what modules other packages including this package can access.
365 However, in the current installed package database, we have exposed-modules which
366 specify what modules are accessible, but we assume that the current
367 package is responsible for providing these modules.
368
369 To implement Backpack, we have to extend this exposed-modules (``Export declarations''
370 on Figure~\ref{fig:pkgdb}). Rather
371 than a list of logical module names, we provide a new list of tuples:
372 the exported logical module name and original physical module name (this
373 is in two parts: the InstalledPackageId and the original module name).
374 For example, an traditional module export is simply (Name, my-pkg-id, Name);
375 a renamed module is (NewName, my-pkg-id, OldName), and an external module
376 is (Name, external-pkg-id, Name).
377
378 As an example:
379
380 \begin{verbatim}
381 package P where
382 M = ...
383 N = ...
384 package Q (M, R, T)
385 include P (N -> R)
386 T = ...
387 \end{verbatim}
388
389 And now if we look at Q\@:
390
391 \begin{verbatim}
392 exposed-modules:
393 <M, P.id, M>
394 <R, P.id, N>
395 <T, Q.id, T>
396 \end{verbatim}
397
398 When we compile Q, and the interface file gets generated, we have
399 to generate identifiers for each of the exposed modules. These should
400 be calculated to directly refer to the ``original name'' of each them;
401 so for example M and R point directly to package P, but they also
402 include the original name they had in the original definition.
403
404 \section{Simplifying Backpack}\label{sec:simplifying-backpack}
405
406 Backpack as currently defined always requires a \emph{shaping} pass,
407 which calculates the shapes of all modules defined in a package.
408 The shaping pass is critical to the solution of the double-vision problem
409 in recursive module linking, but it also presents a number of unpalatable
410 implementation problems:
411
412 \begin{itemize}
413
414 \item \emph{Shaping is a lot of work.} A module shape specifies the
415 providence of all data types and identifiers defined by a
416 module. To calculate this, we must preprocess and parse all
417 modules, even before we do the type-checking pass.
418
419 \item \emph{Shaping must be done upfront.} In the current Backpack
420 design, all shapes must be computed before any typechecking can
421 occur. While performing the shaping pass upfront is necessary
422 in order to solve the double vision problem (where a module
423 identity may be influenced by later definitions), it also means
424 that either GHC must compile an entire package in one go, or it
425 must first do a shaping pass, and then revisit every module and
426 compile them proper. Nor is it (easily) possible to skip the
427 shaping pass when it is unnecessary, as one might expect to be
428 the case in the absence of mutual recursion. Shaping is not
429 a ``pay as you go'' language feature.
430
431 \item \emph{GHC can't compile all programs shaping accepts.} Shaping
432 accepts programs that GHC, with its current hs-boot mechanism, cannot
433 compile. In particular, GHC requires that any data type or function
434 in a signature actually be \emph{defined} in the module corresponding
435 to that file (i.e., an original name can be assigned to these entities
436 immediately.) Shaping permits unrestricted exports to implement
437 modules; this shows up in the formalism as $\beta$ module variables.
438
439 \item \emph{Shaping encourages inefficient program organization.}
440 Shaping is designed to enable mutually recursive modules, but as
441 currently implemented, mutual recursion is less efficient than
442 code without recursive dependencies. Programmers should avoid
443 this code organization, except when it is absolutely necessary.
444
445 \item \emph{GHC is architecturally ill-suited for shaping.} Shaping
446 implies that GHC's internal concept of an ``original name'' be
447 extended to accommodate module variables. This is an extremely
448 invasive change to all aspects of GHC, since the original names
449 assumption is baked quite deeply into the compiler.
450
451 \end{itemize}
452
453 To be clear, the shaping pass is fundamentally necessary for some
454 Backpack packages. Here is the example which convinced Simon:
455
456 \begin{verbatim}
457 package p where
458 A :: [data T; f :: T -> T]
459 B = [export T(MkT), h; import A(f); data T = MkT; h x = f MkT]
460 A = [export T(MkT), f, h; import B; f MkT = MkT]
461 \end{verbatim}
462
463 The key to this example is that B \emph{may or may not typecheck} depending
464 on the definition of A. Because A reexports B's definition T, B will
465 typecheck; but if A defined T on its own, B would not typecheck. Thus,
466 we \emph{cannot} typecheck B until we have done some analysis of A (the
467 shaping analysis!)
468
469 So, if we want to jettison the shaping analysis, we'd like a subset
470 of Backpack which does not allow mutually recursive modules.
471 Here is a programming discipline which does just that, which we will call the \textbf{linking restriction}: \emph{Module implementations must be declared before
472 signatures.} Formally, this restriction modifies the rule for merging
473 polarized module shapes ($\widetilde{\tau}_1^{m_1} \oplus \widetilde{\tau}_2^{m_2}$) so that
474 $\widetilde{\tau}_1^- \oplus \widetilde{\tau}_2^+$ is always undefined.\footnote{This seemed to be the crispest way of defining the restriction, although this means an error happens a bit later than I'd like it to: I'd prefer if we errored while merging logical contexts, but we don't know what is a hole at that point.}
475
476 Here is an example of the linking restriction. Consider these two packages:
477
478 \begin{verbatim}
479 package random where
480 System.Random = [ ... ].hs
481
482 package monte-carlo where
483 System.Random :: ...
484 System.MonteCarlo = [ ... ].hs
485 \end{verbatim}
486
487 Here, random is a definite package which may have been compiled ahead
488 of time; monte-carlo is an indefinite package with a dependency
489 on any package which provides \verb|System.Random|.
490
491 Now, to link these two applications together, only one ordering
492 is permissible:
493
494 \begin{verbatim}
495 package myapp where
496 include random
497 include monte-carlo
498 \end{verbatim}
499
500 If myapp wants to provide its own random implementation, it can do so:
501
502 \begin{verbatim}
503 package myapp2 where
504 System.Random = [ ... ].hs
505 include monte-carlo
506 \end{verbatim}
507
508 In both cases, all of \verb|monte-carlo|'s holes have been filled in by the time
509 it is included. The alternate ordering is not allowed.
510
511 Why does this discipline prevent mutually recursive modules? Intuitively,
512 a hole is the mechanism by which we can refer to an implementation
513 before it is defined; otherwise, we can only refer to definitions which
514 preceed our definition. If there are never any holes \emph{which get filled},
515 implementation links can only go backwards, ruling out circularity.
516
517 It's easy to see how mutual recursion can occur if we break this discipline:
518
519 \begin{verbatim}
520 package myapp2 where
521 include monte-carlo
522 System.Random = [ import System.MonteCarlo ].hs
523 \end{verbatim}
524
525 \subsection{Typechecking of definite modules without shaping}
526
527 If we are not carrying out a shaping pass, we need to be able to calculate
528 $\widetilde{\Xi}_{\mathsf{pkg}}$ on the fly. In the case that we are
529 compiling a package---there will be no holes in the final package---we
530 can show that shaping is unnecessary quite easily, since with the
531 linking restriction, everything is definite from the get-go.
532
533 Observe the following invariant: at any given step of the module
534 bindings, the physical context $\widetilde{\Phi}$ contains no
535 holes. We can thus conclude that there are no module variables in any
536 type shapes. As the only time a previously calculated package shape can
537 change is due to unification, the incrementally computed shape is in
538 fact the true one.
539
540 As far as the implementation is concerned, we never have to worry
541 about handling module variables; we only need to do extra typechecks
542 against (renamed) interface files.
543
544 \subsection{Compilation of definite modules}\label{sec:compiling-definite}
545
546 Of course, we still have to compile the code,
547 and this includes any subpackages which we have mixed in the dependencies
548 to make them fully definite. Let's take the following package as an example:
549
550 \begin{verbatim}
551 package pkg-a where
552 A = ...
553 package pgk-b where -- indefinite package
554 A :: ...
555 B = [ b = ... ]
556 package pkg-c where
557 include pkg-a
558 include pkg-b
559 \end{verbatim}
560
561 There seem to be two schools of thought for how compilation should proceed.
562
563 \paragraph{The ``downstream'' proposal} This is Simon's favorite
564 proposal. Because of the linking invariant, if we traverse the Backpack
565 specification, any given module we need to compile will have all of its
566 dependencies compiled (since it could only depend on the dependency if
567 there was a signature, and the signature could not have been linked
568 unless it was implemented previously.) So we just go ahead and compile
569 everything one-by-one while traversing the package tree, as if the
570 package was a single big package. (Of course, if we encounter a
571 definite package, don't bother recompiling it; just use it directly.)
572 In particular, we maintain the invariant that any code generated will
573 export symbols under the current package's namespace. So the identifier
574 \verb|b| in the example becomes a symbol \verb|pkg-c_pkg-b_B_b| rather
575 than \verb|pkg-b_B_b| (package subqualification is necessary because
576 package C may define its own B module after thinning out the import.)
577
578 One big problem with this proposal is that it doesn't implement applicative
579 semantics. If there is another package:
580
581 \begin{verbatim}
582 package pkg-d where
583 include pkg-a
584 include pkg-b
585 \end{verbatim}
586
587 this will generate its own instance of B, even though it should be the same
588 as C. Simon was willing to entertain the idea that, well, as long as the
589 type-checker is able to figure out they are the same, then it might be OK
590 if we accidentally generate two copies of the code (provided they actually
591 are the same).
592
593 \paragraph{The ``upstream'' proposal}
594 The problem with the ``downstream'' proposal is that it always recompiles
595 all of the indefinite parts of the package, even if some of them should
596 be shared in some sense. Hypothetically, a fully linked version of an
597 indefinite package should be considered a package in its own right
598 (in particular, it is associated with a physical module identity in Backpack).
599 So all we need to do is store these installed packages somewhere, probably
600 the local package database. If we recompile the same dependency chain,
601 the installed package can be simply reused. These products do not have
602 to be exposed.
603
604 One problem with this proposal is in this example:
605
606 \begin{verbatim}
607 package myapp2 where
608 System.Random = [ ... ].hs
609 include monte-carlo
610 \end{verbatim}
611
612 Here, monte-carlo depends on a ``subpart of the myapp2 package'', and it's
613 not entirely clear how monte-carlo should be represented in the installed
614 package database: should myapp2 be carved up into pieces so that subparts
615 of its package description can be installed to the database?
616
617 Another reason you might not be so keen about this proposal is the fact
618 that we have to hit the package database, despite the fact that these
619 are all ostensibly local build products. (But perhaps not!)
620
621 \subsection{Typechecking of indefinite modules}\label{sec:typechecking-indefinite}
622
623 When we are typechecking a package that contains some holes, some extra
624 care must be taken. We can no longer assume that there are no holes in
625 the physical context, and while these holes will never be linked against
626 a physical implementation, they may be linked against other signatures.
627 (Note: while disallowing signature linking would solve our problem, it
628 would disallow a wide array of useful instances of signature reuse, for
629 example, a package mylib that implements both mylib-1x-sig and mylib-2x-sig.)
630
631 With holes, we must handle module variables, and we sometimes must unify them:
632
633 \begin{verbatim}
634 package p where
635 A :: [ data A ]
636 package q where
637 A :: [ data A ]
638 package r where
639 include p
640 include q
641 \end{verbatim}
642
643 In this package, it is not possible to a priori assign original names to
644 module A in p and q, because in package r, they should have the same
645 original name. When signature linking occurs, unification may occur,
646 which means we have to rename all relevant original names. (A similar
647 situation occurs when a module is typechecked against a signature.)
648
649 An invariant which would be nice to have is this: when typechecking a
650 signature or including a package, we may apply renaming to the entities
651 being brought into context. But once we've picked an original name for
652 our entities, no further renaming should be necessary. (Formally, in the
653 unification for semantic object shapes, apply the unifier to the second
654 shape, but not the first one.)
655
656 However, there are plenty of counterexamples here:
657
658 \begin{verbatim}
659 package p where
660 A :: [ data A ]
661 B :: [ data A ]
662 M = ...
663 A = B
664 \end{verbatim}
665
666 In this package, does module M know that A.A and B.A are type equal? In
667 fact, the shaping pass will have assigned equal module identities to A
668 and B, so M \emph{equates these types}, despite the aliasing occurring
669 after the fact.
670
671 We can make this example more sophisticated, by having a later
672 subpackage which causes the aliasing; now, the decision is not even a
673 local one (on the other hand, the equality should be evident by inspection
674 of the package interface associated with q):
675
676 \begin{verbatim}
677 package p where
678 A :: [ data A ]
679 B :: [ data A ]
680 package q where
681 A :: [ data A ]
682 B = A
683 package r where
684 include p
685 include q
686 \end{verbatim}
687
688 Another possibility is that it might be acceptable to do a small shaping
689 pass, without parsing modules or signatures, \emph{simply} looking at
690 names and aliases. But logical names are not the only mechanism by
691 which unification may occur:
692
693 \begin{verbatim}
694 package p where
695 C :: [ data A ]
696 A = [ data A = A ]
697 B :: [ import A(A) ]
698 C = B
699 \end{verbatim}
700
701 It is easy to conclude that the original names of C and B are the same. But
702 more importantly, C.A must be given the original name of p:A.A. This can only
703 be discovered by looking at the signature definition for B. In any case, it
704 is worth noting that this situation parallels the situation with hs-boot
705 files (although there is no mutual recursion here). See Section~\ref{sec:hs-boot-restirct}
706 for more details.
707
708 \paragraph{Hey, these signature imports are kind of tricky\ldots}
709
710 When signatures and modules are interleaved, the interaction can be
711 complex. Here is an example:
712
713 \begin{verbatim}
714 package p where
715 C :: [ data A ]
716 M = [ import C; ... ]
717 A = [ import M; data A = A ]
718 C :: [ import A(A) ]
719 \end{verbatim}
720
721 Here, the second signature for C refers to a module implementation A
722 (this is permissible: it simply means that the original name for p:C.A
723 is p:A.A). But wait! A relies on M, and M relies on C. Circularity?
724 Fortunately not: a client of package p will find it impossible to have
725 the hole C implemented in advance, since they will need to get their hands on module
726 A\ldots but it will not be defined prior to package p.
727
728 In any case, however, it would be good to emit a warning if a package
729 cannot be compiled without mutual recursion.
730
731 \subsection{Installing indefinite packages}\label{sec:installing-indefinite}
732
733 If an indefinite package contains no code at all, we only need
734 to install the interface file for the signatures. However, if
735 they include code, we must provide all of the
736 ingredients necessary to compile them when the holes are linked against
737 actual implementations. (Figure~\ref{fig:pkgdb})
738
739 \paragraph{Source tarball or preprocessed source?} What is the representation of the source that is saved is. There
740 are a number of possible choices:
741
742 \begin{itemize}
743 \item The original tarballs downloaded from Hackage,
744 \item Preprocessed source files,
745 \item Some sort of internal, type-checked representation of Haskell code (maybe the output of the desugarer).
746 \end{itemize}
747
748 Storing the tarballs is the simplest and most straightforward mechanism,
749 but we will have to be very certain that we can recompile the module
750 later in precisely the same we compiled it originally, to ensure the hi
751 files match up (fortunately, it should be simple to perform an optional
752 sanity check before proceeding.) The appeal of saving preprocessed
753 source, or even the IRs, is that this is conceptually this is exactly
754 what an indefinite package is: we have paused the compilation process
755 partway, intending to finish it later. However, our compilation strategy
756 for definite packages requires us to run this step using a \emph{different}
757 choice of original names, so it's unclear how much work could actually be reused.
758
759 \subsection{Restricted recursive modules ala hs-boot}\label{sec:hs-boot-restrict}
760
761 It should be possible to support GHC-style mutual recursion using the
762 Backpack formalism immediately using hs-boot files. However, to avoid
763 the need for a shaping pass, we must adopt an existing constraint that
764 already applies to hs-boot files: \emph{at the time we define a signature,
765 we must know what the original name for all data types is}. We then
766 compile modules as usual, but compiling against the signature as if it
767 were an hs-boot file.
768
769 (ToDo: Figure out why this eliminates the shaping pass)
770
771 Compiling packages in this way gives the tantalizing possibility
772 of true separate compilation: the only thing we don't know is what the actual
773 package name of an indefinite package will be, and what the correct references
774 to have are. This is a very minor change to the assembly, so one could conceive
775 of dynamically rewriting these references at the linking stage.
776
777 \section{Implementation plan}
778
779 \section{Open questions}\label{sec:open-questions}
780
781 Here are open problems about the implementation which still require
782 hashing out.
783
784 \begin{itemize}
785
786 \item In Section~\ref{sec:simplifying-backpack}, we argued that we
787 could implement Backpack without needing a shaping pass. We're
788 pretty certain that this will work for typechecking and
789 compiling fully definite packages with no recursive linking, but
790 in Section~\ref{sec:typechecking-indefinite}, we described some
791 of the prevailing difficulties of supporting signature linking.
792 Renaming is not an insurmountable problem, but backwards flow of
793 shaping information can be, and it is unclear how best to
794 accommodate this. This is probably the most important problem
795 to overcome.
796
797 \item In Section~\ref{sec:compiling-definite}, we describe two strategies
798 for compiling packages with no holes (but which depend on indefinite
799 packages). Simon and Edward still don't agree which proposal is best (Simon
800 in the downstream camp, Edward is in the upstream camp.)
801
802 \item In Section~\ref{sec:installing-indefinite}, a few choices for how to
803 store source code were pitched, however, there is not consensus on which
804 one is best.
805
806 \item What is the impact of the multiplicity of PackageIds on
807 dependency solving in Cabal? Old questions of what to prefer
808 when multiple package-versions are available (Cabal originally
809 only needed to solve this between different versions of the same
810 package, preferring the oldest version), but with signatures,
811 there are more choices. Should there be a complex solver that
812 does all signature solving, or a preprocessing step that puts
813 things back into the original Cabal version. Authors may want
814 to suggest policy for what packages should actually link against
815 signatures (so a crypto library doesn't accidentally link
816 against a null cipher package).
817
818 \end{itemize}
819
820 \bibliographystyle{plain}
821 \bibliography{backpack-impl}
822
823 \end{document}