ecf047c8ca568c40299919011dfe7d55b6d3e237
[ghc.git] / docs / backpack / backpack-impl.tex
1 \documentclass{article}
2
3 \usepackage{graphicx} %[pdftex] OR [dvips]
4 \usepackage{fullpage}
5 \usepackage{float}
6 \usepackage{titling}
7 \usepackage{hyperref}
8 \usepackage{tikz}
9 \usetikzlibrary{arrows}
10 \usetikzlibrary{positioning}
11 \setlength{\droptitle}{-6em}
12
13 \newcommand{\ghcfile}[1]{\textsl{#1}}
14
15 \title{Implementing Backpack}
16
17 \begin{document}
18
19 \maketitle
20
21 The purpose of this document is to describe an implementation path
22 for Backpack~\cite{Kilpatrick:2014:BRH:2535838.2535884} in GHC\@.
23
24 We start off by outlining the current architecture of GHC, ghc-pkg and Cabal,
25 which constitute the existing packaging system. We then state what our subgoals
26 are, since there are many similar sounding but different problems to solve. Next,
27 we describe the ``probably correct'' implementation plan, and finish off with
28 some open design questions. This is intended to be an evolving design document,
29 so please contribute!
30
31 \section{Current packaging architecture}
32
33 The overall architecture is described in Figure~\ref{fig:arch}.
34
35 \begin{figure}[H]
36 \center{\scalebox{0.8}{\includegraphics{arch.png}}}
37 \label{fig:arch}\caption{Architecture of GHC, ghc-pkg and Cabal. Green bits indicate additions from upcoming IHG work, red bits indicate additions from Backpack. Orange indicates a Haskell library.}
38 \end{figure}
39
40 Here, arrows indicate dependencies from one component to another. Color
41 coding is as follows: orange components are libaries, green components
42 are to be added with the IHG work, red components are to be added with
43 Backpack. (Thus, black and orange can be considered the current)
44
45 \subsection{Installed package database}
46
47 Starting from the bottom, we have the \emph{installed package database}
48 (actually a collection of such databases), which stores information
49 about what packages have been installed are thus available to be
50 compiled against. There is both a global database (for the system
51 administrator) and a local database (for end users), which can be
52 updated independently. One way to think about the package database
53 is as a \emph{cache of object code}. In principle, one could compile
54 any piece of code by repeatedly recompiling all of its dependencies;
55 the installed package database describes when this can be bypassed.
56
57 \begin{figure}[H]
58 \center{\scalebox{0.8}{\includegraphics{pkgdb.png}}}
59 \label{fig:pkgdb}\caption{Anatomy of a package database.}
60 \end{figure}
61
62 In Figure~\ref{fig:pkgdb}, we show the structure of a package database.
63 The installed package are created from a Cabal file through the process
64 of dependency resolution and compilation. In database terms, the primary key
65 of a package database is the InstalledPackageId
66 (Figure~\ref{fig:current-pkgid}). This ID uniquely identifies an
67 instance of an installed package. The PackageId omits the ABI hash and
68 is used to qualify linker exported symbols: the current value of this
69 parameter is communicated to GHC using the \verb|-package-id| flag.
70
71 In principle, packages with different PackageIds should be linkable
72 together in the same compiled program, whereas packages with the same
73 PackageId are not (even if they have different InstalledPackageIds). In
74 practice, GHC is currently only able to select one version of a package,
75 as it clears out all old versions of the package in
76 \ghcfile{compiler/main/Package.lhs}:applyPackageFlag.
77
78 \begin{figure}
79 \center{\begin{tabular}{r l}
80 PackageId & package name, package version \\
81 InstalledPackageId & PackageId, ABI hash \\
82 \end{tabular}}
83 \label{fig:current-pkgid}\caption{Current structure of package identifiers.}
84 \end{figure}
85
86 The database entry itself contains the information from the installed package ID,
87 as well as information such as what dependencies it was linked against, where
88 its compiled code and interface files live, its compilation flags, what modules
89 it exposes, etc. Much of this information is only relevant to Cabal; GHC
90 uses a subset of the information in the package database.
91
92 \subsection{GHC}
93
94 The two programs which access the package database directly are GHC
95 proper (for compilation) and ghc-pkg (which is a general purpose
96 command line tool for manipulating the database.) GHC relies on
97 the package database in the following ways:
98
99 \begin{itemize}
100 \item It imports the local and global package databases into
101 its runtime database, and applies modifications to the exposed
102 and trusted status of the entries via the flags \verb|-package|
103 and others (\ghcfile{compiler/main/Packages.lhs}). The internal
104 package state can be seen at \verb|-v4| or higher.
105 \item It uses this package database to find the location of module
106 interfaces when it attempts to load the module info of an external
107 module (\ghcfile{compiler/iface/LoadIface.hs}).
108 \end{itemize}
109
110 GHC itself performs a type checking phase, which generates an interface
111 file representing the module (so that later invocations of GHC can load the type
112 of a module), and then after compilation projects object files and linked archives
113 for programs to use.
114
115 \paragraph{Original names} Original names are an important design pattern
116 in GHC\@.
117 Sometimes, a name can be exposed in an hi file even if its module
118 wasn't exposed. Here is an example (compiled in package R):
119
120 \begin{verbatim}
121 module X where
122 import Internal (f)
123 g = f
124
125 module Internal where
126 import Internal.Total (f)
127 \end{verbatim}
128
129 Then in X.hi:
130
131 \begin{verbatim}
132 g = <R.id, Internal.Total, f> (this is the original name)
133 \end{verbatim}
134
135 (The reason we refer to the package as R.id is because it's the
136 full package ID, and not just R).
137
138 \subsection{hs-boot}
139
140 \verb|hs-boot| is a special mechanism used to support recursive linking
141 of modules within a package, today. Suppose I have a recursive module
142 dependency between modules and A and B. I break one of\ldots
143
144 (ToDo: describe how hs-boot mechanism works)
145
146 \subsection{Cabal}
147
148 Cabal is the build system for GHC, we can think of it as parsing a Cabal
149 file describing a package, and then making (possibly multiple)
150 invocations to GHC to perform the appropriate compilation. What
151 information does Cabal pass onto GHC\@? One can get an idea for this by
152 looking at a prototypical command line that Cabal invokes GHC with:
153
154 \begin{verbatim}
155 ghc --make
156 -package-name myapp-0.1
157 -hide-all-packages
158 -package-id containers-0.9-ABCD
159 Module1 Module2
160 \end{verbatim}
161
162 There are a few things going on here. First, Cabal has to tell GHC
163 what the name of the package it's compiling (otherwise, GHC can't appropriately
164 generate symbols that other code referring to this package might generate).
165 There are also a number of commands which configure its in-memory view of
166 the package database (GHC's view of the package database may not directly
167 correspond to what is on disk). There's also an optimization here: in principle,
168 GHC can compile each module one-by-one, but instead we use the \verb|--make| flag
169 because this allows GHC to reuse some data structures, resulting in a nontrivial
170 speedup.
171
172 (ToDo: describe cabal-install/sandbox)
173
174 \section{Goals}
175
176 Here are some of the high-level goals which motivate our improvements to
177 the module system.
178
179 \begin{itemize}
180 \item Solve \emph{Cabal hell}, a situation which occurs when conflicting
181 version ranges on a wide range of dependencies leads to a situation
182 where it is impossible to satisfy the constraints. We're seeking
183 to solve this problem in two ways: first, we want to support
184 multiple instances of containers-2.9 in the database which are
185 compiled with different dependencies (and even link them
186 together), and second, we want to abolish (often inaccurate)
187 version ranges and move to a regime where packages depend on
188 signatures.
189
190 \item Support \emph{hermetic builds with sharing}. A hermetic build
191 system is one which simulates rebuilding every package whenever
192 it is built; on the other hand, actually rebuilding every time
193 is extremely inefficient (but what happens in practice with
194 Cabal sandboxes). We seek to solve this problem with the IHG work,
195 by allowing multiple instances of a package in the database, where
196 the only difference is compilation parameters. We don't care
197 about being able to link these together in a single program.
198
199 \item Support \emph{module-level pluggability} as an alternative to
200 existing (poor) usage of type classes. The canonical example are
201 strings, where a library might want to either use the convenient
202 but inefficient native strings, or the efficient packed Text data
203 type, but would really like to avoid having to say \verb|StringLike s => ...|
204 in all of their type signatures. While we do not plan on supporting
205 separate compilation, Cabal should understand how to automatically
206 recompile these ``indefinite'' packages when they are instantiated
207 with a new plugin.
208
209 \item Support \emph{separate modular development}, where a library and
210 an application using the library can be developed and type-checked
211 separately, intermediated by an interface. This is applicable in
212 the pluggability example as well, where we want to ensure that all
213 of the $M \times N$ configurations of libraries versus applications
214 type check, by only running the typechecker $M + N$ times. A closely
215 related concern is related toolchain support for extracting a signature
216 from an existing implementation, as current Haskell culture is averse
217 to explicitly writing separate signature files.
218
219 \item Subsume existing support for \emph{mutually recursive modules},
220 without the double vision problem.
221 \end{itemize}
222
223 A \emph{non-goal} is to allow users to upgrade upstream libraries
224 without recompiling downstream. This is an ABI concern and we're not
225 going to worry about it.
226
227 \section{Aside: Recent IHG work}\label{sec:ihg}
228
229 The IHG project has allocated some funds to relax the package instance
230 constraint in the package database, so that multiple instances can be
231 stored, but now the user of GHC must explicitly list package-IDs to be
232 linked against. In the far future, it would be expected that tools like
233 Cabal automatically handle instance selection among a large number of
234 instances, but this is subtle and so this work is only to do some
235 foundational work, allowing a package database to optionally relax the
236 unique package-version requirement, and utilize environment files to
237 select which packages should be used. See Duncan's email for more
238 details on the proposal.
239
240 To implement this:
241
242 \begin{enumerate}
243
244 \item Remove the ``removal step'' when registering a package (with a flag)
245
246 \item Check \ghcfile{compiler/main/Packages.lhs}:mkPackagesState to look out for shadowing
247 within a database. We believe it already does the right thing, since
248 we already need to handle shadowing between the local and global database.
249
250 \end{enumerate}
251
252 Once these changes are implemented, we can program multiple instances by
253 using \verb|-hide-all-packages -package-id ...|, even if there is no
254 high-level tool support.
255
256 Actually, this concern is orthogonal to the purposes of Backpack, if
257 we redefine PackageId appropriately.
258
259 \paragraph{The ABI hash} Currently, InstalledPackageId
260 is constructed of a package, version and ABI hash
261 (generateRegistrationInfo in
262 \ghcfile{libraries/Cabal/Cabal/Distribution/Simple/Register.hs}). The
263 use of an ABI hash is a bit of GHC-specific hack introduced in 2009,
264 intended to make sure these installed package IDs are unique. While
265 this is quite clever, using the ABI is actually a bit inflexible, as one
266 might reasonably want to have multiple copies of a package with the same
267 ABI but different source code changes.\footnote{In practice, our ABIs
268 are so unstable that it doesn't really matter.}
269
270 In Figure~\ref{fig:proposed-pkgid}, there is an alternate logical
271 representation of InstalledPackageId which attempts to extricate the
272 notion of ABI compatibility from what actually might uniquely identify a
273 package beyond its PackageId. We imagine these components to be:
274
275 \begin{itemize}
276 \item A hash of the source code (so one can register different
277 in-development versions without having to bump the version
278 number);
279 \item Compilation way (profiling? dynamic?)
280 \item Compilation flags (such as compilation way, optimization,
281 profiling settings)\footnote{This is a little undefined on a package bases, because in principle the flags could be varied on a per-file basis. More likely this will be approximated against the relevant fields in the Cabal file as well as arguments passed to Cabal.};
282 \end{itemize}
283
284 A historical note: in the 2012 GSoC project to allow multiple instances
285 of a package to be installed at the same time, use of \emph{random
286 numbers} was used to workaround the inability to get an ABI early
287 enough. We are not using this plan.
288
289 \section{Concrete physical identity = PackageId + Module name + Module dependencies}\label{sec:ipi}
290
291 \begin{figure}
292 \center{\begin{tabular}{r l}
293 PackageId & hash of ``package name, package version, dependency resolution, \emph{module} environment'' \\
294 InstalledPackageId & hash of ``PackageId, source code, way, compiler flags'' \\
295 \end{tabular}}
296 \label{fig:proposed-pkgid}\caption{Proposed new structure of package identifiers.}
297 \end{figure}
298
299 In Backpack, there needs to be some mechanism for assigning
300 \emph{physical module identities} to modules, which are essential for
301 typechecking Backpack packages, since they let us tell if two types are
302 equal or not.
303
304 In GHC, our method of testing whether or not two types are equal is by
305 comparing their original names, which is a tuple of a PackageId and the
306 module name (summarized in Figure~\ref{fig:current-pkgid}). If it looks
307 like a duck and quacks like a duck, it is a duck: we might reasonable
308 imagine that \emph{concrete physical identity = PackageId and module
309 name}. But is this sufficient to represent Backpack's physical module
310 identities?
311
312 If we look at how physical module identities are allocated in Paper Backpack,
313 we can see that each module corresponds to an identity constructor (in
314 Backpack this is the package name, some representation of the module
315 source, and a disambiguating integer if module source is duplicated)
316 \emph{as well as} physical module identities which were applied to
317 this constructor. The current PackageId + Module name scheme adequately encodes
318 the constructor (in particular, a single package can't have multiple modules
319 with the same name), but it does not encode the \emph{module} dependencies.
320 We have to add these to our original names. There are three ways we can do
321 this:
322
323 \begin{enumerate}
324 \item Augment original names to have a third component. This is less than ideal because it requires pervasive changes to all parts of the compiler.
325 \item Augment the module name to record module dependency information. This is a bit of an odd place to put the information, because our module names correspond quite concretely to the actual source file a user wrote some source in.
326 \item Augment the PackageId to record module dependency information. For now, this is the regime we will use; however, see Section~\ref{sec:flatten} for why the issue is more subtle.
327 \end{enumerate}
328
329 (XXX Figure~\ref{fig:proposed-pkgid} is out of date now.)
330
331 And now, the complications\ldots
332
333 \paragraph{Relaxing package selection restrictions} As mentioned
334 previously, GHC is unable to select multiple packages with the same
335 package name (but different PackageIds). This restriction needs to be
336 lifted. For backwards compatibility reasons, it's probably better to
337 not overload \verb|-package-id| but add a new flag, maybe \verb|-force-package-id|;
338 we also need to disable the old version masking behavior. This is orthogonal
339 to the IHG work, which wants to allow multiple \emph{InstalledPackageIds} in the
340 \emph{database} (here, we want to allow multiple \emph{PackageIds} in \emph{compiled code}).
341
342 \paragraph{Linker symbols} As we increase the amount of information in
343 PackageId, it's important to be careful about the length of these IDs,
344 as they are used for exported linker symbols (e.g.
345 \verb|base_TextziReadziLex_zdwvalDig_info|). Very long symbol names
346 hurt compile and link time, object file sizes, GHCi startup time,
347 dynamic linking, and make gdb hard to use. As such, the current plan is
348 to do away with full package names and versions, and instead use just a
349 base-62 encoded hash, perhaps with the first four characters of the package
350 name for user-friendliness.
351
352 \paragraph{Wired-in names} One annoying thing to remember is that GHC
353 has wired-in names, which refer to packages without any version. Now
354 these wired names also have to accomodate dependency trees. A
355 suggested approach is to have a fixed table from these wired names to
356 package IDs; alternately we can use something like the special \verb|inplace|
357 version number.
358
359 \paragraph{Version numbers} The interaction of Backpack's package polymorphism
360 (the ability to mixin different implementations to holes) and Cabal's dependency
361 resolution mechanism (which permits a name libfoo to be resolved to a specific
362 version libfoo-0.1) can be subtle: there are in fact \emph{two} levels of
363 indirection going on here.
364
365 The simplest way to think about the distinction is as follows. When I write
366 a Backpack package which does \verb|include foobar|, I have not actually
367 written a Paper Backpack package. Instead, I have written a \emph{pre-package},
368 which Cabal's dependency solver then takes and rewrites all package references
369 into versioned references \verb|include foobar-0.1|, which now corresponds to
370 a Paper Backpack package. For example, this is a pre-package:
371
372 \begin{verbatim}
373 package foo where
374 include bar
375 \end{verbatim}
376
377 and this is a Paper Backpack package:
378
379 \begin{verbatim}
380 package foo-0.3(bar-0.1) where
381 include bar-0.1(baz-0.2)
382 \end{verbatim}
383
384 Notably, we must \emph{rename} the package to include information about
385 how we resolved all of the inner package references, and if these inner
386 package references had dependencies, those must be included too! In
387 effect, the \emph{dependency resolution} must be encoded into the package ID,
388 along with the existing Backpack \emph{physical identity regular tree}.
389 Phew!
390
391 \paragraph{Free variables (or, what is a non-concrete physical
392 identity?)} Physical identities in their full generality are permitted
393 to have free variables, which represent holes. Handling this is a
394 tricky question, and we defer it to Section~\ref{sec:typechecking-indefinite}, when
395 we talk about packages with holes.
396
397 \section{The granularity of packages versus system libraries}\label{sec:flatten}
398
399 In the original design of GHC's packaging system, the intent was that
400 a package be compiled into a system library, e.g. \verb|libHSbase-4.7.1.0.so|
401 or \verb|libHSbase-4.7.1.0.a|. Dependencies between packages translate simply
402 into dependencies between system libraries.
403
404 Interestingly enough, Paper Backpack specifies dependency tracking at
405 the \emph{module} level, which is a finer granularity than we have been
406 tracking previously. Here is a Backpack package which demonstrates the
407 issue:
408
409 \begin{verbatim}
410 package a where
411 P :: ...
412 Q :: [ x :: Int ]
413 A = [ import P; ... ]
414 B = [ import Q; ... ]
415 package p where
416 P = [ ... ]
417 package x where
418 include p
419 Q = [ x = 0 ]
420 include a
421 package y where
422 include p
423 Q = [ x = 1 ]
424 include a
425 \end{verbatim}
426
427 Here, we have a package \verb|a| which probably should have been defined
428 as two separate packages (since \verb|A| only relies on \verb|P| and
429 \verb|B| only relies on \verb|Q|), but the functionality has been
430 glommed together. Then, the downstream packages \verb|x| and \verb|y|
431 fill in the holes using the same implementation of \verb|P|, but
432 differing implementations of \verb|Q|. (For more explanation about how
433 we would go about compiling this set of packages, please see
434 Section~\ref{sec:compiling-definite}.) The operant question: is module
435 \verb|A| from package \verb|x| the same as module \verb|A| from package
436 \verb|y|? In Paper Backpack, the answer is yes.
437
438 From an end-user perspective, why is this desirable? One answer is that
439 it permits users to modularize third-party packages which were not
440 packaged with modularity in mind, but happen to be modular. For
441 example, when libraries ship with test-cases, they currently have to
442 split these test-cases to separate packages, so as to not introduce
443 spurious dependencies with various test frameworks, which the user may
444 not have or care about. If dependencies are done on a per-module basis,
445 as long as the user doesn't import a test module, they never gain the
446 extra dependency. Another situation is when a library also happens to
447 have a helper utility module which doesn't have any of the dependencies
448 of the primary library. And yet another situation is when we want to
449 define type class instances for data types without orphan instances, but
450 without pulling in a pile of packages defining type classes for things
451 an end user doesn't care about.
452
453 From an implementation perspective, however, this answer is quite distressing:
454
455 \begin{enumerate}
456 \item When I install package \verb|x| and then install package
457 \verb|y|, what should happen to the installed module files for
458 \verb|A|? Currently, the installed package database organizes
459 these files at the package level, so it is not clear where these
460 files should live---we might even accidentally duplicate them!
461 Similarly, when I typecheck, I need to ensure that the original
462 name for \verb|x:A| and \verb|y:A| are the same: thus, the
463 PackageId I assign cannot be \verb|x| or \verb|y|.
464
465 \item When I create a \verb|libHSx.so| and a \verb|libHSy.so|, which
466 dynamic library contains the object files for \verb|A|? Either
467 of these dynamic libraries could be used alone, so both must
468 contain \verb|A|. But an application could also use both at the
469 same time, at which point a program will see two copies of the
470 program text for \verb|A|.
471 \end{enumerate}
472
473 The first problem can be solved by ``flattening'' the package database,
474 so that instead of organizing object files by library, the database
475 is just a directory of physical module identities to objects/interfaces.
476 Instaled packages are now represented as (possibly overlapping) sets over
477 this store of modules. However, the second problem is far more persistent:
478 if a package is a dynamic library, we still are unable to get the sharing of
479 object files necessary. (This problem isn't present for static linking, where
480 one doesn't actually have to link against \verb|libHSx.a|, the object files
481 are just fine.)
482
483 In the next few sections, we describe a few different designs for Backpack which
484 vary the granularity of packages. In general, the coarser the granularity,
485 the closer to the current package-library correspondence the design is, but
486 at the cost of giving up more sharing. Finer granularity requires smaller
487 and smaller corresponding libraries. We'll use the following running example.
488
489 \begin{verbatim}
490 package p where
491 A :: [ a :: Bool ]
492 B = [ import A; b = a ]
493 C = [ c = True ]
494
495 package q where
496 A = [ a = False ]
497 include p
498 D = [ import C; d = c ]
499 E = [ import C; e = c ]
500 \end{verbatim}
501
502 As a notational convenience, we'll omit the full physical identity of a
503 module when it's clear from context. We'll start by recapping the
504 finest granularity, and coarsen.
505
506 \subsection{One library per physical module identity}
507
508 In this world, we generate a library per physical module identity.
509 Briefly, the physical module identities for our running example
510 are \verb|q:A|, \verb|p:B(q:A)|, \verb|p:C|, \verb|q:D(p:C)| and \verb|q:E(p:C)|.
511 This results in the following libraries (and their respective
512 dependencies): \\
513
514 \begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,node distance=6cm,
515 thick,m/.style={rectangle,draw,font=\sffamily\Large\bfseries}]
516 \node[m] (1) {libHSq:A.so};
517 \node[m] (2) [left of=1] {libHSp:B.so};
518 \node[m] (3) [below=0.3cm of 1] {libHSp:C.so};
519 \node[m] (4) [left of=3] {libHSq:D.so};
520 \node[m] (5) [below=0.3cm of 4] {libHSq:E.so};
521 \path
522 (2) edge node {} (1)
523 (4) edge node {} (3)
524 (5) edge node {} (3)
525 ;
526 \end{tikzpicture}
527
528 Notice that there is no ``ordering'' that the packages could have been
529 built in; rather, we walk the package tree, building modules as we encounter
530 them (thus, the instantiated package \verb|p| is fully built before \verb|q|).
531
532 \subsection{One library per package identity}
533
534 In this world, we simplify physical module identities in the following
535 way: we the module names and only consider the package components of
536 physical module identities to identify what library a package should
537 live in. In the running example, the simplified physical module identities
538 are as follows: \verb|q|, \verb|p(q)|, \verb|p| and \verb|q(p)|. This
539 results in a very similar module graph: \\
540
541 \begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,node distance=6cm,
542 thick,m/.style={rectangle,draw,font=\sffamily\Large\bfseries}]
543 \node[m] (1) {libHSq.so (A)};
544 \node[m] (2) [left of=1] {libHSp(q).so (B)};
545 \node[m] (3) [below=0.5cm] {libHSp.so (C)};
546 \node[m] (4) [left of=3] {libHSq(p).so (D,E)};
547 \path
548 (2) edge node [right] {} (1)
549 (4) edge node [right] {} (3);
550 \end{tikzpicture}
551
552 The primary difference seen here is that modules \verb|q:D(p:C)| and \verb|q:E(p:C)| have
553 been placed in the same library, \verb|libHSq(p).so|, since they both erase
554 to \verb|q(p)|.
555
556 This scheme preserves all of the type-level sharing properties that Backpack
557 requires, since modules with identical physical identities are guaranteed
558 to be placed into the same library. However, there are a few cases where
559 one might need to add extra code to a library ``after the fact'', usually
560 when renaming is involved:
561
562 \begin{verbatim}
563 package p where
564 A :: [ a :: Bool ]
565 B = [ import A; b = a ]
566 package q where
567 A1 = [ a = True ]
568 A2 = [ a = False ]
569 package linker1 where
570 include q
571 include p (A as A1, B as B1)
572 \end{verbatim}
573
574 Installing \verb|linker1| results in libraries \verb|libHSq.so| (A1 and A2)
575 as well as \verb|libHSp(q).so| (physical identity \verb|p:B(q:A1)|---note the 1).
576
577 However, now, suppose we install this package:
578
579 \begin{verbatim}
580 package linker2 where
581 include q
582 include p (A as A2, B as B2)
583 \end{verbatim}
584
585 We get a new module B with physical identity \verb|p:B(q:A2)|.
586 Unfortunately, this erases to \verb|p(q)|, so we are obligated to also
587 place it in \verb|libHSq(p).so|.
588
589 \subsection{One library per package identity (more aggressive)}
590
591 (ToDo: Remember what this did)
592
593 \subsection{One library per definite package}
594
595 \begin{tikzpicture}[->,>=stealth',shorten >=1pt,auto,node distance=6cm,
596 thick,m/.style={rectangle,draw,font=\sffamily\Large\bfseries}]
597 \node[m] (1) {libHSq.so (A,B,C,D,E)};
598 \end{tikzpicture}
599
600 In this world, we create a dynamic library per definite package (package with
601 no holes). Indefinite packages don't get compiled into libraries, the code
602 is duplicated and type equality is only seen if a type came from the same
603 definite package. In the running example, we only generate \verb|libHSq.so|
604 which exports all of the modules (\verb|p| is indefinite), and if another
605 package instantiates \verb|p|, the instances of \verb|C| will be considered
606 different.
607
608 (ToDo: Why is this world problematic?)
609
610 \paragraph{Operating system linking} When the package database is flattened
611 into a collection of modules, it becomes less clear how to generate library
612 files corresponding to a package. One possibility is to simply take the
613 set of files corresponding to a package and wrap it up into a library.
614 If an end-user links against two libraries which include the same object file,
615 the linker needs to suppress the symbols associated with one of the instances
616 of the object file (it's identical, right?)\footnote{It may not actually be
617 possible to do this in the static linking case: one must refer to the actual object
618 files}.
619
620 If your modules are truly applicative, this might even work OK\@. However, you will
621 be a sad panda if there is any unsafe, mutable global state in the shared
622 object (since the object files will each get separate data segments for this
623 global state): a generative semantics.\footnote{Even if we did get this right,
624 which would be possible when statically linking these modules together, the
625 interaction could still be surprising: Backpack can remodularize modules, but
626 it can't remodularize values inside a module, so if a module has a dependency
627 but some global state in the module doesn't, the resulting behavior can be
628 surprising. Perhaps the moral of the story really is, ``Don't do side effects
629 in an applicative module system! No really!''} \\
630
631 \subsection{Alternatives to flattening package DB}
632 Flattening can be seen as one of four different representations of packages
633 at the OS/library level. While it promotes maximal sharing of identical
634 modules, it is perhaps too fine-grained for most purposes.
635 \emph{ToDo: Describe the alternatives.}
636
637 \paragraph{Package slicing} Instead of changing the package database,
638 we automatically
639 slice a single package into multiple packages, so that the sliced
640 packages have dependencies which accurately reflect their constitutent
641 modules. For a well modularized package, the slicing operation should
642 be a no-op. This would also be useful in some other situations (see the
643 \verb|-module-env| discussion in Section~\ref{sec:compiling-definite}).
644 In fact, which slice a module should be placed in can be automatically
645 calculated by taking the package name with the regular tree
646 associated with the module (Section~\ref{sec:ipi}).
647
648 A minor downside of package slicing is in a dynamically linked environment,
649 we pay a performance cost when we have to jump from one dynamic library
650 to another, and slicing could introduce are large number of separate
651 dynamic libraries which need to be switched between each other.
652
653 Edward likes this proposal.
654
655 \paragraph{Changing Backpack to disallow fine-grained dependencies}
656
657 Another perspective is that we fell into a trap when we decided that
658 dependencies should be calculated per-module. What if, instead, we
659 expanded the dependency of each module to cover \emph{all signatures}
660 in the package? Then the dependency tree would always be the same and
661 the package would be shared only if all holes were precisely the same.
662 Our motivating example, then, would fail to witness sharing.
663
664 This might be the simplest thing to do, but it is a change in the
665 Backpack semantics, and rules out modularization without splitting a package
666 into multiple packages. Maybe Scott can give other reasons why this
667 would not be so good. SPJ is quite keen on this plan.
668
669 \section{Exposed modules should allow external modules}\label{sec:reexport}
670
671 In Backpack, the definition of a package consists of a logical context,
672 which maps logical module names to physical module names. These do not
673 necessarily coincide, since some physical modules may have been defined
674 in other packages and mixed into this package. This mapping specifies
675 what modules other packages including this package can access.
676 However, in the current installed package database, we have exposed-modules which
677 specify what modules are accessible, but we assume that the current
678 package is responsible for providing these modules.
679
680 To implement Backpack, we have to extend this exposed-modules (``Export declarations''
681 on Figure~\ref{fig:pkgdb}). Rather
682 than a list of logical module names, we provide a new list of tuples:
683 the exported logical module name and original physical module name (this
684 is in two parts: the InstalledPackageId and the original module name).
685 For example, an traditional module export is simply (Name, my-pkg-id, Name);
686 a renamed module is (NewName, my-pkg-id, OldName), and an external module
687 is (Name, external-pkg-id, Name).
688
689 As an example:
690
691 \begin{verbatim}
692 package P where
693 M = ...
694 N = ...
695 package Q (M, R, T)
696 include P (N -> R)
697 T = ...
698 \end{verbatim}
699
700 And now if we look at Q\@:
701
702 \begin{verbatim}
703 exposed-modules:
704 <M, P.id, M>
705 <R, P.id, N>
706 <T, Q.id, T>
707 \end{verbatim}
708
709 When we compile Q, and the interface file gets generated, we have
710 to generate identifiers for each of the exposed modules. These should
711 be calculated to directly refer to the ``original name'' of each them;
712 so for example M and R point directly to package P, but they also
713 include the original name they had in the original definition.
714
715 \paragraph{Error messages} When it is possible for multiple physical
716 entities to have the same ``user-friendly'' name, this can result in a
717 very confusing situation if both entities have to be referred to in the
718 same message. This is especially true when renaming is in place:
719 you not only want to print out the name in scope, but probably some indication
720 of what the original name is. In short, when it comes to error messages, tread with care!
721
722 \section{Shapeless Backpack}\label{sec:simplifying-backpack}
723
724 Backpack as currently defined always requires a \emph{shaping} pass,
725 which calculates the shapes of all modules defined in a package.
726 The shaping pass is critical to the solution of the double-vision problem
727 in recursive module linking, but it also presents a number of unpalatable
728 implementation problems:
729
730 \begin{itemize}
731
732 \item \emph{Shaping is a lot of work.} A module shape specifies the
733 providence of all data types and identifiers defined by a
734 module. To calculate this, we must preprocess and parse all
735 modules, even before we do the type-checking pass. (Fortunately,
736 shaping doesn't require a full parse of a module, only enough
737 to get identifiers, which makes it a slightly more expensive
738 version of \verb|ghc -M|.)
739
740 \item \emph{Shaping must be done upfront.} In the current Backpack
741 design, all shapes must be computed before any typechecking can
742 occur. While performing the shaping pass upfront is necessary
743 in order to solve the double vision problem (where a module
744 identity may be influenced by later definitions), it means
745 that GHC must first do a shaping pass, and then revisit every module and
746 compile them proper. Nor is it (easily) possible to skip the
747 shaping pass when it is unnecessary, as one might expect to be
748 the case in the absence of mutual recursion. Shaping is not
749 a ``pay as you go'' language feature.
750
751 \item \emph{GHC can't compile all programs shaping accepts.} Shaping
752 accepts programs that GHC, with its current hs-boot mechanism, cannot
753 compile. In particular, GHC requires that any data type or function
754 in a signature actually be \emph{defined} in the module corresponding
755 to that file (i.e., an original name can be assigned to these entities
756 immediately.) Shaping permits unrestricted exports to implement
757 modules; this shows up in the formalism as $\beta$ module variables.
758
759 \item \emph{Shaping encourages inefficient program organization.}
760 Shaping is designed to enable mutually recursive modules, but as
761 currently implemented, mutual recursion is less efficient than
762 code without recursive dependencies. Programmers should avoid
763 this code organization, except when it is absolutely necessary.
764
765 \item \emph{GHC is architecturally ill-suited for directly
766 implementing shaping.} Shaping implies that GHC's internal
767 concept of an ``original name'' be extended to accommodate
768 module variables. This is an extremely invasive change to all
769 aspects of GHC, since the original names assumption is baked
770 quite deeply into the compiler. Plausible implementations of
771 shaping requires all these variables to be skolemized outside
772 of GHC\@.
773
774 \end{itemize}
775
776 To be clear, the shaping pass is fundamentally necessary for some
777 Backpack packages. Here is the example which convinced Simon:
778
779 \begin{verbatim}
780 package p where
781 A :: [data T; f :: T -> T]
782 B = [export T(MkT), h; import A(f); data T = MkT; h x = f MkT]
783 A = [export T(MkT), f, h; import B; f MkT = MkT]
784 \end{verbatim}
785
786 The key to this example is that B \emph{may or may not typecheck} depending
787 on the definition of A. Because A reexports B's definition T, B will
788 typecheck; but if A defined T on its own, B would not typecheck. Thus,
789 we \emph{cannot} typecheck B until we have done some analysis of A (the
790 shaping analysis!)
791
792 Thus, it is beneficial (from an optimization point of view) to
793 consider a subset of Backpack for which shaping is not necessary.
794 Here is a programming discipline which does just that, which we will call the \textbf{linking restriction}: \emph{Module implementations must be declared before
795 signatures.} Formally, this restriction modifies the rule for merging
796 polarized module shapes ($\widetilde{\tau}_1^{m_1} \oplus \widetilde{\tau}_2^{m_2}$) so that
797 $\widetilde{\tau}_1^- \oplus \widetilde{\tau}_2^+$ is always undefined.\footnote{This seemed to be the crispest way of defining the restriction, although this means an error happens a bit later than I'd like it to: I'd prefer if we errored while merging logical contexts, but we don't know what is a hole at that point.}
798
799 Here is an example of the linking restriction. Consider these two packages:
800
801 \begin{verbatim}
802 package random where
803 System.Random = [ ... ].hs
804
805 package monte-carlo where
806 System.Random :: ...
807 System.MonteCarlo = [ ... ].hs
808 \end{verbatim}
809
810 Here, random is a definite package which may have been compiled ahead
811 of time; monte-carlo is an indefinite package with a dependency
812 on any package which provides \verb|System.Random|.
813
814 Now, to link these two applications together, only one ordering
815 is permissible:
816
817 \begin{verbatim}
818 package myapp where
819 include random
820 include monte-carlo
821 \end{verbatim}
822
823 If myapp wants to provide its own random implementation, it can do so:
824
825 \begin{verbatim}
826 package myapp2 where
827 System.Random = [ ... ].hs
828 include monte-carlo
829 \end{verbatim}
830
831 In both cases, all of \verb|monte-carlo|'s holes have been filled in by the time
832 it is included. The alternate ordering is not allowed.
833
834 Why does this discipline prevent mutually recursive modules? Intuitively,
835 a hole is the mechanism by which we can refer to an implementation
836 before it is defined; otherwise, we can only refer to definitions which
837 preceed our definition. If there are never any holes \emph{which get filled},
838 implementation links can only go backwards, ruling out circularity.
839
840 It's easy to see how mutual recursion can occur if we break this discipline:
841
842 \begin{verbatim}
843 package myapp2 where
844 include monte-carlo
845 System.Random = [ import System.MonteCarlo ].hs
846 \end{verbatim}
847
848 \subsection{Typechecking of definite modules without shaping}
849
850 If we are not carrying out a shaping pass, we need to be able to calculate
851 $\widetilde{\Xi}_{\mathsf{pkg}}$ on the fly. In the case that we are
852 compiling a package---there will be no holes in the final package---we
853 can show that shaping is unnecessary quite easily, since with the
854 linking restriction, everything is definite from the get-go.
855
856 Observe the following invariant: at any given step of the module
857 bindings, the physical context $\widetilde{\Phi}$ contains no
858 holes. We can thus conclude that there are no module variables in any
859 type shapes. As the only time a previously calculated package shape can
860 change is due to unification, the incrementally computed shape is in
861 fact the true one.
862
863 As far as the implementation is concerned, we never have to worry
864 about handling module variables; we only need to do extra typechecks
865 against (renamed) interface files.
866
867 \subsection{Compilation of definite modules}\label{sec:compiling-definite}
868
869 Of course, we still have to compile the code, and this includes any
870 subpackages which we have mixed in the dependencies to make them fully
871 definite. Let's take the following set of packages as an example:
872
873 \begin{verbatim}
874 package pkg-a where
875 A = ...
876 B = ... -- this code is ignored
877 package pgk-b where -- indefinite package
878 A :: ...
879 B = [ import A; ... ]
880 package pkg-c where
881 include pkg-a (A)
882 include pkg-b
883 C = [ import B; ... ]
884 \end{verbatim}
885
886 With the linking invariant, we can simply walk the Backpack package ``tree'',
887 compiling each of its dependencies. Let's walk through it explicitly.\footnote{To simplify matters, we assume that there is only one instance of any
888 PackageId in the database, so we omit the unique-ifying hashes from the
889 ghc-pkg registration commands; we ignore the existence of version numbers
890 and Cabal's dependency solver; finally, we do the compilation in
891 one-shot mode, even though Cabal in practice will use the Make mode.}
892
893 First, we have to build \verb|pkg-a|. This package has no dependencies
894 of any kind, so compiling is much like compiling ordinary Haskell. If
895 it was already installed in the database, we wouldn't even bother compiling it.
896
897 \begin{verbatim}
898 ADEPS = # empty!
899 ghc -c A.hs -package-name pkg-a-ADEPS
900 ghc -c B.hs -package-name pkg-a-ADEPS
901 # install and register pkg-a-ADEPS
902 \end{verbatim}
903
904 Next, we have to build \verb|pkg-b|. This package has a hole \verb|A|,
905 intuitively, it depends on package A. This is done in two steps:
906 first we check if the signature given for the hole matches up with the
907 actual implementation provided, and then we build the module properly.
908
909 \begin{verbatim}
910 BDEPS = "A -> pkg-a-ADEPS:A"
911 # This doesn't actually create an hi-boot file
912 ghc -c A.hs-boot -package-name pkg-b-BDEPS -module-env BDEPS
913 ghc -c B.hs -package-name pkg-b-BDEPS -module-env BDEPS
914 # install and register pkg-b-BDEPS
915 \end{verbatim}
916
917 Notably, these commands diverge slightly from the traditional compilation process.
918 Traditionally, we would specify the flags
919 \verb|-hide-all-packages| \verb|-package-id package-a-ADEPS| in order to
920 let GHC know that it should look at this package to find modules,
921 including \verb|A|. However, if we did this here, we would also pull in
922 module B, which is incorrect, as this module was thinned out in the parent
923 package description. Conceptually, this package is being compiled in the
924 context of some module environment \verb|BDEPS| (a logical context, in Backpack lingo)
925 which maps modules to original names and is utilized by the module finder to
926 lookup the import in \verb|B.hs|.
927
928 Finally, we created all of the necessary subpackages, and we can compile
929 our package proper.
930
931 \begin{verbatim}
932 CDEPS = # empty!!
933 ghc -c C.hs -package-name pkg-c-CDEPS -hide-all-packages \
934 -package pkg-a-ADEPS -hide-module B \
935 -package pkg-b-BDEPS
936 # install and register package pkg-c-CDEPS
937 \end{verbatim}
938
939 This mostly resembles traditional compilation, but there are a few
940 interesting things to note. First, GHC needs to know about thinning/renaming
941 in the package description (here, it's transmitted using the \verb|-hide-module|
942 command, intended to apply to the most recent package definition).\footnote{Concrete
943 command line syntax is, of course, up for discussion.} Second, even though C
944 ``depends'' on subpackages, these do not show in its package-name identifier,
945 e.g. CDEPS\@. This is because this package \emph{chose} the values of ADEPS and BDEPS
946 explicitly (by including the packages in this particular order), so there are no
947 degrees of freedom.\footnote{In the presence of a Cabal-style dependency solver
948 which associates a-0.1 with a concrete identifier a, these choices need to be
949 recorded in the package ID.} Finally, in principle, we could have also used
950 the \verb|-module-env| flag to communicate how to lookup the B import (notice
951 that the \verb|-package pkg-a-ADEPS| argument is a bit useless because we
952 never end up using the import.) I will talk a little more about the tradeoffs
953 shortly.
954
955 Overall, there are a few important things to notice about this architecture.
956 First, because the \verb|pkg-b-BDEPS| product is installed, if in another package
957 build we instantiate the indefinite module B with exactly the same \verb|pkg-a|
958 implementation, we can skip the compilation process and reuse the version.
959 This is because the calculated \verb|BDEPS| will be the same, and thus the package
960 IDs will be the same.
961
962 XXX ToDo: actually write down pseudocode algorithm for this
963
964 \paragraph{Module environment or package flags?} In the previous
965 section, I presented two ways by which one can tweak the behavior of
966 GHC's module finder, which is responsible for resolving \verb|import B|
967 into an actual physical module. The first, \verb|-module-env| is to
968 explicitly describe a full mapping from module names to original names;
969 the second, \verb|-package| with \verb|-hide-module| and
970 \verb|-rename-module|, is to load packages as before but apply
971 thinning/renaming as necessary.
972
973 In general, it seems that using \verb|-package| is desirable, because its
974 input size is smaller. (If a package exports 200 modules, we would be obligated
975 to list all of them in a module environment.) However, a tricky situation
976 occurs when some of these modules come from a parent package:
977
978 \begin{verbatim}
979 package myapp2 where
980 System.Random = [ ... ].hs
981 include monte-carlo
982 App = [ ... ].hs
983 \end{verbatim}
984
985 Here, monte-carlo depends on a ``subpart of the myapp2 package'', and it's
986 not entirely clear how monte-carlo should be represented in the installed
987 package database: should myapp2 be carved up into pieces so that subparts
988 of its package description can be installed to the database? A package
989 stub like this would never used by any other package, it is strictly local.
990
991 On the other hand, if one uses a module environment for specifying how
992 \verb|monte-carlo| should handle \verb|System.Random|, we don't actually
993 have to create a stub package: all we have to do is make sure GHC knows
994 how to find the module with this original name. To make things better,
995 the size of the module environment will only be as long as all of the
996 holes visible to the module in the package description, so the user will
997 have explicitly asked for all of this pluggability.
998
999 The conclusion seems clear: package granularity for modules from includes,
1000 and module environments for modules from parent packages!
1001
1002 \paragraph{An appealing but incorrect alternative} In this section,
1003 we briefly describe an alternate compilation strategy that might
1004 sound good, but isn't so good. The basic idea is, when compiling
1005 \verb|pkg-c|, to compile all of its indefinite packages as if the
1006 package were one single, big package.
1007 (Of course, if we encounter a
1008 definite package, don't bother recompiling it; just use it directly.)
1009 In particular, we maintain the invariant that any code generated will
1010 export symbols under the current package's namespace. So the identifier
1011 \verb|b| in the example becomes a symbol \verb|pkg-c_pkg-b_B_b| rather
1012 than \verb|pkg-b_B_b| (package subqualification is necessary because
1013 package C may define its own B module after thinning out the import.)
1014
1015 The fatal problem with this proposal is that it doesn't implement applicative
1016 semantics beyond compilation units. While modules within a single
1017 compilation will get reused, if there is another package:
1018
1019 \begin{verbatim}
1020 package pkg-d where
1021 include pkg-a
1022 include pkg-b
1023 \end{verbatim}
1024
1025 when it is compiled by itself, it will generate its own instance of B,
1026 even though it should be the same as C. This is bad news.
1027
1028 \subsection{Restricted recursive modules ala hs-boot}\label{sec:hs-boot-restrict}
1029
1030 It should be possible to support GHC-style mutual recursion using the
1031 Backpack formalism immediately using hs-boot files. However, to avoid
1032 the need for a shaping pass, we must adopt an existing constraint that
1033 already applies to hs-boot files: \emph{at the time we define a signature,
1034 we must know what the original name for all data types is}. In practice,
1035 GHC enforces this by stating that: (1) an hs-boot file must be
1036 accompanied with an implementation, and (2) the implementation must
1037 in fact define (and not reexport) all of the declarations in the signature.
1038
1039 Why does this not require a shaping pass? The reason is that the
1040 signature is not really polymorphic: we require that the $\alpha$ module
1041 variable be resolved to a concrete module later in the same package, and
1042 that all the $\beta$ module variables be unified with $\alpha$. Thus, we
1043 know ahead of time the original names and don't need to deal with any
1044 renaming.\footnote{This strategy doesn't completely resolve the problem
1045 of cross-package mutual recursion, because we need to first compile a
1046 bit of the first package (signatures), then the second package, and then
1047 the rest of the first package.}
1048
1049 Compiling packages in this way gives the tantalizing possibility
1050 of true separate compilation: the only thing we don't know is what the actual
1051 package name of an indefinite package will be, and what the correct references
1052 to have are. This is a very minor change to the assembly, so one could conceive
1053 of dynamically rewriting these references at the linking stage. But
1054 separate compilation achieved in this fashion would not be able to take
1055 advantage of cross-module optimizations.
1056
1057 \section{Shaped Backpack}
1058
1059 Despite the simplicity of shapeless Backpack with the linking
1060 restriction in the absence of holes, we will find that when we have
1061 holes, it will be very difficult to do type-checking without
1062 some form of shaping. This section is very much a work in progress,
1063 but the ability to typecheck against holes, even with the linking restriction,
1064 is a very important part of modular separate development, so we will need
1065 to support it at some ponit.
1066
1067 \subsection{Efficient shaping}
1068
1069 (These are Edward's opinion, he hasn't convinced other folks that this is
1070 the right way to do it.)
1071
1072 In this section, I want to argue that, although shaping constitutes
1073 a pre-pass which must be run before compilation in earnest, it is only
1074 about as bad as the dependency resolution analysis that GHC already does
1075 in \verb|ghc -M| or \verb|ghc --make|.
1076
1077 In Paper Backpack, what information does shaping compute? It looks at
1078 exports, imports, data declarations and value declarations (but not the
1079 actual expressions associated with these values.) As a matter of fact,
1080 GHC already must look at the imports associated with a package in order
1081 to determine the dependency graph, so that it can have some order to compile
1082 modules in. There is a specialized parser which just parses these statements,
1083 and then ignores the rest of the file.
1084
1085 A bit of background: the \emph{renamer} is responsible for resolving
1086 imports and figuring out where all of these entities actually come from.
1087 SPJ would really like to avoid having to run the renamer in order to perform
1088 a shaping pass.
1089
1090 \paragraph{Is it necessary to run the Renamer to do shaping?}
1091 Edward and Scott believe the answer is no, well, partially.
1092 Shaping needs to know the original names of all entities exposed by a
1093 module/signature. Then it needs to know (a) which entities a module/signature
1094 defines/declares locally and (b) which entities that module/signature exports.
1095 The former, (a), can be determined by a straightforward inspection of a parse
1096 tree of the source file.\footnote{Note that no expression or type parsing
1097 is necessary. We only need names of local values, data types, and data
1098 constructors.} The latter, (b), is a bit trickier. Right now it's the Renamer
1099 that interprets imports and exports into original names, so we would still
1100 rely on that implementation. However, the Renamer does other, harder things
1101 that we don't need, so ideally we could factor out the import/export
1102 resolution from the Renamer for use in shaping.
1103
1104 Unfortunately the Renamer's import resolution analyzes \verb|.hi| files, but for
1105 local modules, which haven't yet been typechecked, we don't have those.
1106 Instead, we could use a new file format, \verb|.hsi| files, to store the shape of
1107 a locally defined module. (Defined packages are bundled with their shapes,
1108 so included modules have \verb|.hsi| files as well.) (What about the logical
1109 vs.~physical distinction in file names?) If we refactor the import/export
1110 resolution code, could we rewrite it to generically operate on both
1111 \verb|.hi| files and \verb|.hsi| files?
1112
1113 Alternatively, rather than storing shapes on a per-source basis, we could
1114 store (in memory) the entire package shape. Similarly, included packages
1115 could have a single shape file for the entire package. Although this approach
1116 would make shaping non-incremental, since an entire package's shape would
1117 be recomputed any time a constituent module's shape changes, we do not expect
1118 shaping to be all that expensive.
1119
1120 \subsection{Typechecking of indefinite modules}\label{sec:typechecking-indefinite}
1121
1122 Recall in our argument in the definite case, where we showed there are
1123 no holes in the physical context. With indefinite modules, this is no
1124 longer true. While (with the linking restriction) these holes will never
1125 be linked against a physical implementation, they may be linked against
1126 other signatures. (Note: while disallowing signature linking would
1127 solve our problem, it would disallow a wide array of useful instances of
1128 signature reuse, for example, a package mylib that implements both
1129 mylib-1x-sig and mylib-2x-sig.)
1130
1131 With holes, we must handle module variables, and we sometimes must unify them:
1132
1133 \begin{verbatim}
1134 package p where
1135 A :: [ data A ]
1136 package q where
1137 A :: [ data A ]
1138 package r where
1139 include p
1140 include q
1141 \end{verbatim}
1142
1143 In this package, it is not possible to a priori assign original names to
1144 module A in p and q, because in package r, they should have the same
1145 original name. When signature linking occurs, unification may occur,
1146 which means we have to rename all relevant original names. (A similar
1147 situation occurs when a module is typechecked against a signature.)
1148
1149 An invariant which would be nice to have is this: when typechecking a
1150 signature or including a package, we may apply renaming to the entities
1151 being brought into context. But once we've picked an original name for
1152 our entities, no further renaming should be necessary. (Formally, in the
1153 unification for semantic object shapes, apply the unifier to the second
1154 shape, but not the first one.)
1155
1156 However, there are plenty of counterexamples here:
1157
1158 \begin{verbatim}
1159 package p where
1160 A :: [ data A ]
1161 B :: [ data A ]
1162 M = ...
1163 A = B
1164 \end{verbatim}
1165
1166 In this package, does module M know that A.A and B.A are type equal? In
1167 fact, the shaping pass will have assigned equal module identities to A
1168 and B, so M \emph{equates these types}, despite the aliasing occurring
1169 after the fact.
1170
1171 We can make this example more sophisticated, by having a later
1172 subpackage which causes the aliasing; now, the decision is not even a
1173 local one (on the other hand, the equality should be evident by inspection
1174 of the package interface associated with q):
1175
1176 \begin{verbatim}
1177 package p where
1178 A :: [ data A ]
1179 B :: [ data A ]
1180 package q where
1181 A :: [ data A ]
1182 B = A
1183 package r where
1184 include p
1185 include q
1186 \end{verbatim}
1187
1188 Another possibility is that it might be acceptable to do a mini-shaping
1189 pass, without parsing modules or signatures, \emph{simply} looking at
1190 names and aliases. But logical names are not the only mechanism by
1191 which unification may occur:
1192
1193 \begin{verbatim}
1194 package p where
1195 C :: [ data A ]
1196 A = [ data A = A ]
1197 B :: [ import A(A) ]
1198 C = B
1199 \end{verbatim}
1200
1201 It is easy to conclude that the original names of C and B are the same. But
1202 more importantly, C.A must be given the original name of p:A.A. This can only
1203 be discovered by looking at the signature definition for B. In any case, it
1204 is worth noting that this situation parallels the situation with hs-boot
1205 files (although there is no mutual recursion here).
1206
1207 The conclusion is that you will probably, in fact, have to do real
1208 shaping in order to typecheck all of these examples.
1209
1210 \paragraph{Hey, these signature imports are kind of tricky\ldots}
1211
1212 When signatures and modules are interleaved, the interaction can be
1213 complex. Here is an example:
1214
1215 \begin{verbatim}
1216 package p where
1217 C :: [ data A ]
1218 M = [ import C; ... ]
1219 A = [ import M; data A = A ]
1220 C :: [ import A(A) ]
1221 \end{verbatim}
1222
1223 Here, the second signature for C refers to a module implementation A
1224 (this is permissible: it simply means that the original name for p:C.A
1225 is p:A.A). But wait! A relies on M, and M relies on C. Circularity?
1226 Fortunately not: a client of package p will find it impossible to have
1227 the hole C implemented in advance, since they will need to get their hands on module
1228 A\ldots but it will not be defined prior to package p.
1229
1230 In any case, however, it would be good to emit a warning if a package
1231 cannot be compiled without mutual recursion.
1232
1233 \subsection{Incremental typechecking}
1234 We want to typecheck modules incrementally, i.e., when something changes in
1235 a package, we only want to re-typecheck the modules that care about that
1236 change. GHC already does this today.%
1237 \footnote{\url{https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler/RecompilationAvoidance}}
1238 Is the same mechanism sufficient for Backpack? Edward and Scott think that it
1239 is, mostly. Our conjecture is that a module should be re-typechecked if the
1240 existing mechanism says it should \emph{or} if the logical shape
1241 context (which maps logical names to physical names) has changed. The latter
1242 condition is due to aliases that affect typechecking of modules.
1243
1244 Let's look again at an example from before:
1245 \begin{verbatim}
1246 package p where
1247 A :: [ data A ]
1248 B :: [ data A ]
1249 M = [ import A; import B; ... ]
1250 \end{verbatim}
1251 Let's say that \verb|M| is typechecked successfully. Now we add an alias binding
1252 at the end of the package, \verb|A = B|. Does \verb|M| need to be
1253 re-typechecked? Yes! (Well, it seems so, but let's just assert ``yes'' for now.
1254 Certainly in the reverse case---if we remove the alias and then ask---this
1255 is true, since \verb|M| might have depended on the two \verb|A| types
1256 being the same.)
1257 The logical shape context changed to say that \verb|A| and
1258 \verb|B| now map to the same physical module identity. But does the existing
1259 recompilation avoidance mechanism say that \verb|M| should be re-typechecked?
1260 It's unclear. The \verb|.hi| file for \verb|M| records that it imported \verb|A| and
1261 \verb|B| with particular ABIs, but does it also know about the physical module
1262 identities (or rather, original module names) of these modules?
1263
1264 Scott thinks this highlights the need for us to get our story straight about
1265 the connection between logical names, physical module identities, and file
1266 names!
1267
1268
1269 \subsection{Installing indefinite packages}\label{sec:installing-indefinite}
1270
1271 If an indefinite package contains no code at all, we only need
1272 to install the interface file for the signatures. However, if
1273 they include code, we must provide all of the
1274 ingredients necessary to compile them when the holes are linked against
1275 actual implementations. (Figure~\ref{fig:pkgdb})
1276
1277 \paragraph{Source tarball or preprocessed source?} What is the representation of the source that is saved is. There
1278 are a number of possible choices:
1279
1280 \begin{itemize}
1281 \item The original tarballs downloaded from Hackage,
1282 \item Preprocessed source files,
1283 \item Some sort of internal, type-checked representation of Haskell code (maybe the output of the desugarer).
1284 \end{itemize}
1285
1286 Storing the tarballs is the simplest and most straightforward mechanism,
1287 but we will have to be very certain that we can recompile the module
1288 later in precisely the same we compiled it originally, to ensure the hi
1289 files match up (fortunately, it should be simple to perform an optional
1290 sanity check before proceeding.) The appeal of saving preprocessed
1291 source, or even the IRs, is that this is conceptually this is exactly
1292 what an indefinite package is: we have paused the compilation process
1293 partway, intending to finish it later. However, our compilation strategy
1294 for definite packages requires us to run this step using a \emph{different}
1295 choice of original names, so it's unclear how much work could actually be reused.
1296
1297 \section{Surface syntax}
1298
1299 In the Backpack paper, a brand new module language is presented, with
1300 syntax for inline modules and signatures. This syntax is probably worth implementing,
1301 because it makes it easy to specify compatibility packages, whose module
1302 definitions in general may be very short:
1303
1304 \begin{verbatim}
1305 package ishake-0.12-shake-0.13 where
1306 include shake-0.13
1307 Development.Shake.Sys = Development.Shake.Cmd
1308 Development.Shake = [ (**>) = (&>) ; (*>>) = (|*>)]
1309 Development.Shake.Rule = [ defaultPriority = rule . priority 0.5 ]
1310 include ishake-0.12
1311 \end{verbatim}
1312
1313 However, there are a few things that are less than ideal about the
1314 surface syntax proposed by Paper Backpack:
1315
1316 \begin{itemize}
1317 \item It's completely different from the current method users
1318 specify packages. There's nothing wrong with this per se
1319 (one simply needs to support both formats) but the smaller
1320 the delta, the easier the new packaging format is to explain
1321 and implement.
1322
1323 \item Sometimes order matters (relative ordering of signatures and
1324 module implementations), and other times it does not (aliases).
1325 This can be confusing for users.
1326
1327 \item Users have to order module definitions topologically,
1328 whereas in current Cabal modules can be listed in any order, and
1329 GHC figures out an appropriate order to compile them.
1330 \end{itemize}
1331
1332 Here is an alternative proposal, closely based on Cabal syntax. Given
1333 the following Backpack definition:
1334
1335 \begin{verbatim}
1336 package libfoo(A, B, C, Foo) where
1337 include base
1338 -- renaming and thinning
1339 include libfoo (Foo, Bar as Baz)
1340 -- holes
1341 A :: [ a :: Bool ].hsig
1342 A2 :: [ b :: Bool ].hsig
1343 -- normal module
1344 B = [
1345 import {-# SOURCE #-} A
1346 import Foo
1347 import Baz
1348 ...
1349 ].hs
1350 -- recursively linked pair of modules, one is private
1351 C :: [ data C ].hsig
1352 D = [ import {-# SOURCE #-} C; data D = D C ].hs
1353 C = [ import D; data C = C D ].hs
1354 -- alias
1355 A = A2
1356 \end{verbatim}
1357
1358 We can write the following Cabal-like syntax instead (where
1359 all of the signatures and modules are placed in appropriately
1360 named files):
1361
1362 \begin{verbatim}
1363 package: libfoo
1364 ...
1365 build-depends: base, libfoo (Foo, Bar as Baz)
1366 holes: A A2 -- deferred for now
1367 exposed-modules: Foo B C
1368 aliases: A = A2
1369 other-modules: D
1370 \end{verbatim}
1371
1372 Notably, all of these lists are \emph{insensitive} to ordering!
1373 The key idea is use of the \verb|{-# SOURCE #-}| pragma, which
1374 is enough to solve the important ordering constraint between
1375 signatures and modules.
1376
1377 Here is how the elaboration works. For simplicity, in this algorithm
1378 description, we assume all packages being compiled have no holes
1379 (including \verb|build-depends| packages). Later, we'll discuss how to
1380 extend the algorithm to handle holes in both subpackages and the main
1381 package itself.
1382
1383 \begin{enumerate}
1384
1385 \item At the top-level with \verb|package| $p$ and
1386 \verb|exposed-modules| $ms$, record \verb|package p (ms) where|
1387
1388 \item For each package $p$ with thinning/renaming $ms$ in
1389 \verb|build-depends|, record a \verb|include p (ms)| in the
1390 Backpack package. The ordering of these includes does not
1391 matter, since none of these packages have holes.
1392
1393 \item Take all modules $m$ in \verb|other-modules| and
1394 \verb|exposed-modules| which were not exported by build
1395 dependencies, and create a directed graph where hs and hs-boot
1396 files are nodes and imports are edges (the target of an edge is
1397 an hs file if it is a normal import, and an hs-boot file if it
1398 is a SOURCE import). Topologically sort this graph, erroring if
1399 this graph contains cycles (even with recursive modules, the
1400 cycle should have been broken by an hs-boot file). For each
1401 node, in this order, record \verb|M = [ ... ]| or \verb|M :: [ ... ]|
1402 depending on whether or not it is an hs or hs-boot. If possible,
1403 sort signatures before implementations when there is no constraint
1404 otherwise.
1405
1406 \end{enumerate}
1407
1408 Here is a simple example which shows how SOURCE can be used to disambiguate
1409 between two important cases. Suppose we have these modules:
1410
1411 \begin{verbatim}
1412 -- A1.hs
1413 import {-# SOURCE #-} B
1414
1415 -- A2.hs
1416 import B
1417
1418 -- B.hs
1419 x = True
1420
1421 -- B.hs-boot
1422 x :: Bool
1423 \end{verbatim}
1424
1425 Then we translate the following packages as follows:
1426
1427 \begin{verbatim}
1428 exposed-modules: A1 B
1429 -- translates to
1430 B :: [ x :: Bool ]
1431 A1 = [ import B ]
1432 B = [ x = True ]
1433 \end{verbatim}
1434
1435 but
1436
1437 \begin{verbatim}
1438 exposed-modules: A2 B
1439 -- translates to
1440 B = [ x = True ]
1441 B :: [ x :: Bool ]
1442 A2 = [ import B ]
1443 \end{verbatim}
1444
1445 The import controls placement between signature and module, and in A1 it
1446 forces B's signature to be sorted before B's implementation (whereas in
1447 the second section, there is no constraint so we preferentially place
1448 the B's implementation first)
1449
1450 \paragraph{Holes in the database} In the presence of holes,
1451 \verb|build-depends| resolution becomes more complicated. First,
1452 let's consider the case where the package we are building is
1453 definite, but the package database contains indefinite packages with holes.
1454 In order to maintain the linking restriction, we now have to order packages
1455 from step (2) of the previous elaboration. We can do this by creating
1456 a directed graph, where nodes are packages and edges are from holes to the
1457 package which implements them. If there is a cycle, this indicates a mutually
1458 recursive package. In the absence of cycles, a topological sorting of this
1459 graph preserves the linking invariant.
1460
1461 One subtlety to consider is the fact that an entry in \verb|build-depends|
1462 can affect how a hole is instantiated by another entry. This might be a
1463 bit weird to users, who might like to explicitly say how holes are
1464 filled when instantiating a package. Food for thought, surface syntax wise.
1465
1466 \paragraph{Holes in the package} Actually, this is quite simple: the
1467 ordering of includes goes as before, but some indefinite packages in the
1468 database are less constrained as they're ``dependencies'' are fulfilled
1469 by the holes at the top-level of this package. It's also worth noting
1470 that some dependencies will go unresolved, since the following package
1471 is valid:
1472
1473 \begin{verbatim}
1474 package a where
1475 A :: ...
1476 package b where
1477 include a
1478 \end{verbatim}
1479
1480 \paragraph{Multiple signatures} In Backpack syntax, it's possible to
1481 define a signature multiple times, which is necessary for mutually
1482 recursive signatures:
1483
1484 \begin{verbatim}
1485 package a where
1486 A :: [ data A ]
1487 B :: [ import A; data B = B A ]
1488 A :: [ import B; data A = A B ]
1489 \end{verbatim}
1490
1491 Critically, notice that we can see the constructors for both module B and A
1492 after the signatures are linked together. This is not possible in GHC
1493 today, but could be possible by permitting multiple hs-boot files. Now
1494 the SOURCE pragma indicating an import must \emph{disambiguate} which
1495 hs-boot file it intends to include. This might be one way of doing it:
1496
1497 \begin{verbatim}
1498 -- A.hs-boot2
1499 data A
1500
1501 -- B.hs-boot
1502 import {-# SOURCE hs-boot2 #-} A
1503
1504 -- A.hs-boot
1505 import {-# SOURCE hs-boot #-} B
1506 \end{verbatim}
1507
1508 \paragraph{Explicit or implicit reexports} One annoying property of
1509 this proposal is that, looking at the \verb|exposed-modules| list, it is
1510 not immediately clear what source files one would expect to find in the
1511 current package. It's not obvious what the proper way to go about doing
1512 this is.
1513
1514 \paragraph{Better syntax for SOURCE} If we enshrine the SOURCE import
1515 as a way of solving Backpack ordering problems, it would be nice to have
1516 some better syntax for it. One possibility is:
1517
1518 \begin{verbatim}
1519 abstract import Data.Foo
1520 \end{verbatim}
1521
1522 which makes it clear that this module is pluggable, typechecking against
1523 a signature. Note that this only indicates how type checking should be
1524 done: when actually compiling the module we will compile against the
1525 interface file for the true implementation of the module.
1526
1527 It's worth noting that the SOURCE annotation was originally made a
1528 pragma because, in principle, it should have been possible to compile
1529 some recursive modules without needing the hs-boot file at all. But if
1530 we're moving towards boot files as signatures, this concern is less
1531 relevant.
1532
1533 \section{Bits and bobs}
1534
1535 \subsection{Abstract type synonyms}
1536
1537 In Paper Backpack, abstract type synonyms are not permitted, because GHC doesn't
1538 understand how to deal with them. The purpose of this section is to describe
1539 one particularly nastiness of abstract type synonyms, by way of the occurs check:
1540
1541 \begin{verbatim}
1542 A :: [ type T ]
1543 B :: [ import qualified A; type T = [A.T] ]
1544 \end{verbatim}
1545
1546 At this point, it is illegal for \verb|A = B|, otherwise this type synonym would
1547 fail the occurs check. This seems like pretty bad news, since every instance
1548 of the occurs check in the type-checker could constitute a module inequality.
1549
1550 \subsection{Type families}
1551
1552 Like type classes, type families must not overlap (and this is a question of
1553 type safety!)
1554
1555 A more subtle question is compatibility and apartness of type family
1556 equations. Under these checks, aliasing of modules can fail if it causes
1557 two type families to be identified, but their definitions are not apart.
1558 Here is a simple example:
1559
1560 \begin{verbatim}
1561 A :: [
1562 type family F a :: *
1563 type instance F Int = Char
1564 ]
1565 B :: [
1566 type family F a :: *
1567 type instance F Int = Bool
1568 ]
1569 \end{verbatim}
1570
1571 Now it is illegal for \verb|A = B|, because when the type families are
1572 unified, the instances now fail the apartness check. However, if the second
1573 instance was \verb|F Int = Char|, the families would be able to link together.
1574
1575 To make matters worse, an implementation may define more axioms than are
1576 visible in the signature:
1577
1578 \begin{verbatim}
1579 package a where
1580 A :: [
1581 type family F a :: *
1582 type instance F Int = Bool
1583 ]
1584 package b where
1585 include a
1586 B = [
1587 import A
1588 type instance F Bool = Bool
1589 ...
1590 ]
1591 package c where
1592 A = [
1593 type family F a :: *
1594 type instance F Int = Bool
1595 type instance F Bool = Int
1596 ]
1597 include b
1598 \end{verbatim}
1599
1600 It would seem that private axioms cannot be naively supported. Is
1601 there any way that thinning axioms could be made to work?
1602
1603 \section{Open questions}\label{sec:open-questions}
1604
1605 Here are open problems about the implementation which still require
1606 hashing out.
1607
1608 \begin{itemize}
1609
1610 \item In Section~\ref{sec:simplifying-backpack}, we argued that we
1611 could implement Backpack without needing a shaping pass. We're
1612 pretty certain that this will work for typechecking and
1613 compiling fully definite packages with no recursive linking, but
1614 in Section~\ref{sec:typechecking-indefinite}, we described some
1615 of the prevailing difficulties of supporting signature linking.
1616 Renaming is not an insurmountable problem, but backwards flow of
1617 shaping information can be, and it is unclear how best to
1618 accommodate this. This is probably the most important problem
1619 to overcome.
1620
1621 \item In Section~\ref{sec:installing-indefinite}, a few choices for how to
1622 store source code were pitched, however, there is not consensus on which
1623 one is best.
1624
1625 \item What is the impact of the multiplicity of PackageIds on
1626 dependency solving in Cabal? Old questions of what to prefer
1627 when multiple package-versions are available (Cabal originally
1628 only needed to solve this between different versions of the same
1629 package, preferring the oldest version), but with signatures,
1630 there are more choices. Should there be a complex solver that
1631 does all signature solving, or a preprocessing step that puts
1632 things back into the original Cabal version. Authors may want
1633 to suggest policy for what packages should actually link against
1634 signatures (so a crypto library doesn't accidentally link
1635 against a null cipher package).
1636
1637 \end{itemize}
1638
1639 \bibliographystyle{plain}
1640 \bibliography{backpack-impl}
1641
1642 \end{document}