[no-ci] Minor bugfixes in Backpack docs.
[ghc.git] / docs / backpack / backpack-impl.tex
1 \documentclass{article}
2
3 \usepackage{pifont}
4 \usepackage{graphicx} %[pdftex] OR [dvips]
5 \usepackage{fullpage}
6 \usepackage{wrapfig}
7 \usepackage{float}
8 \usepackage{titling}
9 \usepackage{hyperref}
10 \usepackage{tikz}
11 \usepackage{color}
12 \usepackage{footnote}
13 \usepackage{float}
14 \usepackage{algorithm}
15 \usepackage{algpseudocode}
16 \usetikzlibrary{arrows}
17 \usetikzlibrary{positioning}
18 \setlength{\droptitle}{-6em}
19
20 \input{commands-new-new.tex}
21
22 \newcommand{\nuAA}{\nu_\mathit{AA}}
23 \newcommand{\nuAB}{\nu_\mathit{AB}}
24 \newcommand{\nuGA}{\nu_\mathit{GA}}
25 \newcommand{\nuGB}{\nu_\mathit{GB}}
26 \newcommand{\betaPL}{\beta_\mathit{PL}}
27 \newcommand{\betaAA}{\beta_\mathit{AA}}
28 \newcommand{\betaAS}{\beta_\mathit{AS}}
29 \newcommand{\thinandalso}{\hspace{.45cm}}
30 \newcommand{\thinnerandalso}{\hspace{.38cm}}
31
32 \input{commands-rebindings.tex}
33
34 \newcommand{\var}[1]{\textsf{#1}}
35
36 \newcommand{\ghcfile}[1]{\textsl{#1}}
37
38 \title{Implementing Backpack}
39
40 \begin{document}
41
42 \maketitle
43
44 The purpose of this document is to describe an implementation path
45 for Backpack in GHC\@.
46
47 \tableofcontents
48
49 \section{What we are trying to solve}
50
51 While the current ecosystem has proved itself serviceable for many years,
52 there are a number of major problems which causes significant headaches
53 for many users. Here are some of them:
54
55 \subsection{Package reinstalls are destructive}\label{sec:destructive}
56
57 When attempting to install a new package, you might get an error like
58 this:
59
60 \begin{verbatim}
61 $ cabal install hakyll
62 cabal: The following packages are likely to be broken by the reinstalls:
63 pandoc-1.9.4.5
64 Graphalyze-0.14.0.0
65 Use --force-reinstalls if you want to install anyway.
66 \end{verbatim}
67
68 While this error message is understandable if you're really trying to
69 reinstall a package, it is quite surprising that it can occur even if
70 you didn't ask for any reinstalls!
71
72 The underlying cause of this problem is related to an invariant Cabal
73 currently enforces on a package database: there can only be one instance
74 of a package for any given package name and version. This means that it
75 is not possible to install a package multiple times, compiled against
76 different dependencies. However, sometimes, reinstalling a package with
77 different dependencies is the only way to fulfill version bounds of a
78 package! For example: say we have three packages \pname{a}, \pname{b}
79 and \pname{c}. \pname{b-1.0} is the only version of \pname{b}
80 available, and it has been installed and compiled against \pname{c-1.0}.
81 Later, the user installs an updated version \pname{c-1.1} and then
82 attempts to install \pname{a}, which depends on the specific versions
83 \pname{c-1.1} and \pname{b-1.0}. We \emph{cannot} use the already
84 installed version of \pname{b-1.0}, which depends on the wrong version
85 of \pname{c}, so our only choice is to reinstall \pname{b-1.0} compiled
86 against \pname{c-1.1}. This will break any packages, e.g. \pname{d},
87 which were built against the old version of \pname{b-1.0}.
88
89 Our solution to this problem is to \emph{abolish} destructive package
90 installs, and allow a package to be installed multiple times with the same
91 package name and version. However, allowing this poses some interesting
92 user interface problems, since package IDs are now no longer unambiguous
93 identifiers.
94
95 \subsection{Version bounds are often over/under-constrained}
96
97 When attempting to install a new package, Cabal might fail in this way:
98
99 \begin{verbatim}
100 $ cabal install hledger-0.18
101 Resolving dependencies...
102 cabal: Could not resolve dependencies:
103 # pile of output
104 \end{verbatim}
105
106 There are a number of possible reasons why this could occur, but usually
107 it's because some of the packages involved have over-constrained version
108 bounds, which are resulting in an unsatisfiable set of constraints (or,
109 at least, Cabal gave up backtracking before it found a solution.) To
110 add insult to injury, most of the time the bound is nonsense and removing
111 it would result in a working compilation. In fact, this situation is
112 so common that Cabal has a flag \verb|--allow-newer| which lets you
113 override the package upper bounds.
114
115 However, the flip-side is when Cabal finds a satisfying set, but your
116 compilation fails with a type error. Here, you had an under-constrained
117 set of version bounds which didn't actually reflect the compatible
118 versions of a package, and Cabal picked a version of the package which
119 was incompatible.
120
121 Our solution to this problem is to use signatures instead of version
122 numbers as the primary mechanism by which compatibility is determined:
123 e.g., if it typechecks, it's a valid choice. Version numbers can still
124 be used to reflect semantic changes not seen in the types (in
125 particular, ruling out buggy versions of a package is a useful
126 operation), but these bounds are empirical observations and can be
127 collected after-the-fact.
128
129 \subsection{It is difficult to support multiple implementations of a type}
130
131 This problem is perhaps best described by referring to a particular
132 instance of it Haskell's ecosystem: the \texttt{String} data type. Haskell,
133 by default, implements strings as linked lists of integers (representing
134 characters). Many libraries use \texttt{String}, because it's very
135 convenient to program against. However, this representation is also
136 very \emph{slow}, so there are alternative implementations such as
137 \texttt{Text} which implement efficient, UTF-8 encoded packed byte
138 arrays.
139
140 Now, suppose you are writing a library and you don't care if the user of
141 your library is using \texttt{String} or \texttt{Text}. However, you
142 don't want to rewrite your library twice to support both data types:
143 rather, you'd like to rely on some \emph{common interface} between the
144 two types, and let the user instantiate the implementation. The only
145 way to do this in today's Haskell is using type classes; however, this
146 necessitates rewriting all type signatures from a nice \texttt{String ->
147 String} to \texttt{StringLike s => s -> s}. The result is less readable,
148 required a large number of trivial edits to type signatures, and might
149 even be less efficient, if GHC does not appropriately specialize your code
150 written in this style.
151
152 Our solution to this problem is to introduce a new mechanism of
153 pluggability: module holes, which let us use types and functions from a
154 module \texttt{Data.String} as before, but defer choosing \emph{what}
155 module should be used in the implementation to some later point (or
156 instantiate the code multiple times with different choices.)
157
158 \subsection{Fast moving APIs are difficult to develop/develop against}
159
160 Most packages that are uploaded to Hackage have package authors which pay
161 some amount of attention to backwards compatibility and avoid making egregious
162 breaking changes. However, a package like the \verb|ghc-api| follows a
163 very different model: the library is a treated by its developers as an
164 internal component of an application (GHC), and is frequently refactored
165 in a way that changes its outwards facing interface.
166
167 Arguably, an application like GHC should design a stable API and
168 maintain backwards compatibility against it. However, this is a lot of
169 work (including refactoring) which is only being done slowly, and in the
170 meantime, the dump of all the modules gives users the functionality they
171 want (even if it keeps breaking every version.)
172
173 One could say that the core problem is there is no way for users to
174 easily communicate to GHC authors what parts of the API they rely on. A
175 developer of GHC who is refactoring an interface will often rely on the
176 typechecker to let them know which parts of the codebase they need to
177 follow and update, and often could say precisely how to update code to
178 use the new interface. User applications, which live out of tree, don't
179 receive this level of attention.
180
181 Our solution is to make it possible to typecheck the GHC API against a
182 signature. Important consumers can publish what subsets of the GHC API
183 they rely against, and developers of GHC, as part of their normal build
184 process, type-check against these signatures. If the signature breaks,
185 a developer can either do the refactoring differently to avoid the
186 compatibility-break, or document how to update code to use the new API\@.
187
188 \section{Backpack in a nutshell}
189
190 For a more in-depth tutorial about Backpack's features, check out Section 2
191 of the original Backpack paper. In this section, we briefly review the
192 most important points of Backpack's design.
193
194 \paragraph{Thinning and renaming at the module level}
195 A user can specify a build dependency which only exposes a subset of
196 modules (possibly under different names.) By itself, it's a way for the
197 user to resolve ambiguous module imports at the package level, without
198 having to use the \texttt{PackageImports} syntax extension.
199
200 \paragraph{Holes (abstract module definitions)} The core
201 component of Backpack's support for \emph{separate modular development}
202 is the ability to specify abstract module bindings, or holes, which give
203 users of the module an obligation to provide an implementation which
204 fulfills the signature of the hole. In this example:
205
206 \begin{verbatim}
207 package p where
208 A :: [ ... ]
209 B = [ import A; ... ]
210 \end{verbatim}
211
212 \verb|p| is an \emph{indefinite package}, which cannot be compiled until
213 an implementation of \m{A} is provided. However, we can still type check
214 \m{B} without any implementation of \m{A}, by type checking it against
215 the signature. Holes can be put into signature packages and included
216 (depended upon) by other packages to reuse definitions of signatures.
217
218 \paragraph{Filling in holes with an implementation}
219 A hole in an indefinite package can be instantiated in a \emph{mix-in}
220 style: namely, if a signature and an implementation have the same name,
221 they are linked together:
222
223 \begin{verbatim}
224 package q where
225 A = [ ... ]
226 include p -- has signature A
227 \end{verbatim}
228
229 Renaming is often useful to rename a module (or a hole) so that a signature
230 and implementation have the same name and are linked together.
231 An indefinite package can be instantiated multiple times with different
232 implementations: the \emph{applicativity} of Backpack means that if
233 a package is instantiated separately with the same module, the results
234 are type equal:
235
236 \begin{verbatim}
237 package q' where
238 A = [ ... ]
239 include p (A, B as B1)
240 include p (A, B as B2)
241 -- B1 and B2 are equivalent
242 \end{verbatim}
243
244 \paragraph{Combining signatures together}
245 Unlike implementations, it's valid for a multiple signatures with the
246 same name to be in scope.
247
248 \begin{verbatim}
249 package a-sig where
250 A :: [ ... ]
251 package a-sig2 where
252 A :: [ ... ]
253 package q where
254 include a-sig
255 include a-sig2
256 B = [ import A; ... ]
257 \end{verbatim}
258
259 These signatures \emph{merge} together, providing the union of the
260 functionality (assuming the types of individual entities are
261 compatible.) Backpack has a very simple merging algorithm: types must
262 match exactly to be compatible (\emph{width} subtyping).
263
264 \clearpage
265
266 \section{Module and package identity}
267
268 \begin{figure}[H]
269 \begin{tabular}{p{0.45\textwidth} p{0.45\textwidth}}
270 \begin{verbatim}
271 package p where
272 A :: [ data X ]
273 P = [ import A; data Y = Y X ]
274 package q where
275 A1 = [ data X = X1 ]
276 A2 = [ data X = X2 ]
277 include p (A as A1, P as P1)
278 include p (A as A2, P as P2)
279 \end{verbatim}
280 &
281 \begin{verbatim}
282 package p where
283 A :: [ data X ]
284 P = [ data T = T ] -- no A import!
285 package q where
286 A1 = [ data X = X1 ]
287 A2 = [ data X = X2 ]
288 include p (A as A1, P as P1)
289 include p (A as A2, P as P2)
290 \end{verbatim}
291 \\
292 (a) Type equality must consider holes\ldots &
293 (b) \ldots but how do we track dependencies? \\
294 \end{tabular}
295 \caption{Two similar examples}\label{fig:simple-ex}
296 \end{figure}
297
298 One of the central questions one encounters when type checking Haskell
299 code is: \emph{when are two types equal}? In ordinary Haskell, the
300 answer is simple: ``They are equal if their \emph{original names} (i.e.,
301 where they were originally defined) are the same.'' However, in
302 Backpack, the situation is murkier due to the presence of \emph{holes}.
303 Consider the pair of examples in Figure~\ref{fig:simple-ex}.
304 In Figure~\ref{fig:simple-ex}a, the types \m{B1}.\verb|Y| and \m{B2}.\verb|Y| should not be
305 considered equal, even though na\"\i vely their original names are
306 \pname{p}:\m{B}.\verb|Y|, since their arguments are different \verb|X|'s!
307 On the other hand, if we instantiated \pname{p} twice with the same \m{A}
308 (e.g., change the second include to \texttt{include p (A as A1, P as P2)}),
309 we might consider the two resulting \verb|Y|'s
310 equal, an \emph{applicative} semantics of identity instantiation. In
311 Figure~\ref{fig:simple-ex}b, we see that even though \m{A} was instantiated differently,
312 we might reasonably wonder if \texttt{T} should still be considered the same,
313 since it has no dependence on the actual choice of \m{A}.
314
315 In fact, there are quite a few different choices that can be made here.
316 Figures~\ref{fig:applicativity}~and~\ref{fig:granularity} summarize the various
317 choices on two axes: the granularity of applicativity (under what circumstances
318 do we consider two types equal) and the granularity of dependency (what circumstances
319 do we consider two types not equal)? A \ding{52} means the design we have chosen
320 answers the question affirmatively---\ding{54}, negatively---but all of these choices
321 are valid points on the design space.
322
323 \subsection{The granularity of applicativity}
324
325 An applicative semantics of package instantiation states that if a package is
326 instantiated with the ``same arguments'', then the resulting entities it defines
327 should also be considered equal. Because Backpack uses \emph{mix-in modules},
328 it is very natural to consider the arguments of a package instantiation as the
329 modules, as shown in Figure~\ref{fig:applicativity}b: the same module \m{A} is
330 linked for both instantiations, so \m{P1} and \m{P2} are considered equal.
331
332 However, we consider the situation at a finer granularity, we might say, ``Well,
333 for a declaration \texttt{data Y = Y X}, only the definition of type \verb|X| matters.
334 If they are the same, then \verb|Y| is the same.'' In that case, we might accept
335 that in Figure~\ref{fig:applicativity}a, even though \pname{p} is instantiated
336 with different modules, at the end of the day, the important component \verb|X| is
337 the same in both cases, so \verb|Y| should also be the same. This is a sort of
338 ``extreme'' view of modular development, where every declaration is desugared
339 into a separate module. In our design, we will be a bit more conservative, and
340 continue with module level applicativity, in the same manner as Paper Backpack.
341
342 \paragraph{Implementation considerations}
343 Compiling Figure~\ref{fig:applicativity}b to dynamic libraries poses an
344 interesting challenge, if every package compiles to a dynamic library.
345 When we compile package \pname{q}, the libraries we end up producing are \pname{q}
346 and an instance of \pname{p} (instantiated with \pname{q}:\m{A}). Furthermore,
347 \pname{q} refers to code in \pname{p} (the import in \m{Q}), and vice versa (the usage
348 of the instantiated hole \m{A}). When building static libraries, this circular
349 dependency doesn't matter: when we link the executable, we can resolve all
350 of the symbols in one go. However, when the libraries in question are
351 dynamic libraries \verb|libHSq.so| and \verb|libHSp(q:A).so|, we now have
352 a \emph{circular dependency} between the two dynamic libraries, and most dynamic
353 linkers will not be able to load either of these libraries.
354
355 To break the circularity in Figure~\ref{fig:applicativity}b, we have to \emph{inline}
356 the entire module \m{A} into the instance of \pname{p}. Since the code is exactly
357 the same, we can still consider the instance of \m{A} in \pname{q} and in \pname{p}
358 type equal. However, in Figure~\ref{fig:applicativity}c, applicativity has been
359 done at a coarser level: although we are using Backpack's module mixin syntax,
360 morally, this example is filling in the holes with the \emph{package} \pname{a}
361 (rather than a module). In this case, we can achieve code sharing, since
362 \pname{p} can refer directly to \pname{a}, breaking the circularity.
363
364 \newcolumntype{C}{>{\centering\arraybackslash}p{0.3\textwidth}}
365 \begin{savenotes}
366 \begin{figure}
367 \begin{tabular}{C C C}
368 \begin{verbatim}
369 package q where
370 A = [ data X = X ]
371 A1 = [ import A; x = 0 ]
372 A2 = [ import A; x = 1 ]
373 include p (A as A1, P as P1)
374 include p (A as A2, P as P2)
375 Q = [ import P1; ... ]
376 \end{verbatim}
377 &
378 \begin{verbatim}
379 package q where
380 A = [ data X = X ]
381
382
383 include p (A, P as P1)
384 include p (A, P as P2)
385 Q = [ import P1; ... ]
386 \end{verbatim}
387 &
388 \begin{verbatim}
389 package a where
390 A = [ data X = X ]
391 package q where
392 include a
393 include p (A, P as P1)
394 include p (A, P as P2)
395 Q = [ import P1; ... ]
396 \end{verbatim}
397 \\
398 (a) Declaration applicativity \ding{54} &
399 (b) Module applicativity \ding{52} &
400 (c) Package applicativity \ding{52} \\
401 \end{tabular}
402 \caption{Choices of granularity of applicativity on \pname{p}: given \texttt{data Y = Y X}, is \m{P1}.\texttt{Y} equal to \m{P2}.\texttt{Y}?}\label{fig:applicativity}
403 \end{figure}
404 \end{savenotes}
405
406 \subsection{The granularity of dependency}
407
408 \begin{savenotes}
409 \newcolumntype{C}{>{\centering\arraybackslash}p{0.3\textwidth}}
410 \begin{figure}
411 \begin{tabular}{C C C}
412 \begin{verbatim}
413 package p(A,P) where
414 A :: [ data X ]
415 P = [
416 import A
417 data T = T
418 data Y = Y X
419 ]
420 \end{verbatim}
421 &
422 \begin{verbatim}
423 package p(A,P) where
424 A :: [ data X ]
425 B = [ data T = T ]
426 C = [
427 import A
428 data Y = Y X
429 ]
430 P = [
431 import B
432 import C
433 ]
434 \end{verbatim}
435 &
436 \begin{verbatim}
437 package b where
438 B = [ data T = T ]
439 package c where
440 A :: [ data X ]
441 C = [
442 import A
443 data Y = Y X
444 ]
445 package p(A,P) where
446 include b; include c
447 P = [ import B; import C ]
448 \end{verbatim}
449 \\
450 (a) Declaration granularity \ding{54} &
451 (b) Module granularity \ding{54} &
452 (c) Package granularity \ding{52} \\
453 \end{tabular}
454 \caption{Choices of granularity for dependency: is the identity of \texttt{T} independent of how \m{A} is instantiated?}\label{fig:granularity}
455 \end{figure}
456
457 \end{savenotes}
458
459 In the previous section, we considered \emph{what} entities may be considered for
460 computing dependency; in this section we consider \emph{which} entities are actually
461 considered as part of the dependencies for the declaration/module/package we're writing.
462 Figure~\ref{fig:granularity} contains a series of examples which exemplify the choice
463 of whether or not to collect dependencies on a per-declaration, per-module or per-package basis:
464
465 \begin{itemize}
466 \item Package-level granularity states that the modules in a package are
467 considered to depend on \emph{all} of the holes in the package, even if
468 the hole is never imported. Figure~\ref{fig:granularity}c is factored so that
469 \verb|T| is defined in a distinct package \pname{b} with no holes, so no matter
470 the choice of \m{A}, \m{B}.\verb|T| will be the same. On the other hand, in
471 Figure~\ref{fig:granularity}b, there is a hole in the package defining \m{B},
472 so the identity of \verb|T| will depend on the choice of \m{A}.
473
474 \item Module-level granularity states that each module has its own dependency,
475 computed by looking at its import statements. In this setting, \verb|T| in Figure~\ref{fig:granularity}b
476 is independent of \m{A}, since the hole is never imported in \m{B}. But once again, in
477 Figure~\ref{fig:granularity}a, there is an import in the module defining \verb|T|,
478 so the identity of \verb|T| once again depends on the choice of \m{A}.
479
480 \item Finally, at the finest level of granularity, one could chop up \pname{p} in
481 Figure~\ref{fig:granularity}a, looking at the type declaration-level dependency
482 to suss out whether or not \verb|T| depends on \m{A}. It doesn't refer to
483 anything in \m{A}, so it is always considered the same.
484 \end{itemize}
485
486 It is well worth noting that the system described by Paper Backpack tracks dependencies per module;
487 however, we have decided that we will implement tracking per package instead:
488 a coarser grained granularity which accepts less programs.
489
490 Is a finer form of granularity \emph{better?} Not necessarily! For
491 one, we can always split packages into further subpackages (as was done
492 in Figure~\ref{fig:granularity}c) which better reflect the internal hole
493 dependencies, so it is always possible to rewrite a program to make it
494 typecheck---just with more packages. Additionally, the finer the
495 granularity of dependency, the more work I have to do to understand what
496 the identities of entities in a module are. In Paper Backpack, I have
497 to understand the imports of all modules in a package; with
498 declaration-granularity, I have to understand the entire code. This is
499 a lot of work for the developer to think about; a more granular model is
500 easier to remember and reason about. Finally, package-level granularity
501 is much easier to implement, as it preserves the previous compilation
502 model, \emph{one library per package}. At a fine level of granularity, we
503 may end up repeatedly compiling a module which actually should be considered
504 ``the same'' as any other instance of it.
505
506 Nevertheless, finer granularity can be desirable from an end-user perspective.
507 Usually, these circumstances arise when library-writers are forced to split their
508 components into many separate packages, when they would much rather have written
509 a single package. For example, if I define a data type in my library, and would
510 like to define a \verb|Lens| instance for it, I would create a new package just
511 for the instance, in order to avoid saddling users who aren't interested in lenses
512 with an extra dependency. Another example is test suites, which have dependencies
513 on various test frameworks that a user won't care about if they are not planning
514 on testing the code. (Cabal has a special case for this, allowing the user
515 to write effectively multiple packages in a single Cabal file.)
516
517 \subsection{Summary}
518
519 We can summarize all of the various schemes by describing the internal data
520 types that would be defined by GHC under each regime. First, we have
521 the shared data structures, which correspond closely to what users are
522 used to seeing:
523
524 \begin{verbatim}
525 <pkg-name> ::= containers, ...
526 <pkg-version ::= 1.0, ...
527 <pkg-id> ::= <pkg-name>-<pkg-version>
528 <mod-name> ::= Data.Set, ...
529 <occ> ::= empty, ...
530 \end{verbatim}
531
532 Changing the \textbf{granularity of applicativity} modifies how we represent the
533 list of dependencies associated with an entity. With module applicativity,
534 we list module identities (not yet defined); with declaration applicativity
535 we actually list the original names (i.e., ids).
536
537 \begin{verbatim}
538 <deps> ::= <id>, ... # Declaration applicativity
539 <deps> ::= <module>, ... # Module applicativity
540 \end{verbatim}
541
542 Changing the \textbf{granularity of dependency} affects how we compute
543 the lists of dependencies, and what entities are well defined:
544
545 \begin{verbatim}
546 # Package-level granularity
547 <pkg-key> ::= hash(<pkg-id> + <deps for pkg>)
548 <module> ::= <pkg-key> : <mod-name>
549 <id> ::= <module> . <occ>
550
551 # Module-level granularity
552 <pkg-key> not defined
553 <module> ::= hash(<pkg-id> : <mod-name> + <deps for mod>)
554 <id> ::= <module-key> . <occ>
555
556 # Declaration-level granularity
557 <pkg-key> not defined
558 <module> not defined
559 <id> ::= hash(<pkg-id> : <mod-name> . <occ> + <deps for decl>)
560 \end{verbatim}
561
562 Notice that as we increase the granularity, the notion of a ``package'' and a ``module''
563 become undefined. This is because, for example, with module-level granularity, a single
564 ``package'' may result in several modules, each of which have different sets of
565 dependencies. It doesn't make much sense to refer to the package as a monolithic entity,
566 because the point of splitting up the dependencies was so that if a user relies only
567 on a single module, it has a correspondingly restricted set of dependencies.
568 \subsection{The new scheme, formally}
569
570 \begin{wrapfigure}{R}{0.5\textwidth}
571 \begin{myfig}
572 \[
573 \begin{array}{@{}lr@{\;}c@{\;}l@{}}
574 \text{Package Names (\texttt{PkgName})} & P &\in& \mathit{PkgNames} \\
575 \text{Module Path Names (\texttt{ModName})} & p &\in& \mathit{ModPaths} \\
576 \text{Module Identity Vars} & \alpha,\beta &\in& \mathit{IdentVars} \\
577 \text{Package Key (\texttt{PackageKey})} & \K &::=& P(\vec{p\mapsto\nu}) \\
578 \text{Module Identities (\texttt{Module})} & \nu &::=&
579 \alpha ~|~
580 \K\colon\! p \\
581 \text{Module Identity Substs} & \phi,\theta &::=&
582 \{\vec{\alpha \coloneqq \nu}\} \\
583 \end{array}
584 \]
585 \caption{Module Identities}
586 \label{fig:mod-idents}
587 \end{myfig}
588 \end{wrapfigure}
589
590 In this section, we give a formal treatment of our choice in the design space, in the
591 same style as the Backpack paper, but omitting mutual recursion, as it follows straightforwardly.
592 Physical module
593 identities $\nu$, the \texttt{Module} component of \emph{original names} in GHC, are either (1) \emph{variables} $\alpha$, which are
594 used to represent holes\footnote{In practice, these will just be fresh paths in a special package key for variables.} or (2) a concrete module $p$ defined in package
595 $P$, with holes instantiated with other module identities (might be
596 empty)\footnote{In Paper Backpack, we would refer to just $P$:$p$ as the identity
597 constructor. However, we've written the subterms specifically next to $P$ to highlight the semantic difference of these terms.}.
598
599 As in traditional Haskell, every package contains a number of module
600 files at some module path $p$; within a package these paths are
601 guaranteed to be unique.\footnote{In Paper Backpack, the module expressions themselves are used to refer to globally unique identifiers for each literal. This makes the metatheory simpler, but for implementation purposes it is convenient to conflate the \emph{original} module path that a module is defined at with its physical identity.} When we write inline module definitions, we assume
602 that they are immediately assigned to a module path $p$ which is incorporated
603 into their identity. A module identity $\nu$ simply augments this
604 with subterms $\vec{p\mapsto\nu}$ representing how \emph{all} holes in the package $P$
605 were instantiated.\footnote{In Paper Backpack, we do not distinguish between holes/non-holes, and we consider all imports of the \emph{module}, not the package.} This naming is stable because the current Backpack surface syntax does not allow a logical path in a package
606 to be undefined. A package key is $P(\vec{p\mapsto\nu})$.
607
608 Here is the very first example from
609 Section 2 of the original Backpack paper, \pname{ab-1}:
610
611 \begin{example}
612 \Pdef{ab-1}{
613 \Pmod{A}{x = True}
614 \Pmod{B}{\Mimp{A}; y = not x}
615 % \Pmodd{C}{\mname{A}}
616 }
617 \end{example}
618
619 The identities of \m{A} and \m{B} are
620 \pname{ab-1}:\mname{A} and \pname{ab-1}:\mname{B}, respectively.\footnote{In Paper Backpack, the identity for \mname{B} records its import of \mname{A}, but since it is definite, this is strictly redundant.} In a package with holes, each
621 hole (within the package definition) gets a fresh variable as its
622 identity, and all of the holes associated with package $P$ are recorded. Consider \pname{abcd-holes-1}:
623
624 \begin{example}
625 \Pdef{abcd-holes-1}{
626 \Psig{A}{x :: Bool} % chktex 26
627 \Psig{B}{y :: Bool} % chktex 26
628 \Pmod{C}{x = False}
629 \Pmodbig{D}{
630 \Mimpq{A}\\
631 \Mimpq{C}\\
632 % \Mexp{\m{A}.x, z}\\
633 z = \m{A}.x \&\& \m{C}.x
634 }
635 }
636 \end{example}
637
638 The identities of the four modules
639 are, in order, $\alpha_a$, $\alpha_b$, $\pname{abcd-holes-1}(\alpha_a,\alpha_b)$:\mname{C}, and
640 $\pname{abcd-holes-1}(\alpha_a,\alpha_b)$:\mname{D}.\footnote{In Paper Backpack, the granularity is at the module level, so the subterms of \mname{C} and \mname{D} can differ.} We include both $\alpha_a$ and $\alpha_b$ in both \mname{C} and \mname{D}, regardless of the imports. When we link the package against an implementation of the hole, these variables are replaced with the identities of the modules we linked against.
641
642 Shaping proceeds in the same way as in Paper Backpack, except that the
643 shaping judgment must also accept the package key
644 $P(\vec{p\mapsto\alpha})$ so we can create identifiers with
645 \textsf{mkident}. This implies we must know ahead of time what the holes
646 of a package are.
647
648 \paragraph{A full Backpack comparison}
649 If you're curious about how the rest of the Backpack examples translate,
650 look no further than this section.
651
652 First, consider the module identities in the \m{Graph} instantiations in
653 \pname{multinst}, shown in Figure 2 of the original Backpack paper.
654 In the definition of \pname{structures}, assume that the variables for
655 \m{Prelude} and \m{Array} are $\alpha_P$ and $\alpha_A$ respectively.
656 The identity of \m{Graph} is $\pname{structures}(\alpha_P, \alpha_A)$:\m{Graph}. Similarly, the identities of the two array implementations
657 are $\nu_{AA} = \pname{arrays-a}(\alpha_P)$:\m{Array} and
658 $\nu_{AB} = \pname{arrays-b}(\alpha_P)$:\m{Array}.\footnote{Notice that the subterms coincide with Paper Backpack! A sign that module level granularity is not necessary for many use-cases.}
659
660 The package \pname{graph-a} is more interesting because it
661 \emph{links} the packages \pname{arrays-a} and \pname{structures}
662 together, with the implementation of \m{Array} from \pname{arrays-a}
663 \emph{instantiating} the hole \m{Array} from \pname{structures}. This
664 linking is reflected in the identity of the \m{Graph} module in
665 \pname{graph-a}: whereas in \pname{structures} it was $\nu_G =
666 \pname{structures}(\alpha_P, \alpha_A)$:\m{Graph}, in \pname{graph-a} it is
667 $\nu_{GA} = \nu_G[\nu_{AA}/\alpha_A] = \pname{structures}(\alpha_P, \nu_{AA})$:\m{Graph}. Similarly, the identity of \m{Graph} in
668 \pname{graph-b} is $\nu_{GB} = \nu_G[\nu_{AB}/\alpha_A] =
669 \pname{structures}(\alpha_P, \nu_{AB})$:\m{Graph}. Thus, linking consists
670 of substituting the variable identity of a hole by the concrete
671 identity of the module filling that hole.
672
673 Lastly, \pname{multinst} makes use of both of these \m{Graph}
674 modules, under the aliases \m{GA} and \m{GB}, respectively.
675 Consequently, in the \m{Client} module, \code{\m{GA}.G} and
676 \code{\m{GB}.G} will be correctly viewed as distinct types since they
677 originate in modules with distinct identities.
678
679 As \pname{multinst} illustrates, module identities effectively encode
680 dependency graphs at the package level.\footnote{In Paper Backpack, module identities
681 encode dependency graphs at the module level. In both cases, however, what is being
682 depended on is always a module.} Like in Paper Backpack, we have an \emph{applicative}
683 semantics of instantiation, and the applicativity example in Figure 3 of the
684 Backpack paper still type checks. However, because we are operating at a coarser
685 granularity, modules may have spurious dependencies on holes that they don't
686 actually depend on, which means less type equalities may hold.
687
688
689 \subsection{Cabal dependency resolution}
690
691 Currently, when we compile a Cabal
692 package, Cabal goes ahead and resolves \verb|build-depends| entries with actual
693 implementations, which we compile against. A planned addition to the package key,
694 independent of Backpack, is to record the transitive dependency tree selected
695 during this dependency resolution process, so that we can install \pname{libfoo-1.0}
696 twice compiled against different versions of its dependencies.
697 What is the relationship to this transitive dependency tree of \emph{packages},
698 with the subterms of our package identities which are \emph{modules}? Does one
699 subsume the other? In fact, these are separate mechanisms---two levels of indirections,
700 so to speak.
701
702 To illustrate, suppose I write a Cabal file with \verb|build-depends: foobar|. A reasonable assumption is that this translates into a
703 Backpack package which has \verb|include foobar|. However, this is not
704 actually a Paper Backpack package: Cabal's dependency solver has to
705 rewrite all of these package references into versioned references
706 \verb|include foobar-0.1|. For example, this is a pre-package:
707
708 \begin{verbatim}
709 package foo where
710 include bar
711 \end{verbatim}
712
713 and this is a Paper Backpack package:
714
715 \begin{verbatim}
716 package foo-0.3[bar-0.1[baz-0.2]] where
717 include bar-0.1[baz-0.2]
718 \end{verbatim}
719
720 This tree is very similar to the one tracking dependencies for holes,
721 but we must record this tree \emph{even} when our package has no holes.
722 % As a final example, the full module
723 % identity of \m{B1} in Figure~\ref{fig:granularity} may actually be $\pname{p-0.9(q-1.0[p-0.9]:A1)}$:\m{B}.
724
725 \paragraph{Linker symbols} As we increase the amount of information in
726 PackageId, it's important to be careful about the length of these IDs,
727 as they are used for exported linker symbols (e.g.
728 \verb|base_TextziReadziLex_zdwvalDig_info|). Very long symbol names
729 hurt compile and link time, object file sizes, GHCi startup time,
730 dynamic linking, and make gdb hard to use. As such, we are going to
731 do away with full package names and versions and instead use just a
732 base-62 encoded hash, with the first five characters of the package
733 name for user-friendliness.
734
735 \subsection{Package selection}
736
737 When I fire up \texttt{ghci} with no arguments, GHC somehow creates
738 out of thin air some consistent set of packages, whose modules I can
739 load using \texttt{:m}. This functionality is extremely handy for
740 exploratory work, but actually GHC has to work quite hard in order
741 to generate this set of packages, the contents of which are all
742 dumped into a global namespace. For example, GHC doesn't have access
743 to Cabal's dependency solver, nor does it know \emph{which} packages
744 the user is going to ask for, so it can't just run a constraint solver,
745 get a set of consistent packages to offer and provide them to the user.\footnote{Some might
746 argue that depending on a global environment in this fashion is wrong, because
747 when you perform a build in this way, you have absolutely no ideas what
748 dependencies you actually ended up using. But the fact remains that for
749 end users, this functionality is very useful.}
750
751 To make matters worse, while in the current design of the package database,
752 a package is uniquely identified by its package name and version, in
753 the Backpack design, it is \emph{mandatory} that we support multiple
754 packages installed in the database with the same package name and version,
755 and this can result in complications in the user model. This further
756 complicates GHC's default package selection algorithm.
757
758 In this section, we describe how the current algorithm operates (including
759 what invariants it tries to uphold and where it goes wrong), and how
760 to replace the algorithm to handle generalization to
761 multiple instances in the package database. We'll also try to tease
762 apart the relationship between package keys and installed package IDs in
763 the database.
764
765 \paragraph{The current algorithm} Abstractly, GHC's current package
766 selection algorithm operates as follows. For every package name, select
767 the package with the latest version (recall that this is unique) which
768 is also \emph{valid}. A package is valid if:
769
770 \begin{itemize}
771 \item It exists in the package database,
772 \item All of its dependencies are valid,
773 \item It is not shadowed by a package with the same package ID\footnote{Recall that currently, a package ID uniquely identifies a package in the package database} in
774 another package database (unless it is in the transitive closure
775 of a package named by \texttt{-package-id}), and
776 \item It is not ignored with \texttt{-ignore-package}.
777 \end{itemize}
778
779 Package validity is probably the minimal criterion for to GHC to ensure
780 that it can actually \emph{use} a package. If the package is missing,
781 GHC can't find the interface files or object code associated with the
782 package. Ignoring packages is a way of pretending that a package is
783 missing from the database.
784
785 Package validity is also a very weak criterion. Another criterion we
786 might hope holds is \emph{consistency}: when we consider the transitive
787 closure of all selected packages, for any given package name, there
788 should only be one instance providing that package. It is trivially
789 easy to break this property: suppose that I have packages \pname{a-1.0},
790 \pname{b-1.0} compiled against \pname{a-1.0}, and \pname{a-1.1}. GHC
791 will happily load \pname{b-1.0} and \pname{a-1.1} together in the same
792 interactive session (they are both valid and the latest versions), even
793 though \pname{b-1.0}'s dependency is inconsistent with another package
794 that was loaded. The user will notice if they attempt to treat entities
795 from \pname{a} reexported by \pname{b-1.0} and entities from
796 \pname{a-1.1} as type equal. Here is one user who had this problem:
797 \url{http://stackoverflow.com/questions/12576817/}. In some cases, the
798 problem is easy to work around (there is only one offending package
799 which just needs to be hidden), but if the divergence is deep in two
800 separate dependency hierarchies, it is often easier to just blow away
801 the package database and try again.
802
803 Perversely, destructive reinstallation helps prevent these sorts of
804 inconsistent databases. While inconsistencies can arise when multiple
805 versions of a package are installed, multiple versions will frequently
806 lead to the necessity of reinstalls. In the previous example, if a user
807 attempts to Cabal install a package which depends on \pname{a-1.1} and
808 \pname{b-1.0}, Cabal's dependency solver will propose reinstalling
809 \pname{b-1.0} compiled against \pname{a-1.1}, in order to get a
810 consistent set of dependencies. If this reinstall is accepted, we
811 invalidate all packages in the database which were previously installed
812 against \pname{b-1.0} and \pname{a-1.0}, excluding them from GHC's
813 selection process and making it more likely that the user will see a
814 consistent view of the database.
815
816 \paragraph{Enforcing consistent dependencies} From the user's
817 perspective, it would be desirable if GHC never loaded a set of packages
818 whose dependencies were inconsistent.
819 There are two ways we can go
820 about doing this. First, we can improve GHC's logic so that it doesn't
821 pick an inconsistent set. However, as a point of design, we'd like to
822 keep whatever resolution GHC does as simple as possible (in an ideal
823 world, we'd skip the validity checks entirely, but they ended up being
824 necessary to prevent broken database from stopping GHC from starting up
825 at all). In particular, GHC should \emph{not} learn how to do
826 backtracking constraint solving: that's in the domain of Cabal. Second,
827 we can modify the logic of Cabal to enforce that the package database is
828 always kept in a consistent state, similar to the consistency check
829 Cabal applies to sandboxes, where it refuses to install a package to a
830 sandbox if the resulting dependencies would not be consistent.
831
832 The second alternative is a appealing, but Cabal sandboxes are currently
833 designed for small, self-contained single projects, as opposed to the
834 global ``universe'' that a default environment is intended to provide.
835 For example, with a Cabal sandbox environment, it's impossible to
836 upgrade a dependency to a new version without blowing away the sandbox
837 and starting again. To support upgrades, Cabal needs to do some work:
838 when a new version is put in the default set, all of the
839 reverse-dependencies of the old version are now inconsistent. Cabal
840 should offer to hide these packages or reinstall them compiled against
841 the latest version. Furthermore, because we in general may not have write
842 access to all visible package databases, this visibility information
843 must be independent of the package databases themselves.
844
845 As a nice bonus, Cabal should also be able to snapshot the older
846 environment which captures the state of the universe prior to the
847 installation, in case the user wants to revert back.
848
849 \paragraph{Modifying the default environment} Currently, after GHC
850 calculates the default package environment, a user may further modify
851 the environment by passing package flags to GHC, which can be used to
852 explicitly hide or expose packages. How do these flags interact with
853 our Cabal-managed environments? Hiding packages is simple enough,
854 but exposing packages is a bit dicier. If a user asks for a different
855 version of a package than in the default set, it will probably be
856 inconsistent with the rest of the dependencies. Cabal would have to
857 be consulted to figure out a maximal set of consistent packages with
858 the constraints given. Alternatively, we could just supply the package
859 with no claims of consistency.
860
861 However, this use-case is rare. Usually, it's not because they want a
862 specific version: the package is hidden simply because we're not
863 interested in loading it by default (\pname{ghc-api} is the canonical
864 example, since it dumps a lot of modules in the top level namespace).
865 If we distinguish packages which are consistent but hidden, their
866 loads can be handled appropriately.
867
868 \paragraph{Consistency in Backpack} We have stated as an implicit
869 assumption that if we have both \pname{foo-1.0} and \pname{foo-1.1}
870 available, only one should be loaded at a time. What are the
871 consequences if both of these packages are loaded at the same time? An
872 import of \m{Data.Foo} provided by both packages would be ambiguous and
873 the user might find some type equalities they expect to hold would not.
874 However, the result is not \emph{unsound}: indeed, we might imagine a
875 user purposely wanting two different versions of a library in the same
876 program, renaming the modules they provided so that they could be
877 referred to unambiguously. As another example, suppose that we have an
878 indefinite package with a hole that is instantiated multiple times. In
879 this case, a user absolutely may want to refer to both instantiations,
880 once again renaming modules so that they have unique names.
881
882 There are two consequences of this. First, while the default package
883 set may enforce consistency, a user should still be able to explicitly
884 ask for a package instance, renamed so that its modules don't conflict,
885 and then use it in their program. Second, instantiated indefinite packages
886 should \emph{never} be placed in the default set, since it's impossible
887 to know which instantiation is the one the user prefers. A definite package
888 can reexport an instantiated module under an unambiguous name if the user
889 so pleases.
890
891 \paragraph{Shadowing, installed package IDs, ABI hashes and package
892 keys} Shadowing plays an important role for maintaining the soundness of
893 compilation; call this the \emph{compatibility} of the package set. The
894 problem it addresses is when there are two distinct implementations of a
895 module, but because their package ID (or package key, in the new world
896 order) are the same, they are considered type equal. It is absolutely
897 wrong for a single program to include both implementations
898 simultaneously (the symbols would conflict and GHC would incorrectly
899 conclude things were type equal when they're not), so \emph{shadowing}'s
900 job is to ensure that only one instance is picked, and all the other
901 instances considered invalid (and their reverse-dependencies, etc.)
902 Recall that in current GHC, within a package database, a package
903 instance is uniquely identified by its package ID\@; thus, shadowing
904 only needs to take place between package databases. An interesting
905 corner case is when the same package ID occurs in both databases, but
906 the installed package IDs are the \emph{same}. Because the installed
907 package ID is currently simply an ABI hash, we skip shadowing, because
908 the packages are---in principle---interchangeable.
909
910 There are currently a number of proposed changes to this state of affairs:
911
912 \begin{itemize}
913 \item Change installed package IDs to not be based on ABI hashes.
914 ABI hashes have a number of disadvantages as identifiers for
915 packages in the database. First, they cannot be computed until
916 after compilation, which gave the multi-instance GSoC project a
917 few years some headaches. Second, it's not really true that
918 programs with identical ABI hashes are interchangeable: a new
919 package may be ABI compatible but have different semantics.
920 Thus, installed package IDs are a poor unique identifier for
921 packages in the package database. However, because GHC does not
922 give ABI stability guarantees, it would not be possible to
923 assume from here that packages with the same installed package
924 ID are ABI compatible.
925
926 \item Relaxing the uniqueness constraint on package IDs. There are
927 actually two things that could be done here. First, since we
928 have augmented package IDs with dependency resolution
929 information to form package keys, we could simply state that
930 package keys uniquely identify a package in a database.
931 Shadowing rules can be implemented in the same way as before, by
932 preferring the instance topmost on the stack. Second, we could
933 also allow \emph{same-database} shadowing: that is, not even
934 package keys are guaranteed to be unique in a database: instead,
935 installed package IDs are the sole unique identifier of a
936 package. This architecture is Nix inspired, as the intent is
937 to keep all package information in a centralized database.
938 \end{itemize}
939
940 Without mandatory package environments, same-database shadowing is a bad
941 idea, because GHC now has no idea how to resolve shadowing. Conflicting
942 installed package IDs can be simulated by placing them in multiple
943 package databases (in principle, the databases can be concatenated together
944 and treated as a single monolitic database.)
945
946 \section{Shapeless Backpack}\label{sec:simplifying-backpack}
947
948 Backpack as currently defined always requires a \emph{shaping} pass,
949 which calculates the shapes of all modules defined in a package.
950 The shaping pass is critical to the solution of the double-vision problem
951 in recursive module linking, but it also presents a number of unpalatable
952 implementation problems:
953
954 \begin{itemize}
955
956 \item \emph{Shaping is a lot of work.} A module shape specifies the
957 providence of all data types and identifiers defined by a
958 module. To calculate this, we must preprocess and parse all
959 modules, even before we do the type-checking pass. (Fortunately,
960 shaping doesn't require a full parse of a module, only enough
961 to get identifiers. However, it does have to understand import
962 statements at the same level of detail as GHC's renamer.)
963
964 \item \emph{Shaping must be done upfront.} In the current Backpack
965 design, all shapes must be computed before any typechecking can
966 occur. While performing the shaping pass upfront is necessary
967 in order to solve the double vision problem (where a module
968 identity may be influenced by later definitions), it means
969 that GHC must first do a shaping pass, and then revisit every module and
970 compile them proper. Nor is it (easily) possible to skip the
971 shaping pass when it is unnecessary, as one might expect to be
972 the case in the absence of mutual recursion. Shaping is not
973 a ``pay as you go'' language feature.
974
975 \item \emph{GHC can't compile all programs shaping accepts.} Shaping
976 accepts programs that GHC, with its current hs-boot mechanism, cannot
977 compile. In particular, GHC requires that any data type or function
978 in a signature actually be \emph{defined} in the module corresponding
979 to that file (i.e., an original name can be assigned to these entities
980 immediately.) Shaping permits unrestricted exports to implement
981 modules; this shows up in the formalism as $\beta$ module variables.
982
983 \item \emph{Shaping encourages inefficient program organization.}
984 Shaping is designed to enable mutually recursive modules, but as
985 currently implemented, mutual recursion is less efficient than
986 code without recursive dependencies. Programmers should avoid
987 this code organization, except when it is absolutely necessary.
988
989 \item \emph{GHC is architecturally ill-suited for directly
990 implementing shaping.} Shaping implies that GHC's internal
991 concept of an ``original name'' be extended to accommodate
992 module variables. This is an extremely invasive change to all
993 aspects of GHC, since the original names assumption is baked
994 quite deeply into the compiler. Plausible implementations of
995 shaping requires all these variables to be skolemized outside
996 of GHC\@.
997
998 \end{itemize}
999
1000 To be clear, the shaping pass is fundamentally necessary for some
1001 Backpack packages. Here is the example which convinced Simon:
1002
1003 \begin{verbatim}
1004 package p where
1005 A :: [data T; f :: T -> T]
1006 B = [export T(MkT), h; import A(f); data T = MkT; h x = f MkT]
1007 A = [export T(MkT), f, h; import B; f MkT = MkT]
1008 \end{verbatim}
1009
1010 The key to this example is that B \emph{may or may not typecheck} depending
1011 on the definition of A. Because A reexports B's definition T, B will
1012 typecheck; but if A defined T on its own, B would not typecheck. Thus,
1013 we \emph{cannot} typecheck B until we have done some analysis of A (the
1014 shaping analysis!)
1015
1016 Thus, it is beneficial (from an optimization point of view) to
1017 consider a subset of Backpack for which shaping is not necessary.
1018 Here is a programming discipline which does just that, which we will call the \textbf{linking restriction}: \emph{Module implementations must be declared before
1019 signatures.} Formally, this restriction modifies the rule for merging
1020 polarized module shapes ($\widetilde{\tau}_1^{m_1} \oplus \widetilde{\tau}_2^{m_2}$) so that
1021 $\widetilde{\tau}_1^- \oplus \widetilde{\tau}_2^+$ is always undefined.\footnote{This seemed to be the crispest way of defining the restriction, although this means an error happens a bit later than I'd like it to: I'd prefer if we errored while merging logical contexts, but we don't know what is a hole at that point.}
1022
1023 Here is an example of the linking restriction. Consider these two packages:
1024
1025 \begin{verbatim}
1026 package random where
1027 System.Random = [ ... ].hs
1028
1029 package monte-carlo where
1030 System.Random :: ...
1031 System.MonteCarlo = [ ... ].hs
1032 \end{verbatim}
1033
1034 Here, random is a definite package which may have been compiled ahead
1035 of time; monte-carlo is an indefinite package with a dependency
1036 on any package which provides \verb|System.Random|.
1037
1038 Now, to link these two applications together, only one ordering
1039 is permissible:
1040
1041 \begin{verbatim}
1042 package myapp where
1043 include random
1044 include monte-carlo
1045 \end{verbatim}
1046
1047 If myapp wants to provide its own random implementation, it can do so:
1048
1049 \begin{verbatim}
1050 package myapp2 where
1051 System.Random = [ ... ].hs
1052 include monte-carlo
1053 \end{verbatim}
1054
1055 In both cases, all of \verb|monte-carlo|'s holes have been filled in by the time
1056 it is included. The alternate ordering is not allowed.
1057
1058 Why does this discipline prevent mutually recursive modules? Intuitively,
1059 a hole is the mechanism by which we can refer to an implementation
1060 before it is defined; otherwise, we can only refer to definitions which
1061 preceed our definition. If there are never any holes \emph{which get filled},
1062 implementation links can only go backwards, ruling out circularity.
1063
1064 It's easy to see how mutual recursion can occur if we break this discipline:
1065
1066 \begin{verbatim}
1067 package myapp2 where
1068 include monte-carlo
1069 System.Random = [ import System.MonteCarlo ].hs
1070 \end{verbatim}
1071
1072 \subsection{Typechecking of definite modules without shaping}
1073
1074 If we are not carrying out a shaping pass, we need to be able to calculate
1075 $\widetilde{\Xi}_{\mathsf{pkg}}$ on the fly. In the case that we are
1076 compiling a package---there will be no holes in the final package---we
1077 can show that shaping is unnecessary quite easily, since with the
1078 linking restriction, everything is definite from the get-go.
1079
1080 Observe the following invariant: at any given step of the module
1081 bindings, the physical context $\widetilde{\Phi}$ contains no
1082 holes. We can thus conclude that there are no module variables in any
1083 type shapes. As the only time a previously calculated package shape can
1084 change is due to unification, the incrementally computed shape is in
1085 fact the true one.
1086
1087 As far as the implementation is concerned, we never have to worry
1088 about handling module variables; we only need to do extra typechecks
1089 against (renamed) interface files.
1090
1091 \subsection{Compiling definite packages}\label{sec:compiling}
1092
1093 % New definitions
1094 \algnewcommand\algorithmicswitch{\textbf{switch}}
1095 \algnewcommand\algorithmiccase{\textbf{case}}
1096 \algnewcommand\algorithmicassert{\texttt{assert}}
1097 % New "environments"
1098 \algdef{SE}[SWITCH]{Switch}{EndSwitch}[1]{\algorithmicswitch\ #1\ \algorithmicdo}{\algorithmicend\ \algorithmicswitch}%
1099 \algdef{SE}[CASE]{Case}{EndCase}[1]{\algorithmiccase\ ``#1''}{\algorithmicend\ \algorithmiccase}%
1100 \algtext*{EndSwitch}%
1101 \algtext*{EndCase}%
1102
1103 \begin{algorithm}
1104 \caption{Compilation of definite packages (assume \texttt{-hide-all-packages} on all \texttt{ghc} invocations)}\label{alg:compile}
1105 \begin{algorithmic}
1106 \Procedure{Compile}{\textbf{package} $P$ \textbf{where} $\vec{B}$, $H$, $db$}\Comment{}$H$ maps hole module names to identities
1107 \State$flags\gets \nil$
1108 \State$\mathcal{K}$ $\gets$ \Call{Hash}{$P + H$}
1109 \State%
1110 In-place register the package $\mathcal{K}$ in $db$
1111 \For{$B$ \textbf{in} $\vec{B}$}
1112 \Case{$p = p\texttt{.hs}$}
1113 \State\Call{Exec}{\texttt{ghc -c} $p$\texttt{.hs} \texttt{-package-db} $db$ \texttt{-package-name} $\mathcal{K}$ $flags$}
1114 \EndCase%
1115 \Case{$p$ $\cc$ $p$\texttt{.hsig}}
1116 \State\Call{Exec}{\texttt{ghc -c} $p$\texttt{.hsig} \texttt{-package-db} $db$ \texttt{--sig-of} $H(p)$ $flags$}
1117 \EndCase%
1118 \Case{$p = p'$}
1119 \State$flags\gets flags$ \texttt{-alias} $p$ $p'$
1120 \EndCase%
1121 \Case{\Cinc{$P'$} $\langle\vec{p_H\mapsto p_H'}, \vec{p\mapsto p'} \rangle$}
1122 \State\textbf{let} $H'(p_H) = $ \Call{Exec}{\texttt{ghc --resolve-module} $p_H'$ \texttt{-package-db} $db$ $flags$}
1123 \State$\mathcal{K}'\gets$ \Call{Compile}{$P'$, $H'$, $db$}\Comment{}Nota bene: not $flags$
1124 \State$flags\gets flags$ \texttt{-package} $\mathcal{K}'$ $\langle\vec{p\mapsto p'}\rangle$
1125 \EndCase%
1126 \EndFor%
1127 \State%
1128 Remove $\mathcal{K}$ from $db$
1129 \State%
1130 Install the complete package $\mathcal{K}$ to the global database
1131 \State\Return$\mathcal{K}$
1132 \EndProcedure%
1133 \end{algorithmic}
1134 \end{algorithm}
1135
1136 The full recursive procedure for compiling a Backpack package using
1137 one-shot compilation is given in Figure~\ref{alg:compile}. We
1138 recursively walk through Backpack descriptions, processing each line by
1139 invoking GHC and/or modifying our package state. Here is a more
1140 in-depth description of the algorithm, line-by-line:
1141
1142 \paragraph{The parameters} To compile a package description for package
1143 $P$, we need to know $H$, the mapping of holes $p_H$ in package $P$ to
1144 physical module identities $\nu$ which are implementing them; this
1145 mapping is used to calculate the package key $\mathcal{K}$ for the
1146 package in question. Furthermore, we have an inplace package database
1147 $db$ in which we will register intermediate build results, including
1148 partially compiled parent packages which may provide implementations
1149 of holes for packages they include.
1150
1151 \subsection{Compiling implementations}
1152
1153 We compile modules in the same way we do today, but with some extra
1154 package visibility $flags$, which let GHC know how to resolve imports
1155 and look up original names. We'll describe what the new flags are
1156 and also discuss some subtleties with module lookup.
1157
1158 \paragraph{In-place registration} Perhaps surprisingly, we start
1159 compilation by registering the (uncompiled) package in the in-place
1160 package database. This registration does not expose packages, and is
1161 purely intended to inform the compilation of subpackages where to
1162 find modules that are provided by the parent (in-progress) package,
1163 as well as provide auxiliary information, e.g., such as the package name
1164 and version for error reporting. The pre-registration trick is an old
1165 one used by the GHC build system; the key invariant to look out for
1166 is that we shouldn't reference original names in modules that haven't
1167 been built yet. This is enforced by our manual tracking of holes in
1168 $H$: a module can't occur in $H$ unless it's already been compiled!
1169
1170 \paragraph{New package resolution algorithm} Currently, invocations
1171 of \texttt{-package} and similar flags have the result of \emph{hiding}
1172 other exposed packages with the same name. However, this is not going
1173 to work for Backpack: an indefinite package may get loaded multiple times
1174 with different instantiations, and it might even make sense to load multiple
1175 versions of the same package simultaneously, as long as their modules
1176 are renamed to not conflict.
1177
1178 Thus, we impose the following behavior change: when
1179 \texttt{-hide-all-packages} is specified, we do \emph{not} automatically
1180 hide packages with the same name as a package specified by
1181 \texttt{-package} (or a similar flag): they are all included, even if
1182 there are conflicts. To deal with conflicts, we augment the syntax of
1183 \texttt{-package} to also accept a list of thinnings and renamings, e.g.
1184 \texttt{-package} \pname{containers} $\langle\m{Data.Set},
1185 \m{Data.Map}\mapsto \m{Map}\rangle$ says to make visible for import
1186 \m{Data.Set} and \m{Map} (which is \m{Data.Map} renamed.) This means
1187 that
1188 \texttt{-package} \pname{containers-0.9} $\langle\m{Data.Set}\mapsto
1189 \m{Set09}\rangle$ \texttt{-package} \pname{containers-0.8}
1190 $\langle\m{Data.Set}\mapsto \m{Set08}\rangle$ now uses both packages
1191 concurrently (previously, GHC would hide one of them.)
1192
1193 Additionally, it's important to note that two packages exporting the
1194 same module do not \emph{necessarily} cause a conflict; the modules
1195 may be linkable. For example, \texttt{-package} \pname{containers} $\langle\m{Data.Set}\rangle$
1196 \texttt{-package} \pname{containers} $\langle\m{Data.Set}\rangle$ is fine, because
1197 precisely the same implementation of \m{Data.Set} is loaded in both cases.
1198 A similar situation can occur with signatures:
1199
1200 \begin{verbatim}
1201 package p where
1202 A :: [ x :: Int ]
1203 package q
1204 include p
1205 A :: [ y :: Int ]
1206 B = [ import A; z = x + y ] -- *
1207 package r where
1208 A = [ x = 0; y = 0 ]
1209 include q
1210 \end{verbatim}
1211
1212 Here, both \pname{p} and \pname{q} are visible when compiling the starred
1213 module, which compiles with the flags \texttt{-package} \pname{p}, but there
1214 are two interface files available: one available locally, and one from \pname{p}.
1215 Both of these interface files are \emph{forwarding} to the original implementation
1216 \pname{r} (more on this in the ``Compiling signatures'' file), so rather than
1217 reporting an ambiguous import, we instead have to merge the two interface files
1218 together and use the result as the interface for the module. (This could be done
1219 on the fly, or we could generate merged interface files as we go along.)
1220
1221 Note that we do not need to merge signatures with an implementation, in such
1222 cases, we should just use the implementation interface. E.g.
1223
1224 \begin{verbatim}
1225 package p where
1226 A :: ...
1227 package q where
1228 A = ...
1229 include p
1230 B = [ import A ] -- *
1231 \end{verbatim}
1232
1233 Here, \m{A} is available both from \pname{p} and \pname{q}, but the use in the
1234 starred module should be done with respect to the full implementation.
1235
1236 \paragraph{The \texttt{-alias} flag} We introduce a new flag
1237 \texttt{-alias} for aliasing modules. Aliasing is analogous to
1238 the merging that can occur when we include packages, but it also applies
1239 to modules which are locally defined. When we alias a module $p$ with
1240 $p'$, we require that $p'$ exists in the current module mapping, and then
1241 we attempt to add an entry for it at entry $p$. If there is no mapping for
1242 $p$, this succeeds; otherwise, we apply the same conflict resolution algorithm.
1243
1244 \subsection{Compiling signatures}
1245
1246 Signature compilation is triggered when we compile a signature file.
1247 This mode similar to how we process \verb|hs-boot| files, except
1248 we pass an extra flag \verb|--sig-of| which specifies what the
1249 identity of the actual implementation of the signature is (according to our $H$
1250 mapping). This is guaranteed to exist, due to the linking
1251 restriction, although it may be in a partially registered package
1252 in $db$. If the module is \emph{not} exposed under the name of the
1253 \texttt{hisig}file, we output an \texttt{hisig} file which, for all declarations the
1254 signature exposes, forwards their definitions to the original
1255 implementation file. The intent is that any code in the current package
1256 which compiles against this signature will use this \texttt{hisig} file,
1257 not the original one \texttt{hi} file.
1258 For example, the \texttt{hisig} file produced when compiling the starred interface
1259 points to the implementation in package \pname{q}.
1260
1261 \begin{verbatim}
1262 package p where
1263 A :: ... -- *
1264 B = [ import A; ... ]
1265 package q where
1266 A = [ ... ]
1267 include p
1268 \end{verbatim}
1269
1270 \paragraph{Sometimes \texttt{hisig} is unnecessary}
1271 In the following package:
1272
1273 \begin{verbatim}
1274 package p where
1275 P = ...
1276 P :: ...
1277 \end{verbatim}
1278
1279 Paper Backpack specifies that we check the signature \m{P} against implementation
1280 \m{P}, but otherwise no changes are made (i.e., the signature does not narrow
1281 the implementation.) In this case, it is not necessary to generate an \texttt{hisig} file;
1282 the original interface file suffices.
1283
1284 \paragraph{Multiple signatures} As a simplification, we assume that there
1285 is only one signature per logical name in a package. (This prevents
1286 us from expressing mutual recursion in signatures, but let's not worry
1287 about it for now.)
1288
1289 \paragraph{Restricted recursive modules ala hs-boot}\label{sec:hs-boot-restrict}
1290 When we compile an \texttt{hsig} file without any \texttt{--sig-of} flag (because
1291 no implementation is known), we fall back to old-style GHC mutual recursion.
1292 Na\"\i vely, a shaping pass would be necessary;
1293 so we adopt an existing constraint that
1294 already applies to hs-boot files: \emph{at the time we define a signature,
1295 we must know what the original name for all data types is}. In practice,
1296 GHC enforces this by stating that: (1) an hs-boot file must be
1297 accompanied with an implementation, and (2) the implementation must
1298 in fact define (and not reexport) all of the declarations in the signature.
1299 We can discover if a signature is intended to break a recursive module loop
1300 when we discover that $p\notin flags_H$; in this case, we fallback to the
1301 old hs-boot behavior. (Alternatively, the user can explicitly ask for it.)
1302
1303 Why does this not require a shaping pass? The reason is that the
1304 signature is not really polymorphic: we require that the $\alpha$ module
1305 variable be resolved to a concrete module later in the same package, and
1306 that all the $\beta$ module variables be unified with $\alpha$. Thus, we
1307 know ahead of time the original names and don't need to deal with any
1308 renaming.\footnote{This strategy doesn't completely resolve the problem
1309 of cross-package mutual recursion, because we need to first compile a
1310 bit of the first package (signatures), then the second package, and then
1311 the rest of the first package.}
1312
1313 Compiling packages in this way gives the tantalizing possibility
1314 of true separate compilation: the only thing we don't know is what the actual
1315 package name of an indefinite package will be, and what the correct references
1316 to have are. This is a very minor change to the assembly, so one could conceive
1317 of dynamically rewriting these references at the linking stage. But
1318 separate compilation achieved in this fashion would not be able to take
1319 advantage of cross-module optimizations.
1320
1321 \subsection{Compiling includes}
1322
1323 Includes are the most interesting part of the compilation process, as we have
1324 calculate how the holes of the subpackage we are filling in are compiled $H'$
1325 and modify our flags to make the exports of the include visible to subsequently
1326 compiled modules. We consider the case with renaming, since includes with
1327 no renaming are straightforward.
1328
1329 First, we assume that we know \emph{a priori} what the holes of a
1330 package $p_H$ are (either by some sort of pre-pass, or explicit
1331 declaration.) For each of their \emph{renamed targets} $p'_H$, we look
1332 up the module in the current $flags$ environment, retrieving the
1333 physical module identity by consulting GHC with the
1334 \texttt{--resolve-module} flag and storing it in $H'$. (This can be done in batch.)
1335 For example:
1336
1337 \begin{verbatim}
1338 package p where
1339 A :: ...
1340 ...
1341 package q where
1342 A = [ ... ]
1343 B = [ ... ]
1344 include p (A as B)
1345 \end{verbatim}
1346
1347 When computing the entry $H(\pname{A})$, we run the command \texttt{ghc --resolve-module} \pname{B}.
1348
1349 Next, we recursively call \textsc{Compile} with the computed $H'$.
1350 Note that the entries in $H$ may refer to modules which would not be
1351 picked up by $flags$, but they will be registered in the inplace
1352 package database $db$.
1353 For example, in this situation:
1354
1355 \begin{verbatim}
1356 package p where
1357 B :: ...
1358 C = [ import B; ... ]
1359 package q where
1360 A = [ ... ]
1361 B = [ import A; ... ]
1362 include p
1363 D = [ import C; ... ]
1364 \end{verbatim}
1365
1366 When we recursively process package \pname{p}, $H$ will refer to
1367 \pname{q}:\m{B}, and we need to know where to find it (\pname{q} is only
1368 partially processed and so is in the inplace package database.)
1369 Furthermore, the interface file for \m{B} may refer to \pname{q}:\m{A},
1370 and thus we likewise need to know how to find its interface file.
1371
1372 Note that the inplace package database is not expected to expose and
1373 packages. Otherwise, this example would improperly compile:
1374
1375 \begin{verbatim}
1376 package p where
1377 B = [ import A; ... ]
1378 package q where
1379 A = ...
1380 include p
1381 \end{verbatim}
1382
1383 \pname{p} does not compile on its own, so it should not compile if it is
1384 recursively invoked from \pname{q}. However, if we exposed the modules
1385 of the partially registered package \pname{q}, \m{A} is now suddenly
1386 resolvable.
1387
1388 Finally, once the subpackage is compiled, we can add it to our $flags$ so later
1389 modules we compile see its (appropriately thinned and renamed) modules, and like
1390 aliasing.
1391
1392 \paragraph{Absence of an \texttt{hi} file}
1393 It is important that \texttt{--resolve-module} truly looks up the \emph{implementor}
1394 of a module, and not just a signature which is providing it at some name.
1395 Sometimes, a little extra work is necessary to compute this, for example:
1396
1397 \begin{verbatim}
1398 package p where
1399 A :: [ y :: Int ]
1400 package q where
1401 A :: [ x :: Int ]
1402 include p -- *
1403 package r where
1404 A = [ x = 0; y = 1 ]
1405 include q
1406 \end{verbatim}
1407
1408 When computing $H'$ for the starred include, our $flags$ only include
1409 \texttt{-package-dir} \pname{r} $cwd_r$ $\langle\rangle$: with a thinning
1410 that excludes all modules! The only interface file we can pick up with these
1411 $flags$ is the local definition of \m{A}. However, we \emph{absolutely}
1412 should set $H'(\m{A})=\pname{q}:\m{A}$; if we do so, then we will incorrectly
1413 conclude when compiling the signature in \pname{p} that the implementation
1414 doesn't export enough identifiers to fulfill the signature (\texttt{y} is not
1415 available from just the signature in \pname{q}). Instead, we have to look
1416 up the original implementor of \m{A} in \pname{r}, and use that in $H'$.
1417
1418 \subsection{Commentary}
1419
1420 \paragraph{Just because it compiled, doesn't mean the individual packages type check}
1421 The compilation mechanism described is slightly more permissive than vanilla Backpack.
1422 Here is a simple example:
1423
1424 \begin{verbatim}
1425 package p where
1426 A :: [ data T = T ]
1427 B :: [ data T = T ]
1428 C = [
1429 import A
1430 import B
1431 x = A.T :: B.T
1432 ]
1433 package q where
1434 A = [ data T = T ]
1435 B = A
1436 include p
1437 \end{verbatim}
1438
1439 Here, we incorrectly decide \m{A}\verb|.T| and \m{B}\verb|.T| are type
1440 equal when typechecking \m{C}, because the \verb|hisig| files we
1441 generate for them all point to the same original implementation. However,
1442 \pname{p} should not typecheck.
1443
1444 The problem here is that type checking asks ``does it compile with respect
1445 to all possible instantiations of the holes'', whereas compilation asks
1446 ``does it compile with respect to this particular instantiation of holes.''
1447 In the absence of a shaping pass, this problem is unavoidable.
1448
1449 \section{Shaped Backpack}
1450
1451 Despite the simplicity of shapeless Backpack with the linking
1452 restriction in the absence of holes, we will find that when we have
1453 holes, it will be very difficult to do type-checking without
1454 some form of shaping. This section is very much a work in progress,
1455 but the ability to typecheck against holes, even with the linking restriction,
1456 is a very important part of modular separate development, so we will need
1457 to support it at some point.
1458
1459 \subsection{Efficient shaping}
1460
1461 (These are Edward's opinion, he hasn't convinced other folks that this is
1462 the right way to do it.)
1463
1464 In this section, I want to argue that, although shaping constitutes
1465 a pre-pass which must be run before compilation in earnest, it is only
1466 about as bad as the dependency resolution analysis that GHC already does
1467 in \verb|ghc -M| or \verb|ghc --make|.
1468
1469 In Paper Backpack, what information does shaping compute? It looks at
1470 exports, imports, data declarations and value declarations (but not the
1471 actual expressions associated with these values.) As a matter of fact,
1472 GHC already must look at the imports associated with a package in order
1473 to determine the dependency graph, so that it can have some order to compile
1474 modules in. There is a specialized parser which just parses these statements,
1475 and then ignores the rest of the file.
1476
1477 A bit of background: the \emph{renamer} is responsible for resolving
1478 imports and figuring out where all of these entities actually come from.
1479 SPJ would really like to avoid having to run the renamer in order to perform
1480 a shaping pass.
1481
1482 \paragraph{Is it necessary to run the Renamer to do shaping?}
1483 Edward and Scott believe the answer is no, well, partially.
1484 Shaping needs to know the original names of all entities exposed by a
1485 module/signature. Then it needs to know (a) which entities a module/signature
1486 defines/declares locally and (b) which entities that module/signature exports.
1487 The former, (a), can be determined by a straightforward inspection of a parse
1488 tree of the source file.\footnote{Note that no expression or type parsing
1489 is necessary. We only need names of local values, data types, and data
1490 constructors.} The latter, (b), is a bit trickier. Right now it's the Renamer
1491 that interprets imports and exports into original names, so we would still
1492 rely on that implementation. However, the Renamer does other, harder things
1493 that we don't need, so ideally we could factor out the import/export
1494 resolution from the Renamer for use in shaping.
1495
1496 Unfortunately the Renamer's import resolution analyzes \verb|.hi| files, but for
1497 local modules, which haven't yet been typechecked, we don't have those.
1498 Instead, we could use a new file format, \verb|.hsi| files, to store the shape of
1499 a locally defined module. (Defined packages are bundled with their shapes,
1500 so included modules have \verb|.hsi| files as well.) (What about the logical
1501 vs.~physical distinction in file names?) If we refactor the import/export
1502 resolution code, could we rewrite it to generically operate on both
1503 \verb|.hi| files and \verb|.hsi| files?
1504
1505 Alternatively, rather than storing shapes on a per-source basis, we could
1506 store (in memory) the entire package shape. Similarly, included packages
1507 could have a single shape file for the entire package. Although this approach
1508 would make shaping non-incremental, since an entire package's shape would
1509 be recomputed any time a constituent module's shape changes, we do not expect
1510 shaping to be all that expensive.
1511
1512 \subsection{Typechecking of indefinite modules}\label{sec:typechecking-indefinite}
1513
1514 Recall in our argument in the definite case, where we showed there are
1515 no holes in the physical context. With indefinite modules, this is no
1516 longer true. While (with the linking restriction) these holes will never
1517 be linked against a physical implementation, they may be linked against
1518 other signatures. (Note: while disallowing signature linking would
1519 solve our problem, it would disallow a wide array of useful instances of
1520 signature reuse, for example, a package mylib that implements both
1521 mylib-1x-sig and mylib-2x-sig.)
1522
1523 With holes, we must handle module variables, and we sometimes must unify them:
1524
1525 \begin{verbatim}
1526 package p where
1527 A :: [ data A ]
1528 package q where
1529 A :: [ data A ]
1530 package r where
1531 include p
1532 include q
1533 \end{verbatim}
1534
1535 In this package, it is not possible to a priori assign original names to
1536 module A in p and q, because in package r, they should have the same
1537 original name. When signature linking occurs, unification may occur,
1538 which means we have to rename all relevant original names. (A similar
1539 situation occurs when a module is typechecked against a signature.)
1540
1541 An invariant which would be nice to have is this: when typechecking a
1542 signature or including a package, we may apply renaming to the entities
1543 being brought into context. But once we've picked an original name for
1544 our entities, no further renaming should be necessary. (Formally, in the
1545 unification for semantic object shapes, apply the unifier to the second
1546 shape, but not the first one.)
1547
1548 However, there are plenty of counterexamples here:
1549
1550 \begin{verbatim}
1551 package p where
1552 A :: [ data A ]
1553 B :: [ data A ]
1554 M = ...
1555 A = B
1556 \end{verbatim}
1557
1558 In this package, does module M know that A.A and B.A are type equal? In
1559 fact, the shaping pass will have assigned equal module identities to A
1560 and B, so M \emph{equates these types}, despite the aliasing occurring
1561 after the fact.
1562
1563 We can make this example more sophisticated, by having a later
1564 subpackage which causes the aliasing; now, the decision is not even a
1565 local one (on the other hand, the equality should be evident by inspection
1566 of the package interface associated with q):
1567
1568 \begin{verbatim}
1569 package p where
1570 A :: [ data A ]
1571 B :: [ data A ]
1572 package q where
1573 A :: [ data A ]
1574 B = A
1575 package r where
1576 include p
1577 include q
1578 \end{verbatim}
1579
1580 Another possibility is that it might be acceptable to do a mini-shaping
1581 pass, without parsing modules or signatures, \emph{simply} looking at
1582 names and aliases. But logical names are not the only mechanism by
1583 which unification may occur:
1584
1585 \begin{verbatim}
1586 package p where
1587 C :: [ data A ]
1588 A = [ data A = A ]
1589 B :: [ import A(A) ]
1590 C = B
1591 \end{verbatim}
1592
1593 It is easy to conclude that the original names of C and B are the same. But
1594 more importantly, C.A must be given the original name of p:A.A. This can only
1595 be discovered by looking at the signature definition for B. In any case, it
1596 is worth noting that this situation parallels the situation with hs-boot
1597 files (although there is no mutual recursion here).
1598
1599 The conclusion is that you will probably, in fact, have to do real
1600 shaping in order to typecheck all of these examples.
1601
1602 \paragraph{Hey, these signature imports are kind of tricky\ldots}
1603
1604 When signatures and modules are interleaved, the interaction can be
1605 complex. Here is an example:
1606
1607 \begin{verbatim}
1608 package p where
1609 C :: [ data A ]
1610 M = [ import C; ... ]
1611 A = [ import M; data A = A ]
1612 C :: [ import A(A) ]
1613 \end{verbatim}
1614
1615 Here, the second signature for C refers to a module implementation A
1616 (this is permissible: it simply means that the original name for p:C.A
1617 is p:A.A). But wait! A relies on M, and M relies on C. Circularity?
1618 Fortunately not: a client of package p will find it impossible to have
1619 the hole C implemented in advance, since they will need to get their hands on module
1620 A\ldots but it will not be defined prior to package p.
1621
1622 In any case, however, it would be good to emit a warning if a package
1623 cannot be compiled without mutual recursion.
1624
1625 \subsection{Rename on entry}
1626
1627 Consider the following example:
1628
1629 \begin{verbatim}
1630 package p where
1631 A :: [ data T = T ]
1632 B = [ import A; x = T ]
1633 package q where
1634 C :: ...
1635 A = [ data T = T ]
1636 include p
1637 D = [
1638 import qualified A
1639 import qualified B
1640 import C
1641 x = B.T :: A.T
1642 ]
1643 \end{verbatim}
1644
1645 We are interested in type-checking \pname{q}, which is an indefinite package
1646 on account of the uninstantiated hole \m{C}. Furthermore, let's suppose that
1647 \pname{p} has already been independently typechecked, and its interface files
1648 installed in some global location with $\alpha_A$ used as the module identity
1649 of \m{A}. (To simplify this example, we'll assume $\beta_{AT}=\alpha_A$.)
1650
1651 The first three lines of \pname{q} type check in the normal way, but \m{D}
1652 now poses a problem: if we load the interface file for \m{B} the normal way,
1653 we will get a reference to type \texttt{T} with the original name $\alpha_A$.\texttt{T},
1654 whereas from \m{A} we have an original name \pname{q}:\m{A}.\texttt{T}.
1655
1656 Let's suppose that we already have the result of a shaping pass, which
1657 maps our identity variables to their true identities.
1658 Let's consider the possible options here:
1659
1660 \begin{itemize}
1661 \item We could re-typecheck \pname{p}, feeding it the correct instantiations
1662 for its variables. However, this seems wasteful: we typechecked the
1663 package already, and up-to-renaming, the interface files are exactly
1664 what we need to type check our application.
1665 \item We could make copies of all the interface files, renamed to have the
1666 right original names. This also seems wasteful: why should we have to
1667 create a new copy of every interface file in a library we depend on?
1668 \item When \emph{reading in} the interface file to GHC, we could apply the
1669 renaming according to the shaping pass and store that in memory.
1670 \end{itemize}
1671
1672 That last solution is pretty appealing, however, there are still circumstances
1673 we need to create new interface files; these exactly mirror the cases described
1674 in Section~\ref{sec:compiling}.
1675
1676 \subsection{Incremental typechecking}
1677 We want to typecheck modules incrementally, i.e., when something changes in
1678 a package, we only want to re-typecheck the modules that care about that
1679 change. GHC already does this today.%
1680 \footnote{\url{https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler/RecompilationAvoidance}}
1681 Is the same mechanism sufficient for Backpack? Edward and Scott think that it
1682 is, mostly. Our conjecture is that a module should be re-typechecked if the
1683 existing mechanism says it should \emph{or} if the logical shape
1684 context (which maps logical names to physical names) has changed. The latter
1685 condition is due to aliases that affect typechecking of modules.
1686
1687 Let's look again at an example from before:
1688 \begin{verbatim}
1689 package p where
1690 A :: [ data A ]
1691 B :: [ data A ]
1692 M = [ import A; import B; ... ]
1693 \end{verbatim}
1694 Let's say that \verb|M| is typechecked successfully. Now we add an alias binding
1695 at the end of the package, \verb|A = B|. Does \verb|M| need to be
1696 re-typechecked? Yes! (Well, it seems so, but let's just assert ``yes'' for now.
1697 Certainly in the reverse case---if we remove the alias and then ask---this
1698 is true, since \verb|M| might have depended on the two \verb|A| types
1699 being the same.)
1700 The logical shape context changed to say that \verb|A| and
1701 \verb|B| now map to the same physical module identity. But does the existing
1702 recompilation avoidance mechanism say that \verb|M| should be re-typechecked?
1703 It's unclear. The \verb|.hi| file for \verb|M| records that it imported \verb|A| and
1704 \verb|B| with particular ABIs, but does it also know about the physical module
1705 identities (or rather, original module names) of these modules?
1706
1707 Scott thinks this highlights the need for us to get our story straight about
1708 the connection between logical names, physical module identities, and file
1709 names!
1710
1711
1712 \subsection{Installing indefinite packages}\label{sec:installing-indefinite}
1713
1714 If an indefinite package contains no code at all, we only need
1715 to install the interface file for the signatures. However, if
1716 they include code, we must provide all of the
1717 ingredients necessary to compile them when the holes are linked against
1718 actual implementations. (Figure~\ref{fig:pkgdb})
1719
1720 \paragraph{Source tarball or preprocessed source?} What is the representation of the source that is saved is. There
1721 are a number of possible choices:
1722
1723 \begin{itemize}
1724 \item The original tarballs downloaded from Hackage,
1725 \item Preprocessed source files,
1726 \item Some sort of internal, type-checked representation of Haskell code (maybe the output of the desugarer).
1727 \end{itemize}
1728
1729 Storing the tarballs is the simplest and most straightforward mechanism,
1730 but we will have to be very certain that we can recompile the module
1731 later in precisely the same we compiled it originally, to ensure the hi
1732 files match up (fortunately, it should be simple to perform an optional
1733 sanity check before proceeding.) The appeal of saving preprocessed
1734 source, or even the IRs, is that this is conceptually this is exactly
1735 what an indefinite package is: we have paused the compilation process
1736 partway, intending to finish it later. However, our compilation strategy
1737 for definite packages requires us to run this step using a \emph{different}
1738 choice of original names, so it's unclear how much work could actually be reused.
1739
1740 \section{Surface syntax}
1741
1742 In the Backpack paper, a brand new module language is presented, with
1743 syntax for inline modules and signatures. This syntax is probably worth implementing,
1744 because it makes it easy to specify compatibility packages, whose module
1745 definitions in general may be very short:
1746
1747 \begin{verbatim}
1748 package ishake-0.12-shake-0.13 where
1749 include shake-0.13
1750 Development.Shake.Sys = Development.Shake.Cmd
1751 Development.Shake = [ (**>) = (&>) ; (*>>) = (|*>)]
1752 Development.Shake.Rule = [ defaultPriority = rule . priority 0.5 ]
1753 include ishake-0.12
1754 \end{verbatim}
1755
1756 However, there are a few things that are less than ideal about the
1757 surface syntax proposed by Paper Backpack:
1758
1759 \begin{itemize}
1760 \item It's completely different from the current method users
1761 specify packages. There's nothing wrong with this per se
1762 (one simply needs to support both formats) but the smaller
1763 the delta, the easier the new packaging format is to explain
1764 and implement.
1765
1766 \item Sometimes order matters (relative ordering of signatures and
1767 module implementations), and other times it does not (aliases).
1768 This can be confusing for users.
1769
1770 \item Users have to order module definitions topologically,
1771 whereas in current Cabal modules can be listed in any order, and
1772 GHC figures out an appropriate order to compile them.
1773 \end{itemize}
1774
1775 Here is an alternative proposal, closely based on Cabal syntax. Given
1776 the following Backpack definition:
1777
1778 \begin{verbatim}
1779 package libfoo(A, B, C, Foo) where
1780 include base
1781 -- renaming and thinning
1782 include libfoo (Foo, Bar as Baz)
1783 -- holes
1784 A :: [ a :: Bool ].hsig
1785 A2 :: [ b :: Bool ].hsig
1786 -- normal module
1787 B = [
1788 import {-# SOURCE #-} A
1789 import Foo
1790 import Baz
1791 ...
1792 ].hs
1793 -- recursively linked pair of modules, one is private
1794 C :: [ data C ].hsig
1795 D = [ import {-# SOURCE #-} C; data D = D C ].hs
1796 C = [ import D; data C = C D ].hs
1797 -- alias
1798 A = A2
1799 \end{verbatim}
1800
1801 We can write the following Cabal-like syntax instead (where
1802 all of the signatures and modules are placed in appropriately
1803 named files):
1804
1805 \begin{verbatim}
1806 package: libfoo
1807 ...
1808 build-depends: base, libfoo (Foo, Bar as Baz)
1809 holes: A A2 -- deferred for now
1810 exposed-modules: Foo B C
1811 aliases: A = A2
1812 other-modules: D
1813 \end{verbatim}
1814
1815 Notably, all of these lists are \emph{insensitive} to ordering!
1816 The key idea is use of the \verb|{-# SOURCE #-}| pragma, which
1817 is enough to solve the important ordering constraint between
1818 signatures and modules.
1819
1820 Here is how the elaboration works. For simplicity, in this algorithm
1821 description, we assume all packages being compiled have no holes
1822 (including \verb|build-depends| packages). Later, we'll discuss how to
1823 extend the algorithm to handle holes in both subpackages and the main
1824 package itself.
1825
1826 \begin{enumerate}
1827
1828 \item At the top-level with \verb|package| $p$ and
1829 \verb|exposed-modules| $ms$, record \verb|package p (ms) where|
1830
1831 \item For each package $p$ with thinning/renaming $ms$ in
1832 \verb|build-depends|, record a \verb|include p (ms)| in the
1833 Backpack package. The ordering of these includes does not
1834 matter, since none of these packages have holes.
1835
1836 \item Take all modules $m$ in \verb|other-modules| and
1837 \verb|exposed-modules| which were not exported by build
1838 dependencies, and create a directed graph where hs and hs-boot
1839 files are nodes and imports are edges (the target of an edge is
1840 an hs file if it is a normal import, and an hs-boot file if it
1841 is a SOURCE import). Topologically sort this graph, erroring if
1842 this graph contains cycles (even with recursive modules, the
1843 cycle should have been broken by an hs-boot file). For each
1844 node, in this order, record \verb|M = [ ... ]| or \verb|M :: [ ... ]|
1845 depending on whether or not it is an hs or hs-boot. If possible,
1846 sort signatures before implementations when there is no constraint
1847 otherwise.
1848
1849 \end{enumerate}
1850
1851 Here is a simple example which shows how SOURCE can be used to disambiguate
1852 between two important cases. Suppose we have these modules:
1853
1854 \begin{verbatim}
1855 -- A1.hs
1856 import {-# SOURCE #-} B
1857
1858 -- A2.hs
1859 import B
1860
1861 -- B.hs
1862 x = True
1863
1864 -- B.hs-boot
1865 x :: Bool
1866 \end{verbatim}
1867
1868 Then we translate the following packages as follows:
1869
1870 \begin{verbatim}
1871 exposed-modules: A1 B
1872 -- translates to
1873 B :: [ x :: Bool ]
1874 A1 = [ import B ]
1875 B = [ x = True ]
1876 \end{verbatim}
1877
1878 but
1879
1880 \begin{verbatim}
1881 exposed-modules: A2 B
1882 -- translates to
1883 B = [ x = True ]
1884 B :: [ x :: Bool ]
1885 A2 = [ import B ]
1886 \end{verbatim}
1887
1888 The import controls placement between signature and module, and in A1 it
1889 forces B's signature to be sorted before B's implementation (whereas in
1890 the second section, there is no constraint so we preferentially place
1891 the B's implementation first)
1892
1893 \paragraph{Holes in the database} In the presence of holes,
1894 \verb|build-depends| resolution becomes more complicated. First,
1895 let's consider the case where the package we are building is
1896 definite, but the package database contains indefinite packages with holes.
1897 In order to maintain the linking restriction, we now have to order packages
1898 from step (2) of the previous elaboration. We can do this by creating
1899 a directed graph, where nodes are packages and edges are from holes to the
1900 package which implements them. If there is a cycle, this indicates a mutually
1901 recursive package. In the absence of cycles, a topological sorting of this
1902 graph preserves the linking invariant.
1903
1904 One subtlety to consider is the fact that an entry in \verb|build-depends|
1905 can affect how a hole is instantiated by another entry. This might be a
1906 bit weird to users, who might like to explicitly say how holes are
1907 filled when instantiating a package. Food for thought, surface syntax wise.
1908
1909 \paragraph{Holes in the package} Actually, this is quite simple: the
1910 ordering of includes goes as before, but some indefinite packages in the
1911 database are less constrained as they're ``dependencies'' are fulfilled
1912 by the holes at the top-level of this package. It's also worth noting
1913 that some dependencies will go unresolved, since the following package
1914 is valid:
1915
1916 \begin{verbatim}
1917 package a where
1918 A :: ...
1919 package b where
1920 include a
1921 \end{verbatim}
1922
1923 \paragraph{Multiple signatures} In Backpack syntax, it's possible to
1924 define a signature multiple times, which is necessary for mutually
1925 recursive signatures:
1926
1927 \begin{verbatim}
1928 package a where
1929 A :: [ data A ]
1930 B :: [ import A; data B = B A ]
1931 A :: [ import B; data A = A B ]
1932 \end{verbatim}
1933
1934 Critically, notice that we can see the constructors for both module B and A
1935 after the signatures are linked together. This is not possible in GHC
1936 today, but could be possible by permitting multiple hs-boot files. Now
1937 the SOURCE pragma indicating an import must \emph{disambiguate} which
1938 hs-boot file it intends to include. This might be one way of doing it:
1939
1940 \begin{verbatim}
1941 -- A.hs-boot2
1942 data A
1943
1944 -- B.hs-boot
1945 import {-# SOURCE hs-boot2 #-} A
1946
1947 -- A.hs-boot
1948 import {-# SOURCE hs-boot #-} B
1949 \end{verbatim}
1950
1951 \paragraph{Explicit or implicit reexports} One annoying property of
1952 this proposal is that, looking at the \verb|exposed-modules| list, it is
1953 not immediately clear what source files one would expect to find in the
1954 current package. It's not obvious what the proper way to go about doing
1955 this is.
1956
1957 \paragraph{Better syntax for SOURCE} If we enshrine the SOURCE import
1958 as a way of solving Backpack ordering problems, it would be nice to have
1959 some better syntax for it. One possibility is:
1960
1961 \begin{verbatim}
1962 abstract import Data.Foo
1963 \end{verbatim}
1964
1965 which makes it clear that this module is pluggable, typechecking against
1966 a signature. Note that this only indicates how type checking should be
1967 done: when actually compiling the module we will compile against the
1968 interface file for the true implementation of the module.
1969
1970 It's worth noting that the SOURCE annotation was originally made a
1971 pragma because, in principle, it should have been possible to compile
1972 some recursive modules without needing the hs-boot file at all. But if
1973 we're moving towards boot files as signatures, this concern is less
1974 relevant.
1975
1976 \section{Type classes and type families}
1977
1978 \subsection{Background}
1979
1980 Before we talk about how to support type classes in Backpack, it's first
1981 worth talking about what we are trying to achieve in the design. Most
1982 would agree that \emph{type safety} is the cardinal law that should be
1983 preserved (in the sense that segfaults should not be possible), but
1984 there are many instances of ``bad behavior'' (top level mutable state,
1985 weakening of abstraction guarantees, ambiguous instance resolution, etc)
1986 which various Haskellers may disagree on the necessity of ruling out.
1987
1988 With this in mind, it is worth summarizing what kind of guarantees are
1989 presently given by GHC with regards to type classes and type families,
1990 as well as characterizing the \emph{cultural} expectations of the
1991 Haskell community.
1992
1993 \paragraph{Type classes} When discussing type class systems, there are
1994 several properties that one may talk about.
1995 A set of instances is \emph{confluent} if, no matter what order
1996 constraint solving is performed, GHC will terminate with a canonical set
1997 of constraints that must be satisfied for any given use of a type class.
1998 In other words, confluence says that we won't conclude that a program
1999 doesn't type check just because we swapped in a different constraint
2000 solving algorithm.
2001
2002 Confluence's closely related twin is \emph{coherence} (defined in ``Type
2003 classes: exploring the design space''). This property states that
2004 ``every different valid typing derivation of a program leads to a
2005 resulting program that has the same dynamic semantics.'' Why could
2006 differing typing derivations result in different dynamic semantics? The
2007 answer is that context reduction, which picks out type class instances,
2008 elaborates into concrete choices of dictionaries in the generated code.
2009 Confluence is a prerequisite for coherence, since one
2010 can hardly talk about the dynamic semantics of a program that doesn't
2011 type check.
2012
2013 In the vernacular, confluence and coherence are often incorrectly used
2014 to refer to another related property: \emph{global uniqueness of instances},
2015 which states that in a fully compiled program, for any type, there is at most one
2016 instance resolution for a given type class. Languages with local type
2017 class instances such as Scala generally do not have this property, and
2018 this assumption is frequently used for abstraction.
2019
2020 So, what properties does GHC enforce, in practice?
2021 In the absence of any type system extensions, GHC's employs a set of
2022 rules (described in ``Exploring the design space'') to ensure that type
2023 class resolution is confluent and coherent. Intuitively, it achieves
2024 this by having a very simple constraint solving algorithm (generate
2025 wanted constraints and solve wanted constraints) and then requiring the
2026 set of instances to be \emph{nonoverlapping}, ensuring there is only
2027 ever one way to solve a wanted constraint. Overlap is a
2028 more stringent restriction than either confluence or coherence, and
2029 via the \verb|OverlappingInstances| and \verb|IncoherentInstances|, GHC
2030 allows a user to relax this restriction ``if they know what they're doing.''
2031
2032 Surprisingly, however, GHC does \emph{not} enforce global uniqueness of
2033 instances. Imported instances are not checked for overlap until we
2034 attempt to use them for instance resolution. Consider the following program:
2035
2036 \begin{verbatim}
2037 -- T.hs
2038 data T = T
2039 -- A.hs
2040 import T
2041 instance Eq T where
2042 -- B.hs
2043 import T
2044 instance Eq T where
2045 -- C.hs
2046 import A
2047 import B
2048 \end{verbatim}
2049
2050 When compiled with one-shot compilation, \verb|C| will not report
2051 overlapping instances unless we actually attempt to use the \verb|Eq|
2052 instance in C.\footnote{When using batch compilation, GHC reuses the
2053 instance database and is actually able to detect the duplicated
2054 instance when compiling B. But if you run it again, recompilation
2055 avoidance skips A, and it finishes compiling! See this bug:
2056 \url{https://ghc.haskell.org/trac/ghc/ticket/5316}} This is by
2057 design\footnote{\url{https://ghc.haskell.org/trac/ghc/ticket/2356}}:
2058 ensuring that there are no overlapping instances eagerly requires
2059 eagerly reading all the interface files a module may depend on.
2060
2061 We might summarize these three properties in the following manner.
2062 Culturally, the Haskell community expects \emph{global uniqueness of instances}
2063 to hold: the implicit global database of instances should be
2064 confluent and coherent. GHC, however, does not enforce uniqueness of
2065 instances: instead, it merely guarantees that the \emph{subset} of the
2066 instance database it uses when it compiles any given module is confluent and coherent. GHC does do some
2067 tests when an instance is declared to see if it would result in overlap
2068 with visible instances, but the check is by no means
2069 perfect\footnote{\url{https://ghc.haskell.org/trac/ghc/ticket/9288}};
2070 truly, \emph{type-class constraint resolution} has the final word. One
2071 mitigating factor is that in the absence of \emph{orphan instances}, GHC is
2072 guaranteed to eagerly notice when the instance database has overlap.\footnote{Assuming that the instance declaration checks actually worked\ldots}
2073
2074 Clearly, the fact that GHC's lazy behavior is surprising to most
2075 Haskellers means that the lazy check is mostly good enough: a user
2076 is likely to discover overlapping instances one way or another.
2077 However, it is relatively simple to construct example programs which
2078 violate global uniqueness of instances in an observable way:
2079
2080 \begin{verbatim}
2081 -- A.hs
2082 module A where
2083 data U = X | Y deriving (Eq, Show)
2084
2085 -- B.hs
2086 module B where
2087 import Data.Set
2088 import A
2089
2090 instance Ord U where
2091 compare X X = EQ
2092 compare X Y = LT
2093 compare Y X = GT
2094 compare Y Y = EQ
2095
2096 ins :: U -> Set U -> Set U
2097 ins = insert
2098
2099 -- C.hs
2100 module C where
2101 import Data.Set
2102 import A
2103
2104 instance Ord U where
2105 compare X X = EQ
2106 compare X Y = GT
2107 compare Y X = LT
2108 compare Y Y = EQ
2109
2110 ins' :: U -> Set U -> Set U
2111 ins' = insert
2112
2113 -- D.hs
2114 module Main where
2115 import Data.Set
2116 import A
2117 import B
2118 import C
2119
2120 test :: Set U
2121 test = ins' X $ ins X $ ins Y $ empty
2122
2123 main :: IO ()
2124 main = print test
2125
2126 -- OUTPUT
2127 $ ghc -Wall -XSafe -fforce-recomp --make D.hs
2128 [1 of 4] Compiling A ( A.hs, A.o )
2129 [2 of 4] Compiling B ( B.hs, B.o )
2130
2131 B.hs:5:10: Warning: Orphan instance: instance [safe] Ord U
2132 [3 of 4] Compiling C ( C.hs, C.o )
2133
2134 C.hs:5:10: Warning: Orphan instance: instance [safe] Ord U
2135 [4 of 4] Compiling Main ( D.hs, D.o )
2136 Linking D ...
2137 $ ./D
2138 fromList [X,Y,X]
2139 \end{verbatim}
2140
2141 Locally, all type class resolution was coherent: in the subset of
2142 instances each module had visible, type class resolution could be done
2143 unambiguously. Furthermore, the types of \verb|ins| and \verb|ins'|
2144 discharge type class resolution, so that in \verb|D| when the database
2145 is now overlapping, no resolution occurs, so the error is never found.
2146
2147 It is easy to dismiss this example as an implementation wart in GHC, and
2148 continue pretending that global uniqueness of instances holds. However,
2149 the problem with \emph{global uniqueness of instances} is that they are
2150 inherently nonmodular: you might find yourself unable to compose two
2151 components because they accidentally defined the same type class
2152 instance, even though these instances are plumbed deep in the
2153 implementation details of the components.
2154
2155 As it turns out, there is already another feature in Haskell which
2156 must enforce global uniqueness, to prevent segfaults.
2157 We now turn to type classes' close cousin: type families.
2158
2159 \paragraph{Type families} With type families, confluence is the primary
2160 property of interest. (Coherence is not of much interest because type
2161 families are elaborated into coercions, which don't have any
2162 computational content.) Rather than considering what the set of
2163 constraints we reduce to, confluence for type families considers the
2164 reduction of type families. The overlap checks for type families
2165 can be quite sophisticated, especially in the case of closed type
2166 families.
2167
2168 Unlike type classes, however, GHC \emph{does} check the non-overlap
2169 of type families eagerly. The analogous program does \emph{not} type check:
2170
2171 \begin{verbatim}
2172 -- F.hs
2173 type family F a :: *
2174 -- A.hs
2175 import F
2176 type instance F Bool = Int
2177 -- B.hs
2178 import F
2179 type instance F Bool = Bool
2180 -- C.hs
2181 import A
2182 import B
2183 \end{verbatim}
2184
2185 The reason is that it is \emph{unsound} to ever allow any overlap
2186 (unlike in the case of type classes where it just leads to incoherence.)
2187 Thus, whereas one might imagine dropping the global uniqueness of instances
2188 invariant for type classes, it is absolutely necessary to perform global
2189 enforcement here. There's no way around it!
2190
2191 \subsection{Local type classes}
2192
2193 Here, we say \textbf{NO} to global uniqueness.
2194
2195 This design is perhaps best discussed in relation to modular type
2196 classes, which shares many similar properties. Instances are now
2197 treated as first class objects (in MTCs, they are simply modules)---we
2198 may explicitly hide or include instances for type class resolution (in
2199 MTCs, this is done via the \verb|using| top-level declaration). This is
2200 essentially what was sketched in Section 5 of the original Backpack
2201 paper. As a simple example:
2202
2203 \begin{verbatim}
2204 package p where
2205 A :: [ data T = T ]
2206 B = [ import A; instance Eq T where ... ]
2207
2208 package q where
2209 A = [ data T = T; instance Eq T where ... ]
2210 include p
2211 \end{verbatim}
2212
2213 Here, \verb|B| does not see the extra instance declared by \verb|A|,
2214 because it was thinned from its signature of \verb|A| (and thus never
2215 declared canonical.) To declare an instance without making it
2216 canonical, it must placed in a separate (unimported) module.
2217
2218 Like modular type classes, Backpack does not give rise to incoherence,
2219 because instance visibility can only be changed at the top level module
2220 language, where it is already considered best practice to provide
2221 explicit signatures. Here is the example used in the Modular Type
2222 Classes paper to demonstrate the problem:
2223
2224 \begin{verbatim}
2225 structure A = using EqInt1 in
2226 struct ...fun f x = eq(x,x)... end
2227 structure B = using EqInt2 in
2228 struct ...val y = A.f(3)... end
2229 \end{verbatim}
2230
2231 Is the type of f \verb|int -> bool|, or does it have a type-class
2232 constraint? Because type checking proceeds over the entire program, ML
2233 could hypothetically pick either. However, ported to Haskell, the
2234 example looks like this:
2235
2236 \begin{verbatim}
2237 EqInt1 :: [ instance Eq Int ]
2238 EqInt2 :: [ instance Eq Int ]
2239 A = [
2240 import EqInt1
2241 f x = x == x
2242 ]
2243 B = [
2244 import EqInt2
2245 import A hiding (instance Eq Int)
2246 y = f 3
2247 ]
2248 \end{verbatim}
2249
2250 There may be ambiguity, yes, but it can be easily resolved by the
2251 addition of a top-level type signature to \verb|f|, which is considered
2252 best-practice anyway. Additionally, Haskell users are trained to expect
2253 a particular inference for \verb|f| in any case (the polymorphic one).
2254
2255 Here is another example which might be considered surprising:
2256
2257 \begin{verbatim}
2258 package p where
2259 A :: [ data T = T ]
2260 B :: [ data T = T ]
2261 C = [
2262 import qualified A
2263 import qualified B
2264 instance Show A.T where show T = "A"
2265 instance Show B.T where show T = "B"
2266 x :: String
2267 x = show A.T ++ show B.T
2268 ]
2269 \end{verbatim}
2270
2271 In the original Backpack paper, it was implied that module \verb|C|
2272 should not type check if \verb|A.T = B.T| (failing at link time).
2273 However, if we set aside, for a moment, the issue that anyone who
2274 imports \verb|C| in such a context will now have overlapping instances,
2275 there is no reason in principle why the module itself should be
2276 problematic. Here is the example in MTCs, which I have good word from
2277 Derek does type check.
2278
2279 \begin{verbatim}
2280 signature SIG = sig
2281 type t
2282 val mk : t
2283 end
2284 signature SHOW = sig
2285 type t
2286 val show : t -> string
2287 end
2288 functor Example (A : SIG) (B : SIG) =
2289 let structure ShowA : SHOW = struct
2290 type t = A.t
2291 fun show _ = "A"
2292 end in
2293 let structure ShowB : SHOW = struct
2294 type t = B.t
2295 fun show _ = "B"
2296 end in
2297 using ShowA, ShowB in
2298 struct
2299 val x = show A.mk ++ show B.mk
2300 end : sig val x : string end
2301 \end{verbatim}
2302
2303 The moral of the story is, even though in a later context the instances
2304 are overlapping, inside the functor, the type-class resolution is unambiguous
2305 and should be done (so \verb|x = "AB"|).
2306
2307 Up until this point, we've argued why MTCs and this Backpack design are similar.
2308 However, there is an important sociological difference between modular type-classes
2309 and this proposed scheme for Backpack. In the presentation ``Why Applicative
2310 Functors Matter'', Derek mentions the canonical example of defining a set:
2311
2312 \begin{verbatim}
2313 signature ORD = sig type t; val cmp : t -> t -> bool end
2314 signature SET = sig type t; type elem;
2315 val empty : t;
2316 val insert : elem -> t -> t ...
2317 end
2318 functor MkSet (X : ORD) :> SET where type elem = X.t
2319 = struct ... end
2320 \end{verbatim}
2321
2322 This is actually very different from how sets tend to be defined in
2323 Haskell today. If we directly encoded this in Backpack, it would
2324 look like this:
2325
2326 \begin{verbatim}
2327 package mk-set where
2328 X :: [
2329 data T
2330 cmp :: T -> T-> Bool
2331 ]
2332 Set :: [
2333 data Set
2334 empty :: Set
2335 insert :: T -> Set -> Set
2336 ]
2337 Set = [
2338 import X
2339 ...
2340 ]
2341 \end{verbatim}
2342
2343 It's also informative to consider how MTCs would encode set as it is written
2344 today in Haskell:
2345
2346 \begin{verbatim}
2347 signature ORD = sig type t; val cmp : t -> t -> bool end
2348 signature SET = sig type 'a t;
2349 val empty : 'a t;
2350 val insert : (X : ORD) => X.t -> X.t t -> X.t t
2351 end
2352 struct MkSet :> SET = struct ... end
2353 \end{verbatim}
2354
2355 Here, it is clear to see that while functor instantiation occurs for
2356 implementation, it is not occuring for types. This is a big limitation
2357 with the Haskell approach, and it explains why Haskellers, in practice,
2358 find global uniqueness of instances so desirable.
2359
2360 Implementation-wise, this requires some subtle modifications to how we
2361 do type class resolution. Type checking of indefinite modules works as
2362 before, but when go to actually compile them against explicit
2363 implementations, we need to ``forget'' that two types are equal when
2364 doing instance resolution. This could probably be implemented by
2365 associating type class instances with the original name that was
2366 utilized when typechecking, so that we can resolve ambiguous matches
2367 against types which have the same original name now that we are
2368 compiling.
2369
2370 As we've mentioned previously, this strategy is unsound for type families.
2371
2372 \subsection{Globally unique}
2373
2374 Here, we say \textbf{YES} to global uniqueness.
2375
2376 When we require the global uniqueness of instances (either because
2377 that's the type class design we chose, or because we're considering
2378 the problem of type families), we will need to reject declarations like the
2379 one cited above when \verb|A.T = B.T|:
2380
2381 \begin{verbatim}
2382 A :: [ data T ]
2383 B :: [ data T ]
2384 C :: [
2385 import qualified A
2386 import qualified B
2387 instance Show A.T where show T = "A"
2388 instance Show B.T where show T = "B"
2389 ]
2390 \end{verbatim}
2391
2392 The paper mentions that a link-time check is sufficient to prevent this
2393 case from arising. While in the previous section, we've argued why this
2394 is actually unnecessary when local instances are allowed, the link-time
2395 check is a good match in the case of global instances, because any
2396 instance \emph{must} be declared in the signature. The scheme proceeds
2397 as follows: when some instances are typechecked initially, we type check
2398 them as if all of variable module identities were distinct. Then, when
2399 we perform linking (we \verb|include| or we unify some module
2400 identities), we check again if to see if we've discovered some instance
2401 overlap. This linking check is akin to the eager check that is
2402 performed today for type families; it would need to be implemented for
2403 type classes as well: however, there is a twist: we are \emph{redoing}
2404 the overlap check now that some identities have been unified.
2405
2406 As an implementation trick, one could deferring the check until \verb|C|
2407 is compiled, keeping in line with GHC's lazy ``don't check for overlap
2408 until the use site.'' (Once again, unsound for type families.)
2409
2410 \paragraph{What about module inequalities?} An older proposal was for
2411 signatures to contain ``module inequalities'', i.e., assertions that two
2412 modules are not equal. (Technically: we need to be able to apply this
2413 assertion to $\beta$ module variables, since \verb|A != B| while
2414 \verb|A.T = B.T|). Currently, Edward thinks that module inequalities
2415 are only marginal utility with local instances (i.e., not enough to
2416 justify the implementation cost) and not useful at all in the world of
2417 global instances!
2418
2419 With local instances, module inequalities could be useful to statically
2420 rule out examples like \verb|show A.T ++ show B.T|. Because such uses
2421 are not necessarily reflected in the signature, it would be a violation
2422 of separate module development to try to divine the constraint from the
2423 implementation itself. I claim this is of limited utility, however, because,
2424 as we mentioned earlier, we can compile these ``incoherent'' modules perfectly
2425 coherently. With global instances, all instances must be in the signature, so
2426 while it might be aesthetically displeasing to have the signature impose
2427 extra restrictions on linking identities, we can carry this out without
2428 violating the linking restriction.
2429
2430 \section{Bits and bobs}
2431
2432 \subsection{Abstract type synonyms}
2433
2434 In Paper Backpack, abstract type synonyms are not permitted, because GHC doesn't
2435 understand how to deal with them. The purpose of this section is to describe
2436 one particularly nastiness of abstract type synonyms, by way of the occurs check:
2437
2438 \begin{verbatim}
2439 A :: [ type T ]
2440 B :: [ import qualified A; type T = [A.T] ]
2441 \end{verbatim}
2442
2443 At this point, it is illegal for \verb|A = B|, otherwise this type synonym would
2444 fail the occurs check. This seems like pretty bad news, since every instance
2445 of the occurs check in the type-checker could constitute a module inequality.
2446
2447 \section{Open questions}\label{sec:open-questions}
2448
2449 Here are open problems about the implementation which still require
2450 hashing out.
2451
2452 \begin{itemize}
2453
2454 \item In Section~\ref{sec:simplifying-backpack}, we argued that we
2455 could implement Backpack without needing a shaping pass. We're
2456 pretty certain that this will work for typechecking and
2457 compiling fully definite packages with no recursive linking, but
2458 in Section~\ref{sec:typechecking-indefinite}, we described some
2459 of the prevailing difficulties of supporting signature linking.
2460 Renaming is not an insurmountable problem, but backwards flow of
2461 shaping information can be, and it is unclear how best to
2462 accommodate this. This is probably the most important problem
2463 to overcome.
2464
2465 \item In Section~\ref{sec:installing-indefinite}, a few choices for how to
2466 store source code were pitched, however, there is not consensus on which
2467 one is best.
2468
2469 \item What is the impact of the multiplicity of PackageIds on
2470 dependency solving in Cabal? Old questions of what to prefer
2471 when multiple package-versions are available (Cabal originally
2472 only needed to solve this between different versions of the same
2473 package, preferring the oldest version), but with signatures,
2474 there are more choices. Should there be a complex solver that
2475 does all signature solving, or a preprocessing step that puts
2476 things back into the original Cabal version. Authors may want
2477 to suggest policy for what packages should actually link against
2478 signatures (so a crypto library doesn't accidentally link
2479 against a null cipher package).
2480
2481 \end{itemize}
2482
2483 \end{document}