7d022d1d74ffdfea7f293fb659315df4e84de9f5
[ghc.git] / docs / backpack / backpack-impl.tex
1 \documentclass{article}
2
3 \usepackage{pifont}
4 \usepackage{graphicx} %[pdftex] OR [dvips]
5 \usepackage{fullpage}
6 \usepackage{wrapfig}
7 \usepackage{float}
8 \usepackage{titling}
9 \usepackage{hyperref}
10 \usepackage{tikz}
11 \usepackage{color}
12 \usepackage{footnote}
13 \usepackage{float}
14 \usepackage{algorithm}
15 \usepackage{algpseudocode}
16 \usetikzlibrary{arrows}
17 \usetikzlibrary{positioning}
18 \setlength{\droptitle}{-6em}
19
20 \input{commands-new-new.tex}
21
22 \newcommand{\nuAA}{\nu_\mathit{AA}}
23 \newcommand{\nuAB}{\nu_\mathit{AB}}
24 \newcommand{\nuGA}{\nu_\mathit{GA}}
25 \newcommand{\nuGB}{\nu_\mathit{GB}}
26 \newcommand{\betaPL}{\beta_\mathit{PL}}
27 \newcommand{\betaAA}{\beta_\mathit{AA}}
28 \newcommand{\betaAS}{\beta_\mathit{AS}}
29 \newcommand{\thinandalso}{\hspace{.45cm}}
30 \newcommand{\thinnerandalso}{\hspace{.38cm}}
31
32 \input{commands-rebindings.tex}
33
34 \newcommand{\var}[1]{\textsf{#1}}
35
36 \newcommand{\ghcfile}[1]{\textsl{#1}}
37
38 \title{Implementing Backpack}
39
40 \begin{document}
41
42 \maketitle
43
44 The purpose of this document is to describe an implementation path
45 for Backpack in GHC\@.
46
47 \tableofcontents
48
49 \section{What we are trying to solve}
50
51 While the current ecosystem has proved itself serviceable for many years,
52 there are a number of major problems which causes significant headaches
53 for many users. Here are some of them:
54
55 \subsection{Package reinstalls are destructive}\label{sec:destructive}
56
57 When attempting to install a new package, you might get an error like
58 this:
59
60 \begin{verbatim}
61 $ cabal install hakyll
62 cabal: The following packages are likely to be broken by the reinstalls:
63 pandoc-1.9.4.5
64 Graphalyze-0.14.0.0
65 Use --force-reinstalls if you want to install anyway.
66 \end{verbatim}
67
68 While this error message is understandable if you're really trying to
69 reinstall a package, it is quite surprising that it can occur even if
70 you didn't ask for any reinstalls!
71
72 The underlying cause of this problem is related to an invariant Cabal
73 currently enforces on a package database: there can only be one instance
74 of a package for any given package name and version. This means that it
75 is not possible to install a package multiple times, compiled against
76 different dependencies. However, sometimes, reinstalling a package with
77 different dependencies is the only way to fulfill version bounds of a
78 package! For example: say we have three packages \pname{a}, \pname{b}
79 and \pname{c}. \pname{b-1.0} is the only version of \pname{b}
80 available, and it has been installed and compiled against \pname{c-1.0}.
81 Later, the user installs an updated version \pname{c-1.1} and then
82 attempts to install \pname{a}, which depends on the specific versions
83 \pname{c-1.1} and \pname{b-1.0}. We \emph{cannot} use the already
84 installed version of \pname{b-1.0}, which depends on the wrong version
85 of \pname{c}, so our only choice is to reinstall \pname{b-1.0} compiled
86 against \pname{c-1.1}. This will break any packages, e.g. \pname{d},
87 which were built against the old version of \pname{b-1.0}.
88
89 Our solution to this problem is to \emph{abolish} destructive package
90 installs, and allow a package to be installed multiple times with the same
91 package name and version. However, allowing this poses some interesting
92 user interface problems, since package IDs are now no longer unambiguous
93 identifiers.
94
95 \subsection{Version bounds are often over/under-constrained}
96
97 When attempting to install a new package, Cabal might fail in this way:
98
99 \begin{verbatim}
100 $ cabal install hledger-0.18
101 Resolving dependencies...
102 cabal: Could not resolve dependencies:
103 # pile of output
104 \end{verbatim}
105
106 There are a number of possible reasons why this could occur, but usually
107 it's because some of the packages involved have over-constrained version
108 bounds, which are resulting in an unsatisfiable set of constraints (or,
109 at least, Cabal gave up backtracking before it found a solution.) To
110 add insult to injury, most of the time the bound is nonsense and removing
111 it would result in a working compilation. In fact, this situation is
112 so common that Cabal has a flag \verb|--allow-newer| which lets you
113 override the package upper bounds.
114
115 However, the flip-side is when Cabal finds a satisfying set, but your
116 compilation fails with a type error. Here, you had an under-constrained
117 set of version bounds which didn't actually reflect the compatible
118 versions of a package, and Cabal picked a version of the package which
119 was incompatible.
120
121 Our solution to this problem is to use signatures instead of version
122 numbers as the primary mechanism by which compatibility is determined:
123 e.g., if it typechecks, it's a valid choice. Version numbers can still
124 be used to reflect semantic changes not seen in the types (in
125 particular, ruling out buggy versions of a package is a useful
126 operation), but these bounds are empirical observations and can be
127 collected after-the-fact.
128
129 \subsection{It is difficult to support multiple implementations of a type}
130
131 This problem is perhaps best described by referring to a particular
132 instance of it Haskell's ecosystem: the \texttt{String} data type. Haskell,
133 by default, implements strings as linked lists of integers (representing
134 characters). Many libraries use \texttt{String}, because it's very
135 convenient to program against. However, this representation is also
136 very \emph{slow}, so there are alternative implementations such as
137 \texttt{Text} which implement efficient, UTF-8 encoded packed byte
138 arrays.
139
140 Now, suppose you are writing a library and you don't care if the user of
141 your library is using \texttt{String} or \texttt{Text}. However, you
142 don't want to rewrite your library twice to support both data types:
143 rather, you'd like to rely on some \emph{common interface} between the
144 two types, and let the user instantiate the implementation. The only
145 way to do this in today's Haskell is using type classes; however, this
146 necessitates rewriting all type signatures from a nice \texttt{String ->
147 String} to \texttt{StringLike s => s -> s}. The result is less readable,
148 required a large number of trivial edits to type signatures, and might
149 even be less efficient, if GHC does not appropriately specialize your code
150 written in this style.
151
152 Our solution to this problem is to introduce a new mechanism of
153 pluggability: module holes, which let us use types and functions from a
154 module \texttt{Data.String} as before, but defer choosing \emph{what}
155 module should be used in the implementation to some later point (or
156 instantiate the code multiple times with different choices.)
157
158 \subsection{Fast moving APIs are difficult to develop/develop against}
159
160 Most packages that are uploaded to Hackage have package authors which pay
161 some amount of attention to backwards compatibility and avoid making egregious
162 breaking changes. However, a package like the \verb|ghc-api| follows a
163 very different model: the library is a treated by its developers as an
164 internal component of an application (GHC), and is frequently refactored
165 in a way that changes its outwards facing interface.
166
167 Arguably, an application like GHC should design a stable API and
168 maintain backwards compatibility against it. However, this is a lot of
169 work (including refactoring) which is only being done slowly, and in the
170 meantime, the dump of all the modules gives users the functionality they
171 want (even if it keeps breaking every version.)
172
173 One could say that the core problem is there is no way for users to
174 easily communicate to GHC authors what parts of the API they rely on. A
175 developer of GHC who is refactoring an interface will often rely on the
176 typechecker to let them know which parts of the codebase they need to
177 follow and update, and often could say precisely how to update code to
178 use the new interface. User applications, which live out of tree, don't
179 receive this level of attention.
180
181 Our solution is to make it possible to typecheck the GHC API against a
182 signature. Important consumers can publish what subsets of the GHC API
183 they rely against, and developers of GHC, as part of their normal build
184 process, type-check against these signatures. If the signature breaks,
185 a developer can either do the refactoring differently to avoid the
186 compatibility-break, or document how to update code to use the new API\@.
187
188 \section{Backpack in a nutshell}
189
190 For a more in-depth tutorial about Backpack's features, check out Section 2
191 of the original Backpack paper. In this section, we briefly review the
192 most important points of Backpack's design.
193
194 \paragraph{Thinning and renaming at the module level}
195 A user can specify a build dependency which only exposes a subset of
196 modules (possibly under different names.) By itself, it's a way for the
197 user to resolve ambiguous module imports at the package level, without
198 having to use the \texttt{PackageImports} syntax extension.
199
200 \paragraph{Holes (abstract module definitions)} The core
201 component of Backpack's support for \emph{separate modular development}
202 is the ability to specify abstract module bindings, or holes, which give
203 users of the module an obligation to provide an implementation which
204 fulfills the signature of the hole. In this example:
205
206 \begin{verbatim}
207 package p where
208 A :: [ ... ]
209 B = [ import A; ... ]
210 \end{verbatim}
211
212 \verb|p| is an \emph{indefinite package}, which cannot be compiled until
213 an implementation of \m{A} is provided. However, we can still type check
214 \m{B} without any implementation of \m{A}, by type checking it against
215 the signature. Holes can be put into signature packages and included
216 (depended upon) by other packages to reuse definitions of signatures.
217
218 \paragraph{Filling in holes with an implementation}
219 A hole in an indefinite package can be instantiated in a \emph{mix-in}
220 style: namely, if a signature and an implementation have the same name,
221 they are linked together:
222
223 \begin{verbatim}
224 package q where
225 A = [ ... ]
226 include p -- has signature A
227 \end{verbatim}
228
229 Renaming is often useful to rename a module (or a hole) so that a signature
230 and implementation have the same name and are linked together.
231 An indefinite package can be instantiated multiple times with different
232 implementations: the \emph{applicativity} of Backpack means that if
233 a package is instantiated separately with the same module, the results
234 are type equal:
235
236 \begin{verbatim}
237 package q' where
238 A = [ ... ]
239 include p (A, B as B1)
240 include p (A, B as B2)
241 -- B1 and B2 are equivalent
242 \end{verbatim}
243
244 \paragraph{Combining signatures together}
245 Unlike implementations, it's valid for a multiple signatures with the
246 same name to be in scope.
247
248 \begin{verbatim}
249 package a-sig where
250 A :: [ ... ]
251 package a-sig2 where
252 A :: [ ... ]
253 package q where
254 include a-sig
255 include a-sig2
256 B = [ import A; ... ]
257 \end{verbatim}
258
259 These signatures \emph{merge} together, providing the union of the
260 functionality (assuming the types of individual entities are
261 compatible.) Backpack has a very simple merging algorithm: types must
262 match exactly to be compatible (\emph{width} subtyping).
263
264 \clearpage
265
266 \section{Module and package identity}
267
268 \begin{figure}[H]
269 \begin{tabular}{p{0.45\textwidth} p{0.45\textwidth}}
270 \begin{verbatim}
271 package p where
272 A :: [ data X ]
273 P = [ import A; data Y = Y X ]
274 package q where
275 A1 = [ data X = X1 ]
276 A2 = [ data X = X2 ]
277 include p (A as A1, P as P1)
278 include p (A as A2, P as P2)
279 \end{verbatim}
280 &
281 \begin{verbatim}
282 package p where
283 A :: [ data X ]
284 P = [ data T = T ] -- no A import!
285 package q where
286 A1 = [ data X = X1 ]
287 A2 = [ data X = X2 ]
288 include p (A as A1, P as P1)
289 include p (A as A2, P as P2)
290 \end{verbatim}
291 \\
292 (a) Type equality must consider holes\ldots &
293 (b) \ldots but how do we track dependencies? \\
294 \end{tabular}
295 \caption{Two similar examples}\label{fig:simple-ex}
296 \end{figure}
297
298 One of the central questions one encounters when type checking Haskell
299 code is: \emph{when are two types equal}? In ordinary Haskell, the
300 answer is simple: ``They are equal if their \emph{original names} (i.e.,
301 where they were originally defined) are the same.'' However, in
302 Backpack, the situation is murkier due to the presence of \emph{holes}.
303 Consider the pair of examples in Figure~\ref{fig:simple-ex}.
304 In Figure~\ref{fig:simple-ex}a, the types \m{B1}.\verb|Y| and \m{B2}.\verb|Y| should not be
305 considered equal, even though na\"\i vely their original names are
306 \pname{p}:\m{B}.\verb|Y|, since their arguments are different \verb|X|'s!
307 On the other hand, if we instantiated \pname{p} twice with the same \m{A}
308 (e.g., change the second include to \texttt{include p (A as A1, P as P2)}),
309 we might consider the two resulting \verb|Y|'s
310 equal, an \emph{applicative} semantics of identity instantiation. In
311 Figure~\ref{fig:simple-ex}b, we see that even though \m{A} was instantiated differently,
312 we might reasonably wonder if \texttt{T} should still be considered the same,
313 since it has no dependence on the actual choice of \m{A}.
314
315 In fact, there are quite a few different choices that can be made here.
316 Figures~\ref{fig:applicativity}~and~\ref{fig:granularity} summarize the various
317 choices on two axes: the granularity of applicativity (under what circumstances
318 do we consider two types equal) and the granularity of dependency (what circumstances
319 do we consider two types not equal)? A \ding{52} means the design we have chosen
320 answers the question affirmatively---\ding{54}, negatively---but all of these choices
321 are valid points on the design space.
322
323 \subsection{The granularity of applicativity}
324
325 An applicative semantics of package instantiation states that if a package is
326 instantiated with the ``same arguments'', then the resulting entities it defines
327 should also be considered equal. Because Backpack uses \emph{mix-in modules},
328 it is very natural to consider the arguments of a package instantiation as the
329 modules, as shown in Figure~\ref{fig:applicativity}b: the same module \m{A} is
330 linked for both instantiations, so \m{P1} and \m{P2} are considered equal.
331
332 However, we consider the situation at a finer granularity, we might say, ``Well,
333 for a declaration \texttt{data Y = Y X}, only the definition of type \verb|X| matters.
334 If they are the same, then \verb|Y| is the same.'' In that case, we might accept
335 that in Figure~\ref{fig:applicativity}a, even though \pname{p} is instantiated
336 with different modules, at the end of the day, the important component \verb|X| is
337 the same in both cases, so \verb|Y| should also be the same. This is a sort of
338 ``extreme'' view of modular development, where every declaration is desugared
339 into a separate module. In our design, we will be a bit more conservative, and
340 continue with module level applicativity, in the same manner as Paper Backpack.
341
342 \paragraph{Implementation considerations}
343 Compiling Figure~\ref{fig:applicativity}b to dynamic libraries poses an
344 interesting challenge, if every package compiles to a dynamic library.
345 When we compile package \pname{q}, the libraries we end up producing are \pname{q}
346 and an instance of \pname{p} (instantiated with \pname{q}:\m{A}). Furthermore,
347 \pname{q} refers to code in \pname{p} (the import in \m{Q}), and vice versa (the usage
348 of the instantiated hole \m{A}). When building static libraries, this circular
349 dependency doesn't matter: when we link the executable, we can resolve all
350 of the symbols in one go. However, when the libraries in question are
351 dynamic libraries \verb|libHSq.so| and \verb|libHSp(q:A).so|, we now have
352 a \emph{circular dependency} between the two dynamic libraries, and most dynamic
353 linkers will not be able to load either of these libraries.
354
355 To break the circularity in Figure~\ref{fig:applicativity}b, we have to \emph{inline}
356 the entire module \m{A} into the instance of \pname{p}. Since the code is exactly
357 the same, we can still consider the instance of \m{A} in \pname{q} and in \pname{p}
358 type equal. However, in Figure~\ref{fig:applicativity}c, applicativity has been
359 done at a coarser level: although we are using Backpack's module mixin syntax,
360 morally, this example is filling in the holes with the \emph{package} \pname{a}
361 (rather than a module). In this case, we can achieve code sharing, since
362 \pname{p} can refer directly to \pname{a}, breaking the circularity.
363
364 \newcolumntype{C}{>{\centering\arraybackslash}p{0.3\textwidth}}
365 \begin{savenotes}
366 \begin{figure}
367 \begin{tabular}{C C C}
368 \begin{verbatim}
369 package q where
370 A = [ data X = X ]
371 A1 = [ import A; x = 0 ]
372 A2 = [ import A; x = 1 ]
373 include p (A as A1, P as P1)
374 include p (A as A2, P as P2)
375 Q = [ import P1; ... ]
376 \end{verbatim}
377 &
378 \begin{verbatim}
379 package q where
380 A = [ data X = X ]
381
382
383 include p (A, P as P1)
384 include p (A, P as P2)
385 Q = [ import P1; ... ]
386 \end{verbatim}
387 &
388 \begin{verbatim}
389 package a where
390 A = [ data X = X ]
391 package q where
392 include a
393 include p (A, P as P1)
394 include p (A, P as P2)
395 Q = [ import P1; ... ]
396 \end{verbatim}
397 \\
398 (a) Declaration applicativity \ding{54} &
399 (b) Module applicativity \ding{52} &
400 (c) Package applicativity \ding{52} \\
401 \end{tabular}
402 \caption{Choices of granularity of applicativity on \pname{p}: given \texttt{data Y = Y X}, is \m{P1}.\texttt{Y} equal to \m{P2}.\texttt{Y}?}\label{fig:applicativity}
403 \end{figure}
404 \end{savenotes}
405
406 \subsection{The granularity of dependency}
407
408 \begin{savenotes}
409 \newcolumntype{C}{>{\centering\arraybackslash}p{0.3\textwidth}}
410 \begin{figure}
411 \begin{tabular}{C C C}
412 \begin{verbatim}
413 package p(A,P) where
414 A :: [ data X ]
415 P = [
416 import A
417 data T = T
418 data Y = Y X
419 ]
420 \end{verbatim}
421 &
422 \begin{verbatim}
423 package p(A,P) where
424 A :: [ data X ]
425 B = [ data T = T ]
426 C = [
427 import A
428 data Y = Y X
429 ]
430 P = [
431 import B
432 import C
433 ]
434 \end{verbatim}
435 &
436 \begin{verbatim}
437 package b where
438 B = [ data T = T ]
439 package c where
440 A :: [ data X ]
441 C = [
442 import A
443 data Y = Y X
444 ]
445 package p(A,P) where
446 include b; include c
447 P = [ import B; import C ]
448 \end{verbatim}
449 \\
450 (a) Declaration granularity \ding{54} &
451 (b) Module granularity \ding{54} &
452 (c) Package granularity \ding{52} \\
453 \end{tabular}
454 \caption{Choices of granularity for dependency: is the identity of \texttt{T} independent of how \m{A} is instantiated?}\label{fig:granularity}
455 \end{figure}
456
457 \end{savenotes}
458
459 In the previous section, we considered \emph{what} entities may be considered for
460 computing dependency; in this section we consider \emph{which} entities are actually
461 considered as part of the dependencies for the declaration/module/package we're writing.
462 Figure~\ref{fig:granularity} contains a series of examples which exemplify the choice
463 of whether or not to collect dependencies on a per-declaration, per-module or per-package basis:
464
465 \begin{itemize}
466 \item Package-level granularity states that the modules in a package are
467 considered to depend on \emph{all} of the holes in the package, even if
468 the hole is never imported. Figure~\ref{fig:granularity}c is factored so that
469 \verb|T| is defined in a distinct package \pname{b} with no holes, so no matter
470 the choice of \m{A}, \m{B}.\verb|T| will be the same. On the other hand, in
471 Figure~\ref{fig:granularity}b, there is a hole in the package defining \m{B},
472 so the identity of \verb|T| will depend on the choice of \m{A}.
473
474 \item Module-level granularity states that each module has its own dependency,
475 computed by looking at its import statements. In this setting, \verb|T| in Figure~\ref{fig:granularity}b
476 is independent of \m{A}, since the hole is never imported in \m{B}. But once again, in
477 Figure~\ref{fig:granularity}a, there is an import in the module defining \verb|T|,
478 so the identity of \verb|T| once again depends on the choice of \m{A}.
479
480 \item Finally, at the finest level of granularity, one could chop up \pname{p} in
481 Figure~\ref{fig:granularity}a, looking at the type declaration-level dependency
482 to suss out whether or not \verb|T| depends on \m{A}. It doesn't refer to
483 anything in \m{A}, so it is always considered the same.
484 \end{itemize}
485
486 It is well worth noting that the system described by Paper Backpack tracks dependencies per module;
487 however, we have decided that we will implement tracking per package instead:
488 a coarser grained granularity which accepts less programs.
489
490 Is a finer form of granularity \emph{better?} Not necessarily! For
491 one, we can always split packages into further subpackages (as was done
492 in Figure~\ref{fig:granularity}c) which better reflect the internal hole
493 dependencies, so it is always possible to rewrite a program to make it
494 typecheck---just with more packages. Additionally, the finer the
495 granularity of dependency, the more work I have to do to understand what
496 the identities of entities in a module are. In Paper Backpack, I have
497 to understand the imports of all modules in a package; with
498 declaration-granularity, I have to understand the entire code. This is
499 a lot of work for the developer to think about; a more granular model is
500 easier to remember and reason about. Finally, package-level granularity
501 is much easier to implement, as it preserves the previous compilation
502 model, \emph{one library per package}. At a fine level of granularity, we
503 may end up repeatedly compiling a module which actually should be considered
504 ``the same'' as any other instance of it.
505
506 Nevertheless, finer granularity can be desirable from an end-user perspective.
507 Usually, these circumstances arise when library-writers are forced to split their
508 components into many separate packages, when they would much rather have written
509 a single package. For example, if I define a data type in my library, and would
510 like to define a \verb|Lens| instance for it, I would create a new package just
511 for the instance, in order to avoid saddling users who aren't interested in lenses
512 with an extra dependency. Another example is test suites, which have dependencies
513 on various test frameworks that a user won't care about if they are not planning
514 on testing the code. (Cabal has a special case for this, allowing the user
515 to write effectively multiple packages in a single Cabal file.)
516
517 \subsection{Summary}
518
519 We can summarize all of the various schemes by describing the internal data
520 types that would be defined by GHC under each regime. First, we have
521 the shared data structures, which correspond closely to what users are
522 used to seeing:
523
524 \begin{verbatim}
525 <pkg-name> ::= containers, ...
526 <pkg-version ::= 1.0, ...
527 <pkg-id> ::= <pkg-name>-<pkg-version>
528 <mod-name> ::= Data.Set, ...
529 <occ> ::= empty, ...
530 \end{verbatim}
531
532 Changing the \textbf{granularity of applicativity} modifies how we represent the
533 list of dependencies associated with an entity. With module applicativity,
534 we list module identities (not yet defined); with declaration applicativity
535 we actually list the original names (i.e., ids).
536
537 \begin{verbatim}
538 <deps> ::= <id>, ... # Declaration applicativity
539 <deps> ::= <module>, ... # Module applicativity
540 \end{verbatim}
541
542 Changing the \textbf{granularity of dependency} affects how we compute
543 the lists of dependencies, and what entities are well defined:
544
545 \begin{verbatim}
546 # Package-level granularity
547 <pkg-key> ::= hash(<pkg-id> + <deps for pkg>)
548 <module> ::= <pkg-key> : <mod-name>
549 <id> ::= <module> . <occ>
550
551 # Module-level granularity
552 <pkg-key> not defined
553 <module> ::= hash(<pkg-id> : <mod-name> + <deps for mod>)
554 <id> ::= <module-key> . <occ>
555
556 # Declaration-level granularity
557 <pkg-key> not defined
558 <module> not defined
559 <id> ::= hash(<pkg-id> : <mod-name> . <occ> + <deps for decl>)
560 \end{verbatim}
561
562 Notice that as we increase the granularity, the notion of a ``package'' and a ``module''
563 become undefined. This is because, for example, with module-level granularity, a single
564 ``package'' may result in several modules, each of which have different sets of
565 dependencies. It doesn't make much sense to refer to the package as a monolithic entity,
566 because the point of splitting up the dependencies was so that if a user relies only
567 on a single module, it has a correspondingly restricted set of dependencies.
568 \subsection{The new scheme, formally}
569
570 \begin{wrapfigure}{R}{0.5\textwidth}
571 \begin{myfig}
572 \[
573 \begin{array}{@{}lr@{\;}c@{\;}l@{}}
574 \text{Package Names (\texttt{PkgName})} & P &\in& \mathit{PkgNames} \\
575 \text{Module Path Names (\texttt{ModName})} & p &\in& \mathit{ModPaths} \\
576 \text{Module Identity Vars} & \alpha,\beta &\in& \mathit{IdentVars} \\
577 \text{Package Key (\texttt{PackageKey})} & \K &::=& P(\vec{p\mapsto\nu}) \\
578 \text{Module Identities (\texttt{Module})} & \nu &::=&
579 \alpha ~|~
580 \K\colon\! p \\
581 \text{Module Identity Substs} & \phi,\theta &::=&
582 \{\vec{\alpha \coloneqq \nu}\} \\
583 \end{array}
584 \]
585 \caption{Module Identities}
586 \label{fig:mod-idents}
587 \end{myfig}
588 \end{wrapfigure}
589
590 In this section, we give a formal treatment of our choice in the design space, in the
591 same style as the Backpack paper, but omitting mutual recursion, as it follows straightforwardly.
592 Physical module
593 identities $\nu$, the \texttt{Module} component of \emph{original names} in GHC, are either (1) \emph{variables} $\alpha$, which are
594 used to represent holes\footnote{In practice, these will just be fresh paths in a special package key for variables.} or (2) a concrete module $p$ defined in package
595 $P$, with holes instantiated with other module identities (might be
596 empty)\footnote{In Paper Backpack, we would refer to just $P$:$p$ as the identity
597 constructor. However, we've written the subterms specifically next to $P$ to highlight the semantic difference of these terms.}.
598
599 As in traditional Haskell, every package contains a number of module
600 files at some module path $p$; within a package these paths are
601 guaranteed to be unique.\footnote{In Paper Backpack, the module expressions themselves are used to refer to globally unique identifiers for each literal. This makes the metatheory simpler, but for implementation purposes it is convenient to conflate the \emph{original} module path that a module is defined at with its physical identity.} When we write inline module definitions, we assume
602 that they are immediately assigned to a module path $p$ which is incorporated
603 into their identity. A module identity $\nu$ simply augments this
604 with subterms $\vec{p\mapsto\nu}$ representing how \emph{all} holes in the package $P$
605 were instantiated.\footnote{In Paper Backpack, we do not distinguish between holes/non-holes, and we consider all imports of the \emph{module}, not the package.} This naming is stable because the current Backpack surface syntax does not allow a logical path in a package
606 to be undefined. A package key is $P(\vec{p\mapsto\nu})$.
607
608 Here is the very first example from
609 Section 2 of the original Backpack paper, \pname{ab-1}:
610
611 \begin{example}
612 \Pdef{ab-1}{
613 \Pmod{A}{x = True}
614 \Pmod{B}{\Mimp{A}; y = not x}
615 % \Pmodd{C}{\mname{A}}
616 }
617 \end{example}
618
619 The identities of \m{A} and \m{B} are
620 \pname{ab-1}:\mname{A} and \pname{ab-1}:\mname{B}, respectively.\footnote{In Paper Backpack, the identity for \mname{B} records its import of \mname{A}, but since it is definite, this is strictly redundant.} In a package with holes, each
621 hole (within the package definition) gets a fresh variable as its
622 identity, and all of the holes associated with package $P$ are recorded. Consider \pname{abcd-holes-1}:
623
624 \begin{example}
625 \Pdef{abcd-holes-1}{
626 \Psig{A}{x :: Bool} % chktex 26
627 \Psig{B}{y :: Bool} % chktex 26
628 \Pmod{C}{x = False}
629 \Pmodbig{D}{
630 \Mimpq{A}\\
631 \Mimpq{C}\\
632 % \Mexp{\m{A}.x, z}\\
633 z = \m{A}.x \&\& \m{C}.x
634 }
635 }
636 \end{example}
637
638 The identities of the four modules
639 are, in order, $\alpha_a$, $\alpha_b$, $\pname{abcd-holes-1}(\alpha_a,\alpha_b)$:\mname{C}, and
640 $\pname{abcd-holes-1}(\alpha_a,\alpha_b)$:\mname{D}.\footnote{In Paper Backpack, the granularity is at the module level, so the subterms of \mname{C} and \mname{D} can differ.} We include both $\alpha_a$ and $\alpha_b$ in both \mname{C} and \mname{D}, regardless of the imports. When we link the package against an implementation of the hole, these variables are replaced with the identities of the modules we linked against.
641
642 Shaping proceeds in the same way as in Paper Backpack, except that the
643 shaping judgment must also accept the package key
644 $P(\vec{p\mapsto\alpha})$ so we can create identifiers with
645 \textsf{mkident}. This implies we must know ahead of time what the holes
646 of a package are.
647
648 \paragraph{A full Backpack comparison}
649 If you're curious about how the rest of the Backpack examples translate,
650 look no further than this section.
651
652 First, consider the module identities in the \m{Graph} instantiations in
653 \pname{multinst}, shown in Figure 2 of the original Backpack paper.
654 In the definition of \pname{structures}, assume that the variables for
655 \m{Prelude} and \m{Array} are $\alpha_P$ and $\alpha_A$ respectively.
656 The identity of \m{Graph} is $\pname{structures}(\alpha_P, \alpha_A)$:\m{Graph}. Similarly, the identities of the two array implementations
657 are $\nu_{AA} = \pname{arrays-a}(\alpha_P)$:\m{Array} and
658 $\nu_{AB} = \pname{arrays-b}(\alpha_P)$:\m{Array}.\footnote{Notice that the subterms coincide with Paper Backpack! A sign that module level granularity is not necessary for many use-cases.}
659
660 The package \pname{graph-a} is more interesting because it
661 \emph{links} the packages \pname{arrays-a} and \pname{structures}
662 together, with the implementation of \m{Array} from \pname{arrays-a}
663 \emph{instantiating} the hole \m{Array} from \pname{structures}. This
664 linking is reflected in the identity of the \m{Graph} module in
665 \pname{graph-a}: whereas in \pname{structures} it was $\nu_G =
666 \pname{structures}(\alpha_P, \alpha_A)$:\m{Graph}, in \pname{graph-a} it is
667 $\nu_{GA} = \nu_G[\nu_{AA}/\alpha_A] = \pname{structures}(\alpha_P, \nu_{AA})$:\m{Graph}. Similarly, the identity of \m{Graph} in
668 \pname{graph-b} is $\nu_{GB} = \nu_G[\nu_{AB}/\alpha_A] =
669 \pname{structures}(\alpha_P, \nu_{AB})$:\m{Graph}. Thus, linking consists
670 of substituting the variable identity of a hole by the concrete
671 identity of the module filling that hole.
672
673 Lastly, \pname{multinst} makes use of both of these \m{Graph}
674 modules, under the aliases \m{GA} and \m{GB}, respectively.
675 Consequently, in the \m{Client} module, \code{\m{GA}.G} and
676 \code{\m{GB}.G} will be correctly viewed as distinct types since they
677 originate in modules with distinct identities.
678
679 As \pname{multinst} illustrates, module identities effectively encode
680 dependency graphs at the package level.\footnote{In Paper Backpack, module identities
681 encode dependency graphs at the module level. In both cases, however, what is being
682 depended on is always a module.} Like in Paper Backpack, we have an \emph{applicative}
683 semantics of instantiation, and the applicativity example in Figure 3 of the
684 Backpack paper still type checks. However, because we are operating at a coarser
685 granularity, modules may have spurious dependencies on holes that they don't
686 actually depend on, which means less type equalities may hold.
687
688
689 \subsection{Cabal dependency resolution}
690
691 Currently, when we compile a Cabal
692 package, Cabal goes ahead and resolves \verb|build-depends| entries with actual
693 implementations, which we compile against. A planned addition to the package key,
694 independent of Backpack, is to record the transitive dependency tree selected
695 during this dependency resolution process, so that we can install \pname{libfoo-1.0}
696 twice compiled against different versions of its dependencies.
697 What is the relationship to this transitive dependency tree of \emph{packages},
698 with the subterms of our package identities which are \emph{modules}? Does one
699 subsume the other? In fact, these are separate mechanisms---two levels of indirections,
700 so to speak.
701
702 To illustrate, suppose I write a Cabal file with \verb|build-depends: foobar|. A reasonable assumption is that this translates into a
703 Backpack package which has \verb|include foobar|. However, this is not
704 actually a Paper Backpack package: Cabal's dependency solver has to
705 rewrite all of these package references into versioned references
706 \verb|include foobar-0.1|. For example, this is a pre-package:
707
708 \begin{verbatim}
709 package foo where
710 include bar
711 \end{verbatim}
712
713 and this is a Paper Backpack package:
714
715 \begin{verbatim}
716 package foo-0.3[bar-0.1[baz-0.2]] where
717 include bar-0.1[baz-0.2]
718 \end{verbatim}
719
720 This tree is very similar to the one tracking dependencies for holes,
721 but we must record this tree \emph{even} when our package has no holes.
722 % As a final example, the full module
723 % identity of \m{B1} in Figure~\ref{fig:granularity} may actually be $\pname{p-0.9(q-1.0[p-0.9]:A1)}$:\m{B}.
724
725 \paragraph{Linker symbols} As we increase the amount of information in
726 PackageId, it's important to be careful about the length of these IDs,
727 as they are used for exported linker symbols (e.g.
728 \verb|base_TextziReadziLex_zdwvalDig_info|). Very long symbol names
729 hurt compile and link time, object file sizes, GHCi startup time,
730 dynamic linking, and make gdb hard to use. As such, we are going to
731 do away with full package names and versions and instead use just a
732 base-62 encoded hash, with the first five characters of the package
733 name for user-friendliness.
734
735 \subsection{Package selection}
736
737 When I fire up \texttt{ghci} with no arguments, GHC somehow creates
738 out of thin air some consistent set of packages, whose modules I can
739 load using \texttt{:m}. This functionality is extremely handy for
740 exploratory work, but actually GHC has to work quite hard in order
741 to generate this set of packages, the contents of which are all
742 dumped into a global namespace. For example, GHC doesn't have access
743 to Cabal's dependency solver, nor does it know \emph{which} packages
744 the user is going to ask for, so it can't just run a constraint solver,
745 get a set of consistent packages to offer and provide them to the user.\footnote{Some might
746 argue that depending on a global environment in this fashion is wrong, because
747 when you perform a build in this way, you have absolutely no ideas what
748 dependencies you actually ended up using. But the fact remains that for
749 end users, this functionality is very useful.}
750
751 To make matters worse, while in the current design of the package database,
752 a package is uniquely identified by its package name and version, in
753 the Backpack design, it is \emph{mandatory} that we support multiple
754 packages installed in the database with the same package name and version,
755 and this can result in complications in the user model. This further
756 complicates GHC's default package selection algorithm.
757
758 In this section, we describe how the current algorithm operates (including
759 what invariants it tries to uphold and where it goes wrong), and how
760 to replace the algorithm to handle generalization to
761 multiple instances in the package database. We'll also try to tease
762 apart the relationship between package keys and installed package IDs in
763 the database.
764
765 \paragraph{The current algorithm} Abstractly, GHC's current package
766 selection algorithm operates as follows. For every package name, select
767 the package with the latest version (recall that this is unique) which
768 is also \emph{valid}. A package is valid if:
769
770 \begin{itemize}
771 \item It exists in the package database,
772 \item All of its dependencies are valid,
773 \item It is not shadowed by a package with the same package ID\footnote{Recall that currently, a package ID uniquely identifies a package in the package database} in
774 another package database (unless it is in the transitive closure
775 of a package named by \texttt{-package-id}), and
776 \item It is not ignored with \texttt{-ignore-package}.
777 \end{itemize}
778
779 Package validity is probably the minimal criterion for to GHC to ensure
780 that it can actually \emph{use} a package. If the package is missing,
781 GHC can't find the interface files or object code associated with the
782 package. Ignoring packages is a way of pretending that a package is
783 missing from the database.
784
785 Package validity is also a very weak criterion. Another criterion we
786 might hope holds is \emph{consistency}: when we consider the transitive
787 closure of all selected packages, for any given package name, there
788 should only be one instance providing that package. It is trivially
789 easy to break this property: suppose that I have packages \pname{a-1.0},
790 \pname{b-1.0} compiled against \pname{a-1.0}, and \pname{a-1.1}. GHC
791 will happily load \pname{b-1.0} and \pname{a-1.1} together in the same
792 interactive session (they are both valid and the latest versions), even
793 though \pname{b-1.0}'s dependency is inconsistent with another package
794 that was loaded. The user will notice if they attempt to treat entities
795 from \pname{a} reexported by \pname{b-1.0} and entities from
796 \pname{a-1.1} as type equal. Here is one user who had this problem:
797 \url{http://stackoverflow.com/questions/12576817/}. In some cases, the
798 problem is easy to work around (there is only one offending package
799 which just needs to be hidden), but if the divergence is deep in two
800 separate dependency hierarchies, it is often easier to just blow away
801 the package database and try again.
802
803 Perversely, destructive reinstallation helps prevent these sorts of
804 inconsistent databases. While inconsistencies can arise when multiple
805 versions of a package are installed, multiple versions will frequently
806 lead to the necessity of reinstalls. In the previous example, if a user
807 attempts to Cabal install a package which depends on \pname{a-1.1} and
808 \pname{b-1.0}, Cabal's dependency solver will propose reinstalling
809 \pname{b-1.0} compiled against \pname{a-1.1}, in order to get a
810 consistent set of dependencies. If this reinstall is accepted, we
811 invalidate all packages in the database which were previously installed
812 against \pname{b-1.0} and \pname{a-1.0}, excluding them from GHC's
813 selection process and making it more likely that the user will see a
814 consistent view of the database.
815
816 \paragraph{Enforcing consistent dependencies} From the user's
817 perspective, it would be desirable if GHC never loaded a set of packages
818 whose dependencies were inconsistent.
819 There are two ways we can go
820 about doing this. First, we can improve GHC's logic so that it doesn't
821 pick an inconsistent set. However, as a point of design, we'd like to
822 keep whatever resolution GHC does as simple as possible (in an ideal
823 world, we'd skip the validity checks entirely, but they ended up being
824 necessary to prevent broken database from stopping GHC from starting up
825 at all). In particular, GHC should \emph{not} learn how to do
826 backtracking constraint solving: that's in the domain of Cabal. Second,
827 we can modify the logic of Cabal to enforce that the package database is
828 always kept in a consistent state, similar to the consistency check
829 Cabal applies to sandboxes, where it refuses to install a package to a
830 sandbox if the resulting dependencies would not be consistent.
831
832 The second alternative is a appealing, but Cabal sandboxes are currently
833 designed for small, self-contained single projects, as opposed to the
834 global ``universe'' that a default environment is intended to provide.
835 For example, with a Cabal sandbox environment, it's impossible to
836 upgrade a dependency to a new version without blowing away the sandbox
837 and starting again. To support upgrades, Cabal needs to do some work:
838 when a new version is put in the default set, all of the
839 reverse-dependencies of the old version are now inconsistent. Cabal
840 should offer to hide these packages or reinstall them compiled against
841 the latest version. Furthermore, because we in general may not have write
842 access to all visible package databases, this visibility information
843 must be independent of the package databases themselves.
844
845 As a nice bonus, Cabal should also be able to snapshot the older
846 environment which captures the state of the universe prior to the
847 installation, in case the user wants to revert back.
848
849 \paragraph{Modifying the default environment} Currently, after GHC
850 calculates the default package environment, a user may further modify
851 the environment by passing package flags to GHC, which can be used to
852 explicitly hide or expose packages. How do these flags interact with
853 our Cabal-managed environments? Hiding packages is simple enough,
854 but exposing packages is a bit dicier. If a user asks for a different
855 version of a package than in the default set, it will probably be
856 inconsistent with the rest of the dependencies. Cabal would have to
857 be consulted to figure out a maximal set of consistent packages with
858 the constraints given. Alternatively, we could just supply the package
859 with no claims of consistency.
860
861 However, this use-case is rare. Usually, it's not because they want a
862 specific version: the package is hidden simply because we're not
863 interested in loading it by default (\pname{ghc-api} is the canonical
864 example, since it dumps a lot of modules in the top level namespace).
865 If we distinguish packages which are consistent but hidden, their
866 loads can be handled appropriately.
867
868 \paragraph{Consistency in Backpack} We have stated as an implicit
869 assumption that if we have both \pname{foo-1.0} and \pname{foo-1.1}
870 available, only one should be loaded at a time. What are the
871 consequences if both of these packages are loaded at the same time? An
872 import of \m{Data.Foo} provided by both packages would be ambiguous and
873 the user might find some type equalities they expect to hold would not.
874 However, the result is not \emph{unsound}: indeed, we might imagine a
875 user purposely wanting two different versions of a library in the same
876 program, renaming the modules they provided so that they could be
877 referred to unambiguously. As another example, suppose that we have an
878 indefinite package with a hole that is instantiated multiple times. In
879 this case, a user absolutely may want to refer to both instantiations,
880 once again renaming modules so that they have unique names.
881
882 There are two consequences of this. First, while the default package
883 set may enforce consistency, a user should still be able to explicitly
884 ask for a package instance, renamed so that its modules don't conflict,
885 and then use it in their program. Second, instantiated indefinite packages
886 should \emph{never} be placed in the default set, since it's impossible
887 to know which instantiation is the one the user prefers. A definite package
888 can reexport an instantiated module under an unambiguous name if the user
889 so pleases.
890
891 \paragraph{Shadowing, installed package IDs, ABI hashes and package
892 keys} Shadowing plays an important role for maintaining the soundness of
893 compilation; call this the \emph{compatibility} of the package set. The
894 problem it addresses is when there are two distinct implementations of a
895 module, but because their package ID (or package key, in the new world
896 order) are the same, they are considered type equal. It is absolutely
897 wrong for a single program to include both implementations
898 simultaneously (the symbols would conflict and GHC would incorrectly
899 conclude things were type equal when they're not), so \emph{shadowing}'s
900 job is to ensure that only one instance is picked, and all the other
901 instances considered invalid (and their reverse-dependencies, etc.)
902 Recall that in current GHC, within a package database, a package
903 instance is uniquely identified by its package ID\@; thus, shadowing
904 only needs to take place between package databases. An interesting
905 corner case is when the same package ID occurs in both databases, but
906 the installed package IDs are the \emph{same}. Because the installed
907 package ID is currently simply an ABI hash, we skip shadowing, because
908 the packages are---in principle---interchangeable.
909
910 There are currently a number of proposed changes to this state of affairs:
911
912 \begin{itemize}
913 \item Change installed package IDs to not be based on ABI hashes.
914 ABI hashes have a number of disadvantages as identifiers for
915 packages in the database. First, they cannot be computed until
916 after compilation, which gave the multi-instance GSoC project a
917 few years some headaches. Second, it's not really true that
918 programs with identical ABI hashes are interchangeable: a new
919 package may be ABI compatible but have different semantics.
920 Thus, installed package IDs are a poor unique identifier for
921 packages in the package database. However, because GHC does not
922 give ABI stability guarantees, it would not be possible to
923 assume from here that packages with the same installed package
924 ID are ABI compatible.
925
926 \item Relaxing the uniqueness constraint on package IDs. There are
927 actually two things that could be done here. First, since we
928 have augmented package IDs with dependency resolution
929 information to form package keys, we could simply state that
930 package keys uniquely identify a package in a database.
931 Shadowing rules can be implemented in the same way as before, by
932 preferring the instance topmost on the stack. Second, we could
933 also allow \emph{same-database} shadowing: that is, not even
934 package keys are guaranteed to be unique in a database: instead,
935 installed package IDs are the sole unique identifier of a
936 package. This architecture is Nix inspired, as the intent is
937 to keep all package information in a centralized database.
938 \end{itemize}
939
940 Without mandatory package environments, same-database shadowing is a bad
941 idea, because GHC now has no idea how to resolve shadowing. Conflicting
942 installed package IDs can be simulated by placing them in multiple
943 package databases (in principle, the databases can be concatenated together
944 and treated as a single monolitic database.)
945
946 \section{Shapeless Backpack}\label{sec:simplifying-backpack}
947
948 Backpack as currently defined always requires a \emph{shaping} pass,
949 which calculates the shapes of all modules defined in a package.
950 The shaping pass is critical to the solution of the double-vision problem
951 in recursive module linking, but it also presents a number of unpalatable
952 implementation problems:
953
954 \begin{itemize}
955
956 \item \emph{Shaping is a lot of work.} A module shape specifies the
957 providence of all data types and identifiers defined by a
958 module. To calculate this, we must preprocess and parse all
959 modules, even before we do the type-checking pass. (Fortunately,
960 shaping doesn't require a full parse of a module, only enough
961 to get identifiers. However, it does have to understand import
962 statements at the same level of detail as GHC's renamer.)
963
964 \item \emph{Shaping must be done upfront.} In the current Backpack
965 design, all shapes must be computed before any typechecking can
966 occur. While performing the shaping pass upfront is necessary
967 in order to solve the double vision problem (where a module
968 identity may be influenced by later definitions), it means
969 that GHC must first do a shaping pass, and then revisit every module and
970 compile them proper. Nor is it (easily) possible to skip the
971 shaping pass when it is unnecessary, as one might expect to be
972 the case in the absence of mutual recursion. Shaping is not
973 a ``pay as you go'' language feature.
974
975 \item \emph{GHC can't compile all programs shaping accepts.} Shaping
976 accepts programs that GHC, with its current hs-boot mechanism, cannot
977 compile. In particular, GHC requires that any data type or function
978 in a signature actually be \emph{defined} in the module corresponding
979 to that file (i.e., an original name can be assigned to these entities
980 immediately.) Shaping permits unrestricted exports to implement
981 modules; this shows up in the formalism as $\beta$ module variables.
982
983 \item \emph{Shaping encourages inefficient program organization.}
984 Shaping is designed to enable mutually recursive modules, but as
985 currently implemented, mutual recursion is less efficient than
986 code without recursive dependencies. Programmers should avoid
987 this code organization, except when it is absolutely necessary.
988
989 \item \emph{GHC is architecturally ill-suited for directly
990 implementing shaping.} Shaping implies that GHC's internal
991 concept of an ``original name'' be extended to accommodate
992 module variables. This is an extremely invasive change to all
993 aspects of GHC, since the original names assumption is baked
994 quite deeply into the compiler. Plausible implementations of
995 shaping requires all these variables to be skolemized outside
996 of GHC\@.
997
998 \end{itemize}
999
1000 To be clear, the shaping pass is fundamentally necessary for some
1001 Backpack packages. Here is the example which convinced Simon:
1002
1003 \begin{verbatim}
1004 package p where
1005 A :: [data T; f :: T -> T]
1006 B = [export T(MkT), h; import A(f); data T = MkT; h x = f MkT]
1007 A = [export T(MkT), f, h; import B; f MkT = MkT]
1008 \end{verbatim}
1009
1010 The key to this example is that B \emph{may or may not typecheck} depending
1011 on the definition of A. Because A reexports B's definition T, B will
1012 typecheck; but if A defined T on its own, B would not typecheck. Thus,
1013 we \emph{cannot} typecheck B until we have done some analysis of A (the
1014 shaping analysis!)
1015
1016 Thus, it is beneficial (from an optimization point of view) to
1017 consider a subset of Backpack for which shaping is not necessary.
1018 Here is a programming discipline which does just that, which we will call the \textbf{linking restriction}: \emph{Module implementations must be declared before
1019 signatures.} Formally, this restriction modifies the rule for merging
1020 polarized module shapes ($\widetilde{\tau}_1^{m_1} \oplus \widetilde{\tau}_2^{m_2}$) so that
1021 $\widetilde{\tau}_1^- \oplus \widetilde{\tau}_2^+$ is always undefined.\footnote{This seemed to be the crispest way of defining the restriction, although this means an error happens a bit later than I'd like it to: I'd prefer if we errored while merging logical contexts, but we don't know what is a hole at that point.}
1022
1023 Here is an example of the linking restriction. Consider these two packages:
1024
1025 \begin{verbatim}
1026 package random where
1027 System.Random = [ ... ].hs
1028
1029 package monte-carlo where
1030 System.Random :: ...
1031 System.MonteCarlo = [ ... ].hs
1032 \end{verbatim}
1033
1034 Here, random is a definite package which may have been compiled ahead
1035 of time; monte-carlo is an indefinite package with a dependency
1036 on any package which provides \verb|System.Random|.
1037
1038 Now, to link these two applications together, only one ordering
1039 is permissible:
1040
1041 \begin{verbatim}
1042 package myapp where
1043 include random
1044 include monte-carlo
1045 \end{verbatim}
1046
1047 If myapp wants to provide its own random implementation, it can do so:
1048
1049 \begin{verbatim}
1050 package myapp2 where
1051 System.Random = [ ... ].hs
1052 include monte-carlo
1053 \end{verbatim}
1054
1055 In both cases, all of \verb|monte-carlo|'s holes have been filled in by the time
1056 it is included. The alternate ordering is not allowed.
1057
1058 Why does this discipline prevent mutually recursive modules? Intuitively,
1059 a hole is the mechanism by which we can refer to an implementation
1060 before it is defined; otherwise, we can only refer to definitions which
1061 preceed our definition. If there are never any holes \emph{which get filled},
1062 implementation links can only go backwards, ruling out circularity.
1063
1064 It's easy to see how mutual recursion can occur if we break this discipline:
1065
1066 \begin{verbatim}
1067 package myapp2 where
1068 include monte-carlo
1069 System.Random = [ import System.MonteCarlo ].hs
1070 \end{verbatim}
1071
1072 \subsection{Typechecking of definite modules without shaping}
1073
1074 If we are not carrying out a shaping pass, we need to be able to calculate
1075 $\widetilde{\Xi}_{\mathsf{pkg}}$ on the fly. In the case that we are
1076 compiling a package---there will be no holes in the final package---we
1077 can show that shaping is unnecessary quite easily, since with the
1078 linking restriction, everything is definite from the get-go.
1079
1080 Observe the following invariant: at any given step of the module
1081 bindings, the physical context $\widetilde{\Phi}$ contains no
1082 holes. We can thus conclude that there are no module variables in any
1083 type shapes. As the only time a previously calculated package shape can
1084 change is due to unification, the incrementally computed shape is in
1085 fact the true one.
1086
1087 As far as the implementation is concerned, we never have to worry
1088 about handling module variables; we only need to do extra typechecks
1089 against (renamed) interface files.
1090
1091 \subsection{Compiling definite packages}\label{sec:compiling}
1092
1093 % New definitions
1094 \algnewcommand\algorithmicswitch{\textbf{switch}}
1095 \algnewcommand\algorithmiccase{\textbf{case}}
1096 \algnewcommand\algorithmicassert{\texttt{assert}}
1097 % New "environments"
1098 \algdef{SE}[SWITCH]{Switch}{EndSwitch}[1]{\algorithmicswitch\ #1\ \algorithmicdo}{\algorithmicend\ \algorithmicswitch}%
1099 \algdef{SE}[CASE]{Case}{EndCase}[1]{\algorithmiccase\ ``#1''}{\algorithmicend\ \algorithmiccase}%
1100 \algtext*{EndSwitch}%
1101 \algtext*{EndCase}%
1102
1103 \begin{algorithm}
1104 \caption{Compilation of definite packages (assume \texttt{-hide-all-packages} on all \texttt{ghc} invocations)}\label{alg:compile}
1105 \begin{algorithmic}
1106 \Procedure{Compile}{\textbf{package} $P$ \textbf{where} $\vec{B}$, $H$, $db$}\Comment{}$H$ maps hole module names to identities
1107 \State$flags\gets \nil$
1108 \State$\mathcal{K}$ $\gets$ \Call{Hash}{$P + H$}
1109 \State%
1110 In-place register the package $\mathcal{K}$ in $db$
1111 \For{$B$ \textbf{in} $\vec{B}$}
1112 \Case{$p = p\texttt{.hs}$}
1113 \State\Call{Exec}{\texttt{ghc -c} $p$\texttt{.hs} \texttt{-package-db} $db$ \texttt{-package-name} $\mathcal{K}$ $flags$}
1114 \EndCase%
1115 \Case{$p$ $\cc$ $p$\texttt{.hsig}}
1116 \State\Call{Exec}{\texttt{ghc -c} $p$\texttt{.hsig} \texttt{-package-db} $db$ \texttt{--sig-of} $H(p)$ $flags$}
1117 \EndCase%
1118 \Case{$p = p'$}
1119 \State$flags\gets flags$ \texttt{-alias} $p$ $p'$ \texttt{-package-db} $db$
1120 \EndCase%
1121 \Case{\Cinc{$P'$} $\langle\vec{p_H\mapsto p_H'}, \vec{p\mapsto p'} \rangle$}
1122 \State\textbf{let} $H'(p_H) = $ \Call{Exec}{\texttt{ghc --resolve-module} $p_H'$ \texttt{-package-db} $db$ $flags$}
1123 \State$\mathcal{K}'\gets$ \Call{Compile}{$P'$, $H'$, $db$}\Comment{}Nota bene: not $flags$
1124 \State$flags\gets flags$ \texttt{-package} $\mathcal{K}'$ $\langle\vec{p\mapsto p'}\rangle$
1125 \EndCase%
1126 \EndFor%
1127 \State%
1128 Remove $\mathcal{K}$ from $db$
1129 \State%
1130 Install the complete package $\mathcal{K}$ to the global database
1131 \State\Return$\mathcal{K}$
1132 \EndProcedure%
1133 \end{algorithmic}
1134 \end{algorithm}
1135
1136 The full recursive procedure for compiling a Backpack package is given
1137 in Figure~\ref{alg:compile}. We recursively walk through Backpack descriptions,
1138 processing each line by invoking GHC and/or modifying our package state.
1139 Here is a more in-depth description of the algorith, line-by-line:
1140
1141 \paragraph{The parameters} To compile a package description for package
1142 $P$, we need to know $H$, the mapping of holes $p_H$ in package $P$ to
1143 physical module identities $\nu$ which are implementing them; this
1144 mapping is used to calculate the package key $\mathcal{K}$ for the
1145 package in question. Furthermore, we have an inplace package database
1146 $db$ in which we will register intermediate build results, including
1147 partially compiled parent packages which may provide implementations
1148 of holes for packages they include.
1149
1150 \subsection{Compiling implementations}
1151
1152 We compile modules in the same way we do today, but with some extra
1153 package visibility $flags$, which let GHC know how to resolve imports
1154 and look up original names. We'll describe what the new flags are
1155 and also discuss some subtleties with module lookup.
1156
1157 \paragraph{In-place registration} Perhaps surprisingly, we start
1158 compilation by registering the (uncompiled) package in the in-place
1159 package database. This registration does not expose packages, and is
1160 purely intended to inform the compilation of subpackages where to
1161 find modules that are provided by the parent (in-progress) package,
1162 as well as provide auxiliary information, e.g., such as the package name
1163 and version for error reporting. The pre-registration trick is an old
1164 one used by the GHC build system; the key invariant to look out for
1165 is that we shouldn't reference original names in modules that haven't
1166 been built yet. This is enforced by our manual tracking of holes in
1167 $H$: a module can't occur in $H$ unless it's already been compiled!
1168
1169 \paragraph{New package resolution algorithm} Currently, invocations
1170 of \texttt{-package} and similar flags have the result of \emph{hiding}
1171 other exposed packages with the same name. However, this is not going
1172 to work for Backpack: an indefinite package may get loaded multiple times
1173 with different instantiations, and it might even make sense to load multiple
1174 versions of the same package simultaneously, as long as their modules
1175 are renamed to not conflict.
1176
1177 Thus, we impose the following behavior change: when
1178 \texttt{-hide-all-packages} is specified, we do \emph{not} automatically
1179 hide packages with the same name as a package specified by
1180 \texttt{-package} (or a similar flag): they are all included, even if
1181 there are conflicts. To deal with conflicts, we augment the syntax of
1182 \texttt{-package} to also accept a list of thinnings and renamings, e.g.
1183 \texttt{-package} \pname{containers} $\langle\m{Data.Set},
1184 \m{Data.Map}\mapsto \m{Map}\rangle$ says to make visible for import
1185 \m{Data.Set} and \m{Map} (which is \m{Data.Map} renamed.) This means
1186 that
1187 \texttt{-package} \pname{containers-0.9} $\langle\m{Data.Set}\mapsto
1188 \m{Set09}\rangle$ \texttt{-package} \pname{containers-0.8}
1189 $\langle\m{Data.Set}\mapsto \m{Set08}\rangle$ now uses both packages
1190 concurrently (previously, GHC would hide one of them.)
1191
1192 Additionally, it's important to note that two packages exporting the
1193 same module do not \emph{necessarily} cause a conflict; the modules
1194 may be linkable. For example, \texttt{-package} \pname{containers} $\langle\m{Data.Set}\rangle$
1195 \texttt{-package} \pname{containers} $\langle\m{Data.Set}\rangle$ is fine, because
1196 precisely the same implementation of \m{Data.Set} is loaded in both cases.
1197 A similar situation can occur with signatures:
1198
1199 \begin{verbatim}
1200 package p where
1201 A :: [ x :: Int ]
1202 package q
1203 include p
1204 A :: [ y :: Int ]
1205 B = [ import A; z = x + y ] -- *
1206 package r where
1207 A = [ x = 0; y = 0 ]
1208 include q
1209 \end{verbatim}
1210
1211 Here, both \pname{p} and \pname{q} are visible when compiling the starred
1212 module, which compiles with the flags \texttt{-package} \pname{p}, but there
1213 are two interface files available: one available locally, and one from \pname{p}.
1214 Both of these interface files are \emph{forwarding} to the original implementation
1215 \pname{r} (more on this in the ``Compiling signatures'' file), so rather than
1216 reporting an ambiguous import, we instead have to merge the two interface files
1217 together and use the result as the interface for the module. (This could be done
1218 on the fly, or we could generate merged interface files as we go along.)
1219
1220 Note that we need to merge signatures with an implementation, just use the
1221 implementation interface. E.g.
1222
1223 \begin{verbatim}
1224 package p where
1225 A :: ...
1226 package q where
1227 A = ...
1228 include p
1229 B = [ import A ] -- *
1230 \end{verbatim}
1231
1232 Here, \m{A} is available both from \pname{p} and \pname{q}, but the use in the
1233 starred module should be done with respect to the full implementation.
1234
1235 \paragraph{The \texttt{-alias} flag} We introduce a new flag
1236 \texttt{-alias} for aliasing modules today. Aliasing is analogous to
1237 the merging that can occur when we include packages, but it also applies
1238 to modules which are locally defined. When we alias a module $p$ with
1239 $p'$, we require that $p'$ exists in the current module mapping, and then
1240 we attempt to add an entry for it at entry $p$. If there is no mapping for
1241 $p$, this succeeds; otherwise, we apply the same conflict resolution algorithm.
1242
1243 \subsection{Compiling signatures}
1244
1245 Signature compilation is triggered when we compile a signature file.
1246 This mode similar to how we process \verb|hs-boot| files, except
1247 we pass an extra flag \verb|--sig-of| which specifies what the
1248 identity of the actual implementation of the signature is (according to our $H$
1249 mapping). This is guaranteed to exist, due to the linking
1250 restriction, although it may be in a partially registered package
1251 in $db$. If the module is \emph{not} exposed under the name of the
1252 \texttt{hisig}file, we output an \texttt{hisig} file which, for all declarations the
1253 signature exposes, forwards their definitions to the original
1254 implementation file. The intent is that any code in the current package
1255 which compiles against this signature will use this \texttt{hisig} file,
1256 not the original one \texttt{hi} file.
1257 For example, the \texttt{hisig} file produced when compiling the starred interface
1258 points to the implementation in package \pname{q}.
1259
1260 \begin{verbatim}
1261 package p where
1262 A :: ... -- *
1263 B = [ import A; ... ]
1264 package q where
1265 A = [ ... ]
1266 include p
1267 \end{verbatim}
1268
1269 \paragraph{Sometimes \texttt{hisig} is unnecessary}
1270 In the following package:
1271
1272 \begin{verbatim}
1273 package p where
1274 P = ...
1275 P :: ...
1276 \end{verbatim}
1277
1278 Paper Backpack specifies that we check the signature \m{P} against implementation
1279 \m{P}, but otherwise no changes are made (i.e., the signature does not narrow
1280 the implementation.) In this case, it is not necessary to generate an \texttt{hisig} file;
1281 the original interface file suffices.
1282
1283 \paragraph{Multiple signatures} As a simplification, we assume that there
1284 is only one signature per logical name in a package. (This prevents
1285 us from expressing mutual recursion in signatures, but let's not worry
1286 about it for now.)
1287
1288 \paragraph{Restricted recursive modules ala hs-boot}\label{sec:hs-boot-restrict}
1289 When we compile an \texttt{hsig} file without any \texttt{--sig-of} flag (because
1290 no implementation is known), we fall back to old-style GHC mutual recursion.
1291 Na\"\i vely, a shaping pass would be necessary;
1292 so we adopt an existing constraint that
1293 already applies to hs-boot files: \emph{at the time we define a signature,
1294 we must know what the original name for all data types is}. In practice,
1295 GHC enforces this by stating that: (1) an hs-boot file must be
1296 accompanied with an implementation, and (2) the implementation must
1297 in fact define (and not reexport) all of the declarations in the signature.
1298 We can discover if a signature is intended to break a recursive module loop
1299 when we discover that $p\notin flags_H$; in this case, we fallback to the
1300 old hs-boot behavior. (Alternatively, the user can explicitly ask for it.)
1301
1302 Why does this not require a shaping pass? The reason is that the
1303 signature is not really polymorphic: we require that the $\alpha$ module
1304 variable be resolved to a concrete module later in the same package, and
1305 that all the $\beta$ module variables be unified with $\alpha$. Thus, we
1306 know ahead of time the original names and don't need to deal with any
1307 renaming.\footnote{This strategy doesn't completely resolve the problem
1308 of cross-package mutual recursion, because we need to first compile a
1309 bit of the first package (signatures), then the second package, and then
1310 the rest of the first package.}
1311
1312 Compiling packages in this way gives the tantalizing possibility
1313 of true separate compilation: the only thing we don't know is what the actual
1314 package name of an indefinite package will be, and what the correct references
1315 to have are. This is a very minor change to the assembly, so one could conceive
1316 of dynamically rewriting these references at the linking stage. But
1317 separate compilation achieved in this fashion would not be able to take
1318 advantage of cross-module optimizations.
1319
1320 \subsection{Compiling aliases}
1321
1322 Aliasing simply adds an extra \texttt{-alias} flag to the compilation flags. To
1323 eagerly report errors, we run a special command \texttt{ghc --check} which checks
1324 to make sure $flags$ is consistent (e.g., no linking conflicts.)
1325
1326 \subsection{Compiling includes}
1327
1328 Includes are the most interesting part of the compilation process, as we have
1329 calculate how the holes of the subpackage we are filling in are compiled $H'$
1330 and modify our flags to make the exports of the include visible to subsequently
1331 compiled modules. We consider the case with renaming, since includes with
1332 no renaming are straightforward.
1333
1334 First, we assume that we know \emph{a priori} what the holes of a
1335 package $p_H$ are (either by some sort of pre-pass, or explicit
1336 declaration.) For each of their \emph{renamed targets} $p'_H$, we look
1337 up the module in the current $flags$ environment, retrieving the
1338 physical module identity by consulting GHC with the
1339 \texttt{--resolve-module} flag and storing it in $H'$. (This can be done in batch.)
1340 For example:
1341
1342 \begin{verbatim}
1343 package p where
1344 A :: ...
1345 ...
1346 package q where
1347 A = [ ... ]
1348 B = [ ... ]
1349 include p (A as B)
1350 \end{verbatim}
1351
1352 When computing the entry $H(\pname{A})$, we run the command \texttt{ghc --resolve-module} \pname{B}.
1353
1354 Next, we recursively call \textsc{Compile} with the computed $H'$.
1355 Note that the entries in $H$ may refer to modules which would not be
1356 picked up by $flags$, but they will be registered in the inplace
1357 package database $db$.
1358 For example, in this situation:
1359
1360 \begin{verbatim}
1361 package p where
1362 B :: ...
1363 C = [ import B; ... ]
1364 package q where
1365 A = [ ... ]
1366 B = [ import A; ... ]
1367 include p
1368 D = [ import C; ... ]
1369 \end{verbatim}
1370
1371 When we recursively process package \pname{p}, $H$ will refer to
1372 \pname{q}:\m{B}, and we need to know where to find it (\pname{q} is only
1373 partially processed and so is in the inplace package database.)
1374 Furthermore, the interface file for \m{B} may refer to \pname{q}:\m{A},
1375 and thus we likewise need to know how to find its interface file.
1376
1377 Note that the inplace package database is not expected to expose and
1378 packages. Otherwise, this example would improperly compile:
1379
1380 \begin{verbatim}
1381 package p where
1382 B = [ import A; ... ]
1383 package q where
1384 A = ...
1385 include p
1386 \end{verbatim}
1387
1388 \pname{p} does not compile on its own, so it should not compile if it is
1389 recursively invoked from \pname{q}. However, if we exposed the modules
1390 of the partially registered package \pname{q}, \m{A} is now suddenly
1391 resolvable.
1392
1393 Finally, once the subpackage is compiled, we can add it to our $flags$ so later
1394 modules we compile see its (appropriately thinned and renamed) modules, and like
1395 aliasing.
1396
1397 \paragraph{Absence of an \texttt{hi} file}
1398 It is important that \texttt{--resolve-module} truly looks up the \emph{implementor}
1399 of a module, and not just a signature which is providing it at some name.
1400 Sometimes, a little extra work is necessary to compute this, for example:
1401
1402 \begin{verbatim}
1403 package p where
1404 A :: [ y :: Int ]
1405 package q where
1406 A :: [ x :: Int ]
1407 include p -- *
1408 package r where
1409 A = [ x = 0; y = 1 ]
1410 include q
1411 \end{verbatim}
1412
1413 When computing $H'$ for the starred include, our $flags$ only include
1414 \texttt{-package-dir} \pname{r} $cwd_r$ $\langle\rangle$: with a thinning
1415 that excludes all modules! The only interface file we can pick up with these
1416 $flags$ is the local definition of \m{A}. However, we \emph{absolutely}
1417 should set $H'(\m{A})=\pname{q}:\m{A}$; if we do so, then we will incorrectly
1418 conclude when compiling the signature in \pname{p} that the implementation
1419 doesn't export enough identifiers to fulfill the signature (\texttt{y} is not
1420 available from just the signature in \pname{q}). Instead, we have to look
1421 up the original implementor of \m{A} in \pname{r}, and use that in $H'$.
1422
1423 \subsection{Commentary}
1424
1425 \paragraph{Just because it compiled, doesn't mean the individual packages type check}
1426 The compilation mechanism described is slightly more permissive than vanilla Backpack.
1427 Here is a simple example:
1428
1429 \begin{verbatim}
1430 package p where
1431 A :: [ data T = T ]
1432 B :: [ data T = T ]
1433 C = [
1434 import A
1435 import B
1436 x = A.T :: B.T
1437 ]
1438 package q where
1439 A = [ data T = T ]
1440 B = A
1441 include p
1442 \end{verbatim}
1443
1444 Here, we incorrectly decide \m{A}\verb|.T| and \m{B}\verb|.T| are type
1445 equal when typechecking \m{C}, because the \verb|hisig| files we
1446 generate for them all point to the same original implementation. However,
1447 \pname{p} should not typecheck.
1448
1449 The problem here is that type checking asks ``does it compile with respect
1450 to all possible instantiations of the holes'', whereas compilation asks
1451 ``does it compile with respect to this particular instantiation of holes.''
1452 It's a bit unavoidable, really.
1453
1454 \section{Shaped Backpack}
1455
1456 Despite the simplicity of shapeless Backpack with the linking
1457 restriction in the absence of holes, we will find that when we have
1458 holes, it will be very difficult to do type-checking without
1459 some form of shaping. This section is very much a work in progress,
1460 but the ability to typecheck against holes, even with the linking restriction,
1461 is a very important part of modular separate development, so we will need
1462 to support it at some point.
1463
1464 \subsection{Efficient shaping}
1465
1466 (These are Edward's opinion, he hasn't convinced other folks that this is
1467 the right way to do it.)
1468
1469 In this section, I want to argue that, although shaping constitutes
1470 a pre-pass which must be run before compilation in earnest, it is only
1471 about as bad as the dependency resolution analysis that GHC already does
1472 in \verb|ghc -M| or \verb|ghc --make|.
1473
1474 In Paper Backpack, what information does shaping compute? It looks at
1475 exports, imports, data declarations and value declarations (but not the
1476 actual expressions associated with these values.) As a matter of fact,
1477 GHC already must look at the imports associated with a package in order
1478 to determine the dependency graph, so that it can have some order to compile
1479 modules in. There is a specialized parser which just parses these statements,
1480 and then ignores the rest of the file.
1481
1482 A bit of background: the \emph{renamer} is responsible for resolving
1483 imports and figuring out where all of these entities actually come from.
1484 SPJ would really like to avoid having to run the renamer in order to perform
1485 a shaping pass.
1486
1487 \paragraph{Is it necessary to run the Renamer to do shaping?}
1488 Edward and Scott believe the answer is no, well, partially.
1489 Shaping needs to know the original names of all entities exposed by a
1490 module/signature. Then it needs to know (a) which entities a module/signature
1491 defines/declares locally and (b) which entities that module/signature exports.
1492 The former, (a), can be determined by a straightforward inspection of a parse
1493 tree of the source file.\footnote{Note that no expression or type parsing
1494 is necessary. We only need names of local values, data types, and data
1495 constructors.} The latter, (b), is a bit trickier. Right now it's the Renamer
1496 that interprets imports and exports into original names, so we would still
1497 rely on that implementation. However, the Renamer does other, harder things
1498 that we don't need, so ideally we could factor out the import/export
1499 resolution from the Renamer for use in shaping.
1500
1501 Unfortunately the Renamer's import resolution analyzes \verb|.hi| files, but for
1502 local modules, which haven't yet been typechecked, we don't have those.
1503 Instead, we could use a new file format, \verb|.hsi| files, to store the shape of
1504 a locally defined module. (Defined packages are bundled with their shapes,
1505 so included modules have \verb|.hsi| files as well.) (What about the logical
1506 vs.~physical distinction in file names?) If we refactor the import/export
1507 resolution code, could we rewrite it to generically operate on both
1508 \verb|.hi| files and \verb|.hsi| files?
1509
1510 Alternatively, rather than storing shapes on a per-source basis, we could
1511 store (in memory) the entire package shape. Similarly, included packages
1512 could have a single shape file for the entire package. Although this approach
1513 would make shaping non-incremental, since an entire package's shape would
1514 be recomputed any time a constituent module's shape changes, we do not expect
1515 shaping to be all that expensive.
1516
1517 \subsection{Typechecking of indefinite modules}\label{sec:typechecking-indefinite}
1518
1519 Recall in our argument in the definite case, where we showed there are
1520 no holes in the physical context. With indefinite modules, this is no
1521 longer true. While (with the linking restriction) these holes will never
1522 be linked against a physical implementation, they may be linked against
1523 other signatures. (Note: while disallowing signature linking would
1524 solve our problem, it would disallow a wide array of useful instances of
1525 signature reuse, for example, a package mylib that implements both
1526 mylib-1x-sig and mylib-2x-sig.)
1527
1528 With holes, we must handle module variables, and we sometimes must unify them:
1529
1530 \begin{verbatim}
1531 package p where
1532 A :: [ data A ]
1533 package q where
1534 A :: [ data A ]
1535 package r where
1536 include p
1537 include q
1538 \end{verbatim}
1539
1540 In this package, it is not possible to a priori assign original names to
1541 module A in p and q, because in package r, they should have the same
1542 original name. When signature linking occurs, unification may occur,
1543 which means we have to rename all relevant original names. (A similar
1544 situation occurs when a module is typechecked against a signature.)
1545
1546 An invariant which would be nice to have is this: when typechecking a
1547 signature or including a package, we may apply renaming to the entities
1548 being brought into context. But once we've picked an original name for
1549 our entities, no further renaming should be necessary. (Formally, in the
1550 unification for semantic object shapes, apply the unifier to the second
1551 shape, but not the first one.)
1552
1553 However, there are plenty of counterexamples here:
1554
1555 \begin{verbatim}
1556 package p where
1557 A :: [ data A ]
1558 B :: [ data A ]
1559 M = ...
1560 A = B
1561 \end{verbatim}
1562
1563 In this package, does module M know that A.A and B.A are type equal? In
1564 fact, the shaping pass will have assigned equal module identities to A
1565 and B, so M \emph{equates these types}, despite the aliasing occurring
1566 after the fact.
1567
1568 We can make this example more sophisticated, by having a later
1569 subpackage which causes the aliasing; now, the decision is not even a
1570 local one (on the other hand, the equality should be evident by inspection
1571 of the package interface associated with q):
1572
1573 \begin{verbatim}
1574 package p where
1575 A :: [ data A ]
1576 B :: [ data A ]
1577 package q where
1578 A :: [ data A ]
1579 B = A
1580 package r where
1581 include p
1582 include q
1583 \end{verbatim}
1584
1585 Another possibility is that it might be acceptable to do a mini-shaping
1586 pass, without parsing modules or signatures, \emph{simply} looking at
1587 names and aliases. But logical names are not the only mechanism by
1588 which unification may occur:
1589
1590 \begin{verbatim}
1591 package p where
1592 C :: [ data A ]
1593 A = [ data A = A ]
1594 B :: [ import A(A) ]
1595 C = B
1596 \end{verbatim}
1597
1598 It is easy to conclude that the original names of C and B are the same. But
1599 more importantly, C.A must be given the original name of p:A.A. This can only
1600 be discovered by looking at the signature definition for B. In any case, it
1601 is worth noting that this situation parallels the situation with hs-boot
1602 files (although there is no mutual recursion here).
1603
1604 The conclusion is that you will probably, in fact, have to do real
1605 shaping in order to typecheck all of these examples.
1606
1607 \paragraph{Hey, these signature imports are kind of tricky\ldots}
1608
1609 When signatures and modules are interleaved, the interaction can be
1610 complex. Here is an example:
1611
1612 \begin{verbatim}
1613 package p where
1614 C :: [ data A ]
1615 M = [ import C; ... ]
1616 A = [ import M; data A = A ]
1617 C :: [ import A(A) ]
1618 \end{verbatim}
1619
1620 Here, the second signature for C refers to a module implementation A
1621 (this is permissible: it simply means that the original name for p:C.A
1622 is p:A.A). But wait! A relies on M, and M relies on C. Circularity?
1623 Fortunately not: a client of package p will find it impossible to have
1624 the hole C implemented in advance, since they will need to get their hands on module
1625 A\ldots but it will not be defined prior to package p.
1626
1627 In any case, however, it would be good to emit a warning if a package
1628 cannot be compiled without mutual recursion.
1629
1630 \subsection{Rename on entry}
1631
1632 Consider the following example:
1633
1634 \begin{verbatim}
1635 package p where
1636 A :: [ data T = T ]
1637 B = [ import A; x = T ]
1638 package q where
1639 C :: ...
1640 A = [ data T = T ]
1641 include p
1642 D = [
1643 import qualified A
1644 import qualified B
1645 import C
1646 x = B.T :: A.T
1647 ]
1648 \end{verbatim}
1649
1650 We are interested in type-checking \pname{q}, which is an indefinite package
1651 on account of the uninstantiated hole \m{C}. Furthermore, let's suppose that
1652 \pname{p} has already been independently typechecked, and its interface files
1653 installed in some global location with $\alpha_A$ used as the module identity
1654 of \m{A}. (To simplify this example, we'll assume $\beta_{AT}=\alpha_A$.)
1655
1656 The first three lines of \pname{q} type check in the normal way, but \m{D}
1657 now poses a problem: if we load the interface file for \m{B} the normal way,
1658 we will get a reference to type \texttt{T} with the original name $\alpha_A$.\texttt{T},
1659 whereas from \m{A} we have an original name \pname{q}:\m{A}.\texttt{T}.
1660
1661 Let's suppose that we already have the result of a shaping pass, which
1662 maps our identity variables to their true identities.
1663 Let's consider the possible options here:
1664
1665 \begin{itemize}
1666 \item We could re-typecheck \pname{p}, feeding it the correct instantiations
1667 for its variables. However, this seems wasteful: we typechecked the
1668 package already, and up-to-renaming, the interface files are exactly
1669 what we need to type check our application.
1670 \item We could make copies of all the interface files, renamed to have the
1671 right original names. This also seems wasteful: why should we have to
1672 create a new copy of every interface file in a library we depend on?
1673 \item When \emph{reading in} the interface file to GHC, we could apply the
1674 renaming according to the shaping pass and store that in memory.
1675 \end{itemize}
1676
1677 That last solution is pretty appealing, however, there are still circumstances
1678 we need to create new interface files; these exactly mirror the cases described
1679 in Section~\ref{sec:compiling}.
1680
1681 \subsection{Incremental typechecking}
1682 We want to typecheck modules incrementally, i.e., when something changes in
1683 a package, we only want to re-typecheck the modules that care about that
1684 change. GHC already does this today.%
1685 \footnote{\url{https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler/RecompilationAvoidance}}
1686 Is the same mechanism sufficient for Backpack? Edward and Scott think that it
1687 is, mostly. Our conjecture is that a module should be re-typechecked if the
1688 existing mechanism says it should \emph{or} if the logical shape
1689 context (which maps logical names to physical names) has changed. The latter
1690 condition is due to aliases that affect typechecking of modules.
1691
1692 Let's look again at an example from before:
1693 \begin{verbatim}
1694 package p where
1695 A :: [ data A ]
1696 B :: [ data A ]
1697 M = [ import A; import B; ... ]
1698 \end{verbatim}
1699 Let's say that \verb|M| is typechecked successfully. Now we add an alias binding
1700 at the end of the package, \verb|A = B|. Does \verb|M| need to be
1701 re-typechecked? Yes! (Well, it seems so, but let's just assert ``yes'' for now.
1702 Certainly in the reverse case---if we remove the alias and then ask---this
1703 is true, since \verb|M| might have depended on the two \verb|A| types
1704 being the same.)
1705 The logical shape context changed to say that \verb|A| and
1706 \verb|B| now map to the same physical module identity. But does the existing
1707 recompilation avoidance mechanism say that \verb|M| should be re-typechecked?
1708 It's unclear. The \verb|.hi| file for \verb|M| records that it imported \verb|A| and
1709 \verb|B| with particular ABIs, but does it also know about the physical module
1710 identities (or rather, original module names) of these modules?
1711
1712 Scott thinks this highlights the need for us to get our story straight about
1713 the connection between logical names, physical module identities, and file
1714 names!
1715
1716
1717 \subsection{Installing indefinite packages}\label{sec:installing-indefinite}
1718
1719 If an indefinite package contains no code at all, we only need
1720 to install the interface file for the signatures. However, if
1721 they include code, we must provide all of the
1722 ingredients necessary to compile them when the holes are linked against
1723 actual implementations. (Figure~\ref{fig:pkgdb})
1724
1725 \paragraph{Source tarball or preprocessed source?} What is the representation of the source that is saved is. There
1726 are a number of possible choices:
1727
1728 \begin{itemize}
1729 \item The original tarballs downloaded from Hackage,
1730 \item Preprocessed source files,
1731 \item Some sort of internal, type-checked representation of Haskell code (maybe the output of the desugarer).
1732 \end{itemize}
1733
1734 Storing the tarballs is the simplest and most straightforward mechanism,
1735 but we will have to be very certain that we can recompile the module
1736 later in precisely the same we compiled it originally, to ensure the hi
1737 files match up (fortunately, it should be simple to perform an optional
1738 sanity check before proceeding.) The appeal of saving preprocessed
1739 source, or even the IRs, is that this is conceptually this is exactly
1740 what an indefinite package is: we have paused the compilation process
1741 partway, intending to finish it later. However, our compilation strategy
1742 for definite packages requires us to run this step using a \emph{different}
1743 choice of original names, so it's unclear how much work could actually be reused.
1744
1745 \section{Surface syntax}
1746
1747 In the Backpack paper, a brand new module language is presented, with
1748 syntax for inline modules and signatures. This syntax is probably worth implementing,
1749 because it makes it easy to specify compatibility packages, whose module
1750 definitions in general may be very short:
1751
1752 \begin{verbatim}
1753 package ishake-0.12-shake-0.13 where
1754 include shake-0.13
1755 Development.Shake.Sys = Development.Shake.Cmd
1756 Development.Shake = [ (**>) = (&>) ; (*>>) = (|*>)]
1757 Development.Shake.Rule = [ defaultPriority = rule . priority 0.5 ]
1758 include ishake-0.12
1759 \end{verbatim}
1760
1761 However, there are a few things that are less than ideal about the
1762 surface syntax proposed by Paper Backpack:
1763
1764 \begin{itemize}
1765 \item It's completely different from the current method users
1766 specify packages. There's nothing wrong with this per se
1767 (one simply needs to support both formats) but the smaller
1768 the delta, the easier the new packaging format is to explain
1769 and implement.
1770
1771 \item Sometimes order matters (relative ordering of signatures and
1772 module implementations), and other times it does not (aliases).
1773 This can be confusing for users.
1774
1775 \item Users have to order module definitions topologically,
1776 whereas in current Cabal modules can be listed in any order, and
1777 GHC figures out an appropriate order to compile them.
1778 \end{itemize}
1779
1780 Here is an alternative proposal, closely based on Cabal syntax. Given
1781 the following Backpack definition:
1782
1783 \begin{verbatim}
1784 package libfoo(A, B, C, Foo) where
1785 include base
1786 -- renaming and thinning
1787 include libfoo (Foo, Bar as Baz)
1788 -- holes
1789 A :: [ a :: Bool ].hsig
1790 A2 :: [ b :: Bool ].hsig
1791 -- normal module
1792 B = [
1793 import {-# SOURCE #-} A
1794 import Foo
1795 import Baz
1796 ...
1797 ].hs
1798 -- recursively linked pair of modules, one is private
1799 C :: [ data C ].hsig
1800 D = [ import {-# SOURCE #-} C; data D = D C ].hs
1801 C = [ import D; data C = C D ].hs
1802 -- alias
1803 A = A2
1804 \end{verbatim}
1805
1806 We can write the following Cabal-like syntax instead (where
1807 all of the signatures and modules are placed in appropriately
1808 named files):
1809
1810 \begin{verbatim}
1811 package: libfoo
1812 ...
1813 build-depends: base, libfoo (Foo, Bar as Baz)
1814 holes: A A2 -- deferred for now
1815 exposed-modules: Foo B C
1816 aliases: A = A2
1817 other-modules: D
1818 \end{verbatim}
1819
1820 Notably, all of these lists are \emph{insensitive} to ordering!
1821 The key idea is use of the \verb|{-# SOURCE #-}| pragma, which
1822 is enough to solve the important ordering constraint between
1823 signatures and modules.
1824
1825 Here is how the elaboration works. For simplicity, in this algorithm
1826 description, we assume all packages being compiled have no holes
1827 (including \verb|build-depends| packages). Later, we'll discuss how to
1828 extend the algorithm to handle holes in both subpackages and the main
1829 package itself.
1830
1831 \begin{enumerate}
1832
1833 \item At the top-level with \verb|package| $p$ and
1834 \verb|exposed-modules| $ms$, record \verb|package p (ms) where|
1835
1836 \item For each package $p$ with thinning/renaming $ms$ in
1837 \verb|build-depends|, record a \verb|include p (ms)| in the
1838 Backpack package. The ordering of these includes does not
1839 matter, since none of these packages have holes.
1840
1841 \item Take all modules $m$ in \verb|other-modules| and
1842 \verb|exposed-modules| which were not exported by build
1843 dependencies, and create a directed graph where hs and hs-boot
1844 files are nodes and imports are edges (the target of an edge is
1845 an hs file if it is a normal import, and an hs-boot file if it
1846 is a SOURCE import). Topologically sort this graph, erroring if
1847 this graph contains cycles (even with recursive modules, the
1848 cycle should have been broken by an hs-boot file). For each
1849 node, in this order, record \verb|M = [ ... ]| or \verb|M :: [ ... ]|
1850 depending on whether or not it is an hs or hs-boot. If possible,
1851 sort signatures before implementations when there is no constraint
1852 otherwise.
1853
1854 \end{enumerate}
1855
1856 Here is a simple example which shows how SOURCE can be used to disambiguate
1857 between two important cases. Suppose we have these modules:
1858
1859 \begin{verbatim}
1860 -- A1.hs
1861 import {-# SOURCE #-} B
1862
1863 -- A2.hs
1864 import B
1865
1866 -- B.hs
1867 x = True
1868
1869 -- B.hs-boot
1870 x :: Bool
1871 \end{verbatim}
1872
1873 Then we translate the following packages as follows:
1874
1875 \begin{verbatim}
1876 exposed-modules: A1 B
1877 -- translates to
1878 B :: [ x :: Bool ]
1879 A1 = [ import B ]
1880 B = [ x = True ]
1881 \end{verbatim}
1882
1883 but
1884
1885 \begin{verbatim}
1886 exposed-modules: A2 B
1887 -- translates to
1888 B = [ x = True ]
1889 B :: [ x :: Bool ]
1890 A2 = [ import B ]
1891 \end{verbatim}
1892
1893 The import controls placement between signature and module, and in A1 it
1894 forces B's signature to be sorted before B's implementation (whereas in
1895 the second section, there is no constraint so we preferentially place
1896 the B's implementation first)
1897
1898 \paragraph{Holes in the database} In the presence of holes,
1899 \verb|build-depends| resolution becomes more complicated. First,
1900 let's consider the case where the package we are building is
1901 definite, but the package database contains indefinite packages with holes.
1902 In order to maintain the linking restriction, we now have to order packages
1903 from step (2) of the previous elaboration. We can do this by creating
1904 a directed graph, where nodes are packages and edges are from holes to the
1905 package which implements them. If there is a cycle, this indicates a mutually
1906 recursive package. In the absence of cycles, a topological sorting of this
1907 graph preserves the linking invariant.
1908
1909 One subtlety to consider is the fact that an entry in \verb|build-depends|
1910 can affect how a hole is instantiated by another entry. This might be a
1911 bit weird to users, who might like to explicitly say how holes are
1912 filled when instantiating a package. Food for thought, surface syntax wise.
1913
1914 \paragraph{Holes in the package} Actually, this is quite simple: the
1915 ordering of includes goes as before, but some indefinite packages in the
1916 database are less constrained as they're ``dependencies'' are fulfilled
1917 by the holes at the top-level of this package. It's also worth noting
1918 that some dependencies will go unresolved, since the following package
1919 is valid:
1920
1921 \begin{verbatim}
1922 package a where
1923 A :: ...
1924 package b where
1925 include a
1926 \end{verbatim}
1927
1928 \paragraph{Multiple signatures} In Backpack syntax, it's possible to
1929 define a signature multiple times, which is necessary for mutually
1930 recursive signatures:
1931
1932 \begin{verbatim}
1933 package a where
1934 A :: [ data A ]
1935 B :: [ import A; data B = B A ]
1936 A :: [ import B; data A = A B ]
1937 \end{verbatim}
1938
1939 Critically, notice that we can see the constructors for both module B and A
1940 after the signatures are linked together. This is not possible in GHC
1941 today, but could be possible by permitting multiple hs-boot files. Now
1942 the SOURCE pragma indicating an import must \emph{disambiguate} which
1943 hs-boot file it intends to include. This might be one way of doing it:
1944
1945 \begin{verbatim}
1946 -- A.hs-boot2
1947 data A
1948
1949 -- B.hs-boot
1950 import {-# SOURCE hs-boot2 #-} A
1951
1952 -- A.hs-boot
1953 import {-# SOURCE hs-boot #-} B
1954 \end{verbatim}
1955
1956 \paragraph{Explicit or implicit reexports} One annoying property of
1957 this proposal is that, looking at the \verb|exposed-modules| list, it is
1958 not immediately clear what source files one would expect to find in the
1959 current package. It's not obvious what the proper way to go about doing
1960 this is.
1961
1962 \paragraph{Better syntax for SOURCE} If we enshrine the SOURCE import
1963 as a way of solving Backpack ordering problems, it would be nice to have
1964 some better syntax for it. One possibility is:
1965
1966 \begin{verbatim}
1967 abstract import Data.Foo
1968 \end{verbatim}
1969
1970 which makes it clear that this module is pluggable, typechecking against
1971 a signature. Note that this only indicates how type checking should be
1972 done: when actually compiling the module we will compile against the
1973 interface file for the true implementation of the module.
1974
1975 It's worth noting that the SOURCE annotation was originally made a
1976 pragma because, in principle, it should have been possible to compile
1977 some recursive modules without needing the hs-boot file at all. But if
1978 we're moving towards boot files as signatures, this concern is less
1979 relevant.
1980
1981 \section{Type classes and type families}
1982
1983 \subsection{Background}
1984
1985 Before we talk about how to support type classes in Backpack, it's first
1986 worth talking about what we are trying to achieve in the design. Most
1987 would agree that \emph{type safety} is the cardinal law that should be
1988 preserved (in the sense that segfaults should not be possible), but
1989 there are many instances of ``bad behavior'' (top level mutable state,
1990 weakening of abstraction guarantees, ambiguous instance resolution, etc)
1991 which various Haskellers may disagree on the necessity of ruling out.
1992
1993 With this in mind, it is worth summarizing what kind of guarantees are
1994 presently given by GHC with regards to type classes and type families,
1995 as well as characterizing the \emph{cultural} expectations of the
1996 Haskell community.
1997
1998 \paragraph{Type classes} When discussing type class systems, there are
1999 several properties that one may talk about.
2000 A set of instances is \emph{confluent} if, no matter what order
2001 constraint solving is performed, GHC will terminate with a canonical set
2002 of constraints that must be satisfied for any given use of a type class.
2003 In other words, confluence says that we won't conclude that a program
2004 doesn't type check just because we swapped in a different constraint
2005 solving algorithm.
2006
2007 Confluence's closely related twin is \emph{coherence} (defined in ``Type
2008 classes: exploring the design space''). This property states that
2009 ``every different valid typing derivation of a program leads to a
2010 resulting program that has the same dynamic semantics.'' Why could
2011 differing typing derivations result in different dynamic semantics? The
2012 answer is that context reduction, which picks out type class instances,
2013 elaborates into concrete choices of dictionaries in the generated code.
2014 Confluence is a prerequisite for coherence, since one
2015 can hardly talk about the dynamic semantics of a program that doesn't
2016 type check.
2017
2018 In the vernacular, confluence and coherence are often incorrectly used
2019 to refer to another related property: \emph{global uniqueness of instances},
2020 which states that in a fully compiled program, for any type, there is at most one
2021 instance resolution for a given type class. Languages with local type
2022 class instances such as Scala generally do not have this property, and
2023 this assumption is frequently used for abstraction.
2024
2025 So, what properties does GHC enforce, in practice?
2026 In the absence of any type system extensions, GHC's employs a set of
2027 rules (described in ``Exploring the design space'') to ensure that type
2028 class resolution is confluent and coherent. Intuitively, it achieves
2029 this by having a very simple constraint solving algorithm (generate
2030 wanted constraints and solve wanted constraints) and then requiring the
2031 set of instances to be \emph{nonoverlapping}, ensuring there is only
2032 ever one way to solve a wanted constraint. Overlap is a
2033 more stringent restriction than either confluence or coherence, and
2034 via the \verb|OverlappingInstances| and \verb|IncoherentInstances|, GHC
2035 allows a user to relax this restriction ``if they know what they're doing.''
2036
2037 Surprisingly, however, GHC does \emph{not} enforce global uniqueness of
2038 instances. Imported instances are not checked for overlap until we
2039 attempt to use them for instance resolution. Consider the following program:
2040
2041 \begin{verbatim}
2042 -- T.hs
2043 data T = T
2044 -- A.hs
2045 import T
2046 instance Eq T where
2047 -- B.hs
2048 import T
2049 instance Eq T where
2050 -- C.hs
2051 import A
2052 import B
2053 \end{verbatim}
2054
2055 When compiled with one-shot compilation, \verb|C| will not report
2056 overlapping instances unless we actually attempt to use the \verb|Eq|
2057 instance in C.\footnote{When using batch compilation, GHC reuses the
2058 instance database and is actually able to detect the duplicated
2059 instance when compiling B. But if you run it again, recompilation
2060 avoidance skips A, and it finishes compiling! See this bug:
2061 \url{https://ghc.haskell.org/trac/ghc/ticket/5316}} This is by
2062 design\footnote{\url{https://ghc.haskell.org/trac/ghc/ticket/2356}}:
2063 ensuring that there are no overlapping instances eagerly requires
2064 eagerly reading all the interface files a module may depend on.
2065
2066 We might summarize these three properties in the following manner.
2067 Culturally, the Haskell community expects \emph{global uniqueness of instances}
2068 to hold: the implicit global database of instances should be
2069 confluent and coherent. GHC, however, does not enforce uniqueness of
2070 instances: instead, it merely guarantees that the \emph{subset} of the
2071 instance database it uses when it compiles any given module is confluent and coherent. GHC does do some
2072 tests when an instance is declared to see if it would result in overlap
2073 with visible instances, but the check is by no means
2074 perfect\footnote{\url{https://ghc.haskell.org/trac/ghc/ticket/9288}};
2075 truly, \emph{type-class constraint resolution} has the final word. One
2076 mitigating factor is that in the absence of \emph{orphan instances}, GHC is
2077 guaranteed to eagerly notice when the instance database has overlap.\footnote{Assuming that the instance declaration checks actually worked\ldots}
2078
2079 Clearly, the fact that GHC's lazy behavior is surprising to most
2080 Haskellers means that the lazy check is mostly good enough: a user
2081 is likely to discover overlapping instances one way or another.
2082 However, it is relatively simple to construct example programs which
2083 violate global uniqueness of instances in an observable way:
2084
2085 \begin{verbatim}
2086 -- A.hs
2087 module A where
2088 data U = X | Y deriving (Eq, Show)
2089
2090 -- B.hs
2091 module B where
2092 import Data.Set
2093 import A
2094
2095 instance Ord U where
2096 compare X X = EQ
2097 compare X Y = LT
2098 compare Y X = GT
2099 compare Y Y = EQ
2100
2101 ins :: U -> Set U -> Set U
2102 ins = insert
2103
2104 -- C.hs
2105 module C where
2106 import Data.Set
2107 import A
2108
2109 instance Ord U where
2110 compare X X = EQ
2111 compare X Y = GT
2112 compare Y X = LT
2113 compare Y Y = EQ
2114
2115 ins' :: U -> Set U -> Set U
2116 ins' = insert
2117
2118 -- D.hs
2119 module Main where
2120 import Data.Set
2121 import A
2122 import B
2123 import C
2124
2125 test :: Set U
2126 test = ins' X $ ins X $ ins Y $ empty
2127
2128 main :: IO ()
2129 main = print test
2130
2131 -- OUTPUT
2132 $ ghc -Wall -XSafe -fforce-recomp --make D.hs
2133 [1 of 4] Compiling A ( A.hs, A.o )
2134 [2 of 4] Compiling B ( B.hs, B.o )
2135
2136 B.hs:5:10: Warning: Orphan instance: instance [safe] Ord U
2137 [3 of 4] Compiling C ( C.hs, C.o )
2138
2139 C.hs:5:10: Warning: Orphan instance: instance [safe] Ord U
2140 [4 of 4] Compiling Main ( D.hs, D.o )
2141 Linking D ...
2142 $ ./D
2143 fromList [X,Y,X]
2144 \end{verbatim}
2145
2146 Locally, all type class resolution was coherent: in the subset of
2147 instances each module had visible, type class resolution could be done
2148 unambiguously. Furthermore, the types of \verb|ins| and \verb|ins'|
2149 discharge type class resolution, so that in \verb|D| when the database
2150 is now overlapping, no resolution occurs, so the error is never found.
2151
2152 It is easy to dismiss this example as an implementation wart in GHC, and
2153 continue pretending that global uniqueness of instances holds. However,
2154 the problem with \emph{global uniqueness of instances} is that they are
2155 inherently nonmodular: you might find yourself unable to compose two
2156 components because they accidentally defined the same type class
2157 instance, even though these instances are plumbed deep in the
2158 implementation details of the components.
2159
2160 As it turns out, there is already another feature in Haskell which
2161 must enforce global uniqueness, to prevent segfaults.
2162 We now turn to type classes' close cousin: type families.
2163
2164 \paragraph{Type families} With type families, confluence is the primary
2165 property of interest. (Coherence is not of much interest because type
2166 families are elaborated into coercions, which don't have any
2167 computational content.) Rather than considering what the set of
2168 constraints we reduce to, confluence for type families considers the
2169 reduction of type families. The overlap checks for type families
2170 can be quite sophisticated, especially in the case of closed type
2171 families.
2172
2173 Unlike type classes, however, GHC \emph{does} check the non-overlap
2174 of type families eagerly. The analogous program does \emph{not} type check:
2175
2176 \begin{verbatim}
2177 -- F.hs
2178 type family F a :: *
2179 -- A.hs
2180 import F
2181 type instance F Bool = Int
2182 -- B.hs
2183 import F
2184 type instance F Bool = Bool
2185 -- C.hs
2186 import A
2187 import B
2188 \end{verbatim}
2189
2190 The reason is that it is \emph{unsound} to ever allow any overlap
2191 (unlike in the case of type classes where it just leads to incoherence.)
2192 Thus, whereas one might imagine dropping the global uniqueness of instances
2193 invariant for type classes, it is absolutely necessary to perform global
2194 enforcement here. There's no way around it!
2195
2196 \subsection{Local type classes}
2197
2198 Here, we say \textbf{NO} to global uniqueness.
2199
2200 This design is perhaps best discussed in relation to modular type
2201 classes, which shares many similar properties. Instances are now
2202 treated as first class objects (in MTCs, they are simply modules)---we
2203 may explicitly hide or include instances for type class resolution (in
2204 MTCs, this is done via the \verb|using| top-level declaration). This is
2205 essentially what was sketched in Section 5 of the original Backpack
2206 paper. As a simple example:
2207
2208 \begin{verbatim}
2209 package p where
2210 A :: [ data T = T ]
2211 B = [ import A; instance Eq T where ... ]
2212
2213 package q where
2214 A = [ data T = T; instance Eq T where ... ]
2215 include p
2216 \end{verbatim}
2217
2218 Here, \verb|B| does not see the extra instance declared by \verb|A|,
2219 because it was thinned from its signature of \verb|A| (and thus never
2220 declared canonical.) To declare an instance without making it
2221 canonical, it must placed in a separate (unimported) module.
2222
2223 Like modular type classes, Backpack does not give rise to incoherence,
2224 because instance visibility can only be changed at the top level module
2225 language, where it is already considered best practice to provide
2226 explicit signatures. Here is the example used in the Modular Type
2227 Classes paper to demonstrate the problem:
2228
2229 \begin{verbatim}
2230 structure A = using EqInt1 in
2231 struct ...fun f x = eq(x,x)... end
2232 structure B = using EqInt2 in
2233 struct ...val y = A.f(3)... end
2234 \end{verbatim}
2235
2236 Is the type of f \verb|int -> bool|, or does it have a type-class
2237 constraint? Because type checking proceeds over the entire program, ML
2238 could hypothetically pick either. However, ported to Haskell, the
2239 example looks like this:
2240
2241 \begin{verbatim}
2242 EqInt1 :: [ instance Eq Int ]
2243 EqInt2 :: [ instance Eq Int ]
2244 A = [
2245 import EqInt1
2246 f x = x == x
2247 ]
2248 B = [
2249 import EqInt2
2250 import A hiding (instance Eq Int)
2251 y = f 3
2252 ]
2253 \end{verbatim}
2254
2255 There may be ambiguity, yes, but it can be easily resolved by the
2256 addition of a top-level type signature to \verb|f|, which is considered
2257 best-practice anyway. Additionally, Haskell users are trained to expect
2258 a particular inference for \verb|f| in any case (the polymorphic one).
2259
2260 Here is another example which might be considered surprising:
2261
2262 \begin{verbatim}
2263 package p where
2264 A :: [ data T = T ]
2265 B :: [ data T = T ]
2266 C = [
2267 import qualified A
2268 import qualified B
2269 instance Show A.T where show T = "A"
2270 instance Show B.T where show T = "B"
2271 x :: String
2272 x = show A.T ++ show B.T
2273 ]
2274 \end{verbatim}
2275
2276 In the original Backpack paper, it was implied that module \verb|C|
2277 should not type check if \verb|A.T = B.T| (failing at link time).
2278 However, if we set aside, for a moment, the issue that anyone who
2279 imports \verb|C| in such a context will now have overlapping instances,
2280 there is no reason in principle why the module itself should be
2281 problematic. Here is the example in MTCs, which I have good word from
2282 Derek does type check.
2283
2284 \begin{verbatim}
2285 signature SIG = sig
2286 type t
2287 val mk : t
2288 end
2289 signature SHOW = sig
2290 type t
2291 val show : t -> string
2292 end
2293 functor Example (A : SIG) (B : SIG) =
2294 let structure ShowA : SHOW = struct
2295 type t = A.t
2296 fun show _ = "A"
2297 end in
2298 let structure ShowB : SHOW = struct
2299 type t = B.t
2300 fun show _ = "B"
2301 end in
2302 using ShowA, ShowB in
2303 struct
2304 val x = show A.mk ++ show B.mk
2305 end : sig val x : string end
2306 \end{verbatim}
2307
2308 The moral of the story is, even though in a later context the instances
2309 are overlapping, inside the functor, the type-class resolution is unambiguous
2310 and should be done (so \verb|x = "AB"|).
2311
2312 Up until this point, we've argued why MTCs and this Backpack design are similar.
2313 However, there is an important sociological difference between modular type-classes
2314 and this proposed scheme for Backpack. In the presentation ``Why Applicative
2315 Functors Matter'', Derek mentions the canonical example of defining a set:
2316
2317 \begin{verbatim}
2318 signature ORD = sig type t; val cmp : t -> t -> bool end
2319 signature SET = sig type t; type elem;
2320 val empty : t;
2321 val insert : elem -> t -> t ...
2322 end
2323 functor MkSet (X : ORD) :> SET where type elem = X.t
2324 = struct ... end
2325 \end{verbatim}
2326
2327 This is actually very different from how sets tend to be defined in
2328 Haskell today. If we directly encoded this in Backpack, it would
2329 look like this:
2330
2331 \begin{verbatim}
2332 package mk-set where
2333 X :: [
2334 data T
2335 cmp :: T -> T-> Bool
2336 ]
2337 Set :: [
2338 data Set
2339 empty :: Set
2340 insert :: T -> Set -> Set
2341 ]
2342 Set = [
2343 import X
2344 ...
2345 ]
2346 \end{verbatim}
2347
2348 It's also informative to consider how MTCs would encode set as it is written
2349 today in Haskell:
2350
2351 \begin{verbatim}
2352 signature ORD = sig type t; val cmp : t -> t -> bool end
2353 signature SET = sig type 'a t;
2354 val empty : 'a t;
2355 val insert : (X : ORD) => X.t -> X.t t -> X.t t
2356 end
2357 struct MkSet :> SET = struct ... end
2358 \end{verbatim}
2359
2360 Here, it is clear to see that while functor instantiation occurs for
2361 implementation, it is not occuring for types. This is a big limitation
2362 with the Haskell approach, and it explains why Haskellers, in practice,
2363 find global uniqueness of instances so desirable.
2364
2365 Implementation-wise, this requires some subtle modifications to how we
2366 do type class resolution. Type checking of indefinite modules works as
2367 before, but when go to actually compile them against explicit
2368 implementations, we need to ``forget'' that two types are equal when
2369 doing instance resolution. This could probably be implemented by
2370 associating type class instances with the original name that was
2371 utilized when typechecking, so that we can resolve ambiguous matches
2372 against types which have the same original name now that we are
2373 compiling.
2374
2375 As we've mentioned previously, this strategy is unsound for type families.
2376
2377 \subsection{Globally unique}
2378
2379 Here, we say \textbf{YES} to global uniqueness.
2380
2381 When we require the global uniqueness of instances (either because
2382 that's the type class design we chose, or because we're considering
2383 the problem of type families), we will need to reject declarations like the
2384 one cited above when \verb|A.T = B.T|:
2385
2386 \begin{verbatim}
2387 A :: [ data T ]
2388 B :: [ data T ]
2389 C :: [
2390 import qualified A
2391 import qualified B
2392 instance Show A.T where show T = "A"
2393 instance Show B.T where show T = "B"
2394 ]
2395 \end{verbatim}
2396
2397 The paper mentions that a link-time check is sufficient to prevent this
2398 case from arising. While in the previous section, we've argued why this
2399 is actually unnecessary when local instances are allowed, the link-time
2400 check is a good match in the case of global instances, because any
2401 instance \emph{must} be declared in the signature. The scheme proceeds
2402 as follows: when some instances are typechecked initially, we type check
2403 them as if all of variable module identities were distinct. Then, when
2404 we perform linking (we \verb|include| or we unify some module
2405 identities), we check again if to see if we've discovered some instance
2406 overlap. This linking check is akin to the eager check that is
2407 performed today for type families; it would need to be implemented for
2408 type classes as well: however, there is a twist: we are \emph{redoing}
2409 the overlap check now that some identities have been unified.
2410
2411 As an implementation trick, one could deferring the check until \verb|C|
2412 is compiled, keeping in line with GHC's lazy ``don't check for overlap
2413 until the use site.'' (Once again, unsound for type families.)
2414
2415 \paragraph{What about module inequalities?} An older proposal was for
2416 signatures to contain ``module inequalities'', i.e., assertions that two
2417 modules are not equal. (Technically: we need to be able to apply this
2418 assertion to $\beta$ module variables, since \verb|A != B| while
2419 \verb|A.T = B.T|). Currently, Edward thinks that module inequalities
2420 are only marginal utility with local instances (i.e., not enough to
2421 justify the implementation cost) and not useful at all in the world of
2422 global instances!
2423
2424 With local instances, module inequalities could be useful to statically
2425 rule out examples like \verb|show A.T ++ show B.T|. Because such uses
2426 are not necessarily reflected in the signature, it would be a violation
2427 of separate module development to try to divine the constraint from the
2428 implementation itself. I claim this is of limited utility, however, because,
2429 as we mentioned earlier, we can compile these ``incoherent'' modules perfectly
2430 coherently. With global instances, all instances must be in the signature, so
2431 while it might be aesthetically displeasing to have the signature impose
2432 extra restrictions on linking identities, we can carry this out without
2433 violating the linking restriction.
2434
2435 \section{Bits and bobs}
2436
2437 \subsection{Abstract type synonyms}
2438
2439 In Paper Backpack, abstract type synonyms are not permitted, because GHC doesn't
2440 understand how to deal with them. The purpose of this section is to describe
2441 one particularly nastiness of abstract type synonyms, by way of the occurs check:
2442
2443 \begin{verbatim}
2444 A :: [ type T ]
2445 B :: [ import qualified A; type T = [A.T] ]
2446 \end{verbatim}
2447
2448 At this point, it is illegal for \verb|A = B|, otherwise this type synonym would
2449 fail the occurs check. This seems like pretty bad news, since every instance
2450 of the occurs check in the type-checker could constitute a module inequality.
2451
2452 \section{Open questions}\label{sec:open-questions}
2453
2454 Here are open problems about the implementation which still require
2455 hashing out.
2456
2457 \begin{itemize}
2458
2459 \item In Section~\ref{sec:simplifying-backpack}, we argued that we
2460 could implement Backpack without needing a shaping pass. We're
2461 pretty certain that this will work for typechecking and
2462 compiling fully definite packages with no recursive linking, but
2463 in Section~\ref{sec:typechecking-indefinite}, we described some
2464 of the prevailing difficulties of supporting signature linking.
2465 Renaming is not an insurmountable problem, but backwards flow of
2466 shaping information can be, and it is unclear how best to
2467 accommodate this. This is probably the most important problem
2468 to overcome.
2469
2470 \item In Section~\ref{sec:installing-indefinite}, a few choices for how to
2471 store source code were pitched, however, there is not consensus on which
2472 one is best.
2473
2474 \item What is the impact of the multiplicity of PackageIds on
2475 dependency solving in Cabal? Old questions of what to prefer
2476 when multiple package-versions are available (Cabal originally
2477 only needed to solve this between different versions of the same
2478 package, preferring the oldest version), but with signatures,
2479 there are more choices. Should there be a complex solver that
2480 does all signature solving, or a preprocessing step that puts
2481 things back into the original Cabal version. Authors may want
2482 to suggest policy for what packages should actually link against
2483 signatures (so a crypto library doesn't accidentally link
2484 against a null cipher package).
2485
2486 \end{itemize}
2487
2488 \end{document}