In progress Backpack implementation docs.
[ghc.git] / docs / backpack / backpack-impl.tex
1 \documentclass{article}
2
3 \usepackage{graphicx} %[pdftex] OR [dvips]
4 \usepackage{fullpage}
5 \usepackage{float}
6 \usepackage{titling}
7 \setlength{\droptitle}{-6em}
8
9 \newcommand{\ghcfile}[1]{\textsl{#1}}
10
11 \title{Implementing Backpack}
12
13 \begin{document}
14
15 \maketitle
16
17 The purpose of this document is to describe an implementation path
18 for Backpack~\cite{Kilpatrick:2014:BRH:2535838.2535884} in GHC\@.
19
20 We start off by outlining the current architecture of GHC, ghc-pkg and Cabal,
21 which constitute the existing packaging system. We then state what our subgoals
22 are, since there are many similar sounding but different problems to solve. Next,
23 we describe the ``probably correct'' implementation plan, and finish off with
24 some open design questions. This is intended to be an evolving design document,
25 so please contribute!
26
27 \section{Current packaging architecture}
28
29 The overall architecture is described in Figure~\ref{fig:arch} (ignore
30 the red and green bits for now).
31
32 \begin{figure}[H]
33 \center{\scalebox{0.8}{\includegraphics{arch.png}}}
34 \label{fig:arch}\caption{Architecture of GHC, ghc-pkg and Cabal. Green bits indicate additions from upcoming IHG work, red bits indicate additions from Backpack. Orange indicates a Haskell library.}
35 \end{figure}
36
37 Here, arrows indicate dependencies from one component to another.
38 (insert architectural description here)
39
40 A particularly important component of this picture is the package
41 database, sketched in Figure~\ref{fig:pkgdb}.
42
43 \begin{figure}[H]
44 \center{\scalebox{0.8}{\includegraphics{pkgdb.png}}}
45 \label{fig:pkgdb}\caption{Anatomy of a package database and a package identifier.}
46 \end{figure}
47
48 An installed package is calculated from a Cabal file through the process
49 of dependency resolution and compilation. We can think of it a
50 database, whose primary key is the InstalledPackageId, which presently
51 is the package name, the package version, and the ABI hash (nota bene,
52 the diagram disagrees with this text: it shows the version of installed
53 package IDs which we'd like to move towards.) These IDs uniquely
54 identify an instance of an installed package. A mere PackageId omits
55 the ABI hash.
56
57 The database entry itself contains the information from the installed package ID,
58 as well as information such as what dependencies it was linked against, where
59 its compiled code and interface files live, its compilation flags, what modules
60 it exposes, etc. Much of this information is only relevant to Cabal; GHC
61 uses a subset of the information in the package database.
62
63 \section{Goals}
64
65 There are actually a number of different goals we have for modifying the
66 packaging system.
67
68 \begin{itemize}
69 \item Support multiple instances of containers-2.9 \emph{in the
70 package database}. These instances may be compiled against
71 different dependencies, have the same dependencies but different
72 source files (as when a package is being developed), or be
73 compiled with different options. It is less important to allow
74 these instances to be linkable together.
75
76 \item Some easy-to-implement subset of the functionality provided by
77 packages with holes (Backpack proper).
78 \end{itemize}
79
80 A lower priority goal is to actually allow multiple instances of
81 containers-2.9 to be linked together in the same executable
82 program.\footnote{In particular, this requires changes to how linker symbols
83 are assigned. However, this feature is important to implement a number
84 of Backpack features.}
85
86 A \emph{non-goal} is to allow users to upgrade upstream libraries
87 without recompiling downstream. This is an ABI concern and we're not
88 going to worry about it.
89
90 \section{Aside: Recent IHG work}
91
92 The IHG project has allocated some funds to relax the package instance
93 constraint in the package database, so that multiple instances can be
94 stored, but now the user of GHC must explicitly list package-IDs to be
95 linked against. In the far future, it would be expected that tools like
96 Cabal automatically handle instance selection among a large number of
97 instances, but this is subtle and so this work is only to do some
98 foundational work, allowing a package database to optionally relax the
99 unique package-version requirement, and utilize environment files to
100 select which packages should be used. See Duncan's email for more
101 details on the proposal.
102
103 For the purpose of Backpack, the only relevant part of this proposal
104 is the relaxation of package databases to allow multiple
105
106 \section{Adding Backpack to GHC}
107
108 Backpack additions are described in red in the architectural diagrams.
109 The current structure of this section is to describe the additions bottom up.
110
111 \subsection{Use InstalledPackageId instead PackageId in typechecking}
112
113 In Backpack, there needs to be some mechanism for assigning
114 \emph{physical module identities} to modules, which are essential for
115 typechecking Backpack packages, since they let us tell if two types are
116 equal or not. In the paper, the physical identity was represented as the
117 package that constructed it as well as some representation of the module
118 source. We can simplify this slightly: in current Cabal packages, we
119 require that modules always be given a package-unique logical name;
120 thus, physical identities can be simply represented as a PackageId plus
121 module name. (See \ghcfile{compiler/basicTypes/Module.lhs:Module})
122
123 However, this is not enough if we allow multiple instances of a package
124 in the package database: now we may incorrectly conclude that types
125 defined by different instances of the same version of a package are
126 equal: this would be especially fatal if the two packages were linked
127 against different underlying libraries. Thus, a physical module name
128 should be represented as an InstalledPackageId (which uniquely
129 identifies an installed package) as well as the original logical name.
130
131 \paragraph{Note about linker symbols} Module is currently used for
132 typechecking, and then once again
133
134 Currently, InstalledPackageId is constructed of a package, version and ABI
135 hash. The use of an ABI hash is a bit of hack, mostly to make sure these
136 installed package IDs are unique. In Figure~\ref{fig:pkgdb}, an alternate
137 logical representation of InstalledPackageId is suggested using
138
139 ----
140
141 Simon gave some nice explanations of original names in GHC
142
143 Sometimes, a name can be exposed in an hi file even if its module
144 wasn't exposed. Example in package R:
145
146 module X where
147 import Internal (f)
148 g = f
149
150 module Internal where
151 import Internal.Total (f)
152
153 Then in X.hi:
154
155 g = <R.id, Internal.Total, f> (this is the original name)
156
157 (The reason we refer to the package as R.id is because it's the
158 full package ID, and not just R).
159
160 How might internal names work with Backpack?
161
162 package P where
163 M = ...
164 N = ...
165 package Q (M, R, T)
166 include P (N -> R)
167 T = ...
168
169 Q; exposed modules
170 M -> <P.id, M>
171 R -> <P.id, N>
172 T -> <Q.id, T>
173
174 When we compile Q, and the interface file gets generated, we have
175 to generate identifiers for each of the exposed modules. These should
176 be calculated to directly refer to the "original name" of each them;
177 so for example M and R point directly to package P, but they also
178 include the original name they had in the original definition.
179
180 ----
181
182 Differing intuitions: GHC internals versus Backpack abstraction
183
184 ----
185
186 Refactoring necessary:
187
188 - PackageId in GHC needs to be InstalledPackageId. I get these IDs
189 from -package-name when I build a package, and these are baked
190 into the hi files and linker names. To the type checker, this
191 IS exactly what a package is. But see open question about linker
192 names...
193
194 There appears to already be a conversion, probably a newtype,
195 from package name to linker names, according to Duncan.
196
197 THE PLAN: To remain BC, we have a flag named -package-name which
198 is used for both. So now I maintain two different values,
199 -package-name sets both, and then I have another flag for setting
200 one separately.
201
202 Watch out: GHC plays tricks with wired-in names. Suggested is a
203 table from wired names to package IDs (constructed with the
204 package environment)
205
206 ghc-pkg
207
208 - Remove the "removal step" when registering a package (with a flag)
209
210 - Check main/Packages.lhs:mkPackagesState to look out for shadowing
211 within a database, it might already do the right thing (key idea
212 is that we already do something sensible merging package databases
213 together, reuse that)
214
215 - Experiment using -hide-all-packages -package-id ... flags explicitly
216
217 \section{Open questions}
218
219 Here are open problems about the implementation which still require
220 hashing out.
221
222 - Aliasing of signatures means that it is no longer the case that
223 original name means type equality. We were not able to convince
224 Simon of any satisfactory resolution. Strawman proposal is to
225 extending original names to also be variables probably won't work
226 because it is so deeply wired, but it's difficult to construct hi
227 files so that everything works out (quadratic).
228
229 - Relationship between linker names and InstalledPackageId? The reason
230 the obvious thing to do is use all of InstalledPackageId for linker
231 name, but this breaks recompilation. So some of these things
232 should go in the linker name, and not others (yes package, yes
233 version, yes some compile flags (definitely ways), eventually
234 dependency package IDs, what about cabal build flags). This is
235 approximately an ABI hash, but it's computable before compilation.
236 This worries Simon because now there are two names, but maybe
237 the DB can solve that problem--unfortunately, GHC doesn't ever
238 register during compilation; only later.
239
240 Simon also thought we should use shorter names for linker
241 names and InstallPkgIds. This appears to be orthogonal.
242
243 - In this example:
244
245 package A where
246 A = [ ... ]
247 package A2 where
248 A2 = [ ... ]
249 package B (B)
250 ...
251
252 Do the seperate instantiations of B exist as seperate artifacts
253 in the database, or do they get constructed on the fly by
254 definite packages which include them? The former is conceptually
255 nicer (identity of types work, closer to Backpack semantics), but
256 the latter is easier for linker names. (Simon's first inclination
257 is to construct it on the fly.)
258
259 There was another example, showing that we need to solve this
260 problem even for indefinite combinations of indefinite modules.
261 You can get to it by modifying the earlier example so that C and
262 D still have holes, which E does not fill.
263
264 - We have to store the preprocessed sources for indefinite packages.
265 This is hard when we're constructing packages on the fly.
266
267 - What is the impact on dependency solving in Cabal? Old questions
268 of what to prefer when multiple package-versions are available
269 (Cabal originally only needed to solve this between different
270 versions of the same package, preferring the oldest version), but
271 with signatures, there are more choices. Should there be a
272 complex solver that does all signature solving, or a preprocessing
273 step that puts things back into the original Cabal version.
274 Authors may want to suggest policy for what packages should actually
275 link against signatures (so a crypto library doesn't accidentally
276 link against a null cipher package).
277
278 \section{Immediate tasks}
279
280 At this point in time, it seems we can identify the following set
281 of non-controversial tasks which can be started immediately.
282
283 \begin{itemize}
284 \item Relax the package database constraint to allow multiple
285 instances of package-version.
286 \item Propagate the use of \verb|InstalledPackageId| instead of
287 package IDs for typechecking (but not for linking, yet).
288 \item Implement export declarations in package format, so
289 packages can reexport modules from other packages.
290 \end{itemize}
291
292 The aliasing problem is probably the most important open problem
293 blocking the rest of the system.
294
295 \bibliographystyle{plain}
296 \bibliography{backpack-impl}
297
298 \end{document}