Skip to content

Commit 48dea94

Browse files
author
Martin Sandve Alnæs
committed
Merge pull request #1 from martinal/notebook-diff-v2
Update notebook-diff.md
2 parents 71b2467 + afc7615 commit 48dea94

File tree

1 file changed

+180
-82
lines changed

1 file changed

+180
-82
lines changed

notebook-diff/notebook-diff.md

Lines changed: 180 additions & 82 deletions
Original file line numberDiff line numberDiff line change
@@ -4,15 +4,21 @@
44

55
Diffing and merging notebooks is not properly handled by standard linebased diff and merge tools.
66

7+
78
## Proposed Enhancement
89

910
* Make a package containing tools for diff and merge of notebooks
10-
* Include a command line api
11-
* Pretty-printing of diffs for command line display
12-
* Command line tools for interactive resolution of merge conflicts
13-
* Make the merge tool git compatible
14-
* Make a web gui for displaying notebook diffs
15-
* Make a web gui for interactive resolution of merge conflicts
11+
* Command line functionality:
12+
- A command nbdiff with diff output as json or pretty-printed to console
13+
- A command nbmerge which should be git compatible
14+
- Command line tools for interactive resolution of merge conflicts
15+
- Optional launching of web gui for interactive resolution of merge conflicts
16+
- All command line functionality is also available through the Python package
17+
* Web gui functionality:
18+
- A simple server with a web api to access diff and merge functionality
19+
- A web gui for displaying notebook diffs
20+
- A web gui for displaying notebook merge conflicts
21+
- A web gui for interactive resolution of notebook merge conflicts
1622
* Plugin framework for mime type specific diffing
1723

1824

@@ -27,133 +33,225 @@ testing. At the core of it all is the diff algorithms, which
2733
must handle not only text in source cells but also a number of
2834
data formats based on mime types in output cells.
2935

36+
Cell source is usually the primary content, and output can presumably
37+
be regenerated. In general it is not possible to guarantee that merged
38+
sources and merged output is consistent or makes any kind of
39+
sense. For many use cases options to silently drop output instead of
40+
requiring conflict resolution will produce a smoother workflow.
41+
However such data loss should only happen when explicitly requested.
42+
43+
44+
### Basic use cases of notebook diff
45+
46+
* View difference between two versions of a file:
47+
`nbdiff base.ipynb remote.ipynb`
48+
* Store difference between two versions of a file to a patch file
49+
`nbdiff base.ipynb remote.ipynb patch.json`
50+
* Compute diff of notebooks for use in a regression test framework:
51+
```
52+
import nbdime
53+
di = nbdime.diff_notebooks(a, b)
54+
assert not di
55+
```
56+
* View difference of output cells after re-executing notebook
3057

31-
### Basic diffing use cases
3258

33-
* View difference between versions of a file
59+
Variations will be added on demand with arguments to the nbdiff command, e.g.:
60+
3461
* View diff of sources only
3562
* View diff of output cells (basic text diff of output cells, image diff with external tool)
3663

3764

38-
### Version control use cases
39-
40-
Most commonly, cell source is the primary content,
41-
and output can presumably be regenerated. Indeed, it
42-
is not possible to guarantee that merged sources and
43-
merged output is consistent or makes any kind of sense.
65+
### Basic use cases of notebook merge
4466

4567
The main use case for the merge tool will be a git-compatible commandline merge tool:
46-
68+
```
4769
nbmerge base.ipynb local.ipynb remote.ipynb merged.ipynb
48-
49-
and a web gui for conflict resolution. Ideally the web gui can
50-
reuse as much as possible from jupyter notebook. An initial
51-
version of conflict resolution can be to output a notebook with
52-
conflicts marked within cells, to be manually edited as a regular
53-
jupyter notebook.
70+
```
71+
which can be called from git and launch a console tool or web gui for conflict resolution if needed.
72+
Ideally the web gui can reuse as much as possible from Jupyter Notebook.
5473

5574
Goals:
5675

5776
* Trouble free automatic merge when no merge conflicts occur
58-
* Optional behaviour to drop conflicting output
77+
* Optional behaviour to drop conflicting output, execution counts, and eventual other secondary data
5978
* Easy to use interactive conflict resolution
6079

61-
Not planning (for now):
6280

63-
* Merge of arbitrary output cell content
81+
### Notes on initial implementation
82+
83+
* An initial version of diff gui can simply show e.g. two differing
84+
images side by side, but later versions should do something more
85+
clever.
86+
87+
* An initial version of merge can simply join or optionally delete
88+
conflicting output.
89+
90+
* An initial version of conflict resolution can be to output a
91+
notebook with conflicts marked within cells, to be manually edited
92+
as a regular jupyter notebook.
93+
94+
95+
## Diff format
96+
97+
The diff object represents the difference between two objects A and
98+
B as a list of operations (ops) to apply to A to obtain B. Each
99+
operation is represented as a dict with at least two items:
100+
```
101+
{ "op": <opname>, "key": <key> }
102+
```
103+
The objects A and B are either mappings (dicts) or sequences (lists),
104+
and a different set of ops are legal for mappings and sequences.
105+
Depending on the op, the operation dict usually contains an
106+
additional argument, documented below.
107+
108+
109+
### Diff format for mappings
110+
111+
For mappings, the key is always a string. Valid ops are:
112+
113+
* `{ "op": "remove", "key": <string> }`: delete existing value at key
114+
* `{ "op": "add", "key": <string>, "value": <value> }`: insert new value at key not previously existing
115+
* `{ "op": "replace", "key": <string>, "value": <value> }`: replace existing value at key with new value
116+
* `{ "op": "patch", "key": <string>, "diff": <diffobject> }`: patch existing value at key with another diffobject
117+
118+
119+
### Diff format for sequences (list and string)
120+
121+
For sequences the key is always an integer index. This index is
122+
relative to object A of length N. Valid ops are:
123+
124+
* `{ "op": "removerange", "key": <string>, "length": <n>}`: delete the values A[key:key+length]
125+
* `{ "op": "addrange", "key": <string>, "valuelist": <values> }`: insert new items from valuelist before A[key], at end if key=len(A)
126+
* `{ "op": "patch", "key": <string>, "diff": <diffobject> }`: patch existing value at key with another diffobject
127+
128+
129+
### Relation to JSONPatch
130+
131+
The above described diff representation has similarities with the
132+
JSONPatch standard but is different in some significant ways:
64133

65-
Open questions:
134+
* JSONPatch contains operations "move", "copy", "test" not used by
135+
nbdime, and nbdime contains operations "addrange", "removerange", and
136+
"patch" not in JSONPatch.
66137

67-
* Is it important to track source lines moving between cells?
138+
* Instead of providing a recursive "patch" op, JSONPatch uses a deep
139+
JSON pointer based "path" item in each operation instead of the "key"
140+
item nbdime uses. This way JSONPatch can represent the diff object as
141+
a single list instead of the 'tree' of lists that nbdime uses. The
142+
advantage of the recursive approach is that e.g. all changes to a cell
143+
are grouped and do not need to be collected.
68144

69-
Should make a collection of tricky corner cases, and
70-
run merge tools on test cases from e.g. git if possible.
145+
* JSONPatch uses indices that relate to the intermediate (partially
146+
patched) object, meaning transformation number n cannot be interpreted
147+
without going through the transformations up to n-1. In nbdime the
148+
indices relate to the base object, which means 'delete cell 7' means
149+
deleting cell 7 of the base notebook independently of the previous
150+
transformations in the diff.
71151

152+
A conversion function can fairly easily be implemented.
72153

73-
### Regression testing use cases
74154

75-
* View difference of output cells after re-running cells
155+
## High level diff algorithm approach
76156

157+
The package will contain both generic and notebook-specific variants of diff algorithms.
77158

78-
### Diff format
159+
The generic diff algorithms will handle most json-compatible objects:
79160

80-
A preliminary diff format has been defined, where the diff result is itself a json object.
81-
The details of this format is being refined. For examples of concrete diff
82-
objects, see e.g. the test suite for patch.
161+
* Arbitrary nested structures of dicts and lists are allowed
83162

163+
* Leaf values can be any strings and numbers
84164

85-
#### Diff format for dicts (current)
165+
* Dict keys must always be strings
86166

87-
A diff of two dicts is a list of diff entries:
167+
The generic variants will by extension produce correct diffs for
168+
notebooks, but the notebook-specific variants aim to produce more
169+
meaningful diffs. "Meaningful" is a subjective concept and the
170+
algorithm descriptions below are therefore fairly high-level with
171+
many details left up to the implementation.
88172

89-
key = string
90-
entry = [action, key] | [action, key, argument]
91-
diff = [entry0, entry1, ...]
92173

93-
A dict diff entry is a list of action and argument (except for deletion):
94174

95-
* ["-", key]: delete value at key
96-
* ["+", key, newvalue]: insert newvalue at key
97-
* ["!", key, diff]: patch value at key with diff
98-
* [":", key, newvalue]: replace value at key with newvalue
175+
### Handling nested structures by alignment and recursion
99176

177+
The diff of objects A and B is computed recursively, handling dicts
178+
and lists with different algorithms.
100179

101-
#### Diff format for dicts (alternative)
102180

103-
A diff of two dicts is itself a dict mapping string keys to diff entries:
181+
### Diff approach for dicts
104182

105-
key = string
106-
entry = [action] | [action, argument]
107-
diff = {key0: entry0, key1: entry1, ...}
183+
When computing the diff of two dicts, items are always aligned by key
184+
value, i.e. under no circumstances are values under different keys
185+
compared or diffed. This makes both diff and merge quite
186+
straightforward. Modified leaf values that are both a list or both a
187+
dict will be diffed recursively, with the diff object recording a
188+
"patch" operation. Any other modified leaf values are considered
189+
replaced.
108190

109-
A dict diff entry is a list of action and argument (except for deletion):
110191

111-
* ["-"]: delete value at key
112-
* ["+", newvalue]: insert newvalue at key
113-
* ["!", diff]: patch value at key with diff
114-
* [":", newvalue]: replace value at key with newvalue
192+
### Diff approach for lists
115193

194+
We wish to diff sequences and also recurse and diff aligned elements
195+
within the sequences. The core approach is to first align elements,
196+
requiring some heuristic for comparing elements, and then recursively
197+
diff the elements that are determined equal. *These heuristics will
198+
contain the bulk of the notebook-specific diff algorithm
199+
customizations.*
116200

117-
#### Diff format for sequences (list and string)
201+
The most used approach for computing linebased diffs of source code is
202+
to solve the longest common subsequence (lcs) problem or some
203+
variation of it. We extend the vanilla LCS problem by allowing
204+
customizable predicates for approximate equality of two items,
205+
allowing e.g. a source cell predicate to determine that two pieces of
206+
source code are approximately equal and should be considered the same
207+
cell, or an output cell predicate to determine that two bitmap images
208+
are almost equal.
118209

119-
A diff of two sequences is an ordered list of diff entries:
210+
In addition we have an experimental multilevel algorithm that employs
211+
a basic LCS algorithm with a sequence of increasingly relaxed equality
212+
predicates, allowing e.g. prioritizing equality of source+output over
213+
just equality of source. Note that determining good heuristics and
214+
refining the above mentioned algorithms will be a significant part of
215+
the work and some experimentation must be allowed. In particular the
216+
behaviour of the multilevel approach must be investigated further and
217+
other approaches could be considered..
120218

121-
index = integer
122-
entry = [action, index] | [action, index, argument]
123-
diff = [entry0, entry1, ...]
124219

125-
A sequence diff entry is a list of action, index and argument (except for deletion):
220+
### Note about the potential addition of a "move" transformation
126221

127-
* ["-", index]: delete entry at index
128-
* ["+", index, newvalue]: insert single newvalue before index
129-
* ["--", index, n]: delete n entries starting at index
130-
* ["++", index, newvalues]: insert sequence newvalues before index
131-
* ["!", index, diff]: patch value at index with diff
222+
In the current implementation there is no "move" operation.
223+
Furthermore we make some assumptions on the structure of the json
224+
objects and what kind of transformations are meaningful in a diff.
132225

133-
Possible simplifications:
226+
Items swapping position in a list will be considered added and removed
227+
instead of moved, but in a future iteration adding a "move" operation
228+
is an option to be considered. The main use case for this would be to
229+
resolve merges without conflicts when cells in a notebook are
230+
reordered on one side and modified on the other side.
134231

135-
* Remove single-item "-", "+" and rename "--" and "++" to single-letter.
136-
* OR remove "--" and "++" and stick with just single-item versions.
232+
Even if we add the move operation, values will never be moved between
233+
keys in a dict, e.g.:
137234

235+
diff({"a":"x", "b":"y"}, {"a":"y", "b":"x"})
138236

139-
Note: The currently implemented sequence diff algorithm is
140-
based on a brute force O(N^2) longest common subsequence (LCS)
141-
algorithm, this will be rewritten in terms of a faster algorithm
142-
such as Myers O(ND) LCS based diff algorithm, optionally
143-
using Pythons difflib for use cases it can handle.
144-
In particular difflib does not handle custom compare predicate,
145-
which we need to e.g. identify almost equal cells within sequences
146-
of cells in a notebook.
237+
will be:
147238

239+
[{"op": "replace", "key": "a", "value": "y"},
240+
{"op": "replace", "key": "b", "value": "x"}]
148241

149-
### Merge format
242+
In a notebook context this means for example that data will never be
243+
considered to move across input cells and output cells.
150244

151-
The merge process should return two things: The merge result and the conflicts.
152245

153-
A format for representing merge conflicts is work in progress.
246+
## Merge format
154247

155-
Each transformation in the base->local and base->remote diffs must either
156-
end up in the merge result or be recorded in the conflicts representation.
248+
A merge takes as input a base object (notebook) and local and remote
249+
objects (notebooks) that are modified versions of base. The merge
250+
computes the diffs base->local and base->remote and tries to apply all
251+
changes from each diff to base. The merge returns a merged object
252+
(notebook) contains all successfully applied changes from both sides,
253+
and two diff objects merged->local and merged->remote which contain
254+
the remaining conflicting changes that need manual resolution.
157255

158256

159257
## Pros and Cons
@@ -163,7 +261,7 @@ Pros associated with this implementation include:
163261
* Possibility to use notebooks for self-documenting regression tests
164262

165263
Cons associated with this implementation include:
166-
* Vanilla git installs will not receive the improved behaviour
264+
* Vanilla git installs will not receive the improved behaviour, i.e. this will require installation of the package. To reduce the weight of this issue the package should avoid unneeded heavy dependencies.
167265

168266

169267
## Interested Contributors

0 commit comments

Comments
 (0)