From 9ae4eb28daee7807a18a6330920f3778c7050e60 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?B=C3=A9n=C3=A9dikt=20Tran?= <10796600+picnixz@users.noreply.github.com> Date: Wed, 21 Aug 2024 11:31:05 +0200 Subject: [PATCH 1/7] improve internal docs on `co_linetable` --- InternalDocs/locations.md | 114 ++++++++++++++++++++++++++++++++------ 1 file changed, 96 insertions(+), 18 deletions(-) diff --git a/InternalDocs/locations.md b/InternalDocs/locations.md index 91a7824e2a8e4d..a4ff8f12b7c174 100644 --- a/InternalDocs/locations.md +++ b/InternalDocs/locations.md @@ -5,10 +5,12 @@ representation of the source code positions of instructions, which are returned by the `co_positions()` iterator. `co_linetable` consists of a sequence of location entries. -Each entry starts with a byte with the most significant bit set, followed by zero or more bytes with most significant bit unset. +Each entry starts with a byte with the most significant bit set, followed by +zero or more bytes with most significant bit unset. Each entry contains the following information: -* The number of code units covered by this entry (length) + +* The number of code units covered by this entry (length). * The start line * The end line * The start column @@ -16,9 +18,9 @@ Each entry contains the following information: The first byte has the following format: -Bit 7 | Bits 3-6 | Bits 0-2 - ---- | ---- | ---- - 1 | Code | Length (in code units) - 1 +| Bit 7 | Bits 3-6 | Bits 0-2 | +|-------|----------|----------------------------| +| 1 | Code | Length (in code units) - 1 | The codes are enumerated in the `_PyCodeLocationInfoKind` enum. @@ -33,37 +35,113 @@ Each chunk but the last has bit 6 set. For example: * 63 is encoded as `0x3f` -* 200 is encoded as `0x48`, `0x03` +* 200 is encoded as `0x48`, `0x03` since ``200 = (0x03 << 6) | 0x48``. + +The following helper can be used to convert an integer into a `varint`: + +```py +def write_varint(s): + ret = [] + while s >= 64: + ret.append(((s & 0x3F) | 0x40) & 0x3F) + s >>= 6 + ret.append(s & 0x3F) + return bytes(ret) +``` + +To convert a `varint` into an unsigned integer: + +```py +def read_varint(chunks): + ret = 0 + for chunk in reversed(chunks): + ret = (ret << 6) | chunk + return ret +``` ### Signed integers (svarint) Signed integers are encoded by converting them to unsigned integers, using the following function: -```Python -def convert(s): + +```py +def write_svarint(s): if s < 0: - return ((-s)<<1) | 1 + uval = ((-s) << 1) | 1 else: - return (s<<1) + uval = s << 1 + return write_varint(uval) +``` + +To convert a `svarint` into a signed integer: + +```py +def read_svarint(s): + uval = read_varint(s) + return -(uval >> 1) if uval & 1 else (uval >> 1) ``` ## Location entries The meaning of the codes and the following bytes are as follows: -Code | Meaning | Start line | End line | Start column | End column - ---- | ---- | ---- | ---- | ---- | ---- - 0-9 | Short form | Δ 0 | Δ 0 | See below | See below - 10-12 | One line form | Δ (code - 10) | Δ 0 | unsigned byte | unsigned byte - 13 | No column info | Δ svarint | Δ 0 | None | None - 14 | Long form | Δ svarint | Δ varint | varint | varint - 15 | No location | None | None | None | None +| Code | Meaning | Start line | End line | Start column | End column | +|-------|----------------|---------------|----------|---------------|---------------| +| 0-9 | Short form | Δ 0 | Δ 0 | See below | See below | +| 10-12 | One line form | Δ (code - 10) | Δ 0 | unsigned byte | unsigned byte | +| 13 | No column info | Δ svarint | Δ 0 | None | None | +| 14 | Long form | Δ svarint | Δ varint | varint | varint | +| 15 | No location | None | None | None | None | The Δ means the value is encoded as a delta from another value: + * Start line: Delta from the previous start line, or `co_firstlineno` for the first entry. * End line: Delta from the start line +Note that the indexation of the start and end column values are assumed to +start from 1 and are absolute but that `dis.Positions` is using 0-based values +for the column start and end offsets, when available. + +When constructing artificial `co_linetable` values, only non-None values should +be specified. For instance: + +```py +def foo(): + pass + +co_firstlineno = 42 +foo.__code__ = foo.__code__.replace( + co_firstlineno=co_firstlineno, + co_linetable=bytes([ + # RESUME + (1 << 7) | (13 << 3) | (1 - 1), + # sentinel # no column info # number of units - 1 + *write_svarint(2), # start line delta + # RETURN_CONST (None) + (1 << 7) | (14 << 3) | (1 - 1), + # sentinel # has column info # number of units - 1 + *write_svarint(5), # relative start line delta + *write_varint(12), # end line delta + *write_varint(3), # start column (starts from 1) + *write_varint(8), # end column (starts from 1) + ]) +) + +instructions = list(dis.get_instructions(foo)) +assert len(instructions) == 2 + +assert instructions[0].opname == 'RESUME' +assert instructions[1].opname == 'RETURN_CONST' + +ip0, ip1 = instructions[0].positions, instructions[1].positions +assert ip0 == (co_firstlineno + 2, co_firstlineno + 2, None, None) +assert ip1 == (ip0.lineno + 5, ip1.lineno + 12, (3 - 1), (8 - 1)) +``` + ### The short forms -Codes 0-9 are the short forms. The short form consists of two bytes, the second byte holding additional column information. The code is the start column divided by 8 (and rounded down). +Codes 0-9 are the short forms. The short form consists of two bytes, +the second byte holding additional column information. The code is the +start column divided by 8 (and rounded down). + * Start column: `(code*8) + ((second_byte>>4)&7)` * End column: `start_column + (second_byte&15)` From 3db4e670c9eb11a73988e90109b207ed57ca1dd0 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?B=C3=A9n=C3=A9dikt=20Tran?= <10796600+picnixz@users.noreply.github.com> Date: Wed, 21 Aug 2024 13:01:07 +0200 Subject: [PATCH 2/7] improve docs --- InternalDocs/locations.md | 24 +++++++++++++----------- 1 file changed, 13 insertions(+), 11 deletions(-) diff --git a/InternalDocs/locations.md b/InternalDocs/locations.md index a4ff8f12b7c174..b9decf11478092 100644 --- a/InternalDocs/locations.md +++ b/InternalDocs/locations.md @@ -97,9 +97,16 @@ The Δ means the value is encoded as a delta from another value: * Start line: Delta from the previous start line, or `co_firstlineno` for the first entry. * End line: Delta from the start line -Note that the indexation of the start and end column values are assumed to -start from 1 and are absolute but that `dis.Positions` is using 0-based values -for the column start and end offsets, when available. +### The short forms + +Codes 0-9 are the short forms. The short form consists of two bytes, +the second byte holding additional column information. The code is the +start column divided by 8 (and rounded down). + +* Start column: `(code*8) + ((second_byte>>4)&7)` +* End column: `start_column + (second_byte&15)` + +## Artificial constructions When constructing artificial `co_linetable` values, only non-None values should be specified. For instance: @@ -137,11 +144,6 @@ assert ip0 == (co_firstlineno + 2, co_firstlineno + 2, None, None) assert ip1 == (ip0.lineno + 5, ip1.lineno + 12, (3 - 1), (8 - 1)) ``` -### The short forms - -Codes 0-9 are the short forms. The short form consists of two bytes, -the second byte holding additional column information. The code is the -start column divided by 8 (and rounded down). - -* Start column: `(code*8) + ((second_byte>>4)&7)` -* End column: `start_column + (second_byte&15)` +Note that the indexation of the start and end column values are assumed to +start from 1 and are absolute but that `dis.Positions` is using 0-based values +for the column start and end offsets, when available. From 40281635f702e33628bf71e458664be6ac0fe24c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?B=C3=A9n=C3=A9dikt=20Tran?= <10796600+picnixz@users.noreply.github.com> Date: Wed, 21 Aug 2024 17:16:13 +0200 Subject: [PATCH 3/7] address review --- InternalDocs/locations.md | 36 +++++++++++++++++------------------- 1 file changed, 17 insertions(+), 19 deletions(-) diff --git a/InternalDocs/locations.md b/InternalDocs/locations.md index b9decf11478092..3663b2d13dd44c 100644 --- a/InternalDocs/locations.md +++ b/InternalDocs/locations.md @@ -10,7 +10,7 @@ zero or more bytes with most significant bit unset. Each entry contains the following information: -* The number of code units covered by this entry (length). +* The number of code units covered by this entry (length) * The start line * The end line * The start column @@ -40,7 +40,7 @@ For example: The following helper can be used to convert an integer into a `varint`: ```py -def write_varint(s): +def encode_varint(s): ret = [] while s >= 64: ret.append(((s & 0x3F) | 0x40) & 0x3F) @@ -52,7 +52,7 @@ def write_varint(s): To convert a `varint` into an unsigned integer: ```py -def read_varint(chunks): +def decode_varint(chunks): ret = 0 for chunk in reversed(chunks): ret = (ret << 6) | chunk @@ -64,19 +64,17 @@ def read_varint(chunks): Signed integers are encoded by converting them to unsigned integers, using the following function: ```py -def write_svarint(s): +def svarint_to_varint(s): if s < 0: - uval = ((-s) << 1) | 1 + return ((-s) << 1) | 1 else: - uval = s << 1 - return write_varint(uval) + return s << 1 ``` -To convert a `svarint` into a signed integer: +To convert a varint into a signed integer: ```py -def read_svarint(s): - uval = read_varint(s) +def varint_to_svarint(uval): return -(uval >> 1) if uval & 1 else (uval >> 1) ``` @@ -120,16 +118,16 @@ foo.__code__ = foo.__code__.replace( co_firstlineno=co_firstlineno, co_linetable=bytes([ # RESUME - (1 << 7) | (13 << 3) | (1 - 1), - # sentinel # no column info # number of units - 1 - *write_svarint(2), # start line delta + (1 << 7) | (13 << 3) | (1 - 1), + # sentinel # no column info # number of units - 1 + *encode_varint(svarint_to_varint(2)), # relative start line delta # RETURN_CONST (None) - (1 << 7) | (14 << 3) | (1 - 1), - # sentinel # has column info # number of units - 1 - *write_svarint(5), # relative start line delta - *write_varint(12), # end line delta - *write_varint(3), # start column (starts from 1) - *write_varint(8), # end column (starts from 1) + (1 << 7) | (14 << 3) | (1 - 1), + # sentinel # has column info # number of units - 1 + *encode_varint(svarint_to_varint(5)), # relative start line delta + *encode_varint(12), # end line delta + *encode_varint(3), # start column (starts from 1) + *encode_varint(8), # end column (starts from 1) ]) ) From 6632fa9af91d719d966d96804c92168c4e64840f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?B=C3=A9n=C3=A9dikt=20Tran?= <10796600+picnixz@users.noreply.github.com> Date: Wed, 21 Aug 2024 17:19:38 +0200 Subject: [PATCH 4/7] add missing imports --- InternalDocs/locations.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/InternalDocs/locations.md b/InternalDocs/locations.md index 3663b2d13dd44c..c58edcf6e2df10 100644 --- a/InternalDocs/locations.md +++ b/InternalDocs/locations.md @@ -110,6 +110,8 @@ When constructing artificial `co_linetable` values, only non-None values should be specified. For instance: ```py +import dis + def foo(): pass From 588e81cdd90656d313a6a24093346c1460041a88 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?B=C3=A9n=C3=A9dikt=20Tran?= <10796600+picnixz@users.noreply.github.com> Date: Thu, 22 Aug 2024 12:00:56 +0200 Subject: [PATCH 5/7] simplify --- InternalDocs/locations.md | 44 --------------------------------------- 1 file changed, 44 deletions(-) diff --git a/InternalDocs/locations.md b/InternalDocs/locations.md index c58edcf6e2df10..c57b6946fbe9fb 100644 --- a/InternalDocs/locations.md +++ b/InternalDocs/locations.md @@ -103,47 +103,3 @@ start column divided by 8 (and rounded down). * Start column: `(code*8) + ((second_byte>>4)&7)` * End column: `start_column + (second_byte&15)` - -## Artificial constructions - -When constructing artificial `co_linetable` values, only non-None values should -be specified. For instance: - -```py -import dis - -def foo(): - pass - -co_firstlineno = 42 -foo.__code__ = foo.__code__.replace( - co_firstlineno=co_firstlineno, - co_linetable=bytes([ - # RESUME - (1 << 7) | (13 << 3) | (1 - 1), - # sentinel # no column info # number of units - 1 - *encode_varint(svarint_to_varint(2)), # relative start line delta - # RETURN_CONST (None) - (1 << 7) | (14 << 3) | (1 - 1), - # sentinel # has column info # number of units - 1 - *encode_varint(svarint_to_varint(5)), # relative start line delta - *encode_varint(12), # end line delta - *encode_varint(3), # start column (starts from 1) - *encode_varint(8), # end column (starts from 1) - ]) -) - -instructions = list(dis.get_instructions(foo)) -assert len(instructions) == 2 - -assert instructions[0].opname == 'RESUME' -assert instructions[1].opname == 'RETURN_CONST' - -ip0, ip1 = instructions[0].positions, instructions[1].positions -assert ip0 == (co_firstlineno + 2, co_firstlineno + 2, None, None) -assert ip1 == (ip0.lineno + 5, ip1.lineno + 12, (3 - 1), (8 - 1)) -``` - -Note that the indexation of the start and end column values are assumed to -start from 1 and are absolute but that `dis.Positions` is using 0-based values -for the column start and end offsets, when available. From 0ce4c26776e56d9e63739eb24c89343e0902e664 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?B=C3=A9n=C3=A9dikt=20Tran?= <10796600+picnixz@users.noreply.github.com> Date: Fri, 29 Nov 2024 14:26:50 +0100 Subject: [PATCH 6/7] update docs --- InternalDocs/code_objects.md | 87 +++++++++++++++++++++++++----------- 1 file changed, 61 insertions(+), 26 deletions(-) diff --git a/InternalDocs/code_objects.md b/InternalDocs/code_objects.md index bee4a9d0a08915..7fc620e04d38be 100644 --- a/InternalDocs/code_objects.md +++ b/InternalDocs/code_objects.md @@ -1,4 +1,3 @@ - # Code objects A `CodeObject` is a builtin Python type that represents a compiled executable, @@ -43,7 +42,7 @@ so a compact format is very important. Note that traceback objects don't store all this information -- they store the start line number, for backward compatibility, and the "last instruction" value. The rest can be computed from the last instruction (`tb_lasti`) with the help of the -locations table. For Python code, there is a convenience method +locations table. For Python code, there is a convenience method (`codeobject.co_positions`)[https://docs.python.org/dev/reference/datamodel.html#codeobject.co_positions] which returns an iterator of `({line}, {endline}, {column}, {endcolumn})` tuples, one per instruction. @@ -75,9 +74,11 @@ returned by the `co_positions()` iterator. > See [`Objects/lnotab_notes.txt`](../Objects/lnotab_notes.txt) for more details. `co_linetable` consists of a sequence of location entries. -Each entry starts with a byte with the most significant bit set, followed by zero or more bytes with the most significant bit unset. +Each entry starts with a byte with the most significant bit set, followed by +zero or more bytes with the most significant bit unset. Each entry contains the following information: + * The number of code units covered by this entry (length) * The start line * The end line @@ -86,54 +87,88 @@ Each entry contains the following information: The first byte has the following format: -Bit 7 | Bits 3-6 | Bits 0-2 - ---- | ---- | ---- - 1 | Code | Length (in code units) - 1 +| Bit 7 | Bits 3-6 | Bits 0-2 | +|-------|----------|----------------------------| +| 1 | Code | Length (in code units) - 1 | The codes are enumerated in the `_PyCodeLocationInfoKind` enum. -## Variable-length integer encodings +#### Variable-length integer encodings -Integers are often encoded using a variable-length integer encoding +Integers are often encoded using a variable length integer encoding -### Unsigned integers (`varint`) +##### Unsigned integers (`varint`) Unsigned integers are encoded in 6-bit chunks, least significant first. Each chunk but the last has bit 6 set. For example: * 63 is encoded as `0x3f` -* 200 is encoded as `0x48`, `0x03` +* 200 is encoded as `0x48`, `0x03` since ``200 = (0x03 << 6) | 0x48``. + +The following helper can be used to convert an integer into a `varint`: + +```py +def encode_varint(s): + ret = [] + while s >= 64: + ret.append(((s & 0x3F) | 0x40) & 0x3F) + s >>= 6 + ret.append(s & 0x3F) + return bytes(ret) +``` + +To convert a `varint` into an unsigned integer: + +```py +def decode_varint(chunks): + ret = 0 + for chunk in reversed(chunks): + ret = (ret << 6) | chunk + return ret +``` -### Signed integers (`svarint`) +##### Signed integers (`svarint`) Signed integers are encoded by converting them to unsigned integers, using the following function: -```Python -def convert(s): + +```py +def svarint_to_varint(s): if s < 0: - return ((-s)<<1) | 1 + return ((-s) << 1) | 1 else: - return (s<<1) + return s << 1 +``` + +To convert a `varint` into a signed integer: + +```py +def varint_to_svarint(uval): + return -(uval >> 1) if uval & 1 else (uval >> 1) ``` -*Location entries* +#### Location entries The meaning of the codes and the following bytes are as follows: -Code | Meaning | Start line | End line | Start column | End column - ---- | ---- | ---- | ---- | ---- | ---- - 0-9 | Short form | Δ 0 | Δ 0 | See below | See below - 10-12 | One line form | Δ (code - 10) | Δ 0 | unsigned byte | unsigned byte - 13 | No column info | Δ svarint | Δ 0 | None | None - 14 | Long form | Δ svarint | Δ varint | varint | varint - 15 | No location | None | None | None | None +| Code | Meaning | Start line | End line | Start column | End column | +|-------|----------------|---------------|----------|---------------|---------------| +| 0-9 | Short form | Δ 0 | Δ 0 | See below | See below | +| 10-12 | One line form | Δ (code - 10) | Δ 0 | unsigned byte | unsigned byte | +| 13 | No column info | Δ svarint | Δ 0 | None | None | +| 14 | Long form | Δ svarint | Δ varint | varint | varint | +| 15 | No location | None | None | None | None | The Δ means the value is encoded as a delta from another value: + * Start line: Delta from the previous start line, or `co_firstlineno` for the first entry. -* End line: Delta from the start line +* End line: Delta from the start line. + +##### The short forms -*The short forms* +Codes 0-9 are the short forms. The short form consists of two bytes, +the second byte holding additional column information. The code is the +start column divided by 8 (and rounded down). -Codes 0-9 are the short forms. The short form consists of two bytes, the second byte holding additional column information. The code is the start column divided by 8 (and rounded down). * Start column: `(code*8) + ((second_byte>>4)&7)` * End column: `start_column + (second_byte&15)` From 910fd7f2ad0d3bafeb3b26f187347b9871cc4836 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?B=C3=A9n=C3=A9dikt=20Tran?= <10796600+picnixz@users.noreply.github.com> Date: Fri, 29 Nov 2024 14:29:02 +0100 Subject: [PATCH 7/7] flatten sections --- InternalDocs/code_objects.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/InternalDocs/code_objects.md b/InternalDocs/code_objects.md index 7fc620e04d38be..d4e28c6b238b48 100644 --- a/InternalDocs/code_objects.md +++ b/InternalDocs/code_objects.md @@ -93,11 +93,11 @@ The first byte has the following format: The codes are enumerated in the `_PyCodeLocationInfoKind` enum. -#### Variable-length integer encodings +### Variable-length integer encodings Integers are often encoded using a variable length integer encoding -##### Unsigned integers (`varint`) +#### Unsigned integers (`varint`) Unsigned integers are encoded in 6-bit chunks, least significant first. Each chunk but the last has bit 6 set. @@ -128,7 +128,7 @@ def decode_varint(chunks): return ret ``` -##### Signed integers (`svarint`) +#### Signed integers (`svarint`) Signed integers are encoded by converting them to unsigned integers, using the following function: @@ -147,7 +147,7 @@ def varint_to_svarint(uval): return -(uval >> 1) if uval & 1 else (uval >> 1) ``` -#### Location entries +### Location entries The meaning of the codes and the following bytes are as follows: @@ -164,7 +164,7 @@ The Δ means the value is encoded as a delta from another value: * Start line: Delta from the previous start line, or `co_firstlineno` for the first entry. * End line: Delta from the start line. -##### The short forms +### The short forms Codes 0-9 are the short forms. The short form consists of two bytes, the second byte holding additional column information. The code is the