Skip to content

Commit 42e7af8

Browse files
authored
Bugfix accounting for Variation Selector 16 (#97)
Closes #96 - Add new table, `VS16_NARROW_TO_WIDE`. It has only one version, "9.0.0". This defines a set of characters that are otherwise Narrow, like '0', that become wide when combined with `U+FE0F`, "VARIATION SELECTOR 16". - change `wcwidth.wcswidth()` function, now tracks "last measured character", and, on U+FE0F, checks that character in table VS16_NARROW_TO_WIDE, and, if matching, adds 1 to the measured width. - add `verify-table-integrity.py`, this is an unrelated file from previous work in #91 that should have been included there. - new tests: The latest list of 'emoji-zwj-sequences.txt' and 'emoji-variation-sequences.txt' are fetched by update-tables.py and placed in 'tests/' folder, and now used by automatic tests in test_emoji_zwj.py, this is helpful to ensure 100% compatibility with all latest known emoji sequences - fix issue with codecov.io token Note: A single "9.0.0" version is used because of ambiguity in legacy releases of the emoji variation sequences files. So ambiguous, that very few terminals get it right! See https://ucs-detect.readthedocs.io/results.html for testing results. I believe that U+FE0F is something of a "fixup" for early emojis. I don't expect any new U+FE0F sequences to be published.
1 parent f4368e3 commit 42e7af8

15 files changed

+3077
-208
lines changed

.github/workflows/ci.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -136,6 +136,8 @@ jobs:
136136
137137
- name: Upload coverage to Codecov
138138
uses: codecov/codecov-action@v3
139+
env:
140+
CODECOV_TOKEN: ${{secrets.CODECOV_TOKEN}}
139141

140142
- name: Fail if coverage is <100%.
141143
run: |
@@ -148,3 +150,5 @@ jobs:
148150
name: html-report
149151
path: htmlcov
150152
if: ${{ failure() }}
153+
env:
154+
CODECOV_TOKEN: ${{secrets.CODECOV_TOKEN}}

bin/update-tables.py

Lines changed: 279 additions & 102 deletions
Large diffs are not rendered by default.

bin/verify-table-integrity.py

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
#!/usr/bin/env python3
2+
"""
3+
This is a small script to make an inquiry into the version history of unicode data tables, and to
4+
validate conflicts in the tables as they are published:
5+
6+
- check for individual code point definitions change in in subsequent releases,
7+
these should be considered before attempting to reduce the size of our versioned
8+
tables without a careful incremental change description. Each "violation" is
9+
logged as INFO.
10+
- check that a codepoint in the 'zero' table is not present in the 'wide' table
11+
and vice versa. This is logged as ERROR and causes program to exit 1.
12+
13+
Some examples of the first kind,
14+
15+
1.
16+
17+
value 0x1f93b in table WIDE_EASTASIAN version 12.1.0 is not defined in 13.0.0 from range ('0x1f90d', '0x1f971')
18+
value 0x1f946 in table WIDE_EASTASIAN version 12.1.0 is not defined in 13.0.0 from range ('0x1f90d', '0x1f971')
19+
20+
two characters were changed from 'W' to 'N':
21+
22+
-EastAsianWidth-12.0.0.txt:1F90D..1F971;W # So [101] WHITE HEART..YAWNING FACE
23+
+EastAsianWidth-12.1.0.txt:1F90C..1F93A;W # So [47] PINCHED FINGERS..FENCER
24+
+EastAsianWidth-12.1.0.txt:1F93B;N # So MODERN PENTATHLON
25+
+EastAsianWidth-12.1.0.txt:1F93C..1F945;W # So [10] WRESTLERS..GOAL NET
26+
+EastAsianWidth-12.1.0.txt:1F946;N # So RIFLE
27+
+EastAsianWidth-12.1.0.txt:1F947..1F978;W # So [50] FIRST PLACE MEDAL..DISGUISED FACE
28+
29+
As well as for output,
30+
31+
value 0x11a3 in table WIDE_EASTASIAN version 6.1.0 is not defined in 6.2.0 from range ('0x11a3', '0x11a7')
32+
...
33+
value 0x11fe in table WIDE_EASTASIAN version 6.1.0 is not defined in 6.2.0 from range ('0x11fa', '0x11ff')
34+
35+
Category code was changed from 'W' to 'N':
36+
37+
-EastAsianWidth-6.1.0.txt:11A3;W # HANGUL JUNGSEONG A-EU
38+
+EastAsianWidth-6.2.0.txt:11A3;N # HANGUL JUNGSEONG A-EU
39+
40+
41+
2.
42+
43+
value 0x1cf2 in table ZERO_WIDTH version 11.0.0 is not defined in 12.0.0 from range ('0x1cf2', '0x1cf4')
44+
value 0x1cf3 in table ZERO_WIDTH version 11.0.0 is not defined in 12.0.0 from range ('0x1cf2', '0x1cf4')
45+
46+
Category code was changed from 'Mc' to 'Lo':
47+
48+
-DerivedGeneralCategory-11.0.0.txt:1CF2..1CF3 ; Mc # [2] VEDIC SIGN ARDHAVISARGA..VEDIC SIGN ROTATED ARDHAVISARGA
49+
+DerivedGeneralCategory-12.0.0.txt:1CEE..1CF3 ; Lo # [6] VEDIC SIGN HEXIFORM LONG ANUSVARA..VEDIC SIGN ROTATED ARDHAVISARGA
50+
51+
As well as for output,
52+
53+
value 0x19b0 in table ZERO_WIDTH version 7.0.0 is not defined in 8.0.0 from range ('0x19b0', '0x19c0')
54+
...
55+
value 0x19c8 in table ZERO_WIDTH version 7.0.0 is not defined in 8.0.0 from range ('0x19c8', '0x19c9')
56+
57+
Category code was changed from 'Mc' to 'Lo':
58+
59+
-DerivedGeneralCategory-7.0.0.txt:19B0..19C0 ; Mc # [17] NEW TAI LUE VOWEL SIGN VOWEL SHORTENER..NEW TAI LUE VOWEL SIGN IY
60+
+DerivedGeneralCategory-8.0.0.txt:19B0..19C9 ; Lo # [26] NEW TAI LUE VOWEL SIGN VOWEL SHORTENER..NEW TAI LUE TONE MARK-2
61+
"""
62+
# std imports
63+
import logging
64+
65+
66+
def main(log: logging.Logger):
67+
# local
68+
from wcwidth import ZERO_WIDTH, WIDE_EASTASIAN, _bisearch, list_versions
69+
reversed_uni_versions = list(reversed(list_versions()))
70+
tables = {'ZERO_WIDTH': ZERO_WIDTH,
71+
'WIDE_EASTASIAN': WIDE_EASTASIAN}
72+
errors = 0
73+
for idx, version in enumerate(reversed_uni_versions):
74+
if idx == 0:
75+
continue
76+
next_version = reversed_uni_versions[idx - 1]
77+
for table_name, table in tables.items():
78+
next_table = table[next_version]
79+
curr_table = table[version]
80+
other_table_name = 'WIDE_EASTASIAN' if table_name == 'ZERO_WIDTH' else 'ZERO_WIDTH'
81+
other_table = tables[other_table_name][version]
82+
for start_range, stop_range in curr_table:
83+
for unichar_n in range(start_range, stop_range):
84+
if not _bisearch(unichar_n, next_table):
85+
log.info(f'value {hex(unichar_n)} in table_name={table_name}'
86+
f' version={version} is not defined in next_version={next_version}'
87+
f' from inclusive range {hex(start_range)}-{hex(stop_range)}')
88+
if _bisearch(unichar_n, other_table):
89+
log.error(f'value {hex(unichar_n)} in table_name={table_name}'
90+
f' version={version} is duplicated in other_table_name={other_table_name}'
91+
f' from inclusive range {hex(start_range)}-{hex(stop_range)}')
92+
errors += 1
93+
if errors:
94+
log.error(f'{errors} errors, exit 1')
95+
exit(1)
96+
97+
98+
if __name__ == '__main__':
99+
_logfmt = '%(levelname)s %(filename)s:%(lineno)d %(message)s'
100+
logging.basicConfig(level="INFO", format=_logfmt, force=True)
101+
log = logging.getLogger()
102+
main(log)

docs/intro.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -216,6 +216,11 @@ Other Languages
216216
=======
217217
History
218218
=======
219+
0.2.10 *2023-11-08*
220+
* **Bugfix** accounting of some kinds of emoji sequences using U+FE0F
221+
Variation Selector 16 (`PR #97`_).
222+
* **Updated** `Specification <Specification_from_pypi_>`_.
223+
219224
0.2.9 *2023-10-30*
220225
* **Bugfix** zero-width characters used in Emoji ZWJ sequences, Balinese,
221226
Jamo, Devanagari, Tamil, Kannada and others (`PR #91`_).
@@ -319,6 +324,7 @@ https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c::
319324
.. _`PR #35`: https://github.com/jquast/wcwidth/pull/35
320325
.. _`PR #82`: https://github.com/jquast/wcwidth/pull/82
321326
.. _`PR #91`: https://github.com/jquast/wcwidth/pull/91
327+
.. _`PR #97`: https://github.com/jquast/wcwidth/pull/97
322328
.. _`jquast/blessed`: https://github.com/jquast/blessed
323329
.. _`selectel/pyte`: https://github.com/selectel/pyte
324330
.. _`thomasballinger/curtsies`: https://github.com/thomasballinger/curtsies

docs/specs.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,3 +52,7 @@ Category codes of Nonspacing Mark (``Mn``) and Spacing Mark (``Mc``).
5252

5353
Any characters of Modifier Symbol category, ``'Sk'`` where ``'FULLWIDTH'`` is
5454
present in comment of unicode data file, aprox. 3 characters.
55+
56+
Any character in sequence with U+FE0F (Variation Selector 16) defined by
57+
Emoji Variation Sequences txt as ``emoji style``.
58+

docs/unicode_version.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -121,3 +121,9 @@ release files:
121121
``EastAsianWidth-15.1.0.txt``
122122
*Date: 2023-07-28, 23:34:08 GMT*
123123

124+
``emoji-variation-sequences-12.0.0.txt``
125+
*Date: 2019-01-15, 12:10:05 GMT*
126+
127+
``emoji-variation-sequences-15.1.0.txt``
128+
*Date: 2023-02-01, 02:22:54 GMT*
129+

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ def main():
4444
setuptools.setup(
4545
name='wcwidth',
4646
# NOTE: manually manage __version__ in wcwidth/__init__.py !
47-
version='0.2.9',
47+
version='0.2.10',
4848
description=(
4949
"Measures the displayed width of unicode strings in a terminal"),
5050
long_description=codecs.open(

0 commit comments

Comments
 (0)