-
-
Notifications
You must be signed in to change notification settings - Fork 32.5k
gh-127833: Reword and expand the Notation section #134443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 5 commits
211f83d
ec90d40
3f3a0db
160ff42
231e2ba
fa17c00
327b90d
3c5caa6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -90,44 +90,112 @@ Notation | |
|
||
.. index:: BNF, grammar, syntax, notation | ||
|
||
The descriptions of lexical analysis and syntax use a modified | ||
`Backus–Naur form (BNF) <https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form>`_ grammar | ||
notation. This uses the following style of definition: | ||
|
||
.. productionlist:: notation | ||
name: `lc_letter` (`lc_letter` | "_")* | ||
lc_letter: "a"..."z" | ||
|
||
The first line says that a ``name`` is an ``lc_letter`` followed by a sequence | ||
of zero or more ``lc_letter``\ s and underscores. An ``lc_letter`` in turn is | ||
any of the single characters ``'a'`` through ``'z'``. (This rule is actually | ||
adhered to for the names defined in lexical and grammar rules in this document.) | ||
|
||
Each rule begins with a name (which is the name defined by the rule) and | ||
``::=``. A vertical bar (``|``) is used to separate alternatives; it is the | ||
least binding operator in this notation. A star (``*``) means zero or more | ||
repetitions of the preceding item; likewise, a plus (``+``) means one or more | ||
repetitions, and a phrase enclosed in square brackets (``[ ]``) means zero or | ||
one occurrences (in other words, the enclosed phrase is optional). The ``*`` | ||
and ``+`` operators bind as tightly as possible; parentheses are used for | ||
grouping. Literal strings are enclosed in quotes. White space is only | ||
meaningful to separate tokens. Rules are normally contained on a single line; | ||
rules with many alternatives may be formatted alternatively with each line after | ||
the first beginning with a vertical bar. | ||
|
||
.. index:: lexical definitions, ASCII | ||
|
||
In lexical definitions (as the example above), two more conventions are used: | ||
Two literal characters separated by three dots mean a choice of any single | ||
character in the given (inclusive) range of ASCII characters. A phrase between | ||
angular brackets (``<...>``) gives an informal description of the symbol | ||
defined; e.g., this could be used to describe the notion of 'control character' | ||
if needed. | ||
|
||
Even though the notation used is almost the same, there is a big difference | ||
between the meaning of lexical and syntactic definitions: a lexical definition | ||
operates on the individual characters of the input source, while a syntax | ||
definition operates on the stream of tokens generated by the lexical analysis. | ||
All uses of BNF in the next chapter ("Lexical Analysis") are lexical | ||
definitions; uses in subsequent chapters are syntactic definitions. | ||
|
||
The descriptions of lexical analysis use a grammar notation that is a mixture | ||
of `EBNF <https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form>`_ | ||
encukou marked this conversation as resolved.
Show resolved
Hide resolved
|
||
and `PEG <https://en.wikipedia.org/wiki/Parsing_expression_grammar>`_. | ||
For example: | ||
|
||
.. grammar-snippet:: | ||
:group: notation | ||
|
||
name: `letter` (`letter` | `digit` | "_")* | ||
letter: "a"..."z" | "A"..."Z" | ||
digit: "0"..."9" | ||
|
||
In this example, the first line says that a ``name`` is a ``letter`` followed | ||
by a sequence of zero or more ``letter``\ s, ``digit``\ s, and underscores. | ||
A ``letter`` in turn is any of the single characters ``'a'`` through | ||
``'z'`` and ``A`` through ``Z``; a ``digit`` is a single character from ``0`` | ||
to ``9``. | ||
|
||
Each rule begins with a name (which identifies the rule that's being defined) | ||
followed by a colon, ``:``. | ||
The definition to the right of the colon uses the following syntax elements: | ||
|
||
* ``name``: A name refers to another rule. | ||
Where possible, it is a link to the rule's definition. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What possibilities exist, other than referencing another rule? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Rules that aren't defined by formal grammar in the docs, for example Also: tokens are currently unlinked, and many don't have a (lexical) grammar rule so they should eventually link to prose. |
||
|
||
* ``TOKEN``: An uppercase name refers to a :term:`token`. | ||
For the purposes of grammar definitions, tokens are the same as rules. | ||
|
||
* ``"text"``, ``'text'``: Text in single or double quotes must match literally | ||
(without the quotes). The type of quote is chosen according to the meaning | ||
of ``text``: | ||
|
||
* ``'if'``: A name in single quotes denotes a :ref:`keyword <keywords>`. | ||
* ``"case"``: A name in double quotes denotes a | ||
:ref:`soft-keyword <soft-keywords>`. | ||
* ``'@'``: A non-letter symbol in single quotes denotes an | ||
:py:data:`~token.OP` token, that is, a :ref:`delimiter <delimiters>` or | ||
:ref:`operator <operators>`. | ||
|
||
* ``"a"..."z"``: Two literal characters separated by three dots mean a choice | ||
of any single character in the given (inclusive) range of ASCII characters. | ||
* ``<...>``: A phrase between angular brackets gives an informal description | ||
of the matched symbol (for example, ``<any ASCII character except "\">``), | ||
or an abbreviation that is defined in nearby text (for example, ``<Lu>``). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we be mentioning somewhere that these are not part of the actual Python grammar, but part of the notation to make it easier to describe specific constructs? Maybe as part of the first paragraph that says that this is a mixture of EBNF and PEG? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I moved them to the end, and (re-)added notes that they're only used in the lexer definitions. |
||
* ``e1 e2``: Items separated only by whitespace denote a sequence. | ||
Here, ``e1`` must be followed by ``e2``. | ||
* ``e1 | e2``: A vertical bar is used to separate alternatives. | ||
It denotes PEG's "ordered choice": if ``e1`` matches, ``e2`` is | ||
not considered. | ||
In traditional PEG grammars, this is written as a slash, ``/``, rather than | ||
a vertical bar. | ||
See :pep:`617` for more background and details. | ||
* ``e*``: A star means zero or more repetitions of the preceding item. | ||
* ``e+``: Likewise, a plus means one or more repetitions. | ||
* ``[e]``: A phrase enclosed in square brackets means zero or | ||
one occurrences. In other words, the enclosed phrase is optional. | ||
* ``e?``: A question mark has exactly the same meaning as square brackets: | ||
the preceding item is optional. | ||
* ``(e)``: Parentheses are used for grouping. | ||
|
||
The unary operators (``*``, ``+``, ``?``) bind as tightly as possible; | ||
the vertical bar (``|``) binds most loosely. | ||
|
||
White space is only meaningful to separate tokens. | ||
|
||
Rules are normally contained on a single line, but rules that are too long | ||
may be wrapped: | ||
|
||
.. grammar-snippet:: | ||
:group: notation | ||
|
||
literal: stringliteral | bytesliteral | ||
| integer | floatnumber | imagnumber | ||
|
||
Alternatively, rules may be formatted with the first line ending at the colon, | ||
and each alternative beginning with a vertical bar on a new line. | ||
For example: | ||
|
||
|
||
.. grammar-snippet:: | ||
:group: notation-alt | ||
|
||
literal: | ||
| stringliteral | ||
| bytesliteral | ||
| integer | ||
| floatnumber | ||
| imagnumber | ||
|
||
This does *not* mean that there is an empty first alternative. | ||
|
||
.. index:: lexical definitions | ||
|
||
There is some difference between *lexical* and *syntactic* analysis: | ||
the :term:`lexical analyzer` operates on the individual characters of the | ||
input source, while the *parser* (syntactic analyzer) operates on the stream | ||
of :term:`tokens <token>` generated by the lexical analysis. | ||
However, in some cases the exact boundary between the two phases is a | ||
CPython implementation detail. | ||
|
||
The practical difference between the two is that in *lexical* definitions, | ||
all whitespace is significant. | ||
The lexical analyzer :ref:`discards <whitespace>` all whitespace that is not | ||
converted to tokens like :data:`token.INDENT` or :data:`~token.NEWLINE`. | ||
*Syntactic* definitions then use these tokens, rather than source characters. | ||
|
||
This documentation uses the same BNF grammar for both styles of definitions. | ||
All uses of BNF in the next chapter (“Lexical Analysis”) are lexical definitions; | ||
encukou marked this conversation as resolved.
Show resolved
Hide resolved
|
||
uses in subsequent chapters are syntactic definitions. |
Uh oh!
There was an error while loading. Please reload this page.