Skip to content

Multiline block comments with extraneous characters after the MULTI_LINE[1] indicator throw out the C-style parser #343

@lbruno

Description

@lbruno

LLVM has these beautiful box-comments in which the SPDX licensing info gets inserted, leading to this error:

Could not parse 'Apache-2.0 WITH LLVM-exception *|'

Note: the above has multiple spaces prior to *|, but Github ate them.

Example:

/*==-- clang-c/BuildSystem.h - Utilities for use by build systems -*- C -*-===*\
|*                                                                            *|
|* Part of the LLVM Project, under the Apache License v2.0 with LLVM          *|
|* Exceptions.                                                                *|
|* See https://llvm.org/LICENSE.txt for license information.                  *|
|* SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception                    *|
|*                                                                            *|
|*===----------------------------------------------------------------------===*|
|*                                                                            *|
|* This header provides various utilities for use by build systems.           *|
|*                                                                            *|
\*===----------------------------------------------------------------------===*/

I'd exclude the regex *\*[|/]$, but that's not how the code works in:

for line in lines:
if cls.MULTI_LINE[1]:
possible_line = line.lstrip(cls.INDENT_BEFORE_MIDDLE)
prefix = cls.MULTI_LINE[1]
if possible_line.startswith(prefix):
line = possible_line.lstrip(prefix)
else:
_LOGGER.debug(
"'%s' does not contain a middle comment marker", line
)
result.append(line)

I'd add a block that, when cls.MULTI_LINE[1] is non-empty, runs a possible_line.rsplit(cls.MULTI_LINE[1], maxsplit=1)[0] and instead try to parse that. I think this would >not< require changes later lines 191 and 196, but I think the coupling between these blocks is a bit fragile:

last = last.rstrip(cls.MULTI_LINE[2])
last = last.rstrip(cls.INDENT_BEFORE_END)
last = last.strip()
if cls.MULTI_LINE[1] and last.startswith(cls.MULTI_LINE[1]):
last = last.lstrip(cls.MULTI_LINE[1])
last = last.lstrip()

To address this use case of extraneous characters, I think the best approach is to drop off anything after the expected MULTI_LINE[1]. If however the character isn't the expected one (say, the * is missing from this example), the comment parser breaks in the same way as today. I think that's no regression but that it's just a fragile parser; thoughts on that are welcome, but I'd say I'm proposing a strict improvement, merely incomplete.

Thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions