-
Notifications
You must be signed in to change notification settings - Fork 165
Description
LLVM has these beautiful box-comments in which the SPDX licensing info gets inserted, leading to this error:
Could not parse 'Apache-2.0 WITH LLVM-exception *|'
Note: the above has multiple spaces prior to *|
, but Github ate them.
Example:
/*==-- clang-c/BuildSystem.h - Utilities for use by build systems -*- C -*-===*\
|* *|
|* Part of the LLVM Project, under the Apache License v2.0 with LLVM *|
|* Exceptions. *|
|* See https://llvm.org/LICENSE.txt for license information. *|
|* SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception *|
|* *|
|*===----------------------------------------------------------------------===*|
|* *|
|* This header provides various utilities for use by build systems. *|
|* *|
\*===----------------------------------------------------------------------===*/
I'd exclude the regex *\*[|/]$
, but that's not how the code works in:
reuse-tool/src/reuse/_comment.py
Lines 172 to 182 in e704086
for line in lines: | |
if cls.MULTI_LINE[1]: | |
possible_line = line.lstrip(cls.INDENT_BEFORE_MIDDLE) | |
prefix = cls.MULTI_LINE[1] | |
if possible_line.startswith(prefix): | |
line = possible_line.lstrip(prefix) | |
else: | |
_LOGGER.debug( | |
"'%s' does not contain a middle comment marker", line | |
) | |
result.append(line) |
I'd add a block that, when cls.MULTI_LINE[1]
is non-empty, runs a possible_line.rsplit(cls.MULTI_LINE[1], maxsplit=1)[0]
and instead try to parse that. I think this would >not< require changes later lines 191 and 196, but I think the coupling between these blocks is a bit fragile:
reuse-tool/src/reuse/_comment.py
Lines 191 to 196 in e704086
last = last.rstrip(cls.MULTI_LINE[2]) | |
last = last.rstrip(cls.INDENT_BEFORE_END) | |
last = last.strip() | |
if cls.MULTI_LINE[1] and last.startswith(cls.MULTI_LINE[1]): | |
last = last.lstrip(cls.MULTI_LINE[1]) | |
last = last.lstrip() |
To address this use case of extraneous characters, I think the best approach is to drop off anything after the expected MULTI_LINE[1]
. If however the character isn't the expected one (say, the *
is missing from this example), the comment parser breaks in the same way as today. I think that's no regression but that it's just a fragile parser; thoughts on that are welcome, but I'd say I'm proposing a strict improvement, merely incomplete.
Thanks in advance!