Multiline block comments with extraneous characters after the MULTI_LINE[1] indicator throw out the C-style parser

LLVM has these beautiful box-comments in which the SPDX licensing info gets inserted, leading to this error:

`Could not parse 'Apache-2.0 WITH LLVM-exception                    *|'`

Note: the above has multiple spaces prior to `*|`, but Github ate them.

Example: 
```
/*==-- clang-c/BuildSystem.h - Utilities for use by build systems -*- C -*-===*\
|*                                                                            *|
|* Part of the LLVM Project, under the Apache License v2.0 with LLVM          *|
|* Exceptions.                                                                *|
|* See https://llvm.org/LICENSE.txt for license information.                  *|
|* SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception                    *|
|*                                                                            *|
|*===----------------------------------------------------------------------===*|
|*                                                                            *|
|* This header provides various utilities for use by build systems.           *|
|*                                                                            *|
\*===----------------------------------------------------------------------===*/
```

I'd exclude the regex ` *\*[|/]$`, but that's not how the code works in:
https://github.com/fsfe/reuse-tool/blob/e7040863a265db9b2314a2f29b3fcfd59f5cae51/src/reuse/_comment.py#L172-L182

I'd add a block that, when `cls.MULTI_LINE[1]` is non-empty, runs a `possible_line.rsplit(cls.MULTI_LINE[1], maxsplit=1)[0]` and instead try to parse that. I think this would >not< require changes later lines 191 and 196, but I think the coupling between these blocks is a bit fragile:
https://github.com/fsfe/reuse-tool/blob/e7040863a265db9b2314a2f29b3fcfd59f5cae51/src/reuse/_comment.py#L191-L196

To address this use case of extraneous characters, I think the best approach is to drop off anything after the expected `MULTI_LINE[1]`. If however the character isn't the expected one (say, the `*` is missing from this example), the comment parser breaks in the same way as today. I think that's no regression but that it's just a fragile parser; thoughts on that are welcome, but I'd say I'm proposing a strict improvement, merely incomplete.

Thanks in advance!


	for line in lines:
	if cls.MULTI_LINE[1]:
	possible_line = line.lstrip(cls.INDENT_BEFORE_MIDDLE)
	prefix = cls.MULTI_LINE[1]
	if possible_line.startswith(prefix):
	line = possible_line.lstrip(prefix)
	else:
	_LOGGER.debug(
	"'%s' does not contain a middle comment marker", line
	)
	result.append(line)

	last = last.rstrip(cls.MULTI_LINE[2])
	last = last.rstrip(cls.INDENT_BEFORE_END)
	last = last.strip()
	if cls.MULTI_LINE[1] and last.startswith(cls.MULTI_LINE[1]):
	last = last.lstrip(cls.MULTI_LINE[1])
	last = last.lstrip()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multiline block comments with extraneous characters after the MULTI_LINE[1] indicator throw out the C-style parser #343

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multiline block comments with extraneous characters after the MULTI_LINE[1] indicator throw out the C-style parser #343

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions