Skip to content

[LLDB] Add Lexer (with tests) for DIL (Data Inspection Language). #123521

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Feb 5, 2025
Merged
132 changes: 132 additions & 0 deletions lldb/include/lldb/ValueObject/DILLexer.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
//===-- DILLexer.h ----------------------------------------------*- C++ -*-===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//

#ifndef LLDB_VALUEOBJECT_DILLEXER_H_
#define LLDB_VALUEOBJECT_DILLEXER_H_
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe our headers don't have the ending underscore.

Suggested change
#ifndef LLDB_VALUEOBJECT_DILLEXER_H_
#define LLDB_VALUEOBJECT_DILLEXER_H_
#ifndef LLDB_VALUEOBJECT_DILLEXER_H
#define LLDB_VALUEOBJECT_DILLEXER_H


#include "llvm/ADT/StringRef.h"
#include "llvm/ADT/iterator_range.h"
#include "llvm/Support/Error.h"
#include <cstdint>
#include <limits.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It fits in nicer with the other headers (though maybe you don't need it if you remove the UINT_MAX thing below)

Suggested change
#include <limits.h>
#include <climits>

#include <memory>
#include <string>
#include <vector>

namespace lldb_private::dil {

/// Class defining the tokens generated by the DIL lexer and used by the
/// DIL parser.
class Token {
public:
enum Kind {
coloncolon,
eof,
identifier,
l_paren,
r_paren,
unknown,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the only remaining use of the "unknown" token is to construct "empty" tokens in the tests. I think all of those cases could be avoided by just declaring the Token variable later in the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works for this PR, but the parser has a token member variable, where it stores the current token it's working on. When I create the parser I have to initialize the member variable token to some value; the only one that makes sense is "unknown".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe that variable should be optional<Token> then? Though I'm not sure why it's needed as things could just call lexer.GetCurrentToken() instead. In either case, I'd like to remove this from this patch as its not needed here. If it turns out to be the best solution to the parsers needs, we can add the extra type then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just thought of an alternative implementation I think I can use that will make this work. :-) I'll test that and if it works I'll get rid of the 'unknown' token type.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unknown token is still here. Are you sure you uploaded the right version of the patch?

};

Token(Kind kind, std::string spelling, uint32_t start)
: m_kind(kind), m_spelling(std::move(spelling)), m_start_pos(start) {}

Kind GetKind() const { return m_kind; }

std::string GetSpelling() const { return m_spelling; }

bool Is(Kind kind) const { return m_kind == kind; }

bool IsNot(Kind kind) const { return m_kind != kind; }

bool IsOneOf(Kind kind1, Kind kind2) const { return Is(kind1) || Is(kind2); }

template <typename... Ts> bool IsOneOf(Kind kind, Ts... Ks) const {
return Is(kind) || IsOneOf(Ks...);
}

uint32_t GetLocation() const { return m_start_pos; }

static llvm::StringRef GetTokenName(Kind kind);

private:
Kind m_kind;
std::string m_spelling;
uint32_t m_start_pos; // within entire expression string
};

/// Class for doing the simple lexing required by DIL.
class DILLexer {
public:
/// Lexes all the tokens in expr and calls the private constructor
/// with the lexed tokens.
static llvm::Expected<DILLexer> Create(llvm::StringRef expr);

/// Return the current token to be handled by the DIL parser.
const Token &GetCurrentToken() { return m_lexed_tokens[m_tokens_idx]; }

/// Advance the current token position by N.
void Advance(uint32_t N = 1) {
// UINT_MAX means uninitialized, no "current" position, so move to start.
if (m_tokens_idx == UINT_MAX)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just have the lexer start at token zero straight away?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really...this index indicates which token the parser is currently working on. We shouldn't set the index to zero until the parser asks for the first token (& starts parsing with it).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And why is that the case? I don't find this argument particularly convincing. All it does is create a weird "before the first token" state, which you have to account for both in the implementation of the lexer functions and in the functions that use that. I think things would be much simpler if you could just always assume that the lexer has a valid "current" position, and I think that's basically we have the eof pseudo-token. If there was some value in having a "before-the first token" state, then i think it'd be better to have a bof pseudo-token to match that (but I doubt that's the case).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok; the alternative I mentioned above should make this work too (fingers crossed).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I'd hope that all of the UINT_MAX business can go away if we really start at token zero.

m_tokens_idx = 0;
else if (m_tokens_idx + N >= m_lexed_tokens.size())
// N is too large; advance to the end of the lexed tokens.
m_tokens_idx = m_lexed_tokens.size() - 1;
else
m_tokens_idx += N;
}

/// Return the lexed token N positions ahead of the 'current' token
/// being handled by the DIL parser.
const Token &LookAhead(uint32_t N) {
if (m_tokens_idx + N < m_lexed_tokens.size())
return m_lexed_tokens[m_tokens_idx + N];

// Last token should be an 'eof' token.
return m_lexed_tokens.back();
}

/// Return the index for the 'current' token being handled by the DIL parser.
uint32_t GetCurrentTokenIdx() { return m_tokens_idx; }

/// Set the index for the 'current' token (to be handled by the parser)
/// to a particular position. Used for either committing 'look ahead' parsing
/// or rolling back tentative parsing.
void ResetTokenIdx(uint32_t new_value) {
assert(new_value == UINT_MAX || new_value < m_lexed_tokens.size());
m_tokens_idx = new_value;
}

uint32_t NumLexedTokens() { return m_lexed_tokens.size(); }

private:
DILLexer(llvm::StringRef dil_expr, std::vector<Token> lexed_tokens)
: m_expr(dil_expr), m_lexed_tokens(std::move(lexed_tokens)),
m_tokens_idx(UINT_MAX), m_eof_token(Token(Token::eof, "", 0)) {}

static llvm::Expected<Token> Lex(llvm::StringRef expr,
llvm::StringRef &remainder);

// The input string we are lexing & parsing.
llvm::StringRef m_expr;

// Holds all of the tokens lexed so far.
std::vector<Token> m_lexed_tokens;

// Index into m_lexed_tokens; indicates which token the DIL parser is
// currently trying to parse/handle.
uint32_t m_tokens_idx;

// "eof" token; to be returned by lexer when 'look ahead' fails.
Token m_eof_token;
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// "eof" token; to be returned by lexer when 'look ahead' fails.
Token m_eof_token;
};
};


} // namespace lldb_private::dil

#endif // LLDB_VALUEOBJECT_DILLEXER_H_
1 change: 1 addition & 0 deletions lldb/source/ValueObject/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
add_lldb_library(lldbValueObject
DILLexer.cpp
ValueObject.cpp
ValueObjectCast.cpp
ValueObjectChild.cpp
Expand Down
128 changes: 128 additions & 0 deletions lldb/source/ValueObject/DILLexer.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
//===-- DILLexer.cpp ------------------------------------------------------===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
// This implements the recursive descent parser for the Data Inspection
// Language (DIL), and its helper functions, which will eventually underlie the
// 'frame variable' command. The language that this parser recognizes is
// described in lldb/docs/dil-expr-lang.ebnf
//
//===----------------------------------------------------------------------===//

#include "lldb/ValueObject/DILLexer.h"
#include "lldb/Utility/Status.h"
#include "llvm/ADT/StringSwitch.h"

namespace lldb_private::dil {

llvm::StringRef Token::GetTokenName(Kind kind) {
switch (kind) {
case Kind::coloncolon:
return "coloncolon";
case Kind::eof:
return "eof";
case Kind::identifier:
return "identifier";
case Kind::l_paren:
return "l_paren";
case Kind::r_paren:
return "r_paren";
case Kind::unknown:
return "unknown";
}
}

static bool IsLetter(char c) {
return ('a' <= c && c <= 'z') || ('A' <= c && c <= 'Z');
}

static bool IsDigit(char c) { return '0' <= c && c <= '9'; }

// A word starts with a letter, underscore, or dollar sign, followed by
// letters ('a'..'z','A'..'Z'), digits ('0'..'9'), and/or underscores.
static std::optional<llvm::StringRef> IsWord(llvm::StringRef expr,
llvm::StringRef &remainder) {
llvm::StringRef::iterator cur_pos = expr.end() - remainder.size();
llvm::StringRef::iterator start = cur_pos;
bool dollar_start = false;

// Must not start with a digit.
if (cur_pos == expr.end() || IsDigit(*cur_pos))
return std::nullopt;

// First character *may* be a '$', for a register name or convenience
// variable.
if (*cur_pos == '$') {
dollar_start = true;
++cur_pos;
}

// Contains only letters, digits or underscores
for (; cur_pos != expr.end(); ++cur_pos) {
char c = *cur_pos;
if (!IsLetter(c) && !IsDigit(c) && c != '_')
break;
}

// If first char is '$', make sure there's at least one mare char, or it's
// invalid.
if (dollar_start && (cur_pos - start <= 1)) {
cur_pos = start;
return std::nullopt;
}

if (cur_pos == start)
return std::nullopt;

llvm::StringRef word = expr.substr(start - expr.begin(), cur_pos - start);
if (remainder.consume_front(word))
return word;

return std::nullopt;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be equivalent to your code, but I'm only suggesting this to show how code like this can be written in a shorter and more stringref-y fashion. I think the actual algorithm will have to change, as it has a very narrow definition of identifiers.

Suggested change
llvm::StringRef::iterator cur_pos = expr.end() - remainder.size();
llvm::StringRef::iterator start = cur_pos;
bool dollar_start = false;
// Must not start with a digit.
if (cur_pos == expr.end() || IsDigit(*cur_pos))
return std::nullopt;
// First character *may* be a '$', for a register name or convenience
// variable.
if (*cur_pos == '$') {
dollar_start = true;
++cur_pos;
}
// Contains only letters, digits or underscores
for (; cur_pos != expr.end(); ++cur_pos) {
char c = *cur_pos;
if (!IsLetter(c) && !IsDigit(c) && c != '_')
break;
}
// If first char is '$', make sure there's at least one mare char, or it's
// invalid.
if (dollar_start && (cur_pos - start <= 1)) {
cur_pos = start;
return std::nullopt;
}
if (cur_pos == start)
return std::nullopt;
llvm::StringRef word = expr.substr(start - expr.begin(), cur_pos - start);
if (remainder.consume_front(word))
return word;
return std::nullopt;
const char *start = remainder.data();
remainder.consume_front("$"); // initial '$' is valid
remainder = remainder.drop_while([](char c){ return IsDigit(c) || IsLetter(c) || c=='_'; });
llvm::StringRef candidate(start, remainder.data()-start);
if (candidate.empty() || candidate == "$" || IsDigit(candidate[0]))
return std::nullopt;
return candidate;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Not very surprisingly) I find it easier to understand my code than your code, so I would prefer to keep my code. But if you really want me to use something like your code instead I will ...

One question: When constructing 'candidate', you've already removed all the characters that should comprise the candidate word from remainder. Since those characters are no longer there, doesn't that make 'start' invalid? (It's pointing to something that's been removed from 'remainder'?). How can 'remainder.data() - start' possibly work at this point?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key thing to realize here is that a StringRef is just a fancy name for a pair<const char *, size_t>. On its own, it doesn't make anything valid. It's only valid as long as the string it points to is valid. And the same goes for the validity pointers you get from the StringRef. This is just a pointer to the parsed string, which will remain valid for as long as that string is around -- regardless of what happens to the StringRef.

That said, I realized a different problem with this code. It will end up modifying the remainder even in the failure case. So here's a slightly different version which avoids that. This one isn't completely equivalent, as I've dropped the requirement on the positioning of the dollar sign, but that's something I very strongly believe we should do anyway.

StringRef candidate = remainder.take_while([](char c) { return IsDigit(c) || IsLetter(c) || c=='_' || c=='$'; });
if (candidate.empty() || IsDigit(candidate[0]))
  return std::nullopt;
remainder.drop_front(candidate.size());
return candidate;

}

llvm::Expected<DILLexer> DILLexer::Create(llvm::StringRef expr) {
std::vector<Token> tokens;
llvm::StringRef remainder = expr;
do {
if (llvm::Expected<Token> t = Lex(expr, remainder)) {
tokens.push_back(std::move(*t));
} else {
return t.takeError();
}
} while (tokens.back().GetKind() != Token::eof);
return DILLexer(expr, std::move(tokens));
}

llvm::Expected<Token> DILLexer::Lex(llvm::StringRef expr,
llvm::StringRef &remainder) {
// Skip over whitespace (spaces).
remainder = remainder.ltrim();
llvm::StringRef::iterator cur_pos = expr.end() - remainder.size();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
llvm::StringRef::iterator cur_pos = expr.end() - remainder.size();
llvm::StringRef::iterator cur_pos = remainder.begin();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your suggestion is not correct. "cur_pos" is supposed to refer to the position of the currently-being-lexed word/token/etc in the original input string, not its position in remainder.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that's the same thing. Both stringrefs are just pointers to the same underlying string. StringRef::iterator is just a fancy name for const char *. Would it look better to you if this was written as const char *cur_pos = remainder.data(); ?

Or even better, since this is only used constructing the position value below (I'm ignoring the check below), you can just drop this and construct position directly as remainder.data()-expr.data()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Both stringrefs are pointers to the same underlying string" -- I did not know that. SO...there's a string (const char *) in memory, that both StringRefs point to, and updating the StringRefs never updates the underlying memory contents? Hm...it would be nice if there was documentation somewhere that made this clear (the doxygen stuff does NOT). Thanks for the explanation & clarification.


// Check to see if we've reached the end of our input string.
if (remainder.empty() || cur_pos == expr.end())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (remainder.empty() || cur_pos == expr.end())
if (remainder.empty())

If the two conditions aren't equivalent, I'd like to know why.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because they're checking on two different StringRefs? If this function has been called properly (with the correct arguments), then the two probablye should be equivalent. But I don't like to make assumptions like that without checking them...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I see code like this I start to wonder whether there's an edge case I'm missing. Defensive checks may make sense on some external API which is called from a bunch of places, but here we're talking about a private function that's only called from a single place. At best that deserves an assertion (you could e.g. say assert(expr.end() == remainder.end()) at the beginning of the function.

return Token(Token::eof, "", (uint32_t)expr.size());

uint32_t position = cur_pos - expr.begin();
std::optional<llvm::StringRef> maybe_word = IsWord(expr, remainder);
if (maybe_word)
return Token(Token::identifier, maybe_word->str(), position);

constexpr std::pair<Token::Kind, const char *> operators[] = {
{Token::l_paren, "("},
{Token::r_paren, ")"},
{Token::coloncolon, "::"},
};
for (auto [kind, str] : operators) {
if (remainder.consume_front(str))
return Token(kind, str, position);
}

// Unrecognized character(s) in string; unable to lex it.
return llvm::createStringError("Unable to lex input string");
}

} // namespace lldb_private::dil
2 changes: 2 additions & 0 deletions lldb/unittests/ValueObject/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
add_lldb_unittest(LLDBValueObjectTests
DumpValueObjectOptionsTests.cpp
DILLexerTests.cpp

LINK_LIBS
lldbValueObject
lldbPluginPlatformLinux
lldbPluginScriptInterpreterNone
LLVMTestingSupport

LINK_COMPONENTS
Support
Expand Down
Loading