Minciunescu Vlad-Andrei / 151
Faculty of Mathematics and Informatics
University of Bucharest
A Finite Automata Engine for Regular Expression Processing
-
Programming language: C++20
-
This project implements a complete pipeline for processing regular expressions using formal language and automata theory principles.
-
It supports converting infix regex expressions to postfix notation, constructing epsilon-NFAs via Thompson's algorithm, transforming NFAs into DFAs using subset construction, and simulating input word acceptance.
-
The application is structured around a modular class architecture and uses JSON-based configuration files for automated testing via a Tester singleton.
-
Core components include FiniteAutomaton, RegToken, and utility classes like StateNode, StateCluster, and PostfixConverter.
-
The project is built with modern C++ practices and leverages the nlohmann_json library for configuration parsing.
- Regular expressions converter from infix to postfix form;
- NFA construction using Thompson's Algorithm, from postfix form;
- Conversion from NFA to DFA using subset construction;
- DFA simulator
- Program testing using configuration prompts stored in JSON files
- The program consists of test verification by executing specific tasks, so the core object of the application is
Tester
(singleton). - Warning! The project uses an external library for JSON files parsing (nlohmann_json). It is included through CMakeLists.txt configuration file, which should guarantee its properly linking:
include(FetchContent)
FetchContent_Declare(
json
URL https://github.com/nlohmann/json/releases/download/v3.11.2/json.tar.xz
)
FetchContent_MakeAvailable(json)
target_link_libraries(REGExLFA PRIVATE nlohmann_json::nlohmann_json)
-
The
tester
class initializes and runs user tests.init()
: Retrieves data about the existing tests from all JSON files in theREGExLFA/tests
directory. The location of this directory is a private attribute of theTester
class and can be changed in the code.run()
: Accesses the retrieved data about each test (stored in an unordered_map) and compares the program's results with the expected ones. It provides the comparison result and deletes all run tests at the end.
-
The
PostfixConverter
class providesgetPostfix(expression)
, which converts an expression from infix to postfix notation.add_concat_symbols
: Adds concatenation symbols to the infix form of the regular expression, according to the validation expression rules.to_postfix
: After applying the concatenation symbols, the postfix form is obtained from the infix form, calculated using the Shunting-Yard algorithm.
-
The
FiniteAutomaton
class encapsulates the logic and structure of a finite automaton, supporting construction from configuration files or postfix regular expressions, and enabling conversion to an NFA or DFA.
-
A
FiniteAutomaton
object can be initialized in two ways:- From a configuration file:
- Defines:
- Alphabet (
Sigma
) - States (with optional start/final indicators)
- Transitions
- Alphabet (
- Defines:
- From a regular expression in postfix form:
- Automatically builds an NFA using the regex.
- From a configuration file:
-
FiniteAutomaton(const std::string &file);
- Parses the file and sets up the automaton:
- Calls
setSigma
,setStates
,setTransitions
, andsetStartState
.
- Calls
- Parses the file and sets up the automaton:
-
sigma
: Set of characters representing the input alphabet.states
: Vector of all state instances.stateMap
: Maps state names to shared pointers.startState
: Pointer to the initial state.
-
-
bool process(const std::string& word) const
:- Checks if a given string is accepted by the automaton.
-
void setSigma(const std::unordered_set<char> &sigma)
:- Sets the alphabet explicitly.
-
std::unordered_set<char> getSigma()
:- Returns the alphabet.
-
static FiniteAutomaton* buildFromRegex(const std::string& postfix)
:- Builds an NFA from a regular expression in postfix notation.
-
void setSigma(const std::vector<std::string> &sigma_)
:- Extracts and sets the alphabet from configuration data.
-
void setStates(const std::vector<std::string> &stateLines)
:- Initializes states and identifies start/final states.
-
void setTransitions(const std::vector<std::string> &transitions)
:- Establishes transitions between states.
-
void setStartState()
:- Sets the start state based on initial state marker.
-
bool inSigma(const char &symbol) const
:- Checks if a character belongs to the alphabet.
-
bool isNondeterministic() const
:- Returns true if any state has multiple transitions for a symbol.
-
-
-
static void DFS( const std::shared_ptr<StateNode>& origin, const std::function<void(std::shared_ptr<StateNode>)>& action, const std::function<bool(const std::pair<std::shared_ptr<StateNode>, std::optional<char>>&)>& validate);
- Performs a depth-first search over the state graph.
- Applies a user-defined
action
to each visited node. - Uses
validate
to control which transitions are followed.
-
static std::vector<StateCluster> LambdaScope(const std::vector<std::shared_ptr<StateNode>>& nodes);
- Computes epsilon (lambda) closures for all states.
- Each
StateCluster
represents the full epsilon-closure of a state.
-
static std::unordered_set<char> extractSigmaFromRegex(const std::string& postfix);
- Extracts valid alphabet characters from a postfix regex.
- Ignores regex operators like
+
,|
,*
,.
etc.
-
FiniteAutomaton* FiniteAutomaton::buildFromRegex(const std::string &postfix);
Builds a finite automaton (FA) — specifically an epsilon-NFA — from a postfix regular expression.
- Initialization:
- Creates a new
FiniteAutomaton
object. - Extracts and sets the alphabet (
Sigma
) from the postfix regex.
- Postfix Tokenization:
- Calls
RegToken::getENFAToken
to generate a linked structure of state transitions based on the postfix regex.
- Graph Traversal:
- Applies depth-first search (
DFS
) from the start node to collect all state nodes in an indexed vector.
- Lambda Closures:
- Computes epsilon closures (lambda transitions) using
LambdaScope
.
- Expansion Table:
- For each state and each symbol in the alphabet, computes where the automaton would go (post-closure) — this forms a DFA-style transition table.
- State Mapping:
- Assigns integer indices to unique state clusters (i.e., sets of NFA states).
- Builds a transition map (
transition_rules
) between these indices.
- FA State Creation:
- Creates FA states named
q0
,q1
, ..., with properinitial
andfinal
flags. - Final states are identified by checking if they contain the original NFA’s
endNode
.
- Transition Linking:
- Links the created states using the populated
transition_rules
.
- Returns:
- The constructed
FiniteAutomaton
instance.
-
bool FiniteAutomaton::process(const std::string& word) const;
- Starts from
startState
. - Iterates through each symbol in the input
word
. - For each symbol:
- Attempts to find a transition from the current state using the symbol.
- If no transition is found, returns
false
. - Otherwise, moves to the next state. - After processing all symbols:
- Returns
true
if the final state is marked asfinal
,false
otherwise.
This method assumes the FA is deterministic (DFA), as it does not handle multiple possible transitions or epsilon transitions.
-
friend std::ostream& operator<<(std::ostream& os, const FiniteAutomaton& fa);
- Prints a formatted representation of the automaton:
- Type (NFA or DFA)
- Alphabet
- States and final states
- Start state
- Transition map for each state
- Prints a formatted representation of the automaton:
- Converts a postfix regex into an epsilon-NFA (ENFA).
- Constructs state graph using
RegToken::getENFAToken()
. - Applies
LambdaScope
to compute closures. - Builds transition table.
- Assigns names like
q0
,q1
, etc., to new states. - Determines final states based on reachability of the regex's terminal node.
- Designed for extensibility and clarity.
- Uses shared pointers for safe memory management.
- Includes validation and warning mechanisms for input consistency.
- DFA minimization is planned but not yet implemented.
-
-
-
The
RegToken
class is a helper structure used to construct ε-NFAs (epsilon non-deterministic finite automatons) from postfix regular expressions. It encapsulates a pair of start and end nodes for a regex fragment and provides logic to combine and transform these fragments into complex automata structures.-
startNode
: A shared pointer to the startingStateNode
of the regex token.endNode
: A shared pointer to the endingStateNode
of the regex token.
-
-
RegToken()
Default constructor. -
RegToken(std::shared_ptr<StateNode> startNode, std::shared_ptr<StateNode> endNode)
Directly initializes with specific start and end nodes. -
RegToken(char symbol, int &i)
Builds a basic automaton for a single character. It creates two new nodes and connects them using the given symbol. -
RegToken(int &i, const std::shared_ptr<RegToken> &lhs, char operation = 'x', const std::shared_ptr<RegToken>& rhs = nullptr)
Combines existing tokens using regex operations (.
,|
,*
,+
,?
).
-
-
static std::shared_ptr<RegToken> getENFAToken(const std::string& postfix, int& nodesCount);
- Converts a postfix regular expression into an ε-NFA.
- Builds tokens for each character and merges them according to postfix rules.
- Throws runtime error if the expression is malformed.
-
isOperator(char c)
– checks ifc
is a regex operator.isBinary(char c)
– checks ifc
is a binary regex operator (.
or|
).
-
-
.
(Concatenation): Joinslhs
andrhs
by linkinglhs.endNode
torhs.startNode
.|
(Union): Creates a new start and end node with epsilon transitions to/from bothlhs
andrhs
.*
(Kleene Star): Adds looping epsilon transitions for zero or more repetitions.+
(One or more): Similar to*
, but requires at least one pass.?
(Optional): Allows either a single pass or skipping the token entirely.
-
- The class assumes valid postfix expressions.
- Uses
StateNode
and epsilon transitions (symbol' '
) to connect sub-automata.
-
State > Header
The
State
struct models a node (or state) in a finite automaton. It encapsulates the identity and behavior of a single state, including its role (initial/final) and its outgoing transitions.-
-
std::string name
A unique identifier for the state (e.g., "q0", "q1"). -
bool initial = false
Indicates whether this state is the initial (start) state of the automaton. -
bool final = false
Marks the state as final (accepting) if set to true. -
std::unordered_multimap<char, std::shared_ptr<State>> transitions
Stores the outgoing transitions from this state.- The key is a
char
representing the input symbol for the transition. - The value is a
shared_ptr
to the targetState
object. unordered_multimap
allows multiple transitions for the same input symbol (needed for NFAs).
- The key is a
-
-
std::shared_ptr<State>
ensures safe memory management when states are shared across structures like graphs.- This struct is often used in building DFAs and NFAs, where transitions and state roles (start/accepting) are fundamental.
- The use of
unordered_multimap
is especially relevant for non-deterministic automata, where a single input symbol can lead to multiple possible next states.
-
-
StateNode > Header
The
StateNode
class represents a state (node) in a finite automaton, primarily used during the construction of epsilon-NFAs (ENFAs).-
int id
– A unique numeric identifier for the node.std::vector<std::pair<std::shared_ptr<StateNode>, char>> connections
– A list of transitions from this node to others. Each pair includes a target node and a transition symbol ('\0'
represents an epsilon transition).
-
explicit StateNode(int id)
– Constructor that assigns the node ID.void appendConnection(std::shared_ptr<StateNode> node, char connection = '\0')
– Adds a transition to the given node using the specified symbol.bool operator<(const StateNode& node) const
– Enables comparison of nodes by ID (for use in sets or maps).std::string toString() const
– Returns a formatted string representation of the node, e.g.,(3)
.
-
-
The
StateCluster
class represents a set of states used to model state closures or compound states during NFA → DFA conversion.-
-
StateCluster stepWith(char symbol, const std::vector<std::shared_ptr<StateNode>>& id_map) const
- Determines the set of nodes reachable via a transition labeled
symbol
from any node in the current cluster.
- Determines the set of nodes reachable via a transition labeled
-
std::string toString() const
- Returns a string representation of the cluster in the format
{(1),(2),...}
.
- Returns a string representation of the cluster in the format
-
-
-
The
StateAssembler
class is a utility used to incrementally build aStateCluster
.-
std::vector<std::shared_ptr<StateNode>> states
– A collection of nodes that will be part of the resulting cluster.
-
void appendState(std::shared_ptr<StateNode> state)
– Adds a single node to the cluster.void appendState(const StateCluster& nodes)
– Appends all nodes from an existingStateCluster
.StateCluster build() const
– Constructs and returns aStateCluster
from the accumulated nodes.
-