Untimely event consumption may cause node OOM

# Rationale

Java-tron can load event plug-ins through configuration files, currently including Mongo plug-ins and Kafka plug-ins. Plug-in implementation refers to <https://github.com/tronprotocol/event-plugin>. The node consumes events and writes them to Mongo through plug-in serialization or streaming to Kafka.

All events are cached through BlockingQueue:

```java
 private BlockingQueue<TriggerCapsule> triggerCapsuleQueue;
```

There are multiple producers, such as org.tron.core.db.Manager#postTransactionTrigger, which writes the log in the transaction to the queue:

```java
 private void postTransactionTrigger(final TransactionCapsule trxCap,
      final BlockCapsule blockCap) {
    TransactionLogTriggerCapsule trx = new TransactionLogTriggerCapsule(trxCap, blockCap);
    trx.setLatestSolidifiedBlockNumber(getDynamicPropertiesStore()
        .getLatestSolidifiedBlockNum());
    if (!triggerCapsuleQueue.offer(trx)) {
      logger.info("Too many triggers, transaction trigger lost: {}.", trxCap.getTransactionId());
    }
  }
```

But there is only one consumer: org.tron.core.db.Manager#triggerCapsuleProcessLoop:

```java
  private Runnable triggerCapsuleProcessLoop =
      () -> {
        while (isRunTriggerCapsuleProcessThread) {
          try {
            TriggerCapsule triggerCapsule = triggerCapsuleQueue.poll(1, TimeUnit.SECONDS);
            if (triggerCapsule != null) {
              triggerCapsule.processTrigger();
            }
          } catch (InterruptedException ex) {
            logger.info(ex.getMessage());
            Thread.currentThread().interrupt();
          } catch (Throwable throwable) {
            logger.error("Unknown throwable happened in process capsule loop.", throwable);
          }
        }
      };
```

ProcessTrigger actually serializes events through the 7 APIs of the plug-in IPluginEventListener. If the consumption speed of consumers is much slower than that of producers, the queue may be backlogged. After a while, the node will experience frequent full gc, be unable to synchronize or provide external services, or even run out of memory and incur OOM, eventually leading to data loss.

Possible reasons for slow queue data consumption include:

1.  There is not sufficient bandwidth between fullnode and mongo server.
2.  Mongo does not have any filed index.
3.  Mongo’s unique index is not set correctly.

# Implementation

One possible way is to set the maximum and minimum threshold of the queue’s length. Start a monitoring thread, when the queue’s length exceeds the maximum value, this thread will suspend synchronization or broadcasting block, and timely remind users to deal with queue overflow problems; when the length is below the minimum value, it resumes synchronization.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Untimely event consumption may cause node OOM #5721

Rationale

Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Untimely event consumption may cause node OOM #5721

Description

Rationale

Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions