"Solana is down, but up only"
Why is 75% of the transactions not going through, but it's only bull from here?
Understand Solana tx lifecycle
First, we start off with a simple diagram:
Unique features making Solana high-throughput
This part will be explained in layman term
Proof of History
cryptographic timestamping on every single block so it’s easier to track
Leader selection
A validator gets elected as leader for chosen slot (fixed period)
Sealevel
Multiple smart contracts can be executed at the same time as long as they don’t conflict with each other
Gulf stream
Validators can see transactions ahead of being elected
Tower BFT
Validators vote on the state quicker – a better algorithm of PoS
Mega Hardware
Mega GPU and SSD and disks to maintain the state at high speeds
Critical issue on high-throughput networks like Solana
Solana’s measure of network performance is packet per second (pps), because the network is designed to handle high volume (averaging 500 Transactions Per Second). Ethereum on the other hand commonly uses block gas limit (indicates maximum amount of computation work in a block), which currently handles around 30 million gas units, and roughly 20-23 TPS.
Precursor to the problem: high packet rate
First, what is a packet?
It is the fundamental unit of data transmitted over a network, containing header and payload. Header contains control information, payload carries the actual data. When data is sent across the internet, it’s divided into smaller packets – especially used in IP networking.
Second, what is packet-per-second?
Kind of derived in its name — the rate at which network packets are transmitted by the device. 1 million packets per second (1Mpps) means 1 million individual packets are processed per second.
Solana’s transaction rate is measured by packets per second. Today, it receives a range between 100,000 to 10 million packets per second due to the high transaction throughput design.
Limited Time per Packet: If we take 1 million packets per second (1Mpps), and assume that the processing is distributed across 10 threads, each thread can only spend 1 microsecond (1µs) on each packet. This is a very tight time constraint, considering the complexity of operations that need to be performed on each packet, such as validation, routing, and consensus-related computations.
Processing Time Constraints: Many operations required for processing packets, such as cryptographic computations, signature verification, and database lookups, can take longer than 1µs. This means that not all operations can be completed within the allotted time, leading to delays and potential bottlenecks.
Packet Dropping: When the system can't keep up with the incoming packet rate, it may start dropping packets to prevent overload. This can be problematic as some of these packets might be more important than others – especially those that are either real transactions, critical consensus or control information. Dropping important packets can lead to issues like network instability, reduced throughput, and delays in transaction processing.
For the dummies:
- Imagine a post office with 10 workers handling 1 million letters per day
- Each worker has only 1 second to process each letter, which means they must work at a speed of 1 second per letter
- However, some tasks, like checking addresses or sorting routes, take longer than 1 second.
- To avoid a backlog, workers start discarding letters they can't process in time
- So…. this leads to important letters being thrown away, causing delays and issues for senders and recipients
Similarly, a network system can struggle to handle a massive number of data packets per second, leading to dropped packets and potential communication problems.
In addition…
Packets dropping is further exacerbated by the single elected leader only being able to have so many connections
Submitting transactions is free, hence the misalignment of incentives
Bots are better at submitting transactions than real users – hence real tx are being misread and dropped
Solutions to this may involve optimizing packet processing algorithms, improving hardware capabilities, and implementing smarter packet prioritization and management strategies.
Upcoming release
Agave validator client is addressing this issue in the upcoming 1.18 release that relates to the implementation of QUIC protocol, which is a transport layer designed to improve the performance of connection-oriented web application
What is QUIC and key features:
QUIC (Quick UDP Internet Connections) is a transport layer network protocol that manages the large number of packets being transmitted and processed.
How can QUIC help?
QUIC protocol replaces raw UDP, which provides rate limiting with unique connection ID per connection (which allows servers to apply rate limits based on CID) and anti-spoofing capabilities that allow validators to implement advanced anti-DDoS techniques
Multiplexing - QUIC allows for multiple streams of data to be transmitted concurrently over single connection (a significant improvement over TCP which is one streamline of data transmission)
Challenges of QUIC
QUIC maintains state for each connection, which can be resource-intensive for nodes to handle
QUIC is encryption by default overhead especially on chains that are already cryptographic operations heavy
Bursts in transaction volume can be unpredictable for QUIC to handle effectively
transactions and blocks need to be broadcast globally to all nodes, which can lead to simultaneous congestion across multiple network paths, making it challenging for QUIC's end-to-end congestion control to respond effectively
Priority transactions or messages need to be separately implemented for the QUIC congestion control
Currently Solana TPU/QUIC endpoints take way too long to process transactions (reasons refer to the above issue)
Overview of Recent Updates
Staked vs Non-Staked Packets: The network now differentiates between packets sent by staked and non-staked nodes, prioritizing or throttling based on the node’s vested interest in the network's integrity.
QUIC Optimization: The implementation of the QUIC protocol now uses SmallVec to aggregate data chunks, reducing memory allocation by one per packet, which enhances packet handling efficiency.
BankingStage Forwarding Filter: A new filter in the BankingStage of Solana’s pipeline tightens transaction forwarding criteria, especially favoring transactions from staked nodes, which may increase throughput and reduce spam.
Data Streams for Staked Nodes: Staked nodes now face stricter requirements for the number of data streams they must manage every 100ms, ensuring robustness and active contribution to network performance.
Streamer QoS for Low Staked Nodes: Nodes with minimal stakes are treated similarly to unstaked nodes in terms of Quality of Service to discourage low contributions while maintaining network participation.
LocalCluster Default Staked Client: The default client configuration in Solana’s LocalCluster now includes a staked status to more accurately simulate real-world testing and development conditions.