单节点宕机时Apache Ignite出现数据丢失问题求助

阿华AIGC实验室

2026-5-12

Troubleshooting Data Loss Issue in Apache Ignite Cluster

Let's break down the possible causes and solutions for the data count drop you're seeing when a node goes down—this shouldn't happen with proper backup configuration, so let's dig into the details.

First, Understand the Behavior

When you shut down one node and immediately see a 375M count instead of 500M, this is likely tied to timing of cluster failure detection or misconfiguration rather than actual data loss (since restarting the node brings the count back). Here's how to debug and fix it:

1. Wait for Cluster Failure Detection to Complete

Your cluster has networkTimeout set to 50 seconds—this is how long Ignite waits before marking a node as failed. If you query data right after shutting down the node, the cluster hasn't yet detected the failure, and hasn't promoted the backup partitions to primary.

Fix: After shutting down a node, wait at least 60 seconds (to account for the 50s timeout plus some buffer) before querying data. Check the Ignite logs for messages like Node [id=...] left topology or Partition ownership changed to confirm the cluster has finished rebalancing and failover.

2. Verify Node Role Configuration

Your cache uses an AttributeNodeFilter that only allows nodes with ROLE=data.compute to store data. If:

The node you shut down was the only one with this role (unlikely, since you have 3 nodes), or
One of your backup nodes doesn't have the ROLE=data.compute attribute

Then the backup partitions won't be available when the primary node goes down, leading to apparent data loss.

Fix:
- Use the ignite node list command or Ignite Web Console to check the attributes of all 3 nodes. Ensure every node has the ROLE=data.compute attribute set in its configuration.
- Double-check that the node filter is correctly applied to the cache (use the console to view the cache's actual configuration).

3. Check How You're Counting Data

The way you count data might be excluding partitions that are in a transitional state:

If you're using cache.size(), this method by default only counts entries on primary nodes. When a node is down but not yet marked as failed, its primary partitions are considered "lost" and not included in the count.
If you're using a SQL COUNT(*) query, ensure that the query is configured to read from backups (your readFromBackup=true should enable this, but verify that the query isn't restricted to primary nodes).
Fix:
- To get an accurate count during failover, use cache.size(CachePeekMode.ALL) (note: this counts both primary and backup entries, so with backups=1, you'll see 100M total—adjust your expectation accordingly).
- For SQL queries, explicitly enable reading from backups if needed (though readFromBackup=true should handle this by default).

4. Validate Cluster Discovery Configuration

Your TcpDiscoveryVmIpFinder lists only two node addresses, but you're testing with 3 nodes. If the third node isn't included in the addresses list, it might not be properly integrated into the cluster, leading to incorrect partition distribution or failover behavior.

Fix: Update the TcpDiscoveryVmIpFinder to include all 3 nodes' addresses (e.g., add <value>x.x.x.3:47500..47509</value> if it's a third machine). Ensure all nodes use the same ipFinder configuration so they can fully discover each other.

5. Confirm Backup Configuration is Applied

Even though you've tried backups=1, 2, and 3, ensure these settings are actually applied to the cache at runtime. Sometimes cache templates don't inherit settings correctly, or the cache is created with overridden parameters.

Fix: Use the Ignite Web Console or ignite cache list command to inspect the cache's actual configuration. Verify that the backups value matches what you set in the template.

Next Steps After Fixing

Once you resolve the above issues, test node failure with new writes:

Shut down a node and wait for failover to complete.
Insert new data into the cache.
Restart the downed node.
Verify that the new data is present on all nodes after rebalancing.

With proper configuration, Ignite should seamlessly handle writes during node failure, and the restarted node will sync all missing data via rebalancing.

内容的提问来源于stack exchange，提问作者ashK