单节点宕机时Apache Ignite出现数据丢失问题求助
Let's break down the possible causes and solutions for the data count drop you're seeing when a node goes down—this shouldn't happen with proper backup configuration, so let's dig into the details.
First, Understand the Behavior
When you shut down one node and immediately see a 375M count instead of 500M, this is likely tied to timing of cluster failure detection or misconfiguration rather than actual data loss (since restarting the node brings the count back). Here's how to debug and fix it:
1. Wait for Cluster Failure Detection to Complete
Your cluster has networkTimeout set to 50 seconds—this is how long Ignite waits before marking a node as failed. If you query data right after shutting down the node, the cluster hasn't yet detected the failure, and hasn't promoted the backup partitions to primary.
- Fix: After shutting down a node, wait at least 60 seconds (to account for the 50s timeout plus some buffer) before querying data. Check the Ignite logs for messages like
Node [id=...] left topologyorPartition ownership changedto confirm the cluster has finished rebalancing and failover.
2. Verify Node Role Configuration
Your cache uses an AttributeNodeFilter that only allows nodes with ROLE=data.compute to store data. If:
- The node you shut down was the only one with this role (unlikely, since you have 3 nodes), or
- One of your backup nodes doesn't have the
ROLE=data.computeattribute
Then the backup partitions won't be available when the primary node goes down, leading to apparent data loss.
- Fix:
- Use the
ignite node listcommand or Ignite Web Console to check the attributes of all 3 nodes. Ensure every node has theROLE=data.computeattribute set in its configuration. - Double-check that the node filter is correctly applied to the cache (use the console to view the cache's actual configuration).
- Use the
3. Check How You're Counting Data
The way you count data might be excluding partitions that are in a transitional state:
If you're using
cache.size(), this method by default only counts entries on primary nodes. When a node is down but not yet marked as failed, its primary partitions are considered "lost" and not included in the count.If you're using a SQL
COUNT(*)query, ensure that the query is configured to read from backups (yourreadFromBackup=trueshould enable this, but verify that the query isn't restricted to primary nodes).Fix:
- To get an accurate count during failover, use
cache.size(CachePeekMode.ALL)(note: this counts both primary and backup entries, so withbackups=1, you'll see 100M total—adjust your expectation accordingly). - For SQL queries, explicitly enable reading from backups if needed (though
readFromBackup=trueshould handle this by default).
- To get an accurate count during failover, use
4. Validate Cluster Discovery Configuration
Your TcpDiscoveryVmIpFinder lists only two node addresses, but you're testing with 3 nodes. If the third node isn't included in the addresses list, it might not be properly integrated into the cluster, leading to incorrect partition distribution or failover behavior.
- Fix: Update the
TcpDiscoveryVmIpFinderto include all 3 nodes' addresses (e.g., add<value>x.x.x.3:47500..47509</value>if it's a third machine). Ensure all nodes use the same ipFinder configuration so they can fully discover each other.
5. Confirm Backup Configuration is Applied
Even though you've tried backups=1, 2, and 3, ensure these settings are actually applied to the cache at runtime. Sometimes cache templates don't inherit settings correctly, or the cache is created with overridden parameters.
- Fix: Use the Ignite Web Console or
ignite cache listcommand to inspect the cache's actual configuration. Verify that thebackupsvalue matches what you set in the template.
Next Steps After Fixing
Once you resolve the above issues, test node failure with new writes:
- Shut down a node and wait for failover to complete.
- Insert new data into the cache.
- Restart the downed node.
- Verify that the new data is present on all nodes after rebalancing.
With proper configuration, Ignite should seamlessly handle writes during node failure, and the restarted node will sync all missing data via rebalancing.
内容的提问来源于stack exchange,提问作者ashK




