I'm Eran Betzalel from ExposeBox. We're migrating our big data infrastructure from Hadoop/HDFS/HBase VMs to a modern stack using GKE, GCS, and K8ssandra. While the move is exciting, we're currently facing a critical issue that I hope to get some insights on.
Issue Overview:
- Commit Log Corruption:
- Our 8-node Cassandra cluster is experiencing frequent commit log corruptions. The latest occurred on two Cassandra nodes. Despite only two nodes showing the error, the entire cluster is affected, leading to a complete halt.
- The error message points to a bad header in one of the commit log files, suggesting a possible incomplete flush or other disk-related issues.
- Kubernetes Node Failure:
- We detected that a Kubernetes node went down around the time the issue occurred. I'm curious how this event might be contributing to the corruption and what steps can be taken to shield Cassandra from such disruptions.
- Reliability Concerns:
- It’s puzzling why corruption on just two nodes cascades to affect the whole cluster.
- I’m looking for recommendations on enhancing cluster resilience. Are there specific configurations or best practices to ensure that issues on individual nodes don’t compromise the entire cluster?
Questions for the Community:
- Root Cause:
- How can commit log corruption on only two nodes cause a complete cluster halt?
- Resilience Strategies:
- What are the best practices for configuring K8ssandra to handle node failures and unexpected Kubernetes disruptions?
- Are there specific settings or architectural changes that can help prevent such commit log issues from propagating cluster-wide?
- Kubernetes Integration:
- Given that a K8s node failure was detected around the time of the error, how can we make Cassandra more resilient in a dynamic, containerized environment?
Below is the stack trace for your reference:
org.apache.cassandra.db.commitlog.CommitLogReadHandler$CommitLogReadException: Encountered bad header at position 8105604 of commit log /opt/cassandra/data/commitlog/CommitLog-8-1740071674398.log, with bad position but valid CRC
CommitLogReplayer.java:536 - Ignoring commit log replay error likely due to incomplete flush to disk
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:865)
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:727)
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:345)
at org.apache.cassandra.db.commitlog.CommitLog.recoverSegmentsOnDisk(CommitLog.java:208)
at org.apache.cassandra.db.commitlog.CommitLog.recoverFiles(CommitLog.java:229)
at org.apache.cassandra.db.commitlog.CommitLogReplayer.replayFiles(CommitLogReplayer.java:205)
at org.apache.cassandra.db.commitlog.CommitLog.recoverSegmentsOnDisk(CommitLog.java:208)
at org.apache.cassandra.db.commitlog.CommitLog.recoverFiles(CommitLog.java:229)
at org.apache.cassandra.db.commitlog.CommitLogReplayer.replayFiles(CommitLogReplayer.java:205)
at org.apache.cassandra.db.commitlog.CommitLogReader.readCommitLogSegment(CommitLogReader.java:147)
at org.apache.cassandra.db.commitlog.CommitLogReader.readCommitLogSegment(CommitLogReader.java:233)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:140)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:145)
at org.apache.cassandra.db.commitlog.CommitLogSegmentReader$SegmentIterator.computeNext(CommitLogSegmentReader.java:97)
at org.apache.cassandra.db.commitlog.CommitLogSegmentReader$SegmentIterator.computeNext(CommitLogSegmentReader.java:124)