You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This will be just a place for discussing what should we monitor for scaling up the Kafka.
Where would you expect to find this information?
This is just the case of our own when we need to scale up our Kafka clusters.
In normal case, we would just scale up the cluster when the disk is growing and might hit the 100%, or in the case when the CPU/Memory are not sufficient to handle the traffic.
And in our case is that we have this issue that sometimes one or more broker just stopped clean up the out-of-date segments(one or two). Then the local disk would be filled up quickly with these uncleaned segments and restarting the brokers "seems" to resolve the issue temporarily. However, this can happen every day and we just can't keep doing the "restart", especially when the traffic continues to grow, there's be more of these brokers whose local segments can't be cleaned up.
Details
After trying various ways, we found that scaling up the cluster(adding more brokers) can greatly alleviates the issue(although we still have partition with uncleaned segments), then easily there's question about when should we scale up? Which indicator can show us when to scale up before we have to "restart" to resolve the issue first? In our case, our CPU and memory doesn't seem to change much by the scaling up, but the IO wait drops from about 35% to 25% and that looks like an indicator that we can use. Other than this one, we didn't show any other indicator that can help us in this case.
Discussion
So based on the description above, if IOWait is really the indicator that we can use, why sometimes the broker doesn't delete the segments when the IOWait is high? What is the normal range for the IOWait to make sure that the system can work well? Is there some other indicator that can be used in this case? Maybe the throughput? The IOPS? Any idea is welcomed for discussion. Thank you!
The text was updated successfully, but these errors were encountered:
What can we help you with?
This will be just a place for discussing what should we monitor for scaling up the Kafka.
Where would you expect to find this information?
This is just the case of our own when we need to scale up our Kafka clusters.
In normal case, we would just scale up the cluster when the disk is growing and might hit the 100%, or in the case when the CPU/Memory are not sufficient to handle the traffic.
And in our case is that we have this issue that sometimes one or more broker just stopped clean up the out-of-date segments(one or two). Then the local disk would be filled up quickly with these uncleaned segments and restarting the brokers "seems" to resolve the issue temporarily. However, this can happen every day and we just can't keep doing the "restart", especially when the traffic continues to grow, there's be more of these brokers whose local segments can't be cleaned up.
Details
After trying various ways, we found that scaling up the cluster(adding more brokers) can greatly alleviates the issue(although we still have partition with uncleaned segments), then easily there's question about when should we scale up? Which indicator can show us when to scale up before we have to "restart" to resolve the issue first? In our case, our CPU and memory doesn't seem to change much by the scaling up, but the IO wait drops from about 35% to 25% and that looks like an indicator that we can use. Other than this one, we didn't show any other indicator that can help us in this case.
Discussion
So based on the description above, if IOWait is really the indicator that we can use, why sometimes the broker doesn't delete the segments when the IOWait is high? What is the normal range for the IOWait to make sure that the system can work well? Is there some other indicator that can be used in this case? Maybe the throughput? The IOPS? Any idea is welcomed for discussion. Thank you!
The text was updated successfully, but these errors were encountered: