Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alert] Discussion about indicators for scaling up Kafka #608

Open
bingkunyangvungle opened this issue Oct 12, 2024 · 0 comments
Open

[Alert] Discussion about indicators for scaling up Kafka #608

bingkunyangvungle opened this issue Oct 12, 2024 · 0 comments

Comments

@bingkunyangvungle
Copy link

bingkunyangvungle commented Oct 12, 2024

What can we help you with?

This will be just a place for discussing what should we monitor for scaling up the Kafka.

Where would you expect to find this information?

This is just the case of our own when we need to scale up our Kafka clusters.
In normal case, we would just scale up the cluster when the disk is growing and might hit the 100%, or in the case when the CPU/Memory are not sufficient to handle the traffic.
And in our case is that we have this issue that sometimes one or more broker just stopped clean up the out-of-date segments(one or two). Then the local disk would be filled up quickly with these uncleaned segments and restarting the brokers "seems" to resolve the issue temporarily. However, this can happen every day and we just can't keep doing the "restart", especially when the traffic continues to grow, there's be more of these brokers whose local segments can't be cleaned up.
image

Details

After trying various ways, we found that scaling up the cluster(adding more brokers) can greatly alleviates the issue(although we still have partition with uncleaned segments), then easily there's question about when should we scale up? Which indicator can show us when to scale up before we have to "restart" to resolve the issue first? In our case, our CPU and memory doesn't seem to change much by the scaling up, but the IO wait drops from about 35% to 25% and that looks like an indicator that we can use. Other than this one, we didn't show any other indicator that can help us in this case.

Discussion

So based on the description above, if IOWait is really the indicator that we can use, why sometimes the broker doesn't delete the segments when the IOWait is high? What is the normal range for the IOWait to make sure that the system can work well? Is there some other indicator that can be used in this case? Maybe the throughput? The IOPS? Any idea is welcomed for discussion. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant