How do we prevent accidental Graylog denial of service problems without multiple graylog instances?

Question

Our original problem

Last year we had a problem where a rogue piece of software on one server spammed our central Graylog Server with so many messages that it caused problems for other applications.

The main problem was older useful messages from other applications being purged earlier than normal, with the index filled up with the useless messages from the rogue application.

My suggested fix was to give each application it's own index, so no application could starve any other application of log storage space. This would not need any changes to the applications themselves, and would only require changes inside Graylog. Nothing was done however, as a new kubernetes based Graylog solution was being planned.

The solution we were offered

Fast forward to now, and we are now in the process of our replacement Graylog system being commissioned.

Initially we were told that every application would have it's own independent graylog server (load balancer, gelf endpoints, graylog nodes, elastic search cluster) and graylog front end website.

The problem is, there is a complex relationship between different applications, and having to go to different graylog web-servers for logs from different applications (graylog-application1.site for logs from application1 and graylog-application2.site for logs from application2, rather than just going to graylog.site) was going to make cross application searches really difficult.

The revised solution

After this was pointed out, a solution was proposed to group together applications by how likely they are to need to be searched together, so now we are expecting to be given have separate Graylog servers per group (application-group-a.site for application 1 and 3, and application-group-b.site for application 2, 4 and 5 etc.).

I wonder though if this is necessary or sufficient.

It will mean that many likely cross-application searches will be easy to do, but some of the hardest support problems to solve are those which cross application boundaries which are less obvious, and these searches will no longer be as easy (and may in fact be impossible, if you don't know which application is involved).

People have argued that just having separate indices on a single central graylog server does not provide sufficient isolation between application groups. They want to be sure that ingress of messages from one application could never interfere with ingress of messages from other applications, so they want complete isolation between groups.

My problem with this is that it wouldn't help with an application within a group going rogue and spamming the group graylog server. If we can find a solution which prevents denial of service solution within a group, we would also have a solution for a single centralised Graylog service.

I would argue that scaling a single central service horizontally with more load balanced graylog nodes, more elastic search nodes, more GELF end point etc. would be a better solution than having dozens of graylog servers.

Questions

Would separate graylog servers actually provide the level of isolation (denial of service mitigation) that people seem to want, when it is all hosted on the same kubernetes cluster?
Can we provide a similar or better level of isolation with a single central Graylog server than with separate group graylog servers?
Do other organisations use Graylog in this way, with many front end websites, or would a single central web-site to access all logs be expected?

Essentially I'm looking to either convince myself that I'm worrying about nothing, and that this solution is common; or arguments to convince people that what we are considering is contrary to best practice, and we really shouldn't be doing this.

I would really like to find a solution that works for everyone, but it seems to me at the moment that we are rather throwing the baby out with the bath water with our currently proposed solution.

score 1 · Answer 1 · answered Sep 08 '23 at 18:35

Your concerns and questions about the architecture of your Graylog setup are pretty valid. It's important to find a solution that balances the need for isolation and resource management with ease of use and manageability. Let's break it down like Kris Kross!

1. Would separate Graylog servers actually provide the level of isolation (denial of service mitigation) that people seem to want, when it is all hosted on the same Kubernetes cluster?

Separate Graylog servers running on the same Kubernetes cluster can provide some level of isolation, but it's important to understand that they are still sharing resources at the cluster level, including network bandwidth, storage, and potentially CPU and memory. If one application within a group goes rogue and starts spamming messages, it can potentially affect the performance of other Graylog servers on the same cluster due to resource contention. While Kubernetes offers resource management capabilities, it's not a silver bullet for preventing resource-related issues.

2. Can we provide a similar or better level of isolation with a single central Graylog server than with separate group Graylog servers?

A single central Graylog server can provide isolation between applications using separate indices, but as you mentioned, it may not prevent a rogue application from overloading the server and causing issues for other applications. To mitigate this, you can consider implementing stricter rate limiting and throttling policies on the inputs or sources generating log messages. Additionally, you can scale the central Graylog server horizontally to handle increased load, which can be more cost-effective than managing multiple separate servers.

3. Do other organizations use Graylog in this way, with many front-end websites, or would a single central website to access all logs be expected?

The choice of whether to have multiple front-end websites for different Graylog instances or a single central website depends on the organization's specific needs and preferences. Both approaches are valid and have their own advantages and trade-offs.

Multiple Front-end Websites: This approach is suitable when different teams or applications require separate Graylog instances for better isolation, control, and organization. It can simplify access control and provide autonomy to different teams. However, it may require users to switch between different interfaces for cross-application searches.
Single Central Website: A single central website for accessing all logs provides a unified view of the entire log infrastructure, making cross-application searches easier. This approach simplifies user access but may require more robust access control and permissions management to ensure data separation between different teams or applications.

Ultimately, the choice between these approaches depends on your organization's priorities regarding isolation, ease of use, resource management, and the specific requirements of your applications and teams.

To find a balanced solution that works for everyone, consider the following:

Implement strict rate limiting and throttling policies within each Graylog instance to prevent rogue applications from overwhelming the system.
Monitor and alert on resource usage and performance across all Graylog instances to detect and address issues proactively.
Document and communicate best practices for log management and usage to all teams.
Continue to explore the possibility of a centralized Graylog solution with proper resource scaling and access control mechanisms.

All in all, the best approach may involve a combination. You might find yourself using separate Graylog instances for logical groups of applications while implementing robust resource management and monitoring practices to ensure the overall stability of your log management infrastructure. It's also important to involve stakeholders from different teams in the decision-making process to ensure that the chosen solution aligns with their needs and expectations, just a suggestion to head off wiggity wiggity wiggity whack problems, ya know? ;)

score -1 · Answer 2 · answered Sep 09 '23 at 16:14

The big advantage offered by Graylog (and similar product) is not log storage in isolation, but the correlation between different log sources and recorded events. By using multiple separate Graylog servers each collecting different sources, multi-source correlation become much harder.

I would suggest using a single Graylog server (or a clustered multi-node setup, if a single node is not sufficient) and multiple specific indexes, with different (and appropriate) rotation policies. You can route traffic to these indexes via appropriate Stream rules (and selecting "Remove matches from Default Stream").

How do we prevent accidental Graylog denial of service problems without multiple graylog instances?

Our original problem

The solution we were offered

The revised solution

Questions

2 Answers2