How is it possible to design and optimize the implementation of SIEM-type infrastructures in complex environments? There are many elements to take into account: the variety of components, infrastructure sizes, limitations on human and financial resources, technological and organizational challenges, internal and external pressures and threats and so on.
This article will address the topic of SIEM and discuss the implementation of infrastructures that seek to combine to be as efficient and effective as possible. From log collection to real alerts, we will discuss how to create a successful SIEM project when it contains some complex or specific elements like SCADA, IOT, multiple countries, huge networks, heterogeneous and distributed infrastructures, flow concerns, non-resilient links, existing and unknown intrusions, risks of internal and external espionage or sabotage, visitors and internal threats, organizational problems, limited means, etc.
This article is intended to provide feedback from operational experts, but it must also be kept to a reasonable size, so we will take the liberty of bypassing certain details or elements that the attentive reader will recognize. The idea is not to address all the technical points possible in the world, but to share a quick overview. Other ways of thinking or discussions could therefore exist outside of this article.
Prepare your infrastructure
When building a SIEM infrastructure, there are many technical and human challenges, and tools alone cannot meet all your needs. Of course, in the best of all worlds, we could take our time, analyze the entire infrastructure, or even have unlimited budgets and teams.
In reality, when an entity wishes to set up a SIEM project, we generally distinguish between two different cases: those who want a SIEM for somewhat “external” uses (risks around GDPR, traceability/surveillance to be justified for compliance problems, etc.), and those who want a SIEM for “internal” uses (operational aspects of the fight against espionage, etc.).
When the two worlds eventually meet, the decision-makers’ demands will often be different than that of the project sponsor, who will be able to influence choices right down to the technical level. After a complex dance between buyers, lawyers, managers, IT departments, ISS teams, service providers, etc., we can start to set up an efficient infrastructure, while optimizing costs. The aim is to avoid the project ultimately resulting in the management of false positives related to imperfect technical spectra. The ideal efficiency is not easy to find, and it may require sacrificing a few elements in order to better balance the mix between management and technology.
Some questions that often come up are:
- What is the point of collecting and storing all the syslog alerts that come from network switches with many useless event data indicating that the lambda interface is up or down?
- What is the point of collecting and storing syslog alerts coming from equipment that does not even have event data that can be used for cybersecurity (a wifi terminal that does not share useful technical information for example)?
These two questions are fairly harmless and simple, but that is exactly the problem with some SIEM projects: you have to know how to arbitrate, because a project that is too extreme could stray away from operational reality. One might want to keep everything without looking at the content, but this could result in the company needing a potentially expensive storage base with a proven utility. Beyond this topic, what we are illustrating here is the art of consensus. In a successful SIEM project, the various decision-makers must be brought together, with experts capable of honoring and understanding individual needs and those of the group, in order to find the highest common denominators offering the broadest basis for compromise on all parameters. Some examples include:
- Financial directors and executives carefully analyzing spending and sales negotiations;
- Admins monitoring the arrival of agents or trace messages in their systems for fear of reduced performance or unnecessary extra work and reconfiguration;
- Developers using SIEM to do something other than security, such as debugging or “clean telemetry”;
- Network administrators apprehending to see the bandwidths occupied by new flows or flows that are difficult to calculate in advance;
- Employees justifying the implementation of positive traceability, which is not there to monitor what they do, but which may generate fears of monitoring abuse;
- Security teams hoping to finally get a look at their IT infrastructure…
It is obviously a complex art, where the person in charge will find it interesting to surround themselves with efficient, humanly respectful and positive people, in anticipation of situations where historical inner workings and cultural problems (international aspects) may slow down the project.
Unifying the event data
When you want to collect logs, the first problem that can exist is the many of types of components in an infrastructure. They are sometimes so different, that they will generate security event data (traces, logs, events) that are a priori incompatible. Some pioneers have come up with ideas on how to treat them in a similar way.
We will cite the example of one standard, even if it is little adopted or known among the most widely deployed business tools in the world. It is called IDMEF, Intrusion Detection Message Exchange Format (RFC 4765, 4766 and 4767) and aims at formatting all event data in a unified way in order to optimize exchanges and analyses.
In practice, when you have limited means and you don’t want to consume CPU to over-process your own raw data, you will try to work directly with it, without going through a large specific standardization process. As a matter of fact, if you don’t like some aspects of XML, you might not appreciate the enrichment and native processing of IDMEF.
Beyond that, what do we really find in proprietary formats? We will mention the two market standards: LEEF with IBM QRadar and CEF with HP ArcSight. The conscientious reader can study the other open formats like MITRE and DMTF, but can also refer to associated expert analyses, for example: [FORMATS]
In general, be reassured that all SIEM products will in any case gather the interesting event data, in their format, to allow you to work on them, whether on known or internal types, in order to bring you the best processing speed and the best storage possible.
In practice, SIEM solutions are generally compatible with all the classic supported formats in order to cover most infrastructures without risk. Depending on our needs, we will therefore be interested in the compatibility of a SIEM with log retrieval formats and methods such as:
- The syslog, as it is supported by almost all products: the old BSD format (RFC 3164) and the current IETF format (RFC 5424, 5425 and 5426);
- Windows-type events, if this operating system is present (the classic system/application/security events, etc., but also those with interesting proprietary applications such as DNS or even DHCP logs, etc.);
- Major standards such as W3C (Web), JSON, XML, CSV, Key-Value format processing where we will appreciate the play on delimiters or escapes especially when installing in environments where any type of event is possible: UCS*, UTF*, etc.;
- Reading line-by-line (logs of many applications) or sometimes reading multi-lines, which are often useful to manage the owner or even the unknown or specifics;
- The ability to use SQL to type into databases;
- The ability to recover remote trace files (FTP (?), SSH/SCP/SFTP, GET/POST…) for processing and incorporation;
- The proprietary formats previously mentioned as CEF and LEEF.
There are many other particular formats that exist, even if they are less used in simple SIEM projects, but may sometimes be present if necessary. Without being able to mention them all, here are some examples:
- IBM AIX logs for the audit part (kernel events);
- Apple’s logs with its Apple System Log (ASL) file format;
- The logs of the SUN systems for the Basic Security Module used in audits;
- The GELF (Graylog Extended Log Format), which is used when adopting Graylog, and the GROK when using Logstash;
- Netflow is also starting to appear in some SIEM-type infrastructures, in order to mix the network aspects to answer the question: who spoke to whom, when, how long, with what “flags” on the session and what protocol, as well as the number and size of packets;
- SNMP traps, which are widely used in the supervisory community;
- Complicated logs from the mainframe world, still very present in some professional circles.
Finally, many logs are available in the cloud, but you have to go and retrieve them regularly, with only proprietary formats that are sometimes inconsistent. For example, if a company has a strategy of messaging and collaborative work that is outsourced to a large, foreign manufacturer that is a world leader, it is strongly recommended to regularly record the traces of connections to mailboxes and shared directories, etc. You will be surprised (or not) to find that the mailbox of a domain’s admin, or CEO, is often being read at night from IP addresses thousands of miles away.
Agent or Agent-Less Mode
There are different methods for recovering remote traces from multiple data sources to which particular attention should be paid. For example:
- In “agent-less” mode: you won’t need to put an agent on the machine that needs to send its sources, and you will usually just need to reconfigure the source to interact with your SIEM. The classic case is the syslog, where you indicate to a source which IP address it should interact with;
- In “agent” mode: you need to install one more tool, which will have the mission on the source to retrieve the security event data and send it to your SIEM.
On some operating systems, the native agent-less version with syslog will not work for you, because you may need to return your logs in a particular way not natively supported, for example with a TLS type encryption layer not proposed by the product.
In this case, you will need to find a supported agent on this platform, and it will be the one responsible for issuing the event data. This adds a non-native process to be deployed, which will then have to be supervised (its presence, its consumption of local resources, etc.). Some people will not like to add processes to an existing production infrastructure, and it is then understandable that the agent-less mode remains the most native and the simplest, even though it will not have some of the same functionality of an agent.
Risks about Limitations
In SIEM projects, there can be a strong economic aspect, especially on the complex subject of trace recovery. Without going into detail, the risks to be controlled include the following:
- The problem with the limit on the number of events per second (EPS) captured by a SIEM: can you accept the risk of not seeing an attack because your license limits the number of lines read by your SIEM?
- The problem with correlations: sometimes limited, optional, not even provided or to be built manually. Can you only collect event data from multiple sources, without thinking about the most useful rules for dealing with the most important attack scenarios?
We will try to avoid integrating mathematically piece by piece, and instead focus on the whole project, ensuring that the emission of events remains coherent until it is treated, in order to avoid ending up with pieces that are incompatible with the initial needs.
Timely, final or regular intrusion testing by an independent third-party can help validate certain decisions, sometimes even during a request for proposal.
Choosing your infrastructure
Of course, a comprehensive and detailed picture of the sources of events and associated flows will need to be determined as efficiently as possible. Let’s take for example someone who has sites in different countries with local authentication services.
They may want to avoid clogging their global network that has remote links where bandwidth is expensive (satellite, etc.), all to end up with syslog indicating that the printing service has just started on a trainee’s machine. It may be preferable to set up cleverly chosen collection points instead, in order to limit the side effects in the event of a loss and to reduce the amount of bandwidth use.
In terms of resilience, it is not uncommon to have to deal with complex situations. Connections could fail anywhere, for example on a ship at sea, on a factory that temporarily experiences a sandstorm, or simply from classic network outages. If a network strand is lost, it can sometimes be more difficult to use a backup link, because they could be restricted at the flow level limiting the actions one can take. Another interesting possibility would be to run the correlations as close to the sources as possible in order to build security alerts that would be kept in a data cache.
Compressing, Indexing, Integrity
Without going into the details of every possible technical solution, once the collection of logs is in place, you will inevitably have choices to make about how to store them. By efficiently organizing these records, the answer to the problem of log usage is already being prepared.
- If you store too much, it may take more time to work on these items.
- If you don’t store enough, you risk losing useful data the day an incident occurs.
- If you wish to gain efficiency by indexing to skillfully dig into these data storages afterwards, you can also sometimes encounter problems on the size occupied.
Of course, when looking at the cost of storage, one realizes that it may be very interesting to compress this data rather than record everything raw.
Nevertheless, for legal reasons, it must be ensured that this storage is carried out without damaging the integrity of this data, so as not to have any doubts about the evidence. Fortunately, compression is not a danger, but if you want to add signature mechanisms to stored messages for legal or trust reasons (e.g. via HMAC), you may encounter real performance problems.
On a small network, with few logs to store, this is reasonably achieved, but it costs a certain amount of integration time. On a large network, with a lot of input data, the cost of this signature will be carried over to the hardware to be used (CPU, or even network), as well as the associated resilience, because some solutions require a round trip to an entity that signs, and the latter then becomes a potential Single Point Of Failure.
Therefore it’s not impossible and it’s even interesting (on paper) to sign all the logs, but at the operational level, for those who want an efficient solution that is measured against the real security risks, when you really have a lot of data and other problems to deal with, choices have to be made.
Storage and Hardware
When handling so much data, you may also want to make sure you don’t lose anything. You will then have to look at the cost aspects:
- Do you want to make backups of the disks containing the security event data?
- Do you want to minimize the risk of failure on disks containing security event data with RAID or other solutions?
In general, decision making is not only technical, but also financial on this point, and losing raw logs would be a concern. At a minimum, security alerts have to be stored elsewhere (incident tickets, alerts corresponding to raw log analysis, etc.) even if it means reducing the critical failure on the raw data silos on the front end of the field.
Encryption and Robustness
Let’s imagine what happens when you deploy SIEM servers on high-risk sites, for example in distant countries where you do not control all the parameters: the equipment, the hypervisors and the personnel (trainees, subcontractors, various nationalities, non-agent authorizations, espionage cases, strong competition).
By default, it is preferable not to deploy SIEM on operating systems that have neither encrypted data zones nor encrypted system zones, because there would be a much less level of guarantee against local attacks.
In addition, applications and operating systems should not use vulnerable layers, and it is recommended to harden your system infrastructure. You will avoid having appliances where everything would run without limitation on only one account (root). If you have to set up a SIEM with insecure products or operating systems, it may work, but again, the exposure area would be larger and it is always better to minimize the risks, because you never know where you will be targeted. One must make sure to set up different areas with advanced and thoughtful controls.
Analyze and alert
Now let’s suppose that you have interesting event data from the field, stored intelligently, with useful and protected information. You will want to be able to work on these elements, for example to look for evidence of intrusions or attacks.
Some trace messages are quite easy to analyze, because they immediately correspond to a problem. Sometimes it will be necessary to choose what is acceptable or not acceptable in an entity, knowing that this may also depend on the context.
Let’s take a simple example: a company’s VPN bridgehead starts running in promiscuous mode at night. Is it an intrusion with someone intercepting the decrypted streams of your employees?
Jun 10 01:27:55 frontvpn kernel: device tun0 entered promiscuous mode
Maybe some administrators are officially working on it, for example to understand a failure, or to perform migration, optimization or monitoring. Even if we try to look and see if an administrator has officially logged in, maybe his session has been compromised or his workstation is being used as a rebound to go to sensitive machines.
Unfortunately, not everything that is contextual is always visible in the logs in an easy way, getting the answers to all the questions can be very costly and the real effectiveness against current attacks is reduced.
The actual usefulness of certain correlations
Would you couple your access control with a SIEM just to know if people are in a building? To find out if a person is on leave or absence, you would probably need direct access or access via APIs to aspects related to HR databases. However would you really want to link your SIEM with such information (and would you have the right in certain countries, or with certain employee representations, etc.)?
It may be better to use forces on more important security issues, such as making sure you can’t be easily hacked into workstations or websites with risky databases, rather than setting up data factories, which are very complex to maintain over time. Everyone will decide according to their needs and their desire for technical realism: the presence of several sources, complex correlations, real efficiency versus the attacks of the moment, cost of maintenance (formats that can change in the footsteps of manufacturers overnight, etc.).
In general, several types of correlations are needed, like gears of different sizes. Rapid ones are necessary for points that are narrowed over time, while larger ones are important for other types of attacks. When handling several million data per day for example, the search for slow scans such as long enough brute force attacks is more difficult. You would need good algorithms, but you would also need resources at the hardware level. Most detections use thresholds, and if an authentication error detection is set to 10 per minute for a given account for example, then a hacker that is below this threshold could remain invisible. This explains the idea behind the logic of gears with different speeds.
In the best cases, SIEM comes with default correlation rules, which already work in many classical environments globally, in order to find all the usual traces of technical impurities and classical attacks for Windows, Unix, etc. This avoids complex integration issues where you have to talk in scenario attack mode, and then possibly miss the real technical intrusions.
Finally, it is important to have a console, unified or not, in order to manage heterogeneous event data and to have a real cyber-surveillance. Advanced filtering features and queries on alerts or raw data are very useful for SOC teams.
The overall objective is to move towards a solution that can answer the question: who did what, when, how, where, to where, and even why, for any size of monitored infrastructure. The ability to generate useful reports and statistics for meetings should not limit the technical challenges. It will therefore be necessary to find tools that offer answers to your needs according to your means: justifying compliance, detecting attacks, etc.
Other articles in this MISC file talk about monitoring topics, so these SOC aspects will not be discussed in more detail here, although other organizational topics would be useful to consider, such as RACI matrices for who manages what on a SIEM project, etc.
Before spending your energy, motivation and means, the humble author will conclude by saying that you should not believe that a SIEM alone will contribute to the fight against cyberattacks.
Talk to the many offensive experts from any company, and they will share their thoughts on these defensive components, the classic SIEM or the classic NIDS. Unfortunately, there are many arguments that have not been proven, that try to promote log processing as an absolute solution. Yet everyone already knows that when a machine is compromised, there is not always useful trace data to know that it is the case. Most attacks don’t leave log lines, so even if you put all your energy into a large SIEM to centralize and process data, you will remain limited to the initial sensors on the systems and the applications. When you don’t have good logs, correlation rules and analysts, then the monitoring infrastructure is not very efficient.
Setting up a SIEM project is like dealing with data “pipes”; a bit like a complex plumbing situation, where you need to organize and balance treatments, controls, robustness, resilience and even monitoring.
You will therefore think of trying to ensure encryption or robustness, to determine and organize the only data to be kept so as to optimize their future use, and to avoid unnecessary additional costs by combining efficiency and accuracy.
Of course, there are differences depending on the size of the SIEM. When you process billions of logs from a proxy of a large global account, with all the requests to the Internet, discovering an espionage operation will be complex considering all the places to hide in, for example even on sub-domains of large sites among the most visited in the world.
Article under license (CC BY-NC-ND) – published in MISC Magazine No. 100 – November 2018
[FORMATS] Guillaume Hiet, Hervé Debar, Selim Menouar, et Vérène Houdebine, « Étude comparative des formats d’alertes », CC&ESAR (Computer & Electronics Security Applications, novembre 2015