Last week we looked at Splunk hardware and identifying the different pieces of a typical organisation's deployment. This week we're going to discuss data retention, and next week we'll tie it all together in a Standalone deployment.
Let's begin with how data becomes searchable: data is sent to the indexer, it goes through the Parsing Pipeline, followed by the Indexing Pipeline. Read How Indexing Works for more detail, but essentially:
1) Data is forwarded to an indexer
2) Data is parsed, fields extracted, and turned into events.
3) Events are indexed, making the information searchable.
4) Raw data and events are compressed, and written to disk.
Note: if using a Heavy Forwarder part of this process will be done before sending data to the Indexer. See link above for more details.
Regarding licensing, in a Standalone deployment, the license is on the Splunk server and there's no need for pooling. In a Distributed deployment, you will have a License Master, and all other servers will be License Slaves. In most situations, you will have one license pool, however, one license can also be grouped into smaller pools as required.
The following can help you understand how Splunk handles data in more detail:
Now that we've got an idea of how searchable events are created, let's cover where and how they're stored.
"Splunk Enterprise stores indexed data in buckets, which are directories containing both the data and index files into the data. An index typically consists of many buckets, organised by age of the data." Reference: Buckets and Indexer Clustering
Hot Bucket: as we now know, data comes into the Indexer, and is written to disk once made searchable. However, when its being written to disk, it's actually being stored in a hot bucket. Once the bucket gets to the maximum size, the ‘lid is closed’, the bucket is renamed and rolls to a warm bucket. A real-time search is actually reading from a hot bucket, as it's being written to, as you can imagine this is costly to the systems' performance, often automated checks every ten minutes is more effective.
Warm Bucket: a warm bucket is a bit older, but still valid data. These buckets are no longer changed, and only able to be read. Both hot and warm buckets are recommended to be on the fastest storage you have.
Cold Bucket: again read only, when an index reaches a certain size or a number of warm buckets is reached, which can be customised, it rolls over into a cold bucket; oldest buckets first. This data can be stored on a separate location from the Hot/Warm buckets. It is still searchable.
Frozen: by default, data is deleted after a set amount of time or size from Cold storage. If you change the configuration to archive data instead, Splunk will transfer the older data to Frozen. This data can be stored on the slowest storage, even removable. Frozen data cannot be searched.
Note: when buckets are named, it's based on the timestamp for the first event, followed by the timestamp for the last event; which is how Splunk knows what buckets to open on search. Secondly, as buckets are written based on time range, it is possible that data will be deleted before the configured age length, due to a maximum size of the index. Splunk will never go over the maximum index size.
Thawed: If need to investigate data that has been moved to Frozen, Thaw it. Meaning moving it from Frozen to Thawed storage, which re-enables searching. This may be handy when investigating an incident, or even network planning. Read here for Restore Archived Indexed Data.
See How the Indexer Stores Indexes for more, there's a great table summarises the above. Or, if you'd like a more advanced view, see Buckets and Indexer Clusters, regarding buckets and replication of buckets in clustering environments.
The last piece I want to mention in this post, is how to calculate storage requirements. The best idea and what Splunk actually recommends, is deploying a test environment and sending as much information as possible into that Splunk test environment, to give you an idea of what license you’re looking at. Splunk Sizing is helpful to start and give you an idea of what to expect, but ultimately it's best to test. There are ways to reduce license usage, but you must decide on what is acceptable to your organisation, as once data is deleted, it is not recoverable.
You cannot create data that was never saved, you cannot return data that was deleted. However, you can index data that was previously generated and not yet indexed.
Final note: if you find any inaccuracies, please do let me know.