In an earlier blog post, Log monitoring and unstructured log data, moving beyond tail -f, we talked about collecting and working with unstructured log data. We learned that it’s very easy to add data to the Elastic Stack. So far the only parsing we did was to extract the timestamp from this data, so older data gets backfilled correctly.
We also talked about searching this unstructured data toward the end of the blog. While unstructured data can be incredibly useful when combined with full text search functionality, there are cases where we need a little more structure to use the data to answer our questions.
Schema on write or schema on read — why not both?
Schema on write remains the default option that Elasticsearch uses to handle incoming data. All fields in a document are indexed as it’s ingested, otherwise known as schema on write. This is what makes running searches in Elastic so fast, regardless of the volume of data returned or the number of queries executed. It’s also a big part of what our users love about Elastic.
Schema on write works really well if you know your data and how it’s structured before ingest. That way, the schema (logical view of data structure) can be fully defined in the index mapping. It also requires sticking to that defined schema when queries are run against the index. In the real world, however, monitoring and telemetry data can often change. New data sources may appear in your environment, for example. An added layer of flexibility to dynamically extract or query new fields after the data has been indexed adds tremendous value, even if it comes at a slight cost to performance.
That’s where schema on read comes in. Data can be quickly ingested in raw form without any indexing, except for certain necessary fields such as timestamp or response codes. Other fields can be created on the fly when queries are run against the data. You don’t need to have intimate knowledge of your data ahead of time, nor do you have to predict all the possible ways that the data may eventually be queried. You can change the data structure at any time, even after the documents have been indexed — a huge benefit of schema on read.
Here’s what’s unique about how Elastic has implemented schema on read. We’ve built runtime fields on the same Elastic platform — the same architecture, the same tools, and the same interfaces you’re already using. There are no new datastores, languages, or components, and there’s no additional procedural overhead. Schema on read and schema on write work well together and seamlessly complement each other, so that you can decide which fields to calculate when a query requires them and which fields to index when your data is ingested into Elasticsearch.
By offering you the best of both worlds on a single stack, we make it easy for you to decide which combination of schema on write and schema on read works best for your specific use cases.
Using runtime fields on the Elastic Stack
Let us start with a quick example.
Using unstructured data we can easily answer questions like “How many errors did we have in the last 15 minutes?” or “When did we last have error X?” But if we want to ask questions like “What’s the sum of number X that appears in our logs?” or “What are our top 5 errors?”, then we need to extract the relevant information first in order to aggregate.
If you’ve followed along with our last blog, our data in the cluster now looks like this: