- Published on
Simplify Your Data System with Logs
- Authors
- Name
- Miles Zarn
Ever stop and think about how many of the things we take for granted just work, and work reliably? For example, there may be a few of us out there who remember early Unix, Linux, or Windows, and how we shuddered when the computer got powered off without the shutdown command, or when the power went out after updating a document without the final save?
In the early days, such an event would crash
the computer, and if you were lucky, the disk filesystem would be intact, and you would have only lost your unsaved work. Unlucky, and you would be using disk utils to rebuild and recover the filesystem integrity so you could boot the computer, ... the worry about the unsaved work!
Eventually that problem was solved, but how? ... Logs.
Filesystems came standard with Dirty Region Logging. Filesystems could always guarantee consistency because changes were written to a log first, then the log was systematically processed. If processing of log was interrupted, then the filesystem could simply recover that last stable checkpoint, and restart processing of changes. Applications began using this basic concept creating automatic logged checkpoints. This simple idea, an ordered, append only log, enables reliable persistence to disk, database reliability, database synchronization to read replicas, distributed consensus voting algorithms and so on.
You could say large scale distributed systems are built around logs used as checkpoints, typically trading Atomicity for Eventual Consistency.
There are ‘hard problems’, and just 'problems'
Often the difference between a "hard problem" and a problem, is how you frame the problem and solution. This post explores why logs are a simple, but great idea, making it easier to build robust applications that provide data high data reliability, and avoid ‘hard problems’.
Usually an application starts out simple, the application and a DB, but then you need to scale it, so you decide to add a cache, then you add an email notification system, but then you need to analyze the data generated, but you can’t add the load to the existing DB, so you setup a long term archival strategy or Hadoop. The evolution continues, the DB can’t index fast enough, so you setup Solr or Redis, and caching with memcached. The next thing you know your apps are managing consistency across two, three or more endpoints. “Happy Path” works awesome, but when things start to go awry, data reliability suffers, unexpected ‘edge cases’ pop out everywhere. What’s worse if that you may have multiple actors trying to push similar state, so now timing can also be an issue. Most of the time these problems go un-noticed, because it is difficult to measure consistency! Then there is that fateful day, when somebody notices some egregious inconsistency, and an investigation reveals un-explainable differences. Or even worse, business KPIs become un-responsive to change, nobody knows why, and there is not one big thing presenting itself as an opportunity.
How did we get here? Sadly, there is no one magic tool to solve all the different data scenarios efficiently, so we utilize specialized infrastructures to transform data into efficient patterns to support a specific use case.
Wait, what? Same data … different form?
Once we come to the realization that the data is the same, then this problem of consistency/reliability becomes super critical. There is a source of truth, and that consistency needs to be maintained across multiple services and use cases. Measuring consistency across the various services and transformations will be difficult, if not impossible.
The Simple Approach
A common approach is to push the problem of consistency to the application, forcing the application to manage state of the data in the various systems. The application that is doing so many things, now also needs to manage consistency in un-happy times. There are many reasons why we may not want to do that but let’s explore them.
The biggest reason might be in-consistency due to race conditions. We will assume the non-trivial problem of synchronization, and managing consistency in 'un-happy path' fault scenarios can be solved.
Consider a Web App running on multiple instances, this creates an opportunity for two sources to operate on the same data in Cache, Search, and DB.
Let’s model this problem with two users.
Here we have two users updating the two systems, but we timing can be significant! The net result is that the DB and Cache are no longer consistent. You can argue that your web services can't experience this problem because the updates are always centric to a single users in a single session. But wait, what about cache updates, or publishing new inventory... This flaw exists and given sufficiently large population of users and services it can cause problems. There are similar scenarios in transactional updates to a single DB unless transaction locking is used. Many times it is not due to the performance cost, and or availability of the feature on large scale databases.
Enter the log
An ordered, sequential store. The simple solutions are the best!
Now the application (producer) can asynchronously write data to the log, and the log infrastructure contains all the pedantic checkpoint, reliability features, and actual persistence of data. The targets (consumers in this case) . Consistency is guaranteed via the ordered, sequenced log!
Thank you for reading the post! I look forward to your comments and feedback!
Up next a breakdown of a simple event logging infrastructure, modeled into components, including real world challenges faced, and solutions implemented Analyze Typical Data Pipeline.
The views in this article are my own based on my experiences in the last 10 years in data systems.