Previously in The Quest for Operations Intelligence, the focus was placed on what can be delivered with log aggregation and how to improve it. A conclusion was that to have full situational awareness on IT, you would need logs, metrics, configuration and events information correlated for easy one stop analysis when problems arise.
While we talked about logs, metrics and configuration in depth, we left events at the time without any sort of definition. What are events and what can we use them for in our quest for operations happiness?
Event happiness
Those most effected by this quest are the system administrators, who are the ones on call when things go wrong in your infrastructure. When the call comes in the middle of the night, this is the moment when log aggregation and metrics can save very precious time in finding the cause of failure.
The question is, what's happened to bring the system administrator to his post in the deep dark of night?
System monitoring has discovered a failure in the infrastructure, generated an alert which triggered an action which sent a message to the sysadmin to respond to the issue.
Depending on the organizational structure, teams taking care of monitoring are often spread throughout the organization. It's handled by the IT ops team itself, by a specific monitoring team, by the security team or possibly by a cross-functional group. The core of this activity is to perform checks on as many parts of your infrastructure as possible.
These checks are the unit testing of your IT infrastructure.
They are often pieces of code or scripts that validate the status of a critical part of the infrastructure, that it's working in a general sense (i.e. checking HTTPD service status by creating TCP connections to ports 80 and 443). These checks can also become very specific, such as downloading the main web page of a server and checking that static objects match the previously recorded sha256 hash.
These checks use metrics by reviewing that some parameters do not go beyond safe thresholds (i.e. CPU usage beyond 90% for more than one minute), or detect certain messages in the generated logs such as any critical message or an specific message that is known as the symptom of a coming outage.
Figure 1. Log aggregation enriching checks and monitoring. |
Beyond monitoring
About the author
Browse by channel
Automation
The latest on IT automation that spans tech, teams, and environments
Artificial intelligence
Explore the platforms and partners building a faster path for AI
Open hybrid cloud
Explore how we build a more flexible future with hybrid cloud
Security
Explore how we reduce risks across environments and technologies
Edge computing
Updates on the solutions that simplify infrastructure at the edge
Infrastructure
Stay up to date on the world’s leading enterprise Linux platform
Applications
The latest on our solutions to the toughest application challenges
Original shows
Entertaining stories from the makers and leaders in enterprise tech
Products
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Cloud services
- See all products
Tools
- Training and certification
- My account
- Developer resources
- Customer support
- Red Hat value calculator
- Red Hat Ecosystem Catalog
- Find a partner
Try, buy, & sell
Communicate
About Red Hat
We’re the world’s leading provider of enterprise open source solutions—including Linux, cloud, container, and Kubernetes. We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.
Select a language
Red Hat legal and privacy links
- About Red Hat
- Jobs
- Events
- Locations
- Contact Red Hat
- Red Hat Blog
- Diversity, equity, and inclusion
- Cool Stuff Store
- Red Hat Summit