As Red Hat engineers, we are always looking to incorporate features that empower administrators and decision makers. Our goal is to enable them to be proactive, efficient, and to help them maximize value from their infrastructure.

To this end, we are currently working on how to significantly improve the reporting and metrics API in Red Hat Virtualization Manager, our management platform for virtualized resources. Until recently, Red Hat Virtualization relied on native reports and data warehouse engines to provide

access, visualization, and insights into the virtualized and physical infrastructure.

When working on improving user experience, we found that there are some areas that needed improvement. For example, users often requested real time monitoring information, but the reports only showed aggregated data in hourly and daily increments. Although having the historical data has its advantages in capacity planning and performance analysis over time, having real time data means being able to address issues faster, minimizing downtime and improving overall performance.

We also received requests to add monitoring for the engine machine itself and database performance. This is important of course to manage its resources and address db issues for example.

Currently, as seen in the data flow image below, the engine is responsible for data collection from the hosts. This adds load on the engine database, consumes resources and affects the overall engine performance. image01_metrics.png

Improved Metrics Visualization Benefits

We found that there are now better, scaleable, faster and distributed solutions for metrics collection and processing. One of the major improvements we decided on is to collect the data directly from the hosts. This means lowering the load on the engine database and machine, and improving the engine performance.

From each host and engine, data will be collected by Collectd, a simple and powerful daemon that gathers metrics from various sources, e.g. the operating system, applications, log files and external devices, and makes it available over the network. The data will be processed by Fluentd, a data collector that unifies the data, and then send it to a central metrics store.

The users will be able to view and analyze the metrics in real-time, by using the visualization tool. image02_metrics.png

We are very excited about the value this project will provide our users. In the next versions we plan to further integrate it with an easy to install metrics store solution, add built-in dashboards and alerting.

When choosing a metrics store solution we focused on having the following features: metadata handling, high availability, scalability, federation, JDBC/ODBC support, down-sampling option, supported alerting and notification tool and supported visualization tools that provides self service, widgets diversity and interactivity.

We will continue working on adding additional Collectd plugins and process additional logs.

In addition, we plan on adding smart management, based on the metrics store. That is, to trigger engine events according to specific criteria. For example: If we see that there are many if_errors on a network interface for a host we can trigger moving it to non-operational to prevent workload downtime due to network issues.

Questions, comments, or feedback on metrics and reporting?  Reach out using the comments section (below).

Yaniv & Shirly