Customized Health Scores for Applications
Overview
Large organizations face significant challenges in assessing their resilience and observability due to the vast number of applications and components spread across various business units. A centralized, nuanced scoring system can greatly enhance leaders’ understanding of their organization’s operational health. The Assessment Scoring Engine offers a sophisticated, adaptable framework that assigns a weighted score to applications based on a comprehensive set of rules tailored to different business use cases.
This engine employs a multifaceted scoring algorithm that evaluates applications on multiple factors, such as uptime, performance efficiency, error rates, and recovery time, among others. Each dimension can be assigned a different weight based on its importance to the business, allowing for a tailored assessment that aligns with organizational priorities. For instance, an application critical to revenue generation might be scored with a higher emphasis on uptime and transaction speed, while a non-critical reporting tool might focus more on accuracy and completeness of data.
Once an application receives its weighted score, stakeholders can use this data-driven insight to identify and prioritize improvements. This eliminates guesswork, as the scoring clearly indicates which improvements in specific categories will have the most significant impact on the overall score and, by extension, on the organization’s operational health.
Additionally, the Scoring Engine provides Site Reliability Engineering (SRE) teams with a comprehensive view across the organization, enabling them to concentrate their efforts on areas where improvement is most needed. By highlighting applications with lower scores, SRE teams can identify and address the most critical issues. This visibility not only drives targeted improvement efforts but also encourages application owners to proactively manage and enhance their applications’ operational health, knowing that their performance is being monitored and measured.
Problem Statement
Leaders often grapple with identifying and addressing gaps in their operational health, particularly in large entities where numerous applications and systems interplay across varied business units. A critical challenge lies in obtaining a clear, comprehensive view of these gaps and strategically prioritizing remediation efforts to improve resilience effectively. While leaders may be aware of potential vulnerabilities, quantifying the impact and prioritizing actions based on the potential benefit remains a complex endeavor. Manual efforts to quantify gaps is time consuming and difficult to coordinate across business units.
Problem | Details |
Lack of Visibility | Organizations require a mechanism to distill and interpret vast amounts of data across disparate systems to gain actionable insights into their operational health. The need for a scoring engine becomes evident, one that aggregates and synthesizes metrics from multiple sources, providing a unified view of operational health across the organization. |
Prioritization | They need a system that not only identifies issues but also ranks them based on their potential impact on the organization. This prioritization is essential to ensure that resources are channeled where they can yield the maximum benefit, especially in resource-constrained environments. |
Automation of Report Generation | The manual generation of reports is time-consuming and prone to errors and delays. An automated system that can generate insightful reports on operational health metrics and their implications can significantly streamline decision-making processes. |
Most application health dashboards provide an aggregate view of metrics, yet they fall short in offering a weighted analysis. In practice, not all metrics hold equal importance; some are critical to an organization’s core functions and resilience. A sophisticated scoring engine that can apply customized weightings to different metrics will provide a more nuanced and actionable understanding of operational health. By determining the weighted impact of each issue, leaders can make informed decisions on where to focus improvement efforts, aligning resource allocation with strategic objectives to enhance overall resilience.
Solution
Definitions
Components – Logical breakdowns of an application that can be defined by the application team implementing RHI. It can be defined as an application tier such as Front End of Back End.
Categories – A subset of a component that can be defined that contains multiple controls and that a set of rules are aligned with to drive a category score. Examples of a categories could be application, infrastructure, observability, and operations.
Controls – A subset of a category that is aligned with a set of rules to give a combined score that can be weighted according to the resiliency strategy of the enterprise implementing RHI. An example of controls for the Observability Category would be Metrics, Alarms, Traces, etc. Each with rules set to evaluate the health of the control.
Resiliency Rule – An automated evaluation to determine to produce a resiliency score. Rules are aligned with controls and categories to add structure and weighted outcomes based on the organization’s required business needs. Some examples of rules would be:
- Observability metrics currently in place
- Quality of application logs
- Recovery time during real world outage
- Current continuous uptime
- Uptime percentage
- Current application health status
- Current Performance Testing pass rates
- Resiliency Testing pass rates
- Uptime percentage
Background
The Assessment Scoring Engine is an advanced system tailored for large organizations, designed to evaluate applications by assigning scores based on a configurable set of rules and weightings. It starts with defining the business structure and weighting rules, following which the engine aggregates scores from specific rules to derive a comprehensive score for each application component, aiding in decision-making to enhance application performance.
Components
- Infrastructure as Code – Utilizes AWS Cloud Development Kit (CDK) for defining all infrastructure, ensuring automation and consistency.
- AWS Lambda – Acts as the compute backbone, executing the scoring engine’s logic, including data processing and score computation.
- Amazon DynamoDB – Stores essential data like rules, controls, categories, components, and application data, enabling efficient data management.
- Amazon S3 – Used for storing processed reports, providing a secure and scalable storage solution.
Implementation Details
Rule Submission:
- Rules are the fundamental units of measurement within the engine, representing individual score assessments for different components of an application.
- Each rule is part of a hierarchical structure predefined in the engine, ensuring that the data is organized logically according to the engine’s configuration.
- To submit a rule, the upstream application must specify not just the rule’s details but also its context within the hierarchy, including the associated application, component, category, and control names.
- This hierarchical information is crucial because it allows the engine to correctly place and aggregate the rule within the broader context of the organizational structure and application architecture.
- The example JSON request demonstrates how rules are structured and submitted. Each rule includes its name, associated control, category, component, and application names, along with the score it assigns.
- The system is designed to handle multiple rule submissions in a single request, accommodating batch processing for efficiency.
"rules": [{
"rule_name" : "verify_control_average", "control_name": "alarms", "category_name": "application", "component_name": "Commercial_Banking", "application_name": "resiliency-foundation-pipelines",
"score" : 2.2},
{ "rule_name" : "verify_control_average_2", "control_name": "alarms", "category_name": "application", "component_name": "Commercial_Banking", "application_name": "resiliency-foundation-pipelines", "score" : 2.2},
{ "rule_name" : "verify_control_average_3", "control_name": "alarms", "category_name": "application", "component_name": "Commercial_Banking", "application_name": "resiliency-foundation-pipelines", "score" : 2.2},
]}``
Rule Validation and Response:
- Upon receiving rule submissions, the engine validates them against the pre-configured hierarchy. This validation ensures that the rules align with the existing structure, maintaining data integrity and consistency.
- The response to a rule submission includes two key sections: “rules_succeeded” and “rules_failed.” These sections provide immediate feedback on the processing of each submitted rule, detailing which were successfully integrated and which, if any, encountered issues.
- Such feedback is vital for the upstream application to understand the outcome of the submission and take any necessary corrective actions for the rules that failed to process correctly.
The config files for the application have 2 different types, the application configuration and the business configuration. The application configuration defines how a default application should be weighed. The business configuration defines how applications roll up to different business units, as well as which weights are locked in for a certain business unit.
Figure-01
Solution Mapping
The Assessment Scoring Engine is ideally implemented and managed by a dedicated SRE/DevOps team, leveraging their expertise in system reliability and operational metrics. The primary beneficiaries of this service are application teams focused on optimizing their performance scores and organizational leaders seeking insights into resource allocation for enhanced resilience.
- For Application Teams: The engine provides a detailed assessment of their applications, highlighting areas for improvement and enabling a targeted approach to enhance application resilience and performance.
- For Leaders: It offers a granular view of operational health across different business units, facilitating informed decisions on resource distribution and strategic priorities to improve organizational resilience.
Problem | Solution |
Prioritization of Resiliency Work | The Assessment Scoring Engine addresses this by providing a nuanced scoring mechanism that evaluates applications based on a set of customizable rules and weightings, translating complex data into actionable intelligence for prioritizing resilience enhancements. |
Adaptability | Designed to accommodate dynamic IT environments, the engine supports continuous reassessment and adjustment of resilience strategies, ensuring they remain aligned with current business needs and technological landscapes. |
Report Generation | Reports are automatically generated eliminating the need to manually create a report which is prone to errors. |
Summary
The Assessment Scoring Engine is a pivotal tool that aligns with the strategic objectives of enhancing operational resilience within organizations. By providing a structured and data-driven approach to assess and improve application and organizational health, it enables organizations to take targeted actions that directly contribute to the stability and reliability of services. By utilizing AWS services, the engine offers a scalable and secure solution to meet the dynamic needs of modern enterprises.
Involved in these Foundations
- Experiment Broker
- Resiliency Foundations
- Resiliency Health Index