Introduction

Owners: Marwan

Purpose

The purpose of this stress test plan is to evaluate the performance, stability, and scalability of the Call For Rent process under stress load conditions. The objective is to identify potential bottlenecks, weaknesses, and the maximum capacity of the application before failure or significant performance degradation occurs.

Scope

This stress test plan focuses on the key components of the application, including APIs, databases, messaging systems, and other microservices. The test will simulate concurrent users, and increased data processing to ensure the application can handle peak load conditions.

Objectives

Determine the maximum load the application can handle.
Identify performance bottlenecks and potential failure points.
Evaluate the application's behavior under extreme conditions.
Measure the system's ability to recover after a failure.
Provide insights for scaling the infrastructure.

Environments / Infrastructure

Owners: Yevhenii & Andriy

Machines: M1,M2, ... ? Cloud or VPS? specification details CPU characteristics Memory Network interface

Kubernetes: nodes? number, size

services: co-located(affinity), memory, CPU, network interface adb-contracts adb-accounting adb-persons adb-parts

Mongo DB number of connections

Data Set

Owners: Sergey

The idea of generally stored datasets is to have up2date data sources dedicated for specific use cases. Performance testing in our case.
For the moment exisiting list consist of 2 sets of data: the small and te big one. Could be extended farther by request to cover some extra cases.

Upload tools/strategy

The Artillery tool is used for the moment as for upload flow as for performance.
The project with the artillery solution could be found at the bitbucket adb-tests-artillery
And the exisitng README.md file is recommended to be used as a started point.

Inspite of the Artillery is primary performace tool the initial dataset upload scheme is Sequential nevertheless. It's as because of evident artefacts chaining (Owner->Part->RDC etc) as nesessity to avoid issues during the data load due to system overload.
But technically initial load possible to be carefully optimized (in case of request) splitting the massive datasets in a small independent parts and run it in parallel.

Dynamic data

Exporting the data from GDrive storage for a local run don't be wonder discovering the slight differences at the files downloaded at the different dates.
The reason of that - Dynamic data organization, means some of the data (like a contract start/end date) will be regenerated based on current date to be more realistic and stay tuned with the data (like an indexes) that other dependent sources provided externally.

The data exported (and stored) locally will not be changed.

Small data set details (100RCs)

Location: 154ypEZ5j6rRqNiXfiTfZbIRtgw6RJMaskPQukA6Lm-8

Items	Quantity
Organizations	5
Owners	50 (10 per Org)
Tenats	50 (10 per Org)
Parts	100 (2 per Owner)
RDCs	100
RCs	100

Middle data set details (500RCs)

Location: 1NMXfQeup9GKo7rLTHImqNjObrFJ--N8bmAVFYlCBV4M

Items	Quantity
Organizations	5
Owners	50 (10 per Org)
Tenats	50 (10 per Org)
Parts	500 (10 per Owner)
RDCs	500
RCs	500

Big data set details (1000RCs)

Location: 1RLkVcZG2fIqI0VR2mbGg9HLQILsq15LSPPQ54j3yQYw

Items	Quantity
Organizations	5
Owners	50 (10 per Org)
Tenats	50 (10 per Org)
Parts	1000 (20 per Owner)
RDCs	1000
RCs	1000

Number of orgs & RCs Launch strategy (parallel, consecutive, ...)

Observables

Owners: Yevhenii & Andriy

Performance Metrics

Global

Response Time: Average, P95, P99 response times for each service.
Throughput:Requests per second (RPS) or transactions per second (TPS).
Error Rate: Percentage of failed requests or transactions.
Network Latency: Time taken for data to travel across the network.

K8s Infrastructure

Resource Utilization: CPU, memory (/node, /pod, /JVM), and disk usage on Kubernetes nodes.
Pod Scaling: Number of pods during different stages of the test (autoscaling behavior).

Mongo DB

Query Latency: Measure the time taken for read and write queries to execute under load.
Throughput: Monitor the number of operations per second (reads, writes, updates).
Connection Utilization: Track the number of active connections to the MongoDB instance.
Index Performance: Assess index usage and identify slow queries that may benefit from additional indexing.
Replica Set Performance: Monitor replication lag and the performance of secondary nodes in a replica set.
Disk I/O: Track read/write operations per second and I/O wait times.
Cache Utilization: Monitor the efficiency of MongoDB’s internal caching mechanisms (e.g., WiredTiger cache).
Memory Usage: Track the memory consumption by MongoDB processes, including virtual and physical memory usage.
Locking: Monitor the time MongoDB spends in various types of locks (e.g., global, database, collection).
Document Growth: Measure the impact of document growth on performance, particularly for large collections.

Atlas MongoDB Triggers

Log Monitoring: Utilize the MongoDB Atlas logs, which can provide information on trigger execution, including start and end times, errors, and execution details.
Custom Logs: If needed... TBD
Error Handling: Implement error handling within the triggers to catch and log errors, providing visibility into failures.

Monitoring AWS SNS

Message Delivery Success Rate: Delivery Rate: Track the percentage of successfully delivered messages versus failed deliveries.
Delivery Errors: Monitor for any errors in message delivery, such as timeouts or endpoint unreachable errors. AWS CloudWatch can be configured to alert on specific error patterns.
Throughput:
- Messages Published: Monitor the number of messages published to SNS per second/minute to ensure that it can handle the load during stress testing.
- Message Size: Keep an eye on the size of messages being published, as larger messages can impact delivery time and cost.
Subscription Monitoring: Ensure that all subscriptions (e.g., SQS queues) are active and not in a failed state.

Monitoring AWS SQS

Number of Messages: Monitor the queue length to ensure that it doesn’t grow beyond expected limits.
Oldest Message Age: Keep track of the age of the oldest message in the queue to detect delays in processing.
Message Processing Rate:
- Messages Sent and Received: Track the rate at which messages are sent to and received from the SQS queue.
- Messages Deleted: Monitor the number of messages deleted after processing, which indicates successful handling by consumers.
Failed Message Processing: Watch for messages that fail to be processed correctly by consumers and end up in the Dead Letter Queue (DLQ).
Message Latency:
- Receive Latency: Track the time taken for a message to be received and processed after being sent to the queue.
- End-to-End Latency: Measure the total time from when a message is pushed to SNS by the MongoDB trigger to when it is consumed from SQS.
Throttling and Limits:
- Throttling: Monitor for any throttling issues, particularly if the number of messages exceeds AWS service limits. This is critical during stress tests where load spikes are common.
- Queue Limits: Keep an eye on queue limits (e.g., maximum number of inflight messages) and ensure you are not hitting them.

System Behaviour

Service Degradation: Identify any services that degrade or fail under load.
Service Recovery: Time taken to recover after a service failure or restart.
Load Balancing: Distribution of traffic across nodes/pods.

Logs and Traces

Log Analysis: Monitor logs for error patterns, bottlenecks, or unexpected behavior.
Tracing: Use distributed tracing tools to monitor service call flows and latency.

Mongo DB connection Mongo DB usage (atlas) Monitor/query logs

Data Assessment

Success Criteria

Owners: Sergey

Performance Goals: Response times should remain within acceptable limits under X load.
Error Threshold: Error rate should not exceed Y% during peak load.
Resource Utilization: CPU and memory utilization should stay below Z% for sustained periods.
Scalability: The system should scale up/down automatically and efficiently.
Expected data result: the system should produce consistent result (events, accounting entries, contract updates, ...)

Analysis and Reporting

Owners: Sergey

Data Collection: Gather metrics, logs, and traces for each test scenario.
Analysis:
- Compare observed metrics against baseline and threshold values.
- Identify patterns or anomalies in system performance.
Reporting:
- Document findings, highlighting key observations, and recommendations.
- Provide visualizations (graphs, charts) to represent performance trends.

Recommendations

Owners: Team

Infrastructure Changes: Suggestions for scaling nodes, upgrading hardware, or adjusting configurations.
Code Optimization: Identify parts of the codebase that may need optimization to handle higher loads.
Follow-up Testing: Propose additional tests based on findings, e.g., targeted tests on identified bottlenecks.