Practical Scale Testing

Everyone wants scalable systems. Arun Saha explores methods for testing scalability.

Scalability is the ability of a system to handle a growing amount of work or its potential to accommodate such growth. It characterizes how resource utilization grows with increasing load [ Wikipedia-a ]. Scalability is not an ‘add-on’ item; it is a quality that determines the lifetime value of the software. It may not be apparent on day one when the focus is usually on the features, but it is very important for the eventual growth. Therefore, from early on, it is important to test the software and verify that the desired scalability properties are present.

What is a desired scalability property? Here is a textbook example. Consider a system that sorts data. In Big-O notation, the time complexity of a good sorting algorithm (for example, quicksort, merge sort, heapsort) is O(N lg N) where N is the number of items being sorted [ Wikipedia-b ]. Thus, a desired scalability property of this system is that the time taken to sort N items is O(N lg N) , i.e., the time consumed is proportional to N lg N . The only way this can be examined is by running inputs of different sizes through the system [ Orr14 ]. With sufficiently large input size, each time the input size is doubled, the time taken is expected to be doubled too. For example, with T1 = k lg k , and T2 = 2k lg (2k) , T2/T1 = 2k lg (2k) / k lg k = 2 + 2 / lg k ~= 2 . If, however, the system is using a sub-optimal sorting algorithm of time complexity O(N ² ) (for example, bubble sort, insertion-sort) then the ratio of the times taken would be quadrupled (since T2/T1 = (2k) ² /k ² = 4 ). How many different sizes to run? See the discussion in ‘Test procedure’, below.

The holy grail of scalability is linear scaling , i.e. the measured metric changes linearly with the changed attribute. With linear scaling, if the measurements are plotted on a graph, they approximately form a straight line.

Scalability testing is a type of non-functional testing; tests are only conducted on systems that are proved functionally correct. In the example of the sorting system above, the time complexity tests are meaningful and attempted only after it is verified that the system can correctly sort input of different sizes including sufficiently large sizes.

There are two main motivations of scalability testing: (a) identify the limits of the system; (b) verify that the system performs well up to the previously identified limits or desired limits. It is important to identify the goals precisely; for example, a system can support 1M users but it may not be able to support 1M users together . In this case, an important thing to know is, how many users can be supported simultaneously.

Usually, the scale tests are carried out as black box tests. In a broad sense, it involves two kinds of tests and measurements: (1) customer-driven (or deployment-driven) and (2) architecture-driven. The former involves real-life situations although not necessarily everyday situations. For example, can a system of cellular towers handle the load when 100K people gather for a special event? The latter are internal-driven tests, which keep loading the system and study its behavior. There is a connection though; the numbers obtained from the architecture-driven tests influence future deployments.

In this article, we present the key characteristics of successful and practical scale testing at the system level. These characteristics can form the basis of system or load tests. We will consider attribute selection, metric selection, and the testing process. These considerations help design scalability tests for a system.

This article is not about microbenchmarking [ StackOverflow ] where a ‘small’ specific section of code (often a class or a function) is tested for performance.

Identify scalability attributes to explore

A system usually has many different knobs and parameters that could be scaled. Some of them are logical entities without any intrinsic limit, for example, the number of users, while others are physical entities tied to some limits, for example, hardware capacity and operating system constraints. From another perspective, some parameters are exposed externally (to the users of the software), such as the maximum size of the items to be sorted, while others are internal, for example, block size for saving data.

As we saw above, gauging the scalability of a single attribute requires running a specific test multiple times. Therefore, to make the best use of available time, staff, and computing resources, it is important to identify and prioritize the attributes whose scalability properties are to be explored. For example, in a document management system, it may be more important to have ‘search’ scalability (since search results are needed instantly) than ‘remove’ scalability (since removal can be performed asynchronously through batch jobs).

There is no point exploring the scalability properties of attributes that are related; choose only those attributes that are independent of each other. For example, the number of simultaneous users and the number of (external) network connections are related (i.e. dependent); it may not make sense to choose both of them as attributes.

The attributes together form a scale-space where each attribute is a dimension and each combination of the attribute values is a point in that space.

Identify a set of metrics for each attribute

Some general metrics that characterize system performance are, for example, throughput, latency, CPU usage, memory usage, network usage, disk usage.

Each kind of system has its own family of metrics. For example, in a storage system, some commonly used metrics are sequential read IOPS (input-output operations per second), sequential write IOPS, random read IOPS, random write IOPS. In a system involving video, a common metric is frames per second. A database system uses queries per second; a networking system uses packets per second (for throughput) and time per packet (for latency).

There are many other metrics capturing other aspects of a system, besides performance. Examples include

time to recover from a disk failure
time to recover from a node failure
time for traffic to converge after a network failure.

The recovery related tests and obtained metrics provide the expected time to recover from a failure situation. They serve as a useful reference to the customers and the operations community.

It is worth noting that certain metrics may be at odds, for example, the design for optimal (i.e. maximum) throughput may not be the same as the design for optimal (i.e. minimum) latency. Moreover, once the metrics are chosen, the system architecture and design tend to optimize in favor of them. Hence, it is essential to identify what is important to the customers.

Test procedure

Scalability exploration is NOT about running a test once under the maximum scale. Rather, it is a study of the system by running the test for multiple iterations , each time with a different combination of attribute values. Without multiple iterations, we can know how the system behaves at a specific load point, but we cannot know how the system behaves with an increase (or decrease) of load – the whole point of scalability.

It is usually not possible to obtain measurements for all possible values of an attribute in a continuous sense. So, the desired range of an attribute is broken down into multiple probe points and the test is conducted to obtain measurements for each one of them.

As an example, the sorting system mentioned above may have only two attributes, the number of input items and the size of individual items, and only one metric, the time consumed. Each iteration can run with a different value of the attribute tuple (#items, item-size), exploring different points in the scale-space. In order to explore scalability properties up to 1M items and 256-byte items, one possibility is to design the iterations as follows: 10 probe points for #items (100K, 200K, … 1M) and 3 probe points for item-size (8B, 64B, and 256B). This way, the number of iterations necessary is 10x3 = 30. In general, the number of iterations necessary is the product of the number of probe points of each attribute. The combinations are the cross product of the probe points. The results of such a test design can be organized as shown in Table 1

Test Input Id	Attributes		Metrics
Test Input Id	No. of entries	Item size in bytes	Time consumed in ms
1	100000	8
2	200000	8
…	…	…
10	1000000	8
11	100000	64
12	200000	64
…	…	…
20	1000000	64
21	100000	256
22	200000	256
23	200000	256
…	…	…
30	1000000	256

Table 1

Depending on the time necessary to run the test, it is possible to run multiple repetitions of a specific iteration and take their mean value as the representative. Other statistical attributes such as median and variance can be considered as well [ Winder17 ]. This provides more confidence in the observed metrics’ values.

If multiple attribute values are changed from one iteration to the next, then it is difficult to reason which attribute is responsible for the change in the metrics’ values. So, ideally, change one attribute at a time from one iteration to the next. However, if that makes the number of iterations too large to be practical, then some additional strategies may be considered, for example, reduce the number of probe points or increase iterations only in those areas where there are sudden jumps in observed metrics’ values.

Preserve history : If the software running the system is undergoing change, then the scale characteristics may change from one version to another. Hence, it is important to save the results on a per-test per-version basis and track if the characteristics change from one version to the next or over a series of versions. If things become worse in one version, then it is usually necessary and useful to compare with the results from the most recent previous version that is available.

The idea is that the (continuous) build machinery and the scale testing machinery proceeds independently. The scale test pipeline usually picks up a build, runs all the scale tests, reports the results, and then goes to pick up the next build. By that time, it is possible that multiple builds have been done and that is okay. Thus, there may not be data available for all builds. If however, there is a need, it is possible to go back, pick up any build such as one that was earlier skipped and run the test on it. Table 2 contains an example of how the test data can be saved.

Test Input Id	Attribute Tuple (#items, item-size)	Build version	Date	Testbed Id	Metrics (time consumed in ms)
10	(1M, 8B)	108	Jan 2, 2018	42	5
10	(1M, 8B)	111	Jan 3, 2018	42	5
10	(1M, 8B)	120	Jan 9, 2018	42	7

Table 2

Since the attributes and the metrics are likely to be different, each test needs a different schema for its table. The test metrics form a time-series on how a particular metric performed over time. Thanks to the preservation, the performance degradation in Build 120 is easily identified. From the history, it can also be concluded that the degradation happened sometime after Build 111. To find out further whether the regression happened in one build or over a sequence of builds, the intermediate builds need to be tested; a binary search pattern may be used.

Lifecycle of a test

A test goes through multiple stages. First, the test is designed. The design involves choosing the input trigger, the output to verify, the attribute set, and the metric set. The design is reviewed by the appropriate stakeholders. Second, the test is run manually. This is a proof of concept to verify that the test design is okay. Third, test software is written to run the test automatically. This involves invoking commands to configure the testbed, load the test software, set up monitoring for the metrics, trigger the input, collect the output, verify if the actual output matches with the expected output, collecting the logs, and report the results. At this point, the test is ready to be activated. It can be used by humans running scale tests as well as the scale test harness where scale tests are invoked automatically.

Scale testing is mostly performed along with the software development, after unit testing, integration testing, and system testing. During the development cycle, the organization can run the scale tests on as many testbeds as are available and as frequently as things permit. The purpose and the nature of data collection are different after the product is released (i.e. in production) and is not covered in this article. The performance numbers obtained from the scale tests on the final released version of the software are (selectively) published externally and saved as a benchmark for later software versions.

Conclusion

Scalability testing is a broad subject and this article merely scratches the surface. It does not cover related topics such as micro-benchmarking, vertical scaling (scale up) vs. horizontal scaling (scale out). The article explains different aspects of managing scalability tests: designing the test (identifying the relevant attributes and metrics), planning the number of iterations and the attribute combinations to use in each iteration, performing the test by running those iterations, saving the results, verify expected scalability properties, and identify regressions.

Acknowledgements

Many thanks to the Overload reviewers and the editor Frances Buontempo for their suggestions that have helped improve this article.

Note: The opinions expressed in this article are solely the author’s – not author’s employers’.

References

[Orr14] Roger Orr, Overload #124 (December 2014) ‘Order Notation in Practice’ https://accu.org/index.php/journals/2043

[StackOverflow] StackOverflow, ‘What is microbenchmarking?’ https://stackoverflow.com/questions/2842695/what-is-microbenchmarking

[Wikipedia-a] Scalability Testing https://en.wikipedia.org/wiki/Scalability_testing

[Wikipedia-b] Sorting algorithm https://en.wikipedia.org/wiki/Sorting_algorithm

[Winder17] Russel Winder, Overload #141 (October 2017) ‘Marking Benches’ https://accu.org/index.php/journals/2427