How does gzip influence on the server productivity?

Author: Matsievsky Nikolay aka sunnybear

Published: 03 October, 2008

How does gzip influence on the server productivity?

A note: here is English translation for these 2 Russian articles about gzip and CPU costs.

There are a lot of articles about enabling gzip in Apache for HTML, CSS, JS files. But there is no research about how does mod_gzip or mod_deflate influence server productivity. We can output any textual files as archives, reduce their size, and thereby increase load speed of any website. So current research is going to find the answer to the following questions — how gzip enabling should slow down the server? Is this reasonable for high load projects? Is there any case when gzip shouldn't be used?

The model

First of all we should find what expenses are concerned with archive process itself. We can represent all those expenses in the following way:

gzip = disk read/write + library initialization + archive creation

Let us assume (and the following data confirms our assumption in general) that first 2 parts don't alternate when file size changes (we have file from 500 bytes up to 128 Kb), and they are more or less constant (in comparison with the last part). By the way we found that file size costs depend on file size (very slightly but depend). There will be some words about this below.

Naturally, CPU costs for archive creation should be as a linear function from file size (linear assumption gives us an error less than other assumptions), so we can write our formula in the following way:

gzip = FS + LI + K*size

Here FS — file system costs, LI — library and any of constant costs that depend on current gzip implementation, and K — a proportionality coefficient between file size and CPU costs to gzip this file.

Test suites

So to perform our research we need to run 2 sets of tests:

Gzip tests, to get pairs of numbers "size — gzip"
File system, to get pairs of numbers "size — FS"

So we have the only questions after such task definition — why we have only 2 sets (equations)? What about library initialization costs (LI)? Because now we have a system of linear equations and we can easily calculate all required parameters (K and static costs). But if we have 3 sets of test we will solve a redefined system of equations that isn't required here and significantly increase complexity of our research. And with statistical approach (not single test but set of tests) we can reduce most of errors to the minimum.

For all sets of tests common HTML file was taken (to make the situation very close to the real one). Then the first 500, 1000 ... 128000 bytes were cut and all files were archived and read/written with the help of internal OS (FreeBSD, 2.8GHz) utilities. This way we avoided expenses of any external programming language usage.

Test results

Gzip gave us the following graphic. You can see that most of costs are related to file system, not to gzip itself. Here and below all times are in milliseconds. All sets contain 10000 iterations.

Gzip costs from file size

Figure 1. Gzip costs from file size

Now we can add file system costs, minus them from total costs, and get the following picture:

Gzip and file system costs

Figure 2. Gzip and file system costs

Open/read/write costs depend on file size but dependency is very weak for our files. And this lets us to create model of CPU costs from file size (using the upper formula). So we get the following graphic:

Real and model gzip costs

Figure 3. Real and model gzip costs

2 words about file system

A question: why do we need to add tests for file system productivity? Why can't we simply measure gzip time for current file size?

An answer: firstly every web server takes file from file system, so these costs are counted in server's answer time (before first byte). All gzip operations are performed in the memory. We need only to measure how answer time should increase with gzip enabling (so the server will perform some additional operations in the memory).

Secondly not every web server reads from HD directly. Some proxies (squid, nginx) can cache files in memory, so file system costs don't influence answer time. So we need to exclude it from our measurements.

What is faster: gzip or broadband?

Our model approximates our data well so we can have it as a basis for further calculations. We need actually to know how much CPU costs are higher (or maybe lesser) costs to transfer amount of bytes that is eliminated after gzipping. We can plot several graphics to get the final results. For initial CPU costs Dual Xeon 2.8 GHz was taken.

User also wastes some time to unzip archive and we can limit this time with gzip costs for 1GHz PC (unzip is more simple procedure than gzip so the real time is much lesser). So below are drawbacks to transfer additional portions of data in comparison with CPU costs to gzip this data (in milliseconds)for two different broadbands (100 Kb/s and 1500 Kb/s) and two different servers (280 MHz and 1 GHz).

Drawbacks for information transfer and gzip costs for 100/1500Kb broadband and 280/1000MHz

Figure 4. Drawbacks for information transfer and gzip costs for 100/1500Kb broadband and 280/1000MHz

Gzip level

It's not much clear at this moment how CPU costs depend on gzip level and how can we forecast it if we know all other parameters.

So a new set of tests was targeted to get the dependency between gzip level, CPU costs and file size decrease. After this an analytical tool to calculate optimal gzip level was created.

Calculations

There were performed sets of 10000 iterations on the server. Gzip times with gzip level were counted and then average of a set was calculated. Then this average time was decreased file system costs (as gzip operations are performed on clean OS). Also difference in file size was measured. All tests are performed with a HTML file (120 Kb).

Results

We got the following plot for "CPU costs — gzip level" dependency. X is gzip level and Y is CPU costs (average for a set of tests):

CPU costs for different gzip levels

Figure 5. CPU costs for different gzip levels

So now we can plot gzip level efficiency of file size decrease (% from initial file size):

Efficiency of different gzip level

Figure 6. Efficiency of different gzip level

Conclusion

All conclusions are integrated to this tool — you can set CPU of server, CPU of end user and broadband, and the calculator will give you an optimal gzip level for your case (and if gzip shouldn't be enabled on the server). You can cover a lot of possible cases if vary parameters, total error isn't greater than 10%.

A note: if all files are smaller than 4 Kb gzip shouldn't be enabled because we won't get any significant increase in load speed.

All figures in this article are very expressive so in the case not very fast broadband and lower CPU gzip enabling will give you remarkable increase in load speed. If you have an intranet server (with local speed more than 2Mb/s) it can be unnecessary to enable gzip (also you can have too many small files on your website). In the most of cases gzip for you web server will very helpful.

It should be also mentioned that if we gzip textual files we can server them faster and free our server resources faster. In the case of high load projects this can be very important.

Yet Another cSS selector

Useful links

CSS sprites

JavaScript techniques

CSS methods

Websites cloud