30

If all data is essentially just a bit string, then all data can be represented as a number. Because a compression algorithm, c(x), must reduce or keep the same length of the input, then the compressed file must be smaller or equal to the input (and greater or equal to 0). This can be stated as 0 <= c(x) <= x. For compression to be useful there must also be a decode algorithm, d(x), which returns the original input.

Start with the number/data item 0. 0 <= c(0) <= 0, hence c(0) = 0, and d(0) = 0. Then 0 <= c(1) <= 1, and c(1) != 0 as d(0) = 0, so c(1) = 1. We can continue this: 0<=c(2)<=2, c(2) != 0 as d(0) = 0, c(2) != 1 as d(1) = 1, so c(2) = 2. This pattern can then repeat forever showing that without losing any data, any compression algorithm cannot compress data into a size lower than the original input.

I do understand how some compression algorithms work, such as run-length encoding (RLE), but I cannot see how they avoid this issue. One reason which I could see as to why RLE or other algorithms can fail would be how to transmit just the plain text. E.g., if x = 8888123, you can compress that to four 8s, 123, which is then transmitted as 48123. But then how can you transmit 48123?

And if you then add a special delimiter signal to signify that the "8" is a repeat count, then you are adding even more data. If you use a special signal, such as 1111, then you would transmit the signal 111148123 to represent 8888123, but then how do you transmit 111148123 and so on. This is just an example as to why the demonstration of RLE compressing data doesn’t show that compression is actually useful.

Mercury
  • 475

7 Answers7

119

There can be no algorithm that losslessly compresses all inputs. But there are many algorithms that losslessly compress many inputs. And it turns out that most of the strings we like to operate on have enough internal redundancy that they compress rather well.

Think of it as a Robin Hood-style exchange: the algorithm maps an infinite number of possible inputs to an equally infinite number of possible outputs, so if it maps some strings to shorter strings, it must map some other strings to longer strings. But we can arrange things so that it takes from the dull strings to give to the interesting strings :-)

Kilian Foth
  • 110,899
35

Compression works by making more likely patterns shorter at the cost of less likely patterns

Suppose we have a data format which is a string of 4 possible values, we could represent it with 2 bits each:

ABAABAACADAADCAAABAAA

Normally, we could store this as 2 bits per character. This string would cost 42 bits in that format, like this:

000100000100001000110000111000000001000000

However, notice that there are more As than any other letter. We can exploit this and instead use a variable number of bits depending on the letter:

  • A as 1
  • B as 001
  • C as 010
  • D as 011

Notice the length of A has gotten shorter while every other letter needs to be longer in order for it not to be able to be confused with a. Now, if we encode the string, we'd get this:

10011100111010101111011010111001111

Which is only 35 bits long. Overall, we've saved seven bits.

However, if we tried to encode the string DDDDDD in the compact encoding, it's length would be 18 instead of the 12 we'd get uncompressed. So for this "less likely" string the compression actually made the string significantly longer.

Not all compression is based on replacing specific characters, some compression algorithms use run-length encoding that requires extra bits in the case a sub-string is not repeated. Some include a table of replacements in the file itself, which allows the most flexibility but requires the size of the decoding table itself. Some compression systems have a bit flag that lets you turn off compression altogether, but even in this case you would still have the cost of the flag.

In practice though, most files we compress have a lot of patterns. For example, most text on the internet is ASCII printable, even on foreign language sites because of HTML tags and markup. Thus, shortening ASCII printable characters at the cost of non-printables and high-value unicode is basically always significantly shorter. This is just one example of a very strong pattern that makes compression very effective in the real world.

mousetail
  • 458
19

I recommend learning about Shannon entropy, but the simple answer is: lossless compression can't compress all inputs, but it can compress some inputs, and in practice that is enough.

To demonstrate I'd like to amend your RLE algorithm: a 0 followed by another digit n indicates that the next n digits are literal and not compressed. The 0 was free, because we would never need to include 0 counts of anything. This means that 48123 can be either encoded as 1418111213 or 0548123, but we'll only care about the shortest encoding.

Now think of all the possible 9-digit inputs. In the best case, some of them are compressible to 2 digits: 9n, where n is whatever digit is repeated nine times. In the worst case, we need 11 digits to "compress" the sequence, with 2 extra digits to encode that we're not actually going to compress anything.

But what is the average case? If we had a million different inputs, how much storage space would we need in total?

That would depend on how common each input would be. Imagine the input is actually about detecting some relatively rare event, so that 90% of the inputs is all zeroes, and 99% of the rest of the inputs only one of the digits is non-zero, the average output size would only be about 2.5, even though there will be some output values that take up 11 digits.

So it's really the combination of two helpful facts:

  1. While we can't compress all inputs, we can limit how bad the worst case gets.
  2. Most data we care about compressing is relatively low in entropy, which means it is highly compressible.

As an aside, you know how if you're watching a streaming video on like Netflix or YouTube, and there's snow or other busy small details rapidly changing, the resolution drops and everything gets blurry? That's because those types of images have high entropy and are harder to compress, which means the only way to compress that part of the video without it stopping to buffer is to lose information. With lossless compression, we can't make a trade-off like that, which means it's not suitable for streaming video. If we only care about total storage space, then it doesn't matter if there are some parts that take more space, as long as there is enough compression on average.

Jasmijn
  • 1,904
12

For example, if you had a file with 16 bit audio, you can obviously represent it with 16 bit per sample. However, a simple lossless compressor will have an algorithm that predicts the next sample, and only store the difference between the real and the predicted next sample. (Predictor + corrector). The corrections usually need much less than 16 bits. Take a 1,000 Hz sine wave that hasn't been produced 100% correctly: Your lossless compressor will predict a 1,000 Hz sine wave and store the tiny differences between its prediction and the real samplee.

And if your audio file contains just random 16 bit values, then lossless compression won't work very well. Note: You can always get away with only one wasted bit. Have bit 0 of your "compressed" data mean "compressed" or "uncompressed", so for any sample that your lossless compression algorithm cannot shrink, you just set bit 0 to 1 , and follow with the original data.

gnasher729
  • 49,096
7

If our data was sequences of random values, and longer sequences were less likely than shorter ones, then lossless compression would on-average be worse than useless.

But our data is far from random sequences of values. Take for example a piece of English text encoded in ASCII.

The first observation we can make is not all byte values are used and of those that are used some are used far more frequently than others. By assigning shorter codes to frequently used characters and longer codes to less frequently used characters we can encode the text in a smaller space.

Stepping out a level we can observe that the text is formed of words. Many combinations of letters simply never occur. Stepping out another level we find it's not just words, but combinations of words that are used repeatedly.

General purpose lossless compression algorithms are designed to find and describe patterns in somewhat arbitrary data, they work well in cases where the patterns are easily distinguishable, and the distance between repeats of a pattern are not too long.

Most compressors are also designed with an "escape hatch", that is if no patterns can be found or if describing the patterns takes more space than the original data did, there is a mechanism for passing through the data with only a small amount of overhead.

General purpose compressors tend not to do very well on things like photographs, audio and video though, the patterns in these types of data do exist, but they are often fuzzier requiring some "correction" to be applied on top of the pattern. Another problem is that images and video are multi-dimensional, as the video is cut into frames and the image is cut into lines, pixels that are next to each other in the image can be spread to distant locations in a file. Special purpose compressors can do better because they know what sorts of patterns to look for and because they can work in multiple dimensions.

Peter Green
  • 2,326
4

Other answers are right: not all data can be compressed. I'm not aiming to replace the accepted answer, but to provide some other descriptive text that might also be useful.

I'd like to expand on the idea, though, that "we can arrange things so that it takes from the dull strings to give to the interesting strings"

It turns out that a lot of data you care about is actually "interesting".

Perhaps the clearest example I can provide is snow. No, I'm not talking about slightly melted ice. I'm talking about "visual static". See: Wikipedia's page on "Noise (video) for some examples.

Now, as it turns out, some people may find looking at such snow to be interesting. In particular, sometimes when old televisions didn't get a full visual signal, but their antenna got something, the end result is that people could sometimes make out some vague shapes moving in what was mostly just a bunch of visual gibberish.

Okay, so maybe you think a bunch of random dots of different grey scales interesting. However, here is the key question. Is any one of them any more interesting than any other?

What if the seventeenth pixel from the left and twenty seven rows from the top were four shades darker. Would that image be more or less interesting to you? Would you even notice? Even if you were told to check that, and if you had a way to measure that and you did, would you care?

See, an average individual possible screen of pretty randomized snow is, frankly, rather boring, not really any more or less interesting than a similar picture.

Yet, arranged pixels can create glorious images that we care about.

Okay, let's take a look at another example. A sine wave with frequency modulation. This sine wave typically it moves upward and downward between values of positive one and negative one. What you typically find is that the wave moves up, and then it moves down, and then it moves up. What makes it interesting is how quickly the sine wave is moving up or down.

If the thing, whatever the sine wave is measuring, had truely random data instead of useful ("interesting") data, then you wouldn't see an organized wave. You would see dots at random spaces. Maybe they wouldn't even follow the rule of being between negative one and positive one.

See, if I told you:

0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5

you see an interesting pattern. If I told you:

0.5 0.2 973 -4087265 0.42 -10 -15 8279827329.328932

Such numbers have no discernable pattern, and so aren't interesting.

Interesting spreadsheets store information in rows and columns in ways that are interesting. In contrast, one specific instance of really random data is, typically, not too terribly any more interesting than any other piece of truely random data. (And if we aren't needing to identify a specific piece of random data, then lossless compression is probably not as desired if lossy compression can compress stronger.)

So, when we work with data, we typically work with "interesting" data, which has some sort of discernable patterns. Data compression software tends to look for such patterns.

For instance, English text tends to use ASCII codes from 65 to 90 (A-Z) and 97 to 122 (a-z) and others (spaces, periods, sometimes numbers and commas). Much more rarely do you use some of the other characters on a keyboard (like curly braces { } which may get heavily used in computer programming, but not typical English writing), and it is quite uncommon to use some of the other bytes like the character ¼. But while English doesn't typically use such characters, typically data compression can use all of the possible characters that a byte can be set to, so there is a larger pool of availale characters to use than the group of characters that need to be represented. Combine that with concepts like vowels showing up more often, and it turns out that text is typically highy compressible.

Jasmijn's answer describes what FAQS.org Compression: section 8 calls this the "pigeon hole" principle. Allow me to visualize this a bit:

All possible values with three bits:

000
001
010
011
100
101
110
111

All possible values with two bits of less:

00
01
10
11

You can't compress all eight of the 3-bit possibilities into using just two bits of information. If those "two bit" variations represented "compressed data", there are four possible compressed files that can exist. Those can compress into four of the original/uncompressed 3-bit possibilities. The remaining 3-bit possibilities have no smaller form which doesn't already have a meaning (to compress to a different 3-bit value).

Granted, you could try some techniques like being able to store 1-bit values, but you'll still come up short, and then you probably need to store additional information like the length of the comressed file. Keep in mind that every single bit you use up, for anything, could represent twice as many potential files.

TOOGAM
  • 149
2

Compression works using data structures. You say that, for example:

8888123

Is compressed as:

48123

But that's not how RLE (or any compression algorithm) works. Instead, the output from your example would look more like this:

[4, 8] [1, 1] [1, 2] [1, 3]

We now have four output symbols, but we need to make some assumptions about our input data to create a system where some symbols are favored at the expense of other symbols.

Let's say that we define the output symbols as:

struct {
  unsigned int length: 4;
  unsigned int value: 4;
}

Where we assume that each input value is a value from 0-9 (technically, 0-15), and the maximum length per symbol is 16 (length + 1). Armed with these assumptions, we can then write the output as:

[0011 0100] [0000 0001] [0000 0010] [0000 0011]

Here, by declaring that all values are input as numbers, we can say that we've turned 7 bytes into 4. By constraining the inputs to certain ranges, we improve the compression for most use cases.

Alternatively, consider the concept of limiting ourselves to ASCII codes from 0-127, and treating the value as an ASCII string. We now have a setup where we can use the MSB to indicate if this is a literal value, or a repeating value.

To encode 8888123, we output the data as:

[1000 0011] // Run indicator, length 4
[0011 1000] // value '8'
[0011 0001] // No run indicator, value '1'
[0011 0010] // No run indicator, value '2'
[0011 0011] // No run indicator, value '3'

We can also encode values between 128-255 using a longer code:

[1000 0000] // Run indicator, length 1
[1010 1010] // value 170

This is suboptimal if we end up with a lot of these higher characters, but our RLE scheme is optimized for repeating characters between codes 0 and 127. As long as that assumption holds true, we win.

For any lossless algorithm, we have to decide which attributes we are looking for. PNG, for example, favors images that have particular gradients and patterns. Random noise will not compress well, but most images we want to store are not literally random noise. They have patterns, and those patterns can be represented in smaller states.

To see this in action, try compressing a bitmap of some uniform data (e.g. a simple web graphic), a PNG, and an executable file into a ZIP file. The bitmap will likely gain the greatest compression ratio, the PNG a very small compression ratio (especially if it is a PNG of the bitmap), and the executable will most likely be "stored" rather than compressed at all. Most lossless compression algorithms use some variation of finding patterns and expressing those in a compact form.

GIF, remembers previously unseen strings of values to build a dictionary as it encodes the data. The decoder then rebuilds the dictionary by reading the input stream that was the output stream from the encoder. PNG works primarily by using part of LZ77 or Huffman encoding (essentially the same compression that ZIP uses), and can also apply a filter (e.g. a delta for each row) to improve performance even further. This algorithm builds a dictionary with a sliding window, finds patterns, and expresses the pattern as an offset into the sliding window instead of repeating all the bytes.

All of these assumptions built into the various formats assume that we want to encode useful images, such as real-life images or QR codes, and not randomly selected pixels. The algorithm is chosen so that shorter codes represent more useful/common patterns, and longer codes for useless/less common patterns. As a result, most algorithms can compress the data they are designed for efficiently, at the cost of not being able to efficiently encode useless information.

Note: Yes, I oversimplified some of the algorithms, as this answer was already quite long. Interested readers can just search for specific keywords or specifications.