What data compression should be

December 22, 1992 NYT - EVEN a couple of years ago, 40 megabytes of storage seemed like Santa's sack: a bottomless repository. Today, thanks to color graphics, video, sound and byte-spewing engines like Windows and OS/2, some PC buyers consider 100 megabytes a bare minimum, and even 200MB does not seem greedy. Hard disks are like attics or garages; regardless of size, they soon become cluttered.
PERSONAL COMPUTERS; Of Data Compression and Decompression - The New York Times

Back in the days of the C64 doing the impossible was easy and miracles only took a little bit of time. Very frequently the most advanced programmers would look at a new toy and repeat the infamous mantra "That is impossible" as much as 3 times in a row. Back in those days some one claiming to do something impossible was at the top of the list of things one would want to look at. "Show me!". Today however it is more fashionable to keep chanting the "That is impossible" mantra and to become furious if some one suggests you should witness it with your own eyes. Responses like: "What, are you calling me a liar?!?! How dare you!" are not uncommon.

As I still belong to the "magician" class programmers it is safe to say you don't even have to make a prototype anymore. Of course the truth tends to win over time because it is the truth. I still have numerous ideas that would qualify as "impossible" but they aren't easy tricks and they would require a great deal of investigation, trials and bug fixes and nothing would be guaranteed. Before you can make a bigger and better hamburger you first need to make it, claiming you can just doesn't prove anything. Of course the believe nothing is possible effectively prevents such projects from taking off, realistically looking at the risks I'm not sure if that is a bad thing. Developers are like doing the right thing for all the wrong reasons.

Casting my own dreams aside I always wonder what kind of "crazy" ideas of the kind other people came up with. Most of the time there is at least one person who is even more ambitious about the impossible mission.

In the late 80's rumor spread around about a series of new compression algorithms. People claimed the equivalent of making files disappear entirely. We did get several mainstream products that are of course still in use today, strongly depending on the data files or even whole disks can be zipped down beyond half the original size.

I know the key ingredient is the "depending on the data" part. Video is not just video, photos are not just photos and text isn't just text. In theory a tune played on a guitar could be zipped down to musical annotation and a set of string samples. It would take 100 years of coding and 1000 years of compression using our most advanced supper computers but eventually the guitar compression algorithm could very closely approach the theoretical limit. Strings are of course not just strings and so on.

I know lots of compression tricks I'm sure no one ever thought of simply because they are rather silly, "impossible" and much to ambitious. But still those rumors about those "others" remain in the back of my head. One guy, before he was murdered claimed to be able to zip full size films down to 32kb files. Asking about data loss he argued that the small losses actually made the footage look better. Of course 32 kb is "impossible" I know that, don't worry, but lets not use that as an excuse not to think. The only clue he provided was that we have to remember what it is we are actually looking at. If anything at all it was a highly prophetic statement. What is it that we experience as high resolution imaging?

We don't have to get down to 32kb just jet. Without "dreaming" to much there are several indicators that there are technologies on the market today that already do much better than the mainstream implementations.

Here are 2 examples:

Physical Optics Corporation

In the 1990s, anyone who sent a video clip or tried to download one over a network realized that good-quality video consisted of a very large electronic file, and personal computers PCs took a long time to process it. Bandwidth which determined the amount of data that could be transmitted along a communication network and rate of compression which determined the size of a video file respectively were two critical factors in slowing the process. Video files were usually compressed heavily in order to be transmitted over a limited bandwidth, and consequently lost quality when they were displayed again. Physical Optics Corporation POC planned to develop an object-oriented video compression technology that would enable high-fidelity display of video on the receiver�s computer even when the video was transmitted over limited-bandwidth networks. This would be accomplished by developing the Compressed Live Object Video Interaction Singularity CLOVIS approach combining image manipulation principles and intelligent framing techniques. Simply put, POC planned to break down video image files into 1 key objects which would be rendered in terms of lines and shapes following a mathematical theory and allocated more space in the video file, and 2 peripheral objects for higher compression. After transmission, the key objects could be decoded to display on a screen at high quality. Analyzing video files for selective compression, incorporating intelligent framing technique, and then making it compatible with a PC environment were major challenges in this research. As a small employee-owned company, POC was unable to find a venture capital source that would tolerate such high technical risk and long development time. The company applied for funding under the �Digital Video in Information Networks� focused program of the Advanced Technology Program ATP in 1998, and was awarded cost-shared funding for a three-year program starting in January 1999. The project ended in 2002 with the successful development of the CLOVIS technology for video frame compression based on intelligent frame management IFM and new encoding/decoding software. POC also received two U.S. patents for their work. Within three years of completion of this project, POC built several video surveillance products based on CLOVIS. Of these, the most promising is a new flight data recorder that will record video for the first time in aviation history. Another product is the wearable identification tag for soldiers with audio/video information, which received an award from the U.S. Army.
Physical Optics Corporation - Object-Oriented Video Compression for Networks
IHC� Intelligent Hypercompression is an object oriented wavelet based data compression system for security surveillance system and surveillance video system applications. The unique object selection and extraction features high resolution viewing while conserving bandwidth. High quality images can be selected and viewed with higher data rates and higher frame rates than the significantly reduced (data and frame rate) background. IHC is a programmable compression software and hardware solution which allows the user to select the object or background with variable operating parameters, such as compression ratio, frame rate, and data rate. As compared to frame based systems such as MPEG or fractal, IHC� offers a much higher resolution and image quality of the object (s), while working at a lower bandwidth (phone line).
Video Compression Technology and Data Compression Software


Qbit, LLC, a start-up company based in Bethesda, Maryland, is coming out of �stealth� mode to introduce a radical new approach to lossless data reduction that provides extremely high image compression, but loses not even one bit of the original data. In fact, Z-Image is already achieving a 3 to 5x lossless encoding improvement using interframes, and 10x lossless using intraframes.
Qbit is coming out with lossless compression � alex.moskalyuk
Qbit is a research-focused algorithm development and implementation company that specializes in pattern recognition and compression algorithms that can be applied to a wide variety of commercial applications, including lossless compression/decompression for high resolution frame based images and video, and audio.
Qbit. No bit left behind.

The Stick of Jan Sloot

There is a story (I can't remember who told this story first, but it was probably Martin Gardner) about an alien that lands on Earth. The alien wishes to return home with the knowledge of all books on Earth. Unfortunately, his spaceship is so small that he can only take with him a small stick. So what he does is the following. First, he digitizes each book. This translates each book in one huge, but finite, number. Then he attaches all these numbers to each other, which results in an enormous, but still finite, number. In front of this number he places "0.", turning it into a fraction between zero and one. Setting the length of the stick to one, he makes a small cut in the stick at precisely the fraction. When he returns home, he only needs to measure the place of the cut, to get back the fraction, and thus the number, and thus the codes for all the books.

While it seems genius at first glance, of course this kind of information compression is impossible. However, the only reason that it is impossible, is that the universe consists of a finite number of indivisible particles. Were that not the case, in theory the alien's system could work. Of course, it would be unlikely that it could be implemented in practice, since it needs measuring on a scale so fine that even God would not be able to tell the difference between the stored information and the rubbish next to it.

I was reminded of this story when I encountered the affair of Jan Sloot and his Sloot Digital Coding System (SDCS). Since this affair is mainly known in the Netherlands, I will recant it here.

Translated from UT News:

In 1999, Jan Sloot, a common television repair man, discovered a revolutionary coding system that would make hard disks, CD-ROMs, and other data stores superfluous. All movies ever made would fit on one CD-ROM. Sloot attempted to sell his invention to Philips, but he found that Philips' engineers were not interested.

Roel Pieper, at that time a member of the board of Philips, thought the invention was valuable, and decided to join the inventor and his companions. Pieper and Sloot looked for investors all around the world. Pieper himself invested millions in the project. All investors of The Fifth Force, Sloot's company, hoped to become billionaires.

But the paranoid inventor died on September 11, 1999, one day before the invention would be specified in detail in a contract. All his notes, his prototype, and his source code were lost forever.

Translated from GIDS:

The Sloot Digital Coding System (SDCS) would shake the world: a new alphabet for digital storage that didn't use binary code, but a much more efficient method. The principle behind SDCS seems simple. As a text consists of a limited number of characters, a movie consists of a limited amount of colours and sounds. All those basic data were stored in five algorithms in five memory stores. For movies, each algorithm would have a maximum length of 74 Mb. That's 370 Mb in total: the invention's engine. To start the engine, only a proper key was needed. For every page of a book, for every image in a movie, Sloot calculated a unique code. The concatenation of these codes would again result in a unique code. The final code, the key, would be one kilobyte in length, regardless of the length of the movie or the size of the book.

Roel Pieper, professor in Computer Science, commented on people questioning the possibility of such a high compression (translated from UT News):

"It is not about compression. Everyone is mistaken about that. The principle can be compared with a concept as Adobe-postscript, where sender and receiver know what kind of data recipes can be transferred, without the data itself actually being sent."

So we're back to the alien and his stick. Basically, Roel Pieper explains Sloot's 'invention' as the movies already residing on the computers of both parties, but in a deconstructed form, and a key being transferred that specifies exactly how reconstruction must take place.

In principle, this is possible. But the crux is, how long the key needs to be, and how big the data store of deconstructed forms must be. Sloot claims that a key of one kilobyte, and a data store of less than 400 Mb, would suffice. This sounds unbelievable, and it is, in fact, mathematically impossible. There are many different arguments. I prefer the simplest one:

A key of a limited length (whether it is a kilobyte or a terabyte) can only store a limited number of codes, and therefore can only distinguish a finite number of movies. However, the actual number of possible movies is infinite. For, suppose it were finite; in that case there would be a movie that is the longest. By just adding one extra image to the movie, I would have created a longer movie, which I didn't have before. Ergo, the number of possible movies is infinite. Ergo, any key of limited length cannot distinguish every possible movie.

The SDCS is only possible if keys are allowed to become infinite, or the data store is allowed to become infinite (if the data store already contains all movies ever made, a key consisting of a number can be used to select the movie you want to see -- however, in that case it is impossible to have keys for movies that have not been made yet at the time the data store was constructed). This would, of course, make the idea useless.

The whole problem of Pieper's way of thinking (and of every other person who believes that Sloot actually could compress movies with a factor of two million) is that he believes that the key does not need to contain every detail of a movie. Unfortunately, it does.

In the SDCS, it is claimed that no movies are stored, only basic building blocks of movies, such as colours and sounds. So, when a number is presented to the SDCS, it uses the number to fetch colours and sounds, and constructs a movie out of them. Any movie. No two different movies can have the same number, otherwise they would be the same movie. Every possible movie gets its own unique number. Therefore, I should be able to generate any possible movie by loading some unique number in the SDCS.

Think of it: by placing the right number in the SDCS, I can not only get Orson Welles' Citizen Kane. I can get Citizen Kane in colour! Or Citizen Kane backwards! Or Citizen Kane where the credits misspell the name of Everett Sloane, who plays mr. Bernstein, as Everett Sloot! Or Citizen Kane in Swahili! Or Citizen Kane where 'Rosebud' is a teddy bear! Or Citizen Kane with a cameo of myself! Or the cartoon version of Citizen Kane! Or Citizen Kane where Charles Foster Kane wins the election! Or the gay version of Citizen Kane! Or Citizen Kane where Charles Foster Kane is replaced by Jar Jar Binks!

How many movies are possible that are variations on Citizen Kane? More than fit in a number of one kilobyte, I can tell you. More than there are atoms in the universe. An infinite number, actually.

In the SDCS, the details of every possible movie have to be somewhere. They are either in the number, or in the data store. Since the data store is the same for each movie, the details can't be in the data store. Therefore, everything that differs between movies, must be represented in some way in the numbers. Which, for infinite variations of movies, is impossible with finite numbers.

Would it be possible to store, say, all possible movies up to three hours in length, with a limited resolution and a limited sound quality, in a series of finite numbers? Yes, of course that is possible, since in the digital world, there is no difference between data and numbers, thus data of a limited length is equal to a finite number.

Even though Sloot claimed that one kilobyte was enough to store any movie regardless of its length, the question is warranted whether a system as defined by the SDCS is possible for movies of a limited (but realistic) length. The answer is no. One kilobyte is 1024 bytes, and a byte can take 256 different values. With a code of one kilobyte, to represent any movie up to 90 minutes in length, on average a byte must be able to distinguish between all possible sequences of about 5 seconds. Can a sequence of 5 seconds of any movie at all be identified with a number between 0 and 255? Of course not.

The concept of the SDCS, as described, is equivalent to a universal Turing machine. A universal Turing machine (which is actually quite easy to simulate in a computer), can emulate any possible computer program (including a program that shows Citizen Kane which ends with a pie fight) just by feeding it a number. The problem is that even for trivial programs the numbers quickly get enormous, so that it is impossible to filter the rare useful programs from the universe of chaos consisting of all numbers. Therefore, it isn't viable business to sell people a universal Turing machine as a solution to all their business problems.

Evidently, Jan Sloot was a crank. He showed all the signs attributed to cranks. He was paranoid. He felt himself an expert in a field he had no higher education in. He worked alone. He spent decades on one problem any mathematician could have told him was impossible to solve in the first place. He got angry at people who questioned him. He felt misunderstood. He had delusions of grandeur.

I think Jan Sloot really believed he had found a way to compress movies to one kilobyte. Probably he had imagined something as the alien's stick, something that would be theoretically possible in an alternate world, with infinitely divisible particles, but impossible in our reality. He probably lacked the mathematical insight to see that his 'invention' was a farce. In his prototype, he faked his invention, which is why he refused to let anyone near it, and answered only in mystical vagueness to questions. He probably imagined that, with enough money, he could get his invention to work for real. He probably did not feel himself a charlatan or a crook, but an honest businessman, who needed a starting capital to create the world's greatest invention, make some people incredibly rich, and win a Nobel prize.

What truly amazes me is that a man as Roel Pieper, who is a professor of Computer Science no less, could fall for his story, to the point where he invested a huge amount of capital. If his role in this story is really as reported in the media, his credibility as a computer scientist has been seriously tarnished. In my opinion, the University of Twente, with which Pieper is associated, should at least perform an internal investigation, to assess whether Pieper's position can be maintained.

Incidentally, if anyone is interested, I have devised an invention where I can store any movie ever created on a small stick. I need some money to develop this concept further. If you have a couple of million to spare, give them to me. I'll be glad to build a prototype. We're gonna be rich and famous!

September 28, 2004

© 2004 by Pieter Spronck

Romke Jan Bernhard Sloot (27 August 1945, Groningen—11 July 1999, Nieuwegein) was a Dutch electronics technician, who claimed to have developed a revolutionary data compression technique, the Sloot Digital Coding System, which could compress a complete movie down to 8 kilobytes of data— this is orders of magnitude greater compression than the best currently available technology. Despite the impossibility[citation needed] of such a technique there were investors that saw potential. However, Sloot died of a heart attack one day before an attractive deal was signed with Roel Pieper, former CTO and board member of Philips. The story - including an account of a believable demonstration of the technology - is told in modest detail in Tom Perkins' 2007 book, Valley Boy. Perkins, co-founder of the Silicon Valley venture capital firm Kleiner Perkins, had agreed to invest in the technology when Sloot died; Perkins and Pieper would have proceeded after Sloot's death, but a key piece of the technology, a compiler stored on a floppy disk, had disappeared and despite months of searching was never recovered.
Jan Sloot - Wikipedia, the free encyclopedia