An open community 
of Macintosh users,
for Macintosh users.

FineTunedMac Dashboard widget now available! Download Here

Previous Thread
Next Thread
Print Thread
File archiving algorithms
#32866 02/01/15 10:43 AM
Joined: Aug 2009
Likes: 15
OP Online

Joined: Aug 2009
Likes: 15
I've searched, but have been unable to find a comparison of the effectiveness of the various archiving algorithms as respects various types of data.

Trying multiple options to see which one is best every time I need to compress something is a major pain in the butt.

Anybody?

Thanks.


The new Great Equalizer is the SEND button.

In Memory of Harv: Those who can make you believe absurdities can make you commit atrocities. ~Voltaire
Re: File archiving algorithms
artie505 #32883 02/01/15 08:33 PM
Joined: Aug 2009
Likes: 16
Moderator
Online
Moderator

Joined: Aug 2009
Likes: 16
You can find any number of papers comparing the efficacy of the various data compression algorithms by doing a Google search on "data compression algorithm comparison".

When you speak of archiving I assume you are thinking of some type of lossless data compression such as you would use with text, binary code, etc. as opposed to lossy compression used with graphic images, audio, and video. All the various lossless algorithms depend on statistical redundancy within the data itself. In other words how much of the data is repetitious and how often do the reputations appear in the data set. The various algorithms simply use different methods of identifying and recording the repetitious segments. So regardless of what compression algorithm is the amount of compression you will achieve is heavily dependent on the data itself.

Another consideration is how much time and CPU cycles are you willing to devote to data compression or decompression. Most compression implementations provide for setting how much data compression you wish to achieve. But an invariable fact is the greater the compression the longer the time required for compression and decompression. Even the hoary Zip algorithm provides for that but few applications using the Zip algorithm provide a way to actually adjust the settings. However there are several compression tools included in OS X and accessible, including fine tuning, from the command line. There are also a few GUI utilities in the App Store that handle several different algorithms as well as levels of compression.


If we knew what it was we were doing, it wouldn't be called research, would it?

— Albert Einstein
Re: File archiving algorithms
joemikeb #32884 02/01/15 10:00 PM
Joined: Aug 2009
Likes: 15
OP Online

Joined: Aug 2009
Likes: 15
Thanks; the details are much appreciated.

Yes, I'm talking about lossless text, etc. compression, and for that, Keka is pretty robust, including allowing you to select the amount of compression you wish to achieve, even with Zip. (It's got an issue with deleting after archiving that detracts from its overall usefulness, and it doesn't offer as many compression options as other apps I've looked at, but it can create .dmg[s] and .iso[s].)

I'll search later, when I've got more time, and report back.


The new Great Equalizer is the SEND button.

In Memory of Harv: Those who can make you believe absurdities can make you commit atrocities. ~Voltaire
Re: File archiving algorithms
joemikeb #32891 02/02/15 08:11 AM
Joined: Aug 2009
Likes: 15
OP Online

Joined: Aug 2009
Likes: 15
I Googled "data compression algorithm comparison" and got a long list of virtually incomprehensible scholarly papers, but among them was Comparison of compression.

It's not exhaustive, but it gives a good idea of what's what.

Thanks.


The new Great Equalizer is the SEND button.

In Memory of Harv: Those who can make you believe absurdities can make you commit atrocities. ~Voltaire
Re: File archiving algorithms
artie505 #32893 02/02/15 03:55 PM
Joined: Aug 2009
Likes: 16
Moderator
Online
Moderator

Joined: Aug 2009
Likes: 16
Originally Posted By: artie505
Yes, I'm talking about lossless text, etc. compression, and for that, Keka is pretty robust, including allowing you to select the amount of compression you wish to achieve, even with Zip. (It's got an issue with deleting after archiving that detracts from its overall usefulness, and it doesn't offer as many compression options as other apps I've looked at, but it can create .dmg[s] and .iso[s].)

Maybe I misunderstood what you were looking for when you said "archiving algorithms". Keka is a application, actually a GUI front end to several different archiving and/or compression algorithms. If it is an application you are looking for Archive Expert offers several different compression algorithms and perhaps more importantly the ability to select the level of compression in each.

Originally Posted By: artie505
…among them was Comparison of compression.

Unfortunately in order to accurately assess what is essentially a mathematical process, it is almost impossible to avoid math thereby reducing the comprehension level for many readers. In the article you cite, the author is comparing lossless compression algorithms using a single implementation of the algorithm on data types most of which are notoriously immune to lossless compression such as binary code, audio, and graphics. Another problem in the author's methodology is there is no attempt made to separate the efficiency of the algorithm itself from the efficiency of the code used to implement the algorithm. In other words, different implementations of the same algorithm can yield different results in both speed and compression levels. Rather than claiming to compare algorithms it would be more accurate to say the author is comparing compression applications.

It has long been an axiom in computer science circles that applying lossless compression algorithms on binary code, audio, and graphic data is very likely to make the "compressed" file larger than the original and at best the amount of compression achieved, if any, is too slight to be worth the cpu cycles, because there is too little repetition in the data set. The data compression algorithms that work for graphic and audio data are invariably "lossy". In truth, lossless data compression works best on text or as the author of the cited paper calls it "Office data". That leaves binary code without any truly effective compression techniques. Basically the author has confirmed what any first year computer student could have told you.


If we knew what it was we were doing, it wouldn't be called research, would it?

— Albert Einstein
Re: File archiving algorithms
joemikeb #32911 02/03/15 07:47 AM
Joined: Aug 2009
Likes: 15
OP Online

Joined: Aug 2009
Likes: 15
Thanks!

Originally Posted By: joemikeb
Maybe I misunderstood what you were looking for when you said "archiving algorithms". Keka is a application....

Not at all... What I was looking for is a comparison of the effectiveness of the various compression(*) algorithms offered by apps such as Keka and Archive Expert as respects various types of data so I don't have to get involved in experimenting - more out of curiosity than for the potentially minimal benefit to be gained - with "n" algorithms each time I want to compress something. (I mentioned Keka only as what I think is the best example of a compression app that I've run across. I can't d/l Archive Expert, because it's an App Store item, but from what I can see it offers less than Keka and at a significantly higher price. [Time limited offer, only $4.99 now, Original Price is $9.99 v Keka's $1.99.])

Thanks for your insights into my linked doc (...the only one I could both find and understand).

My experience has been that I can't compress pictures, that the "compressed" files are equal in size to, even a bit larger than, the originals. I don't know how the author accomplished his compression; different app than I'm using, perhaps.("In other words, different implementations of the same algorithm can yield different results in both speed and compression levels.")

So, unless I can find an app that works with them, I guess I'm relegated to sending pics without compression.

In the end, then, I follow "In truth, lossless data compression works best on text or as the author of the cited paper calls it "Office data.", but how does "Office data" differ from "Plain", which appears to compress best?

(*) I think I may have used "archive" and "compress" interchangeably when they're not?


The new Great Equalizer is the SEND button.

In Memory of Harv: Those who can make you believe absurdities can make you commit atrocities. ~Voltaire
Re: File archiving algorithms
artie505 #32913 02/03/15 02:13 PM
Joined: Aug 2009
Likes: 16
Moderator
Online
Moderator

Joined: Aug 2009
Likes: 16
Pictures can be compressed but only at the expense of throwing away data, ie. "lossy" compression such as jpeg and mpeg. crazy

My solution for sending full resolution photographic images is Dropbox, iCloud or other online file sharing services. They don't reduced either the bandwidth or CPU cycles required to send/receive the images but it does get around the maximum file size limitations on most email servers.


If we knew what it was we were doing, it wouldn't be called research, would it?

— Albert Einstein
Re: File archiving algorithms
artie505 #32914 02/03/15 02:38 PM
Joined: Aug 2009
Offline

Joined: Aug 2009
To add to what JoeMikeB just wrote, the mail program in Yosemite lets you send up to 5 GB per message, Send to anyone with Mail Drop.
When you send a large attachment, recipients get a link to download the file, so it doesn’t matter what their email size limits are.
iCloud stores the file for free for 30 days, and it doesn’t count against your storage space. I haven't personally used it yet, but it sounds good

Re: File archiving algorithms
artie505 #32919 02/03/15 03:45 PM
Joined: Aug 2009
Offline

Joined: Aug 2009
Originally Posted By: artie505
I've searched, but have been unable to find a comparison of the effectiveness of the various archiving algorithms as respects various types of data.


Data compression has been an area of study and research for decades. It can be mathematically proven that any compression algorithm you develop will, on the average, create a larger 'compressed' file than the original, even if only by one byte.

This seems rather strange until you realize that "on the average" includes random data. So the goal of a compression program is to identify a pattern that it can replace with a smaller representation of the pattern, or to find multiple occurrences of the same set of data that it can replace with a single copy, or a combination of the two techniques.

This relies somewhat heavily on the sort of data you are compressing. Text for example compresses very well due to the variety of patterns and repetitions found in it, as well as the heavy bias in character variety. Databases tend to compress pretty well also, as do many images.

The more the algorithm knows about the data, the better it can compress it. Knowing there will be regular vertical patterns in a stream of image data (that is streamed one row at a time) for example really helps it. JPEG for example uses a block-compression, and so knowing vertical as well as horizontal data adjacency is critical. (it's also a LOSSY compression, it won't reproduce the image 100%, and can even be adjusted for how lossy / small you want the compressed file to be, but a little loss makes for a HUGE improvement in compression)

Most "general purpose" products (such as zip) will "pre-scan" the data, and make a decision as to which of a variety of compression techniques they have available should be used. Some software may even use a different algorithm for different sections inside a program. An app that programs an accessory may consist of an application, a block of text strings to be displayed in the interface, and an internal storage of the firmware for the device, each of which requires a very different approach. The firmware may already be compressed. Compressed data very rarely can be compressed again for any significant gain. That area the routine may simply store that region of data as-is, with no attempt to compress it.

Sometimes if you have an app that is highly specialized, it can do better. An ebook reader for something with limited storage like a nook may heavily optimize compression for their books, and may even make very minor adjustments to the structure of their ebook files, to improve compression. An additional 10-15% compression may be possible with highly-targetted programming. Those same routines may not even be capable of accurately compressing a different kind of file.

Most users will use a general purpose program like zip. It's been around for decades and has been tweaked considerably. It's probably the best general-purpose method currently available.


I work for the Department of Redundancy Department
Re: File archiving algorithms
Virtual1 #32931 02/03/15 08:47 PM
Joined: Aug 2009
Likes: 15
OP Online

Joined: Aug 2009
Likes: 15
Thanks for filling in the blanks...making compression more comprehensible.


The new Great Equalizer is the SEND button.

In Memory of Harv: Those who can make you believe absurdities can make you commit atrocities. ~Voltaire
Re: File archiving algorithms
MarkG #32934 02/03/15 10:49 PM
Joined: Aug 2009
Likes: 15
OP Online

Joined: Aug 2009
Likes: 15
Thanks to you and joemike for your input, which I'll file away in a dark corner of my mind in anticipation of a dreaded future. frown


The new Great Equalizer is the SEND button.

In Memory of Harv: Those who can make you believe absurdities can make you commit atrocities. ~Voltaire
Re: File archiving algorithms
Virtual1 #32936 02/03/15 11:15 PM
Joined: Aug 2009
Likes: 1
Offline

Joined: Aug 2009
Likes: 1
Originally Posted By: Virtual1

Data compression has been an area of study and research for decades. It can be mathematically proven that any compression algorithm you develop will, on the average, create a larger 'compressed' file than the original, even if only by one byte.

This seems rather strange until you realize that "on the average" includes random data. So the goal of a compression program is to identify a pattern that it can replace with a smaller representation of the pattern, or to find multiple occurrences of the same set of data that it can replace with a single copy, or a combination of the two techniques.


To add a bit to what Virtual1 says, it's especially hard to compress images that have already been compressed using things like JPEG compression.

When you compress a regular file, like a text file, it is not at all random. There's such an enormous difference between meaningful language information and randomness that people can look at a string of letters or characters in some utterly unknown alphabet that supposedly represents some utterly unknown language and, just by calculating how much randomness is in it, tell you if it's actually a real language or just gibberish (and if it is a real language, how much information it conveys per block of text)--without knowing anything about the language at all!

Data that appears indistinguishable from randomness is called "high-entropy data." Files that have already been compressed are high-entropy data, because compression works by taking repeating, non-random sequences and replacing them with smaller, random-appearing sequences. Encrypted files are high-entropy data, because the whole point of encryption is to take a file and remove all traces of information from it by making it LOOK random (if you don't know how the encryption works or what the encryption key is). The entire purpose of encryption is to remove any traces of meaningful information from the encrypted text. JPEG pictures are very high entropy--order in the picture is actually permanently and irrevocably removed to facilitate compression.

A purely random sequence of bytes pretty much can't be compressed, at least not meaningfully. That means high-entropy data--already-compressed files, encrypted files, JPEG images, and so on--can't be compressed. There are some compression techniques that are optimized for high-entropy data, but they tend to be modest in their compression when they work at all.

The lower the entropy, the more it's possible to compress. The higher the entropy, the less it's possible to compress.


Photo gallery, all about me, and more: www.xeromag.com/franklin.html
Re: File archiving algorithms
tacit #32938 02/03/15 11:22 PM
Joined: Aug 2009
Likes: 15
OP Online

Joined: Aug 2009
Likes: 15
Thanks!

(This has evolved into a pretty educational thread.)


The new Great Equalizer is the SEND button.

In Memory of Harv: Those who can make you believe absurdities can make you commit atrocities. ~Voltaire

Moderated by  alternaut, cyn 

Link Copied to Clipboard
Powered by UBB.threads™ PHP Forum Software 7.7.4
(Release build 20200307)
Responsive Width:

PHP: 7.4.33 Page Time: 0.027s Queries: 40 (0.020s) Memory: 0.6470 MB (Peak: 0.7589 MB) Data Comp: Zlib Server Time: 2024-04-25 07:38:57 UTC
Valid HTML 5 and Valid CSS