Archive for January, 2016

Python zipfile and tarfile differences

A project of mine involves extracting files from .tar.gz and .zip archives as Python streams.

No problem, there are zipfile and tarfile.

As this is Python and the modules do similar things, you might expect they have similar interfaces. Or at least consistent interfaces.

Unfortunately they are annoyingly different. Consider:

task tarfile zipfile
open an archive zipfile.ZipFile(filename)
get a list of members of the archive opened_tarfile.getmembers() opened_zipfile.infolist()
get name of a member zipfile_member.filename
get size of a member tarfile_member.​size zipfile_member.​file_size
extract an archive member (create a file on the hard disk) opened_tarfile.​extract(tarfile_member) opened_zipfile.​extract(zipfile_member)
get a member as a file-like Python object opened_tarfile.​extractfile(tarfile_member) opened_zipfile.​open(zipfile_member)

There’s some more catches. If you opened the ZIP archive with ZipFile(zipfilename) and want to extract more than one member, each extraction will open and close the ZIP file separately, so use with open(zipfilename) as zipfp: ZipFile(zipfp) instead.

Also, for tar archives, in Python 3, the result of opened_tarfile.extractfile() inherits from BufferedReader and so supports a context manager. In Python 2 it inherits object, implements read() itself, and doesn’t include __enter__() and __exit__() required to support a context manager. Extracting members out of ZIP archives with gets a context manager-capable object since Python 2.7.

I understand what happened here: tarfile was added in 2.3 and was different (and better) than zipfile because tarfile’s author didn’t like [zipfile] interface very much. New things were added to zipfile in 2.6 and again in 2.7, and tarfile was improved in Python 3. But that doesn’t make it any less annoying to write code that works with both tarfile and zipfile on both Python 2 and Python 3. We’re stuck with two frustratingly different interfaces for very similar tasks for a while.