Skip to content

Conversation

@woju
Copy link

@woju woju commented Jul 14, 2025

Make it possible to extract nested archives, which are used in e.g. Mender artifacts. See commit message for details and a (simplified) example.

FileSection.skip() (see below the diff) uses 2-argument readinto, so
attempting to recursively extract archives throws an error. This commit
adds optional second argument to fix this problem. After this commit, it
is possible to extract nested archives in roughly this fashion:

    with open(path, 'rb') as file:
        tar_outer = tarfile.TarFile(fileobj=file)
        for ti_outer in tar_outer:
            tar_inner = tarfile.TarFile(
                fileobj=tar_outer.extractfile(ti_outer))
            for ti_inner in tar_inner:
                ...

Nested archives are used in some embedded contexts, for example Mender
artifacts.

Signed-off-by: Wojciech Porczyk <wojciech.porczyk@connectpoint.pl>
@woju woju force-pushed the tarfile-nested branch from 927617c to f2217df Compare July 14, 2025 09:02
@dpgeorge
Copy link
Member

Thanks for the patch. I can see why it's needed.

But, this is not CPython compatible, and we strive to retain compatibility where possible.

Now, file.readinto(buf, size) is also not CPython compatible, and that's really where the trouble begins. Supporting the second argument there means all file-like objects passed into TarFile must support this 2-arg readinto form.

So I suggest to fix it by changing how skip works, so it doesn't use the 2-arg form, eg:

--- a/python-stdlib/tarfile/tarfile/__init__.py
+++ b/python-stdlib/tarfile/tarfile/__init__.py
@@ -55,9 +55,12 @@ class FileSection:
         if sz:
             buf = bytearray(16)
             while sz:
-                s = min(sz, 16)
-                self.f.readinto(buf, s)
-                sz -= s
+                if sz >= 16:
+                    self.f.readinto(buf)
+                    sz -= 16
+                else:
+                    self.f.read(sz)
+                    sz = 0
 
 
 class TarInfo:

@dpgeorge
Copy link
Member

For reference, this used to work, but commit 2ca1527 optimised skip to not use too much memory.

@woju
Copy link
Author

woju commented Aug 1, 2025

[...] Supporting the second argument there means all file-like objects passed into TarFile must support this 2-arg readinto form.

I thought this was already the case, because as it is, it doesn't work without 2-argument readinto at all, even the outer tarfile can't be extracted.

So I suggest to fix it by changing how skip works, so it doesn't use the 2-arg form, eg:

Sure, wilco. I'll also change the title of the PR

@dpgeorge
Copy link
Member

dpgeorge commented Aug 1, 2025

I thought this was already the case, because as it is, it doesn't work without 2-argument readinto at all, even the outer tarfile can't be extracted.

Yes, you're right, all fileobj's that it uses must support 2-arg readinto. For the most part that's OK, all C-based MicroPython streams will implement that. But Python-based files/stream won't.

For example, if there's a tar file on the host PC and you use mpremote mount . and then at the REPL try to use tarfile.TarFile to open the tar file on the mounted host PC, it'll fail with the same problem as this PR is addressing:

  File "tarfile/__init__.py", line 129, in __next__
  File "tarfile/__init__.py", line 106, in next
  File "tarfile/__init__.py", line 59, in skip
TypeError: function takes 2 positional arguments but 3 were given

So, instead of trying to add the 2-arg form to all streams/files, better to fix it once here so that it doesn't use the 2-arg form.

@woju
Copy link
Author

woju commented Aug 1, 2025

I agree, yes, this sounds like the correct fix. I'll do that the week after; next week I'm on vacation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants