记录一次pyinstaller解包过程

引言

分析由PyInstaller打包的Python程序是一个常见需求。本文将详细记录我对一个名为xxxx.exe的PyInstaller打包程序进行解包和反编译的全过程。

第一步:识别打包工具

通过观察程序图标和文件特征,确认这是一个由PyInstaller打包的Python脚本。PyInstaller是一个流行的Python打包工具,能将Python脚本转换为独立的可执行文件。

第二步:使用pyinstxtractor解包

我使用了专门针对PyInstaller的解包工具pyinstxtractor

1
python pyinstxtractor.py xxxx.exe

执行后输出如下:

1
2
3
4
5
6
7
8
9
10
[+] Processing xxxx.exe
[+] Pyinstaller version: 2.1+
[+] Python version: 36
[+] Length of package: 5612452 bytes
[+] Found 59 files in CArchive
[+] Beginning extraction...please standby
[+] Possible entry point: pyiboot01_bootstrap.pyc
[+] Possible entry point: xxxx.pyc
[+] Found 133 files in PYZ archive
[+] Successfully extracted pyinstaller archive: xxxx.exe

解包完成后,生成了一个xxxx.exe_extracted目录,其中包含所有解包出的pyc文件和dll文件。

第三步:确定Python版本

要正确反编译pyc文件,必须知道原始Python的确切版本。通过以下方法确认:

(1) 可以直接在目录中看到python311.dll文件,可知是python3.11版本。

(2) 具体是python哪个版本,

我们随便打开一个PYZ-00.pyz_extracted目录下的pyc文件(例如base64.pyc),查看其头部信息,前四个字节为A70D0D0A。

我们可以在python的官方仓库中找到不同版本pyc的magic值: https://github.com/python/cpython/blob/main/Include/internal/pycore_magic_number.h
检索到python3.11这部分:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
Python 3.11a1 3450 Use exception table for unwinding ("zero cost" exception handling)
Python 3.11a1 3451 (Add CALL_METHOD_KW)
Python 3.11a1 3452 (drop nlocals from marshaled code objects)
Python 3.11a1 3453 (add co_fastlocalnames and co_fastlocalkinds)
Python 3.11a1 3454 (compute cell offsets relative to locals bpo-43693)
Python 3.11a1 3455 (add MAKE_CELL bpo-43693)
Python 3.11a1 3456 (interleave cell args bpo-43693)
Python 3.11a1 3457 (Change localsplus to a bytes object bpo-43693)
Python 3.11a1 3458 (imported objects now don't use LOAD_METHOD/CALL_METHOD)
Python 3.11a1 3459 (PEP 657: add end line numbers and column offsets for instructions)
Python 3.11a1 3460 (Add co_qualname field to PyCodeObject bpo-44530)
Python 3.11a1 3461 (JUMP_ABSOLUTE must jump backwards)
Python 3.11a2 3462 (bpo-44511: remove COPY_DICT_WITHOUT_KEYS, change
MATCH_CLASS and MATCH_KEYS, and add COPY)
Python 3.11a3 3463 (bpo-45711: JUMP_IF_NOT_EXC_MATCH no longer pops the
active exception)
Python 3.11a3 3464 (bpo-45636: Merge numeric BINARY_*INPLACE_* into
BINARY_OP)
Python 3.11a3 3465 (Add COPY_FREE_VARS opcode)
Python 3.11a4 3466 (bpo-45292: PEP-654 except*)
Python 3.11a4 3467 (Change CALL_xxx opcodes)
Python 3.11a4 3468 (Add SEND opcode)
Python 3.11a4 3469 (bpo-45711: remove type, traceback from exc_info)
Python 3.11a4 3470 (bpo-46221: PREP_RERAISE_STAR no longer pushes lasti)
Python 3.11a4 3471 (bpo-46202: remove pop POP_EXCEPT_AND_RERAISE)
Python 3.11a4 3472 (bpo-46009: replace GEN_START with POP_TOP)
Python 3.11a4 3473 (Add POP_JUMP_IF_NOT_NONE/POP_JUMP_IF_NONE opcodes)
Python 3.11a4 3474 (Add RESUME opcode)
Python 3.11a5 3475 (Add RETURN_GENERATOR opcode)
Python 3.11a5 3476 (Add ASYNC_GEN_WRAP opcode)
Python 3.11a5 3477 (Replace DUP_TOP/DUP_TOP_TWO with COPY and
ROT_TWO/ROT_THREE/ROT_FOUR/ROT_N with SWAP)
Python 3.11a5 3478 (New CALL opcodes)
Python 3.11a5 3479 (Add PUSH_NULL opcode)
Python 3.11a5 3480 (New CALL opcodes, second iteration)
Python 3.11a5 3481 (Use inline cache for BINARY_OP)
Python 3.11a5 3482 (Use inline caching for UNPACK_SEQUENCE and LOAD_GLOBAL)
Python 3.11a5 3483 (Use inline caching for COMPARE_OP and BINARY_SUBSCR)
Python 3.11a5 3484 (Use inline caching for LOAD_ATTR, LOAD_METHOD, and
STORE_ATTR)
Python 3.11a5 3485 (Add an oparg to GET_AWAITABLE)
Python 3.11a6 3486 (Use inline caching for PRECALL and CALL)
Python 3.11a6 3487 (Remove the adaptive "oparg counter" mechanism)
Python 3.11a6 3488 (LOAD_GLOBAL can push additional NULL)
Python 3.11a6 3489 (Add JUMP_BACKWARD, remove JUMP_ABSOLUTE)
Python 3.11a6 3490 (remove JUMP_IF_NOT_EXC_MATCH, add CHECK_EXC_MATCH)
Python 3.11a6 3491 (remove JUMP_IF_NOT_EG_MATCH, add CHECK_EG_MATCH,
add JUMP_BACKWARD_NO_INTERRUPT, make JUMP_NO_INTERRUPT virtual)
Python 3.11a7 3492 (make POP_JUMP_IF_NONE/NOT_NONE/TRUE/FALSE relative)
Python 3.11a7 3493 (Make JUMP_IF_TRUE_OR_POP/JUMP_IF_FALSE_OR_POP relative)
Python 3.11a7 3494 (New location info table)
Python 3.11b4 3495 (Set line number of module's RESUME instr to 0 per PEP 626)

Magic Number是4字节的二进制数据,我们找到对应的十进制数据后,通过以下代码得到相应的二进制数据

1
2
3
MAGIC_NUMBER = (3495).to_bytes(2, 'little') + b'\r\n'
_RAW_MAGIC_NUMBER = int.from_bytes(MAGIC_NUMBER, 'big')
print(hex(_RAW_MAGIC_NUMBER))

可以得到0xA70D0D0A。从而确定具体的版本为3495也就是3.11b4。

第四步:修复pyc文件头

PyInstaller生成的pyc文件缺少标准pyc文件应有的头部信息,需要手动修复。步骤如下:

我们要逆向的是xxx.pyc文件,但是这一步,我们还不能直接反编译pyc文件,因为pyinstaller删除了pyc文件的头部信息中包含的magic number和时间戳等信息,我们需要先将其转换为可以被反编译器识别的格式。
观察xxxx.pyc头部

1
2
3
4
E3 00 00 00 00 00 00 00 | 00 00 00 00 00 05 00 00
00 00 00 00 00 F3 18 02 | 00 00 97 00 64 00 64 01
6C 00 5A 00 64 00 64 01 | 6C 01 5A 01 64 00 64 01
6C 02 5A 02 64 00 64 01 | 6C 03 5A 03 64 00 64 01

base64.pyc头部

1
2
3
4
A7 0D 0D 0A 00 00 00 00 | 00 00 00 00 E3 00 00 00
00 00 00 00 00 00 00 00 | 05 00 00 00 00 00 00 00
00 F3 A2 02 00 00 97 00 | 64 00 5A 00 64 01 64 02
6C 01 5A 01 64 01 64 02 | 6C 02 5A 02 64 01 64 02

虽然我完全看不出有什么关联。

接下来我们新建一个conda环境,版本号和上面的pyc文件版本一致,写一个hello world的py文件,编译为pyc文件。

1
python -m py_compile hello_world.py

再次观察头部信息

1
2
3
4
A7 0D 0D 0A 00 00 00 00 | 42 3A 64 68 27 01 00 00
E3 00 00 00 00 00 00 00 | 00 00 00 00 00 04 00 00
00 00 00 00 00 F3 A6 00 | 00 00 97 00 64 00 A0 00
00 00 00 00 00 00 00 00 | 00 00 00 00 00 00 00 00

这时可以观察到规律了,”E3 00 00 00 00 00 00 00”这第二行以及”00 00 97 00 64 00 A0 00”第三行

我们将”A7 0D 0D 0A 00 00 00 00 | 42 3A 64 68 27 01 00 00”插入到xxxx.pyc的头部信息中。

第五步:尝试反编译

尝试使用uncompyle6

首先尝试使用uncompyle6进行反编译:

1
uncompyle6 xxxx_patched.pyc

但发现uncompyle6最高仅支持到Python 3.8版本,而我们的文件是3.11版本,无法兼容。

尝试使用pycdc

于是转向另一个反编译工具pycdc:

1
2
3
4
5
6
git clone https://github.com/zrax/pycdc.git
cd pycdc
mkdir build
cmake -S . -B build
cmake --build build
./pycdc.exe xxxx_patched.pyc

但遇到了错误:

1
Error decompyling xxxx_patched.pyc: vector too long

使用pycdas获取字节码

当直接反编译失败时,可以使用pycdc附带的pycdas工具获取字节码表示:

1
./pycdas xxxx_patched.pyc

然后将输出的字节码提交给AI大模型(如DeepSeek)进行人工逆向分析,最终成功还原出原始Python代码。

经验总结

  1. 版本匹配至关重要:PyInstaller解包和pyc反编译都需要精确匹配Python版本
  2. 工具链组合使用:没有单一工具能解决所有问题,需要灵活组合多种工具
  3. 人工分析必不可少:当自动化工具失败时,人工分析字节码是最后的手段

工具推荐

  1. 解包工具:pyinstxtractor
  2. 反编译工具:
  3. 辅助工具:
    • WinHex (用于二进制分析)
    • AI大模型 (用于字节码解释)