引言
分析由PyInstaller打包的Python程序是一个常见需求。本文将详细记录我对一个名为xxxx.exe的PyInstaller打包程序进行解包和反编译的全过程。
第一步:识别打包工具
通过观察程序图标和文件特征,确认这是一个由PyInstaller打包的Python脚本。PyInstaller是一个流行的Python打包工具,能将Python脚本转换为独立的可执行文件。
第二步:使用pyinstxtractor解包
我使用了专门针对PyInstaller的解包工具pyinstxtractor:
1
| python pyinstxtractor.py xxxx.exe
|
执行后输出如下:
1 2 3 4 5 6 7 8 9 10
| [+] Processing xxxx.exe [+] Pyinstaller version: 2.1+ [+] Python version: 36 [+] Length of package: 5612452 bytes [+] Found 59 files in CArchive [+] Beginning extraction...please standby [+] Possible entry point: pyiboot01_bootstrap.pyc [+] Possible entry point: xxxx.pyc [+] Found 133 files in PYZ archive [+] Successfully extracted pyinstaller archive: xxxx.exe
|
解包完成后,生成了一个xxxx.exe_extracted
目录,其中包含所有解包出的pyc文件和dll文件。
第三步:确定Python版本
要正确反编译pyc文件,必须知道原始Python的确切版本。通过以下方法确认:
(1) 可以直接在目录中看到python311.dll文件,可知是python3.11版本。
(2) 具体是python哪个版本,
我们随便打开一个PYZ-00.pyz_extracted目录下的pyc文件(例如base64.pyc),查看其头部信息,前四个字节为A70D0D0A。
我们可以在python的官方仓库中找到不同版本pyc的magic值: https://github.com/python/cpython/blob/main/Include/internal/pycore_magic_number.h
检索到python3.11这部分:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
| Python 3.11a1 3450 Use exception table for unwinding ("zero cost" exception handling) Python 3.11a1 3451 (Add CALL_METHOD_KW) Python 3.11a1 3452 (drop nlocals from marshaled code objects) Python 3.11a1 3453 (add co_fastlocalnames and co_fastlocalkinds) Python 3.11a1 3454 (compute cell offsets relative to locals bpo-43693) Python 3.11a1 3455 (add MAKE_CELL bpo-43693) Python 3.11a1 3456 (interleave cell args bpo-43693) Python 3.11a1 3457 (Change localsplus to a bytes object bpo-43693) Python 3.11a1 3458 (imported objects now don't use LOAD_METHOD/CALL_METHOD) Python 3.11a1 3459 (PEP 657: add end line numbers and column offsets for instructions) Python 3.11a1 3460 (Add co_qualname field to PyCodeObject bpo-44530) Python 3.11a1 3461 (JUMP_ABSOLUTE must jump backwards) Python 3.11a2 3462 (bpo-44511: remove COPY_DICT_WITHOUT_KEYS, change MATCH_CLASS and MATCH_KEYS, and add COPY) Python 3.11a3 3463 (bpo-45711: JUMP_IF_NOT_EXC_MATCH no longer pops the active exception) Python 3.11a3 3464 (bpo-45636: Merge numeric BINARY_*INPLACE_* into BINARY_OP) Python 3.11a3 3465 (Add COPY_FREE_VARS opcode) Python 3.11a4 3466 (bpo-45292: PEP-654 except*) Python 3.11a4 3467 (Change CALL_xxx opcodes) Python 3.11a4 3468 (Add SEND opcode) Python 3.11a4 3469 (bpo-45711: remove type, traceback from exc_info) Python 3.11a4 3470 (bpo-46221: PREP_RERAISE_STAR no longer pushes lasti) Python 3.11a4 3471 (bpo-46202: remove pop POP_EXCEPT_AND_RERAISE) Python 3.11a4 3472 (bpo-46009: replace GEN_START with POP_TOP) Python 3.11a4 3473 (Add POP_JUMP_IF_NOT_NONE/POP_JUMP_IF_NONE opcodes) Python 3.11a4 3474 (Add RESUME opcode) Python 3.11a5 3475 (Add RETURN_GENERATOR opcode) Python 3.11a5 3476 (Add ASYNC_GEN_WRAP opcode) Python 3.11a5 3477 (Replace DUP_TOP/DUP_TOP_TWO with COPY and ROT_TWO/ROT_THREE/ROT_FOUR/ROT_N with SWAP) Python 3.11a5 3478 (New CALL opcodes) Python 3.11a5 3479 (Add PUSH_NULL opcode) Python 3.11a5 3480 (New CALL opcodes, second iteration) Python 3.11a5 3481 (Use inline cache for BINARY_OP) Python 3.11a5 3482 (Use inline caching for UNPACK_SEQUENCE and LOAD_GLOBAL) Python 3.11a5 3483 (Use inline caching for COMPARE_OP and BINARY_SUBSCR) Python 3.11a5 3484 (Use inline caching for LOAD_ATTR, LOAD_METHOD, and STORE_ATTR) Python 3.11a5 3485 (Add an oparg to GET_AWAITABLE) Python 3.11a6 3486 (Use inline caching for PRECALL and CALL) Python 3.11a6 3487 (Remove the adaptive "oparg counter" mechanism) Python 3.11a6 3488 (LOAD_GLOBAL can push additional NULL) Python 3.11a6 3489 (Add JUMP_BACKWARD, remove JUMP_ABSOLUTE) Python 3.11a6 3490 (remove JUMP_IF_NOT_EXC_MATCH, add CHECK_EXC_MATCH) Python 3.11a6 3491 (remove JUMP_IF_NOT_EG_MATCH, add CHECK_EG_MATCH, add JUMP_BACKWARD_NO_INTERRUPT, make JUMP_NO_INTERRUPT virtual) Python 3.11a7 3492 (make POP_JUMP_IF_NONE/NOT_NONE/TRUE/FALSE relative) Python 3.11a7 3493 (Make JUMP_IF_TRUE_OR_POP/JUMP_IF_FALSE_OR_POP relative) Python 3.11a7 3494 (New location info table) Python 3.11b4 3495 (Set line number of module's RESUME instr to 0 per PEP 626)
|
Magic Number是4字节的二进制数据,我们找到对应的十进制数据后,通过以下代码得到相应的二进制数据
1 2 3
| MAGIC_NUMBER = (3495).to_bytes(2, 'little') + b'\r\n' _RAW_MAGIC_NUMBER = int.from_bytes(MAGIC_NUMBER, 'big') print(hex(_RAW_MAGIC_NUMBER))
|
可以得到0xA70D0D0A。从而确定具体的版本为3495也就是3.11b4。
第四步:修复pyc文件头
PyInstaller生成的pyc文件缺少标准pyc文件应有的头部信息,需要手动修复。步骤如下:
我们要逆向的是xxx.pyc文件,但是这一步,我们还不能直接反编译pyc文件,因为pyinstaller删除了pyc文件的头部信息中包含的magic number和时间戳等信息,我们需要先将其转换为可以被反编译器识别的格式。
观察xxxx.pyc头部
1 2 3 4
| E3 00 00 00 00 00 00 00 | 00 00 00 00 00 05 00 00 00 00 00 00 00 F3 18 02 | 00 00 97 00 64 00 64 01 6C 00 5A 00 64 00 64 01 | 6C 01 5A 01 64 00 64 01 6C 02 5A 02 64 00 64 01 | 6C 03 5A 03 64 00 64 01
|
base64.pyc头部
1 2 3 4
| A7 0D 0D 0A 00 00 00 00 | 00 00 00 00 E3 00 00 00 00 00 00 00 00 00 00 00 | 05 00 00 00 00 00 00 00 00 F3 A2 02 00 00 97 00 | 64 00 5A 00 64 01 64 02 6C 01 5A 01 64 01 64 02 | 6C 02 5A 02 64 01 64 02
|
虽然我完全看不出有什么关联。
接下来我们新建一个conda环境,版本号和上面的pyc文件版本一致,写一个hello world的py文件,编译为pyc文件。
1
| python -m py_compile hello_world.py
|
再次观察头部信息
1 2 3 4
| A7 0D 0D 0A 00 00 00 00 | 42 3A 64 68 27 01 00 00 E3 00 00 00 00 00 00 00 | 00 00 00 00 00 04 00 00 00 00 00 00 00 F3 A6 00 | 00 00 97 00 64 00 A0 00 00 00 00 00 00 00 00 00 | 00 00 00 00 00 00 00 00
|
这时可以观察到规律了,”E3 00 00 00 00 00 00 00”这第二行以及”00 00 97 00 64 00 A0 00”第三行
我们将”A7 0D 0D 0A 00 00 00 00 | 42 3A 64 68 27 01 00 00”插入到xxxx.pyc的头部信息中。
第五步:尝试反编译
尝试使用uncompyle6
首先尝试使用uncompyle6进行反编译:
1
| uncompyle6 xxxx_patched.pyc
|
但发现uncompyle6最高仅支持到Python 3.8版本,而我们的文件是3.11版本,无法兼容。
尝试使用pycdc
于是转向另一个反编译工具pycdc:
1 2 3 4 5 6
| git clone https://github.com/zrax/pycdc.git cd pycdc mkdir build cmake -S . -B build cmake --build build ./pycdc.exe xxxx_patched.pyc
|
但遇到了错误:
1
| Error decompyling xxxx_patched.pyc: vector too long
|
使用pycdas获取字节码
当直接反编译失败时,可以使用pycdc附带的pycdas工具获取字节码表示:
1
| ./pycdas xxxx_patched.pyc
|
然后将输出的字节码提交给AI大模型(如DeepSeek)进行人工逆向分析,最终成功还原出原始Python代码。
经验总结
- 版本匹配至关重要:PyInstaller解包和pyc反编译都需要精确匹配Python版本
- 工具链组合使用:没有单一工具能解决所有问题,需要灵活组合多种工具
- 人工分析必不可少:当自动化工具失败时,人工分析字节码是最后的手段
工具推荐
- 解包工具:pyinstxtractor
- 反编译工具:
- 辅助工具:
- WinHex (用于二进制分析)
- AI大模型 (用于字节码解释)