
info
All scripts and other results obtained using this decompiler are available in my repository on GitHub.
A few words about decompilation
I will use the assembly code obtained at the previous stage (see the article Modologic. Dissecting the Pathologic virtual machine). It represents a set of 89 opcodes. In fact, all I have is a set of instructions and file headers.
Decompilation into a higher-level language represents, in fact, cross-compilation. This procedure requires a compiler that takes the source code written in one language as input and converts it into code written in another language. To create such a tool, you have to construct an execution graph for the original code and write a simple emulator for the stack machine. But first things first. My previous experience with Angr described in the article Anger management. Welcome to Angr, a symbolic emulation framework helped me a lot in writing a new decompiler. In fact, I repeated typical solutions used in compilers and code analyzers, and they worked! Kudos to the compilation theory.
Control flow graph
When you analyze some code, the first step is to break it down into basic blocks; each block represents a set of instructions that terminates with a control transfer instruction. Then you can construct a graph consisting of such blocks and reflecting the connections: who transfers control to whom.

The screenshot shows that some blocks end with commands that don’t transfer control. This is because the code section following such a block is used as a jump address: another code (usually a loop) transfers control to the middle of the basic block. In such a case, the block is divided into two blocks so that the second one begins with the address to which control can be transferred.
Therefore, first of all, I go through the entire code and record jump addresses. Whether it’s a CALL
or JMP
instruction, I write down where do they transfer control. After that, the code can be divided into basic blocks. You go from the first instruction to the last one and divide them by known jump addresses. For this purpose, I introduced special fields in the instruction class: metadata storing addresses and jump type (conditional or unconditional).
After dividing the code into basic blocks and creating an array of BasicNode
objects, you can start connecting them (i.e. establishing links: which block does the current one transfer control to). Each basic block contains information about its address (i.e. address of its first instruction) and about jump addresses known to its last instruction.
A graph must have a root: the first instruction that starts the control flow. Headers from the assembler file provide a list of possible entry points. I start recursive analysis from the first basic block; after that, the algorithm goes through all jump addresses and adds links to ‘child’ blocks.
Importantly, I go through each block only once. There is a global seen
list: addresses of blocks already seen by the algorithm. They will be ignored during the repeat analysis. This makes it possible to quickly analyze a recursive graph that might include nested loops.
When all links to ‘child’ blocks are established, the graph is ready. This is an analogue of an abstract syntax tree (AST). From now on, my research will be focused on it.
Function parsing
Game’s proprietary language supports equivalents of CALL
and RET
commands in the regular x86 assembler. However, there are no calling conventions like stdcall
. It’s impossible to determine the number of variables passed as function arguments without complete local stack emulation.
That’s why I didn’t cut out the beginnings of functions as separate graph roots. Each call is considered a regular transfer of control, and the body of the called function is part of the body of the calling function. This trick makes it possible to analyze a function in the context of the code that calls it. After the initial analysis, the function can be detached into a separate graph.
Graph construction example
I will demonstrate all analysis phases using a small fire.
file as an example. Its assembly code looks as follows:
RunOp = 0x6
RunTask = 1
GlobalTasks:
GTASK_0 Params = 0
EVENT_5 Op = 0x3 Vars = ()
GTASK_1 Params = 0
EVENT_6 Op = 0x26 Vars = ()
0x0: @ Hold()
0x1: Pop(0)
0x2: Return(); Pop(0)
0x3: @ StopGroup0()
0x4: Pop(0)
0x5: Return(); Pop(0)
0x6: PushEmpty(object, object)
0x7: PushEmpty(bool)
0x8: Call 0x2c
0x9: Pop(0)
0xa: Pop(1); Push((bool) Stack[-1] == 0)
0xb: IF (Stack[-1] == 0) GOTO 0x11; Pop(1)
0xc: PushEmpty()
0xd: Push(-0, 0); TaskCall(0)
0xe: Call 0x0
0xf: Pop(-0, 0); TaskReturn
0x10: Pop(0)
0x11: Push("fire")
0x12: @ FindParticleSystem(Stack[-1], Stack[-2])
0x13: Pop(1)
0x14: Pop(0); Push((bool) Stack[-1] == 0)
0x15: IF (Stack[-1] == 0) GOTO 0x1a; Pop(1)
0x16: Push("Can't find fire particle system")
0x17: @ Trace(Stack[-1])
0x18: Pop(1)
0x19: Return(); Pop(2)
0x1a: Push(CVector(0.0, 0.0, 0.0))
0x1b: Push(CVector(0.0, 1.0, 0.0))
0x1c: Push((float)0.0)
0x1d: @@ AddSource(Stack[-3], Stack[-2], Stack[-1])
0x1e: Pop(3)
0x1f: @@ Enable()
0x20: Pop(0)
0x21: @ Hold()
0x22: Pop(0)
0x23: GOTO 0x21
0x24: Return(); Pop(2)
0x25: Stack[-1] = 0
0x26: PushEmpty()
0x27: Push(-0, 0); TaskCall(0)
0x28: Call 0x0
0x29: Pop(-0, 0); TaskReturn
0x2a: Pop(0)
0x2b: Return(); Pop(0)
0x2c: PushEmpty(bool, bool)
0x2d: @ IsLoaded(Stack[-1])
0x2e: Pop(0)
0x2f: Stack[-3] = Stack[-1]
0x30: Return(); Pop(2)
It can be seen from the headers that there are three entry points: 0x3
, 0x6
, and 0x26
. After going through the code, you can notice two addresses where the control is transferred using Call
: 0x0
and 0x2C
. These are the function entry points. The initial graph looks as follows:
0x3: @ StopGroup0()0x4: Pop(0)0x5: Return(); Pop(0)0x26: PushEmpty()0x27: Push(-0, 0); TaskCall(0)0x28: Call 0x0 0x0: @ Hold() 0x1: Pop(0) 0x2: Return(); Pop(0) 0x29: Pop(-0, 0); TaskReturn 0x2a: Pop(0) 0x2b: Return(); Pop(0)0x6: PushEmpty(object, object)0x7: PushEmpty(bool)0x8: Call 0x2c 0x2c: PushEmpty(bool, bool) 0x2d: @ IsLoaded(Stack[-1]) 0x2e: Pop(0) 0x2f: Stack[-3] = Stack[-1] 0x30: Return(); Pop(2) 0x9: Pop(0) 0xa: Pop(1); Push((bool) Stack[-1] == 0) 0xb: IF (Stack[-1] == 0) GOTO 0x11; Pop(1) 0xc: PushEmpty() 0xd: Push(-0, 0); TaskCall(0) 0xe: Call 0x0 0x0: @ Hold() 0x1: Pop(0) 0x2: Return(); Pop(0) 0xf: Pop(-0, 0); TaskReturn 0x10: Pop(0) 0x11: Push("fire") 0x12: @ FindParticleSystem(Stack[-1], Stack[-2]) 0x13: Pop(1) 0x14: Pop(0); Push((bool) Stack[-1] == 0) 0x15: IF (Stack[-1] == 0) GOTO 0x1a; Pop(1) 0x16: Push("Can't find fire particle system") 0x17: @ Trace(Stack[-1]) 0x18: Pop(1) 0x19: Return(); Pop(2) 0x1a: Push(CVector(0.0, 0.0, 0.0)) 0x1b: Push(CVector(0.0, 1.0, 0.0)) 0x1c: Push((float)0.0) 0x1d: @@ AddSource(Stack[-3], Stack[-2], Stack[-1]) 0x1e: Pop(3) 0x1f: @@ Enable() 0x20: Pop(0) 0x21: @ Hold() 0x22: Pop(0) 0x23: GOTO 0x21
For each entry point, a graph is constructed separately (i.e. there are three graphs in total). The left indentation denotes the beginning of a child block.
The block at 0x26
has two ‘children’: a function call at 0x0
and the next consecutive instruction 0x29
. In the initial graph, the body of the called function is attached to the ‘parent’ block, but only once per graph.
The block at 0x9
contains two child blocks: the next consecutive 0xc
and a transition to 0x11
. Because of this transition, the block (with no control transfer instructions inside it) at 0xf
is divided into two blocks. Importantly, the block at 0x11
is concurrently a ‘child’ of the block 0xf
as the next consecutive one. But since the rule is “never enter the same block twice”, the graph will be displayed with such a weird nesting order.
Stack simulator
The main problem with decompiled code pertains to relative addresses on the stack. A variable is accessed via an index from the end of the stack, and adding a new variable to the end of the stack breaks old references. The only way to figure out which variable is used where is to go through the execution graph and simulate all operations on the stack.
To do this, I had to elaborate the assembly code decompiler by adding the following information to each instruction: (1) number of variables it takes from the stack; and (2) type of variables it pushes onto the stack.
class CInstructionPop: def __init__(self, r): self.OpCode = 'Pop' self.VarIn = r.uint32() self.PopCount = self.VarIn def __repr__(self): return f'Pop({self.VarIn})'class CInstructionAdd: def __init__(self, r): self.OpCode = 'Add' self.Var1 = r.uint32() self.Var2 = r.uint32() self.TaskVar = r.int8() # signed self.PopCount = self.TaskVar & 0x3F self.PushType = [VAR_TYPE.INT] def __repr__(self): if self.TaskVar >= 0: var_1 = f'Stack[-{self.Var1}]' else: var_1 = f'Stack[{self.Var1} + StackPtr]' if self.TaskVar & 0x40: var_2 = f'Stack[{self.Var2} + StackPtr]' else: var_2 = f'Stack[-{self.Var2}]' return f'Pop({self.PopCount}); Push({var_1} + {var_2});'
The stack simulator goes from the top of the graph. Each instruction adds or removes values from the stack. The trick is that I take two stack snapshots for each instruction: before and after the opcode is executed. And then I add them as metadata to the instruction body. At the next stage of analysis, these stack snapshots will help me to identify the variable used in the instruction.
0x6: PushEmpty(object, object)>before: []>after: [var_0_object, var_1_object]0x7: PushEmpty(bool)>before: [var_0_object, var_1_object]>after: [var_0_object, var_1_object, var_2_bool]
For instance, at the very beginning of the function 0x6
, the stack is empty, but the first command pushes two variables of the object
type onto it. The next instruction adds a variable of the bool
type.
The stack simulator remembers the number of the last created variable; when a new variable is pushed onto the stack, it assigns the next ordinal number to it. A simple solution involving a global counter turned out to be really useful. Each variable receives a unique number, and its usage can be tracked!
Prior to entering the child block, a temporary copy of the stack is created. It’s passed to it as the current state. This ensures that the first child block cannot affect the state of the stack in the second code block. Such a trick was borrowed from Angr: if there is a conditional transition, the script goes through each branch in sequence. This way, it visits each block exactly once and calculates the current state of the stack for each of them.
0x8: Call 0x2c>before: [var_0_object, var_1_object, var_2_bool]>after: [var_0_object, var_1_object, var_2_bool] 0x2c: PushEmpty(bool, bool) >before: [var_0_object, var_1_object, var_2_bool] >after: [var_0_object, var_1_object, var_2_bool, var_3_bool, var_4_bool]
A function call receives a copy of the stack from the caller code. This information will later help to determine the number of arguments passed to the function. It can be assumed that a function call doesn’t change the state of the stack. Fortunately, the language design ensures that the caller cleans up arguments of the called function (i.e. the above assumption is correct). Otherwise, I would have to write a fully-functional emulator for the called code.
Passes
The graph consists of blocks, and each block contains a reference to ‘child’ blocks and a list of assembly instructions. The idea is that information about the code in a higher-level language can be added to instructions. In other words, each low-level instruction should have a handler that replaces it with a higher-level instruction.
if isinstance(instr.opcode, CInstructionMovB): node.instructions[i].hl = HLInstructionMovB(instr)if isinstance(instr.opcode, CInstructionPow): node.instructions[i].hl = HLInstructionPow(instr)if isinstance(instr.opcode, CInstructionPop): node.instructions[i].hl = HLInstructionNop()
For example, interaction with the stack affects the stack simulator, but it must not be reflected in the C code. Therefore, CInstructionPop
is replaced by HLInstructionNop
, an object of the high-level NOP (No OPeration) instruction class that won’t be taken into account when C code is composed.
Interaction with the graph is implemented using a set of ‘passes’ (I borrowed this idea from the LLVM architecture). A pass is a function that is iteratively applied to each block in the graph; it changes the block on the fly or collects information about its code. In addition, the called functions should be ‘detached’ into separate graphs.
The first pass replaces instructions with their high-level language analogues. But it includes several special instances.
if isinstance(instr.opcode, CInstructionJump): node.instructions[i].hl = HLInstructionJump(instr) self.insert_label_list.append(instr.opcode.VarIn)if isinstance(instr.opcode, CInstructionJumpB): node.instructions[i].hl = HLInstructionJumpB(instr) self.insert_label_list.append(instr.opcode.VarIn)
Control transfer instructions within the same function jump to a specific address. There are no addresses in C code; so, I add a new instruction to the code: a label with a unique number. On the next pass, the collected information about labels will be converted into new instructions.
def pass_InsertLabels(self, top_node): if hasattr(self, 'insert_label_list'): for i in self.insert_label_list: self.insert_label(i, top_node) self.insert_label_list = []
The pass that records variables that are in use requires special attention:
def pass_RecordUsedVars(self, node): node.Created = [] node.Used = [] for i in node.instructions: if hasattr(i, 'hl'): if hasattr(i.hl, 'Created'): node.Created += i.hl.Created if hasattr(i.hl, 'Used'): node.Used += i.hl.Used node.Created = list(set(node.Created)) node.Used = list(set(node.Used))
Within each high-level instruction, the following fields containing metadata are collected: (1) what variables it uses; and (2) what variables were created in it.
class HLInstructionPush: def __init__(self, instr): opcode = instr.opcode stack = opcode.stack_snapshot_after stack_before = opcode.stack_snapshot_before self.VarIn = stack_before[-opcode.VarIn] self.VarOut = stack[-1] self.Used = [self.VarIn] self.Created = [self.VarOut] def __repr__(self): return f'{self.VarOut} = {self.VarIn};'
The original instruction contains an index for the variable used by it on the stack. This index is compared with the previously obtained stack snapshot to find out which variable was used and write it to the metadata.
Many instructions require information from both stack snapshots (i.e. before and after their execution) because offsets written in the instruction are valid only for the old snapshot; while the output variable created by the instruction is present only in the new snapshot.
Metadata from the Created
and Used
fields are used to determine input arguments of functions:
def get_func_args(self, top_node): Created = [] Used = [] for node, i in list(parse_graph(top_node, [])): if hasattr(node, 'Created'): Created += node.Created Used += node.Used # Used, but not created Args = list(set(Used) - set(Created)) Args.sort(key=lambda x: x.index) return Args
The script collects all variables created and used inside the function body and subtracts the created variables from the used ones. The remaining variables were obviously passed by external code. There is no other way to understand which variables from the stack act as input arguments for a given function.
A separate pass adds a prologue and epilogue to each function (i.e. instructions that will be rendered into the text as func_44(
). The remaining pass adds left indents for the code.
The final tree printout looks as follows:
task_0_event_5(){ StopGroup0(); // 0x3 Func return 0; // 0x5 Return}task_1_event_6(){ TaskCall(0); // 0x27 TaskCall func_0(); // 0x28 Call TaskReturn(); // 0x29 TaskReturn return 0; // 0x2b Return}main(){ var_0_object = Obj(); var_1_object = Obj(); // 0x6 PushV var_2_bool = 0; // 0x7 PushV func_44(var_2_bool); // 0x8 Call var_5_bool = var_2_bool == 0; // 0xa Not if(var_5_bool == 0) goto Label_17; // 0xb JumpB TaskCall(0); // 0xd TaskCall func_0(); // 0xe Call TaskReturn(); // 0xf TaskReturnLabel_17: var_6_string = "fire"; // 0x11 PushS FindParticleSystem(var_6_string, var_1_object); // 0x12 Func var_7_bool = var_1_object == 0; // 0x14 NullEq if(var_7_bool == 0) goto Label_26; // 0x15 JumpB var_8_string = "Can't find fire particle system"; // 0x16 PushS Trace(var_8_string); // 0x17 Func return 2; // 0x19 ReturnLabel_26: var_9_cvector = CVector(0.0, 0.0, 0.0); // 0x1a PushVec var_10_cvector = CVector(0.0, 1.0, 0.0); // 0x1b PushVec var_11_float = 0.0; // 0x1c PushF AddSource(var_9_cvector, var_10_cvector, var_11_float); // 0x1d ObjFunc Enable(); // 0x1f ObjFuncLabel_33: Hold(); // 0x21 Func goto Label_33; // 0x23 Jump}func_0(){ Hold(); // 0x0 Func return 0; // 0x2 Return}func_44(var_2_bool){ var_3_bool = 0; var_4_bool = 0; // 0x2c PushV IsLoaded(var_4_bool); // 0x2d Func var_2_bool = var_4_bool; // 0x2f Mov return 2; // 0x30 Return}
For debugging purposes, I specify the address and opcode of the original assembler instruction in comments. Of course, decompilation errors are possible, but the code looks quite adequate (at least, types specified when variables are created correspond to the ways these variables are used).
Code compression
I decompiled all scripts in a similar way and identified several errors. Let’s examine the item_milk.
script discussed in the previous article:
main(){ var_0_bool = 0; var_1_string = ""; var_2_float = 0; var_3_float = 0; var_4_float = 0; // 0x0 PushV var_1_string = "hunger"; // 0x1 MovS var_2_float = -0.17; // 0x2 MovF var_3_float = 0; // 0x3 MovI var_4_float = 1; // 0x4 MovI func_8(var_0_bool, var_1_string, var_2_float, var_3_float, var_4_float); // 0x5 Call return 0; // 0x7 Return}func_8(var_0_bool, var_1_string, var_2_float, var_3_float, var_4_float){ var_5_bool = 0; var_6_float = 0; var_7_bool = 0; var_8_float = 0; // 0x8 PushV HasProperty(var_1_string, var_7_bool); // 0x9 Func var_9_bool = var_7_bool == 0; // 0xb Not if(var_9_bool == 0) goto Label_15; // 0xc JumpB var_0_bool = 0; // 0xd MovB return 4; // 0xe ReturnLabel_15: GetProperty(var_1_string, var_8_float); // 0xf Func var_10_float = 0; var_11_float = 0; var_12_float = 0; var_13_float = 0; // 0x11 PushV var_11_float = var_8_float + var_2_float; // 0x12 Add2 var_12_float = var_3_float; // 0x13 Mov var_13_float = var_4_float; // 0x14 Mov func_27(var_10_float, var_11_float, var_12_float, var_13_float); // 0x15 Call SetProperty(var_1_string, var_10_float); // 0x17 Func var_0_bool = 1; // 0x19 MovB return 4; // 0x1a Return}func_27(var_10_float, var_11_float, var_12_float, var_13_float){ var_14_bool = var_11_float < var_12_float; // 0x1c LT if(var_14_bool == 0) goto Label_32; // 0x1d JumpB var_10_float = var_12_float; // 0x1e Mov return 0; // 0x1f ReturnLabel_32: var_15_bool = var_11_float > var_13_float; // 0x20 GT if(var_15_bool == 0) goto Label_36; // 0x21 JumpB var_10_float = var_13_float; // 0x22 Mov return 0; // 0x23 ReturnLabel_36: var_10_float = var_11_float; // 0x24 Mov return 0; // 0x25 Return}
Functions get names based on the address of their first instruction. In the original code, these functions had meaningful names, but they were lost in the course of compilation. The script reconstruction process resembles working in IDA Pro: the analyst adds names of variables and functions. Apparently, decompilation cannot be performed in a different way.
main(){ var_0_bool = 0; func_8(var_0_bool, "hunger", -0.17, 0, 1); return;}func_8(var_0_bool, var_1_string, var_2_float, var_3_float, var_4_float){ var_7_bool = 0; var_8_float = 0; HasProperty(var_1_string, var_7_bool); if(var_7_bool != 0) goto Label_15; var_0_bool = 0; return;Label_15: GetProperty(var_1_string, var_8_float); var_10_float = 0; func_27(var_10_float, var_8_float + var_2_float, var_3_float, var_4_float); SetProperty(var_1_string, var_10_float); var_0_bool = 1; return;}func_27(var_10_float, var_11_float, var_12_float, var_13_float){ if(var_11_float >= var_12_float) goto Label_32; var_10_float = var_12_float; return;Label_32: if(var_11_float <= var_13_float) goto Label_36; var_10_float = var_13_float; return;Label_36: var_10_float = var_11_float; return;}
The code can be significantly simplified by moving definitions of variables to the places where these variables are used. This process can be automated using RD analysis, but the problem is that I don’t have definitions of the APIs used (e.g. SetProperty
). Without function prototypes, it’s impossible to say which arguments a function uses by value and which ones, by reference. For example, HasProperty
takes the first string argument as a value, and writes the return value by reference to the second one. But let’s keep code compression for later…
Conclusions
Writing a cross-compiler turned out to be a difficult, but quite feasible task. The next step could be the inverse challenge: create a compiler that translates C code into assembly bytecode. Yes, decompilation isn’t ideal, but tons of scripts suitable for data mining are already available.
See you soon!

2023.02.21 — Pivoting District: GRE Pivoting over network equipment
Too bad, security admins often don't pay due attention to network equipment, which enables malefactors to hack such devices and gain control over them. What…
Full article →
2022.06.01 — F#ck AMSI! How to bypass Antimalware Scan Interface and infect Windows
Is the phrase "This script contains malicious content and has been blocked by your antivirus software" familiar to you? It's generated by Antimalware Scan Interface…
Full article →
2023.04.04 — Serpent pyramid. Run malware from the EDR blind spots!
In this article, I'll show how to modify a standalone Python interpreter so that you can load malicious dependencies directly into memory using the Pyramid…
Full article →
2022.01.12 — First contact. Attacks against contactless cards
Contactless payment cards are very convenient: you just tap the terminal with your card, and a few seconds later, your phone rings indicating that…
Full article →
2023.07.07 — VERY bad flash drive. BadUSB attack in detail
BadUSB attacks are efficient and deadly. This article explains how to deliver such an attack, describes in detail the preparation of a malicious flash drive required for it,…
Full article →
2022.02.09 — F#ck da Antivirus! How to bypass antiviruses during pentest
Antiviruses are extremely useful tools - but not in situations when you need to remain unnoticed on an attacked network. Today, I will explain how…
Full article →
2022.12.15 — What Challenges To Overcome with the Help of Automated e2e Testing?
This is an external third-party advertising publication. Every good developer will tell you that software development is a complex task. It's a tricky process requiring…
Full article →
2022.06.03 — Challenge the Keemaker! How to bypass antiviruses and inject shellcode into KeePass memory
Recently, I was involved with a challenging pentesting project. Using the KeeThief utility from GhostPack, I tried to extract the master password for the open-source KeePass database…
Full article →
2023.03.26 — Attacks on the DHCP protocol: DHCP starvation, DHCP spoofing, and protection against these techniques
Chances are high that you had dealt with DHCP when configuring a router. But are you aware of risks arising if this protocol is misconfigured on a…
Full article →
2023.01.22 — Top 5 Ways to Use a VPN for Enhanced Online Privacy and Security
This is an external third-party advertising publication. In this period when technology is at its highest level, the importance of privacy and security has grown like never…
Full article →