General Tips for Firmware Reverse Engineering

Preface

These notes were originally compiled years ago as a quick reference. They are somewhat fragmented and do not provide step-by-step procedures, but I continue to update them over time.

In this context, “firmware” refers to raw dumps extracted from storage chips or vendor upgrade packages.

Characteristics of reversing raw firmware:

Acquisition difficulty: Firmware files can be hard to obtain.
Limited resources: There are few public write-ups; you mostly rely on experience and exploration.
No direction execution: You cannot run the firmware directly, making debugging difficult.
Missing symbols: Most symbols are stripped; you often need to manually define code regions for disassembly.
Low obfuscation: Code obfuscation is rarely applied.

Firmware Categories

Based on system architecture, firmware can be broadly categorized into SoC firmware and MCU firmware.

SoC Firmware: Typically consists of a processing unit plus peripherals. The processor’s built-in BootROM loads a bootloader from external Flash; the data in that external Flash is what we consider the firmware. SoC devices typically use SPI NOR flash, NAND flash, or eMMC. SPI flash often stores the bootloader, while NAND flash stores the system kernel and filesystem. For the latter, extraction of the filesystem is key; for the former, the focus is on the boot process. Firmware in SPI flash is often composed of multiple distinct parts, so you cannot simply load a raw dump into IDA Pro and expect it to work.
MCU Firmware: Usually monolithic or split into very few regions. For MCUs using only internal storage, the layout is generally Loader + Application. For MCUs with external storage, you will see an internal Loader + Application, and the external Flash is typically not heavily partitioned.

Extracting Firmware

For NAND flash or other specialized storage media, extraction can require significant effort. Firmware from niche or proprietary MCUs can also be notoriously difficult to extract.

Finding the Load Base Address

When reversing firmware, the first step is usually to determine the load base address. Once the correct base is established, IDA can automatically resolve many cross-references, including strings and jump tables (jpt).

(These are rough notes; ignore them if they don’t apply to your specific case.)

Methods to determine the load base address:

Chip Datasheet: Use the memory map and boot-mode pin configuration to locate the base address.
Public Code: Find open-source code for the chip (e.g., a compatible bootloader) and infer the base address from linker scripts or definitions.
Previous-Stage Loader: Reverse the previous-stage loader to find where it loads the next stage (e.g., U-Boot environment variables or code often contain base address info).
Vector Table (IVT): Interrupt vectors often contain absolute addresses; use them to make an educated guess.
String References: If there is no interrupt vector table, look for pointers to strings that use absolute addresses.
Brute-Force Analysis: Extract all strings, then find all potential reference sites in the code. The base address that yields the most valid cross-references is likely correct.
Runtime Dump: If you have debug access (JTAG/SWD/UART), dump the memory at runtime and see where the firmware header resides.
Pattern Matching: Consider “round” addresses like 0x????0000. Compare the destination addresses of pointers/jumps with the distribution of strings in the file. If the lower bits match, the difference reveals the offset between the current base and the real base.
IDA Trick: If the last 4 hex digits of an address offset match the last 4 hex digits of a generic pointer (DCD) value, then the high bits of that pointer value likely represent the base address’s high bits.

Analyzing Layout

Start with hexdump to visualize the data distribution, then use binwalk to identify the CPU instruction set architecture (ISA) and opcode distribution. If it remains unclear, use a hex editor to analyze byte-frequency distribution.

If the data appears compressed (e.g., high entropy), look for specific markers. For example, Lempel-Ziv-Welch (LZW) compression often produces many 0x9D bytes. Check the bytes following 0x9D to see if the stream matches the LZW structure. Reference: List of file signatures.

Other techniques:

Endianness: Search for continuous strings sequences like 0123456789abcdefg. Some systems (e.g., certain printers) use dual flash chips where one holds “1267” and the other “3489”. You may need to interleave and reconstruct the binary using the smallest byte block size.
Magic Values: If source code is available, search for magic values from the source code within the firmware to reconstruct the layout.
Differential Analysis: Compare firmware across different versions, or compare the same version with slightly different contents (control-variable method).
Block Similarity: If you only have a single firmware sample, analyze block similarity to locate magic numbers and infer the system structure.

Avoiding Duplicate Regions

I developed a firmware security tool called UFA - Universal Firmware Analysis to help with this.

(Note: I implemented this feature in late 2020.)

Some firmware images contain redundant system copies (e.g., for A/B updates). With UFA (or other tools that visualize entropy), you can quickly identify duplicated regions and avoid analyzing the same code twice.

Continuous Files & Partially Compressed Files

Partially compressed systems present significant challenges. In day-to-day reversing, you might extract a binary and try to analyze it directly. You see some strings and symbols, but IDA fails to analyze the code flow properly. An entropy graph might reveal that parts of the file are code, while others are compressed data, interspersed with constants (like SHA-512 constants).

Normal compressed data has a constantly high entropy (close to 1). In system firmware, it is unusual to see large sections of readable strings separated by large blocks of compressed data if it were a standard file system. By analyzing the previous-stage loader, you can often confirm if the binary is a continuous file with partial compression.

Partial Encryption vs. Partial Compression

When partial encryption and partial compression are combined, analysis becomes extremely confusing.

IoT devices are often resource-constrained. To balance security and user experience (boot time), vendors may use partial encryption. For example, a SquashFS image might fail to unpack. An inexperienced reverser might assume the file is corrupted. A clearer analysis might reveal a decryption routine; however, even after decryption, unpacking might still fail. Since SquashFS is compressed by definition, “partial encryption” is harder to spot visually because both look like high-entropy noise.

However, partial encryption differs from full encryption:

Partial Compression/Encryption: Compressed data entropy usually fluctuates within a high range. Regions with fluctuations might indicate “unencrypted leftovers” or metadata inside an otherwise partially encrypted area. (See below)

Full Encryption: Fully encrypted data tends to have consistently high randomness, often appearing as a flat, high line on the entropy graph.

Identifying Functions

If the base address is incorrect, IDA often cannot accurately detect code regions or function prologues. In such cases, you can try to blindly recover potential functions to get a foothold.

def remake_func(opcodes, lastbytes, end_ea = ida_ida.inf_get_max_ea()):
    ea = 0x0
    lastbytes_len = len(lastbytes)
    while (ea >= 0):
        ea = ida_bytes.bin_search(ea + 1, end_ea, opcodes, None, 1, ida_bytes.BIN_SEARCH_FORWARD | ida_bytes.BIN_SEARCH_NOBREAK | ida_bytes.BIN_SEARCH_NOSHOW)
        if ea == BADADDR : break
        else:
            print("get_bytes: ", hex(ea-lastbytes_len), ida_bytes.get_bytes((ea-lastbytes_len), lastbytes_len))
            if ida_bytes.get_bytes((ea-lastbytes_len), lastbytes_len) == lastbytes:
                add_func(ea, BADADDR)
                print("0x{:x}: {}".format(ea, GetDisasm(ea)))

# Example usage: Searching for common function prologues/epilogues
remake_func(b'\x55\x89\xe5', b'\xc3', 0xFF000000)
remake_func(b'\x55\x31\xC0', b'\xc3', 0xFF000000)
remake_func(b'\x55\x89\xe5', b'\xc2\x04\x00', 0xFF000000)

Recovering Common Functions

Proprietary MCU firmware rarely uses standard external libraries; most functionality is statically linked or implemented from scratch. You should first identify frequently used standard functions to build a map of the firmware’s logic:

memcpy
memset
memcmp
mmap
printf
strcpy
kfree / malloc

For firmware based on open-source projects, you can use source-based signatures.

Script to find the most-referenced functions:

from idaapi import *
funcs = Functions()
for f in funcs:
    name = Name(f)
    func_xref_amount = len(list(XrefsTo(f)))
    if func_xref_amount > 30:
        print "%s %d" % (name, func_xref_amount)

For open-source MCU firmware, compile your own build using the same toolchain and version if possible. Generate a MAP file or symbols, use FLIRT to create signatures, and then match them against the target firmware to recover function names.

Finding Functions with String References

For firmware where the base address is not aligned to a standard boundary (like 0x1000), guessing the base is difficult. A useful trick involves inspecting string global variables.

First, look at the list of strings in IDA and note the sequence of their offsets.

On x86 architectures, arguments for static variables are often pushed onto the stack. Searching for push instructions is often more effective than searching for mov. In IDA, perform a binary search for the opcode push 0x... (or search for the immediate values). Filter for values ending with specific patterns derived from the string offsets (e.g., 0x********62, 0x********97).

As shown below, if the regularity of the immediate values in the code matches the distance between the strings, the correct base address becomes obvious.

Base calculation example: 0xFEFA5762 (Immediate Value) - 0x22F62 (String Offset) = 0xFEF82800 (Base Address)

Fixing Function Cross-References

If you cannot identify the caller of a function, it may be referenced via a jump table. Globally search for immediate values equal to the function’s address.

Note: Sometimes addresses are stored as relative offsets; you must subtract the base address to find the stored value.
Split Addresses: Sometimes a 32-bit address is constructed from high 16 bits and low 16 bits:

MOV Rx, #HighAddr
MOVT Rx, #LowAddr

Niche Architectures

IDA Pro is excellent at disassembling machine code and generating call graphs for common architectures. However, for niche architectures like NEC V850, you often need to manually identify function entry points. Many cross-references will not be automatically recognized and must be created manually.

Another challenge is chip-specific register layouts: RAM, peripheral buses, interface registers, interrupt controllers, etc.

Solution: Consult the datasheet. If the datasheet is not public, look for Board Support Packages (BSPs) or scatter files code for similar chips.
IDA Config: Add platform-specific configurations to IDA Pro’s cfg files (address map, register names, etc.) to aid analysis.

Reversing by Comparing with Source

If you cannot understand a specific piece of code, find an open-source project with similar functionality. Compile it for the same platform, load the result into IDA Pro, and compare the assembly against your target. This comparative analysis often clarifies the code’s intent.

Emulation

If you face complex obfuscated or mathematical code but only need the input/output behavior, emulate it using Unicorn Engine. It supports common architectures like ARM, MIPS, and PPC, allowing you to execute the code slice in isolation.

Reversing Specific Features

Crypto libraries often rely on specific constant tables (S-boxes, initialization vectors). By searching for these constants, you can identify the algorithms used (AES, SHA, CRC) and locate the functions that use them. Encryption, hashing, and checksum routines are critical checkpoints commonly found during boot, firmware upgrade, and communication phases.

Tools: Use the FindCrypt plugin to quickly locate these constants.
Protocols: For SD/SATA protocols, search for specific Command (CMD) values.
Vehicle Networks: For CAN bus analysis, search for the memory-mapped addresses of CAN registers.

IDA Pro “Problems” Tips

In IDA Pro, navigate to View > Open subviews > Problems, and look for:

NONAME
BOUNDS

These items often indicate an instruction using an immediate value that points outside the defined internal segments. These values could be:

Peripheral register addresses.
Valid memory addresses if the firmware base address were set correctly.
Addresses belonging to an external binary (common in bootloaders or multi-stage firmwares).

Tip: If Firmware A’s base is unknown, but you see references to addresses that look like they belong to Firmware A (whose range you know from a different stage), references in Firmware B can help you calculate Firmware A’s base.

Case Study

Consider an x86 firmware with an unknown base.

Check the Problems view and filter for BOUNDS.
You see many call instructions using relative addressing (e.g., near ptr).
Address 0x7A10A appears. If the file size is smaller than 0x40000, 0x7A10A is clearly invalid as a raw offset—it implies a base address is missing.

Clicking one instance reveals that 0xFEF84DE0 is passed as an argument to the function at 0x7A10A. This is likely a global variable address, not a register.

Using the String Reference trick (described earlier), you determine the base is 0xFEF82800.
After rebasing, IDA identifies more functions.
The address 0x7A10A updates to 0xFEFFC90A. If this is still outside the file’s mapped memory, it likely points to an external binary (e.g., a shared library or common boot code).
If you know from another binary that printf is at 0xFEFFC90A, you can map that external binary into your current IDA database.

Adding a Segment in IDA: Be careful; the UI can be tricky.

Press Shift+F7 to open the Segments window.
Right-click -> Add segment.
Set the Start address to the external binary’s base.

Verify there are no overlaps with existing segments.

Load the external binary: File -> Load file -> Additional binary file…
Set the Loading offset to the base address of the new segment.