自定义硬件运行Linux应用触发内核panic的定位与调试方案

阿华AIGC实验室

2026-5-28

Debugging Kernel Panic on Custom i.MX6 Hardware: Hardware vs Software Root Cause

Alright, let’s break down your kernel panic log and walk through actionable steps to narrow down whether this is a hardware or software issue—you’ve already ruled out memory problems via calibration and stress tests, so we can focus on more targeted debugging.

Key Observations from the Panic Log

First, let’s parse the critical details from your panic output:

Invalid Kernel Address Access: The kernel tried to access 0x54494156, which is a user-space virtual address (ARM 32-bit Linux reserves kernel addresses starting at 0x80000000 on your i.MX6). The page table entry (pgd=00000000) confirms this address isn’t mapped in the kernel’s address space, triggering the paging fault.
Hijacked Execution Flow: The Program Counter (PC) is stuck at 0x54494156—this isn’t a valid kernel code address. This means the kernel’s execution path was unexpectedly redirected, almost certainly due to a corrupted function pointer in the idle/cpuidle code path being overwritten with this user-space value.
Trigger Context: The panic occurred in the swapper/3 (CPU 3’s idle task), right after entering the cpuidle state. You also see a warning from the tw6869 video capture driver: tw6869_querystd: vch1: unknown std detected—this might be unrelated, but it’s a unique clue for this failing board.

Step-by-Step Debugging Plan

Since you have JTAG access to debug the kernel and U-Boot, we can use that to get concrete data instead of guessing.

1. Rule Out Software Configuration & Version Mismatches

First, confirm the working and failing boards are running identical software:

Compare kernel configs: Run zcat /proc/config.gz on both boards and diff the outputs. Look for differences in tw6869 driver settings, cpuidle configurations, or any other kernel features that might affect memory or idle states.
Verify kernel/driver versions: Ensure the kernel commit (gff4e28b) and tw6869 driver code are exactly the same on both boards. Even a tiny patch or compile-time difference could cause memory corruption.
Test without the problematic app: Does the panic happen when the app isn’t running? If yes, the issue is unrelated to the app and likely tied to driver or idle code. If no, the app is triggering kernel state corruption that leads to the panic later (when the CPU enters idle).

2. Use JTAG to Inspect Corrupted Kernel State

When the panic hits, use JTAG to dig into the kernel’s internal state:

Check cpuidle function pointers: The call chain leads to cpuidle_enter_state. Inspect the struct cpuidle_state for the idle state CPU3 was entering—look for any function pointers (like enter or exit) that have the value 0x54494156. That’s the corrupted pointer causing the jump.
Trace the corruption source: The address 0x54494156 translates to ASCII TIAV—this looks like a string from your app or the tw6869 driver. Use JTAG to search kernel memory for this value and see which region it’s in. If it’s in a kernel heap/stack area that’s being overwritten, that’s a software bug.
Examine the stack frame: The backtrace shows the jump from cpuidle_enter_state to the invalid address. Use JTAG to inspect the stack frame for cpuidle_enter_state—was there a function call using a corrupted pointer, or did an exception incorrectly set the PC?

3. Validate Hardware-Specific Behavior

Even with memory tests passing, hardware differences could be the root cause:

Disable cpuidle temporarily: Add cpuidle.off=1 to the kernel command line. If the panic stops, the issue is tied to the hardware’s support for the specific idle state being used—maybe the failing board has a CPU or power management hardware defect that triggers corruption when entering idle.
Check tw6869 peripheral wiring: The driver’s "unknown std detected" warning suggests the video input on channel 1 isn’t working correctly. Compare schematics between the two boards—look for loose connections, incorrect voltage levels, or missing pull-up/down resistors that could cause the driver to behave erratically and corrupt memory.
Verify interrupt routing: Misconfigured interrupts for the tw6869 can cause spurious interrupts that corrupt kernel state. Use JTAG to monitor interrupt activity when running the app and compare it to the working board.

4. Isolate the App’s Impact

Since the panic only occurs when running your specific app:

Strace the app: Run strace -f on the app to log all system calls. Look for unusual ioctls, memory mappings, or interactions with the tw6869 driver that might be triggering kernel state corruption.
Test a minimal app version: Strip down the app to the smallest code snippet that still triggers the panic. This will help you identify exactly which part of the app is causing the issue.

Final Likely Culprits

Based on the log and your setup, the issue is most likely one of two things:

A software bug in the tw6869 driver or cpuidle code that corrupts memory when your app runs, or
A hardware defect with the video peripheral or CPU idle circuitry that causes the kernel to misbehave.

By following these steps, you should be able to narrow down the root cause quickly—start with software validation, then use JTAG to find the corrupted pointer, and finally test hardware-specific variables.

内容的提问来源于stack exchange，提问作者Peter_Amond