zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes:
On 2022/1/10 2:50 H, Huang, Ying wrote:
zhong jiang
<zhongjiang-ali(a)linux.alibaba.com> writes:
On 2022/1/10 1:52 H, Huang, Ying wrote:
Hi, Zhongjiang,
zhongjiang-ali<zhongjiang-ali(a)linux.alibaba.com> writes:
> ANBZ: #80
>
> sysctl_numa_balancing_mode is set to NUMA_BALANCING_MEMORY_TIERING
> allowing memory migration between fast and slow node, and the page
> of slow memory reuse the cpupid field. But it will bring in the
> issue when sysctl_numa_balancing_mode is turned off dynamtically.
>
> should_numa_migrate_memory will choose whether the slow memory should
> be migrated to fast memory when NUMA_BALANCING_MEMORY_TIERING is
> turned off simultaneously. It will fails to obtain the correct node
> from cpupid field in slow memory. hence it will trigger the panic.
Thanks for catching this! Can you share the panic kernel log?
The whole log has
been deleted, but the kernel stack is as follows,
and it is easily reproduced.
[381959.473850] BUG: unable to handle kernel paging request at
ffffffff9670bae0
[381959.474463] PGD 29a620c067 P4D 29a620c067 PUD 29a620d063 PMD
2f3c2ef063 PTE 800fffd6598f4062
[381959.475115] Oops: 0000 [#1] SMP PTI
[381959.475398] CPU: 14 PID: 518 Comm: systemd-journal Kdump: loaded
Tainted: G E 4.19.91-001.ali4000_20210617_6287b9d5de_cbp.alios7.x86_64
#1
[381959.476441] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS
8c24b4c 04/01/2014
[381959.477037] RIP: 0010:should_numa_migrate_memory+0xc3/0x760
[381959.477470] Code: ff 00 00 00 0f 84 89 03 00 00 41 0f b6 b5 f0 05
00 00 39 f1 0f 84 79 03 00 00 c1 f8 08 25 ff 01 00 00 48 8b 04 c5 20
37 16 96 <46> 3b 24 30 74 19 48 83 c4 38 31 c0 5b 5d 41 5c 41 5d 41 5e
41 5f
[381959.478888] RSP: 0000:ffff9ff78d35bd08 EFLAGS: 00010206
[381959.479295] RAX: ffffffff966ec000 RBX: ffffc96b10868980 RCX:
000000000000005c
[381959.479842] RDX: 05169704c2022014 RSI: 0000000000000006 RDI:
ffffc96b10868980
[381959.480390] RBP: 0000000000000000 R08: ff80003fffffffff R09:
ffff93c63ac01d10
[381959.480936] R10: 0000000000000002 R11: 0000000000000000 R12:
0000000000000000
[381959.481484] R13: ffff93c63af8c740 R14: 000000000001fae0 R15:
0000000000000001
[381959.482032] FS: 00007f89d8602880(0000) GS:ffff93c641b80000(0000)
knlGS:0000000000000000
[381959.482650] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[381959.483095] CR2: ffffffff9670bae0 CR3: 0000000f77c12006 CR4:
00000000003606e0
[381959.483646] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[381959.484196] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[381959.484741] Call Trace:
[381959.484965] mpol_misplaced+0x17e/0x200
[381959.485274] do_numa_page+0x1e5/0x288
[381959.485563] handle_mm_fault+0x91b/0x960
[381959.485879] __do_page_fault+0x26b/0x4a0
[381959.486189] do_page_fault+0x32/0x110
[381959.486479] ? async_page_fault+0x8/0x30
[381959.486788] async_page_fault+0x1e/0x30
[381959.487094] RIP: 0033:0x558e2f2e21e0
[381959.487379] Code: 8b 83 c8 00 00 00 48 8b 48 70 48 c1 e9 04 48 85
c9 0f 84 9c 01 00 00 31 d2 48 89 e8 48 f7 f1 48 c1 e2 04 48 03 93 d0
00 00 00 <4c> 8b 3a 4
Thanks! I will try to reproduce this. I have fixed a
similar bug
before, but apparently I missed this case.
Based on my experience, the panic comes from invalid CPU number. So I
suggest to check invalid CPU number and reset CPUPID if so. With that,
we can go back to normal NUMA balancing mode after a while. This will
be a upstream patch too.
The patch has considered the case, cpupid filed has been reset
with current cpu and pid after do_numa_page,
which means go back to normal NUMA balancing mode.
With your patch, the page in PMEM cannot be migrated to DRAM in normal
NUMA balancing mode. I have just written a patch as below. This
hasn't been tested yet. Just to show my idea.
Best Regards,
Huang, Ying
----------------------------8<-------------------------------------
modified include/linux/mm.h
@@ -1440,6 +1440,11 @@ static inline bool __cpupid_match_pid(pid_t task_pid, int cpupid)
return (task_pid & LAST__PID_MASK) == cpupid_to_pid(cpupid);
}
+static inline bool check_cpupid(int cpupid)
+{
+ return cpupid_to_cpu(cpupid) >= nr_cpumask_bits;
+}
+
#define cpupid_match_pid(task, cpupid) __cpupid_match_pid(task->pid, cpupid)
#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
static inline int page_cpupid_xchg_last(struct page *page, int cpupid)
modified kernel/sched/fair.c
@@ -1597,6 +1597,14 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page
* page,
this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
+ /*
+ * The cpupid may be invalid when NUMA_BALANCING_MEMORY_TIERING
+ * is disabled dynamically.
+ */
+ if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) &&
+ !node_is_toptier(src_nid) && !check_cpupid(last_cpupid))
+ return false;
+
/*
* Allow first faults or private faults to migrate immediately early in
* the lifetime of a task. The magic number 4 is based on waiting for
@@ -2812,11 +2820,19 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int
flags)
if (!p->mm)
return;
- /* Numa faults statistics are unnecessary for the slow memory node */
- if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING &&
- !node_is_toptier(mem_node))
+ /*
+ * NUMA faults statistics are unnecessary for the slow memory node.
+ *
+ * And, the cpupid may be invalid when NUMA_BALANCING_MEMORY_TIERING
+ * is disabled dynamically.
+ */
+ if (!node_is_toptier(mem_node) &&
+ (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING ||
+ !check_cpupid(last_cpupid)))
return;