zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes:
On 2022/1/10 1:52 H, Huang, Ying wrote:
Hi, Zhongjiang,
zhongjiang-ali<zhongjiang-ali(a)linux.alibaba.com> writes:
ANBZ: #80
sysctl_numa_balancing_mode is set to NUMA_BALANCING_MEMORY_TIERING
allowing memory migration between fast and slow node, and the page
of slow memory reuse the cpupid field. But it will bring in the
issue when sysctl_numa_balancing_mode is turned off dynamtically.
should_numa_migrate_memory will choose whether the slow memory should
be migrated to fast memory when NUMA_BALANCING_MEMORY_TIERING is
turned off simultaneously. It will fails to obtain the correct node
from cpupid field in slow memory. hence it will trigger the panic.
Thanks for
catching this! Can you share the panic kernel log?
The whole log has been deleted, but the kernel stack is as follows,
and it is easily reproduced.
[381959.473850] BUG: unable to handle kernel paging request at
ffffffff9670bae0
[381959.474463] PGD 29a620c067 P4D 29a620c067 PUD 29a620d063 PMD
2f3c2ef063 PTE 800fffd6598f4062
[381959.475115] Oops: 0000 [#1] SMP PTI
[381959.475398] CPU: 14 PID: 518 Comm: systemd-journal Kdump: loaded
Tainted: G E 4.19.91-001.ali4000_20210617_6287b9d5de_cbp.alios7.x86_64
#1
[381959.476441] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS
8c24b4c 04/01/2014
[381959.477037] RIP: 0010:should_numa_migrate_memory+0xc3/0x760
[381959.477470] Code: ff 00 00 00 0f 84 89 03 00 00 41 0f b6 b5 f0 05
00 00 39 f1 0f 84 79 03 00 00 c1 f8 08 25 ff 01 00 00 48 8b 04 c5 20
37 16 96 <46> 3b 24 30 74 19 48 83 c4 38 31 c0 5b 5d 41 5c 41 5d 41 5e
41 5f
[381959.478888] RSP: 0000:ffff9ff78d35bd08 EFLAGS: 00010206
[381959.479295] RAX: ffffffff966ec000 RBX: ffffc96b10868980 RCX:
000000000000005c
[381959.479842] RDX: 05169704c2022014 RSI: 0000000000000006 RDI:
ffffc96b10868980
[381959.480390] RBP: 0000000000000000 R08: ff80003fffffffff R09:
ffff93c63ac01d10
[381959.480936] R10: 0000000000000002 R11: 0000000000000000 R12:
0000000000000000
[381959.481484] R13: ffff93c63af8c740 R14: 000000000001fae0 R15:
0000000000000001
[381959.482032] FS: 00007f89d8602880(0000) GS:ffff93c641b80000(0000)
knlGS:0000000000000000
[381959.482650] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[381959.483095] CR2: ffffffff9670bae0 CR3: 0000000f77c12006 CR4:
00000000003606e0
[381959.483646] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[381959.484196] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[381959.484741] Call Trace:
[381959.484965] mpol_misplaced+0x17e/0x200
[381959.485274] do_numa_page+0x1e5/0x288
[381959.485563] handle_mm_fault+0x91b/0x960
[381959.485879] __do_page_fault+0x26b/0x4a0
[381959.486189] do_page_fault+0x32/0x110
[381959.486479] ? async_page_fault+0x8/0x30
[381959.486788] async_page_fault+0x1e/0x30
[381959.487094] RIP: 0033:0x558e2f2e21e0
[381959.487379] Code: 8b 83 c8 00 00 00 48 8b 48 70 48 c1 e9 04 48 85
c9 0f 84 9c 01 00 00 31 d2 48 89 e8 48 f7 f1 48 c1 e2 04 48 03 93 d0
00 00 00 <4c> 8b 3a 4
Thanks! I will try to reproduce this. I have fixed a similar bug
before, but apparently I missed this case.
Based on my experience, the panic comes from invalid CPU number. So I
suggest to check invalid CPU number and reset CPUPID if so. With that,
we can go back to normal NUMA balancing mode after a while. This will
be a upstream patch too.
Best Regards,
Huang, Ying
>
>> Signed-off-by: zhongjiang-ali<zhongjiang-ali(a)linux.alibaba.com>
>> ---
>> kernel/sched/fair.c | 8 ++++++++
>> 1 file changed, 8 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 0184145..6afa935 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3016,6 +3016,14 @@ bool should_numa_migrate_memory(struct task_struct *p,
struct page * page,
>> last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
>> /*
>> + * Migration will turn off between fast memory and slow node when
>> + * sysctl_numa_balancing_mode disable the feature dynamically.
>> + */
>> + if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING)
&&
>> + !node_is_toptier(src_nid))
>> + return false;
>> +
>> + /*
>> * Allow first faults or private faults to migrate immediately early in
>> * the lifetime of a task. The magic number 4 is based on waiting for
>> * two full passes of the "multi-stage node selection" test that is