[Pmem] Re: [PATCH] mm: tiered: Do not promotion when tiered memory is turned off

10 Jan 2022

On 2022/1/10 3:36 下午, Huang, Ying wrote:
...
  zhong jiang &lt;zhongjiang-ali(a)linux.alibaba.com&gt;
writes:

  On 2022/1/10 2:50 H, Huang, Ying wrote:
  zhong jiang
&lt;zhongjiang-ali(a)linux.alibaba.com&gt; writes:

  On 2022/1/10 1:52 H, Huang, Ying wrote:
> Hi, Zhongjiang,
>
> zhongjiang-ali&lt;zhongjiang-ali(a)linux.alibaba.com&gt;  writes:
>
>> ANBZ: #80
>>
>> sysctl_numa_balancing_mode is set to NUMA_BALANCING_MEMORY_TIERING
>> allowing memory migration between fast and slow node, and the page
>> of slow memory reuse the cpupid field. But it will bring in the
>> issue when sysctl_numa_balancing_mode is turned off dynamtically.
>>
>> should_numa_migrate_memory will choose whether the slow memory should
>> be migrated to fast memory when NUMA_BALANCING_MEMORY_TIERING is
>> turned off simultaneously. It will fails to obtain the correct node
>> from cpupid field in slow memory. hence it will trigger the panic.
> Thanks for catching this!  Can you share the panic kernel log?
 The whole log has been deleted, but the kernel stack is as follows,
 and it is easily reproduced.

 [381959.473850] BUG: unable to handle kernel paging request at
 ffffffff9670bae0
 [381959.474463] PGD 29a620c067 P4D 29a620c067 PUD 29a620d063 PMD
 2f3c2ef063 PTE 800fffd6598f4062
 [381959.475115] Oops: 0000 [#1] SMP PTI
 [381959.475398] CPU: 14 PID: 518 Comm: systemd-journal Kdump: loaded
 Tainted: G E 4.19.91-001.ali4000_20210617_6287b9d5de_cbp.alios7.x86_64
 #1
 [381959.476441] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS
 8c24b4c 04/01/2014
 [381959.477037] RIP: 0010:should_numa_migrate_memory+0xc3/0x760
 [381959.477470] Code: ff 00 00 00 0f 84 89 03 00 00 41 0f b6 b5 f0 05
 00 00 39 f1 0f 84 79 03 00 00 c1 f8 08 25 ff 01 00 00 48 8b 04 c5 20
 37 16 96 <46> 3b 24 30 74 19 48 83 c4 38 31 c0 5b 5d 41 5c 41 5d 41 5e
 41 5f
 [381959.478888] RSP: 0000:ffff9ff78d35bd08 EFLAGS: 00010206
 [381959.479295] RAX: ffffffff966ec000 RBX: ffffc96b10868980 RCX:
 000000000000005c
 [381959.479842] RDX: 05169704c2022014 RSI: 0000000000000006 RDI:
 ffffc96b10868980
 [381959.480390] RBP: 0000000000000000 R08: ff80003fffffffff R09:
 ffff93c63ac01d10
 [381959.480936] R10: 0000000000000002 R11: 0000000000000000 R12:
 0000000000000000
 [381959.481484] R13: ffff93c63af8c740 R14: 000000000001fae0 R15:
 0000000000000001
 [381959.482032] FS: 00007f89d8602880(0000) GS:ffff93c641b80000(0000)
 knlGS:0000000000000000
 [381959.482650] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [381959.483095] CR2: ffffffff9670bae0 CR3: 0000000f77c12006 CR4:
 00000000003606e0
 [381959.483646] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
 0000000000000000
 [381959.484196] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
 0000000000000400
 [381959.484741] Call Trace:
 [381959.484965] mpol_misplaced+0x17e/0x200
 [381959.485274] do_numa_page+0x1e5/0x288
 [381959.485563] handle_mm_fault+0x91b/0x960
 [381959.485879] __do_page_fault+0x26b/0x4a0
 [381959.486189] do_page_fault+0x32/0x110
 [381959.486479] ? async_page_fault+0x8/0x30
 [381959.486788] async_page_fault+0x1e/0x30
 [381959.487094] RIP: 0033:0x558e2f2e21e0
 [381959.487379] Code: 8b 83 c8 00 00 00 48 8b 48 70 48 c1 e9 04 48 85
 c9 0f 84 9c 01 00 00 31 d2 48 89 e8 48 f7 f1 48 c1 e2 04 48 03 93 d0
 00 00 00 <4c> 8b 3a 4
  Thanks!  I will try to reproduce this.  I have fixed a similar bug
 before, but apparently I missed this case.

 Based on my experience, the panic comes from invalid CPU number.  So I
 suggest to check invalid CPU number and reset CPUPID if so.  With that,
 we can go back to normal NUMA balancing mode after a while.  This will
 be a upstream patch too.
  The patch has considered the case, cpupid filed has been reset
 with current cpu and pid after do_numa_page,

 which means go back to normal NUMA balancing mode.
  With your patch, the page in PMEM cannot be migrated to DRAM in normal
 NUMA balancing mode.  I have just written a patch as below.  This
 hasn't been tested yet.  Just to show my idea.

I get you point that want to migrate the page under the context. IMO,  
The cpuid_to_cpu does not

represent the proper cpu, but an random value.  we depend on it to 
promotion maybe does not

the correct condition if dram has more node.   hence I just avoid to 
migrate to dram in the context.

Thanks

zhong jiang

...
  Best Regards,
 Huang, Ying

 ----------------------------8<-------------------------------------
 modified   include/linux/mm.h
 @@ -1440,6 +1440,11 @@ static inline bool __cpupid_match_pid(pid_t task_pid, int
cpupid)
   	return (task_pid & LAST__PID_MASK) == cpupid_to_pid(cpupid);
   }

 +static inline bool check_cpupid(int cpupid)
 +{
 +	return cpupid_to_cpu(cpupid) >= nr_cpumask_bits;
 +}
 +
   #define cpupid_match_pid(task, cpupid) __cpupid_match_pid(task->pid, cpupid)
   #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
   static inline int page_cpupid_xchg_last(struct page *page, int cpupid)
 modified   kernel/sched/fair.c
 @@ -1597,6 +1597,14 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page
* page,
   	this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
   	last_cpupid = page_cpupid_xchg_last(page, this_cpupid);

 +	/*
 +	 * The cpupid may be invalid when NUMA_BALANCING_MEMORY_TIERING
 +	 * is disabled dynamically.
 +	 */
 +	if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) &&
 +	    !node_is_toptier(src_nid) && !check_cpupid(last_cpupid))
 +		return false;
 +
   	/*
   	 * Allow first faults or private faults to migrate immediately early in
   	 * the lifetime of a task. The magic number 4 is based on waiting for
 @@ -2812,11 +2820,19 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages,
int flags)
   	if (!p->mm)
   		return;

 -	/* Numa faults statistics are unnecessary for the slow memory node */
 -	if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING &&
 -	    !node_is_toptier(mem_node))
 +	/*
 +	 * NUMA faults statistics are unnecessary for the slow memory node.
 +	 *
 +	 * And, the cpupid may be invalid when NUMA_BALANCING_MEMORY_TIERING
 +	 * is disabled dynamically.
 +	 */
 +	if (!node_is_toptier(mem_node) &&
 +	    (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING ||
 +	     !check_cpupid(last_cpupid)))
   		return;

2025

2024

2023

2022

2021

[Pmem] Re: [PATCH] mm: tiered: Do not promotion when tiered memory is turned off