- Pmem - lists.openanolis.cn

by Huang, Ying

Hi, All, We have just released the version 0.8 of the memory tiering kernel in the following URL, https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/log/?h=t… The main changes are as follows (in README-tiering.txt too). Updates in tiering-0.8: - Rebased on v5.15 - Remove cgroup v1 support, we will switch to cgroup v2 support in a future version. If you need cgroup v1 support, please stick with v0.72. - Increase hot threshold quicker if too few pages pass the threshold - Reset hot threshold if workload change is detected - Batch migrate_pages() to reduce TLB shootdown IPIs - Support to decrease hot threshold if the pages just demoted are hot - Support to promote pages asynchronously - Support to wake up kswapd earlier to make promotion more smooth - Add more sysctl knob for experimenting new features - Change the interface to enable NUMA balancing for MPOL_PREFERRED_MANY The recommended configuration is changed too, also in README-tiering.txt. The target of the patchset is upstream, so it uses the rebase policy and refresh the patchset directly instead of changing incrementally. This makes it harder for Alinos to use the patchset ... Best Regards, Huang, Ying

3 years, 3 months

2
1
0 0

Re: [PATCH] mm: tiered: Do not promotion when tiered memory is turned off

by Huang, Ying

zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes: > On 2022/1/10 1:52 H, Huang, Ying wrote: >> Hi, Zhongjiang, >> >> zhongjiang-ali<zhongjiang-ali(a)linux.alibaba.com> writes: >> >>> ANBZ: #80 >>> >>> sysctl_numa_balancing_mode is set to NUMA_BALANCING_MEMORY_TIERING >>> allowing memory migration between fast and slow node, and the page >>> of slow memory reuse the cpupid field. But it will bring in the >>> issue when sysctl_numa_balancing_mode is turned off dynamtically. >>> >>> should_numa_migrate_memory will choose whether the slow memory should >>> be migrated to fast memory when NUMA_BALANCING_MEMORY_TIERING is >>> turned off simultaneously. It will fails to obtain the correct node >>> from cpupid field in slow memory. hence it will trigger the panic. >> Thanks for catching this! Can you share the panic kernel log? > > The whole log has been deleted, but the kernel stack is as follows, > and it is easily reproduced. > > [381959.473850] BUG: unable to handle kernel paging request at > ffffffff9670bae0 > [381959.474463] PGD 29a620c067 P4D 29a620c067 PUD 29a620d063 PMD > 2f3c2ef063 PTE 800fffd6598f4062 > [381959.475115] Oops: 0000 [#1] SMP PTI > [381959.475398] CPU: 14 PID: 518 Comm: systemd-journal Kdump: loaded > Tainted: G E 4.19.91-001.ali4000_20210617_6287b9d5de_cbp.alios7.x86_64 > #1 > [381959.476441] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS > 8c24b4c 04/01/2014 > [381959.477037] RIP: 0010:should_numa_migrate_memory+0xc3/0x760 > [381959.477470] Code: ff 00 00 00 0f 84 89 03 00 00 41 0f b6 b5 f0 05 > 00 00 39 f1 0f 84 79 03 00 00 c1 f8 08 25 ff 01 00 00 48 8b 04 c5 20 > 37 16 96 <46> 3b 24 30 74 19 48 83 c4 38 31 c0 5b 5d 41 5c 41 5d 41 5e > 41 5f > [381959.478888] RSP: 0000:ffff9ff78d35bd08 EFLAGS: 00010206 > [381959.479295] RAX: ffffffff966ec000 RBX: ffffc96b10868980 RCX: > 000000000000005c > [381959.479842] RDX: 05169704c2022014 RSI: 0000000000000006 RDI: > ffffc96b10868980 > [381959.480390] RBP: 0000000000000000 R08: ff80003fffffffff R09: > ffff93c63ac01d10 > [381959.480936] R10: 0000000000000002 R11: 0000000000000000 R12: > 0000000000000000 > [381959.481484] R13: ffff93c63af8c740 R14: 000000000001fae0 R15: > 0000000000000001 > [381959.482032] FS: 00007f89d8602880(0000) GS:ffff93c641b80000(0000) > knlGS:0000000000000000 > [381959.482650] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [381959.483095] CR2: ffffffff9670bae0 CR3: 0000000f77c12006 CR4: > 00000000003606e0 > [381959.483646] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000 > [381959.484196] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > 0000000000000400 > [381959.484741] Call Trace: > [381959.484965] mpol_misplaced+0x17e/0x200 > [381959.485274] do_numa_page+0x1e5/0x288 > [381959.485563] handle_mm_fault+0x91b/0x960 > [381959.485879] __do_page_fault+0x26b/0x4a0 > [381959.486189] do_page_fault+0x32/0x110 > [381959.486479] ? async_page_fault+0x8/0x30 > [381959.486788] async_page_fault+0x1e/0x30 > [381959.487094] RIP: 0033:0x558e2f2e21e0 > [381959.487379] Code: 8b 83 c8 00 00 00 48 8b 48 70 48 c1 e9 04 48 85 > c9 0f 84 9c 01 00 00 31 d2 48 89 e8 48 f7 f1 48 c1 e2 04 48 03 93 d0 > 00 00 00 <4c> 8b 3a 4 Thanks! I will try to reproduce this. I have fixed a similar bug before, but apparently I missed this case. Based on my experience, the panic comes from invalid CPU number. So I suggest to check invalid CPU number and reset CPUPID if so. With that, we can go back to normal NUMA balancing mode after a while. This will be a upstream patch too. Best Regards, Huang, Ying >> >>> Signed-off-by: zhongjiang-ali<zhongjiang-ali(a)linux.alibaba.com> >>> --- >>> kernel/sched/fair.c | 8 ++++++++ >>> 1 file changed, 8 insertions(+) >>> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>> index 0184145..6afa935 100644 >>> --- a/kernel/sched/fair.c >>> +++ b/kernel/sched/fair.c >>> @@ -3016,6 +3016,14 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page, >>> last_cpupid = page_cpupid_xchg_last(page, this_cpupid); >>> /* >>> + * Migration will turn off between fast memory and slow node when >>> + * sysctl_numa_balancing_mode disable the feature dynamically. >>> + */ >>> + if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) && >>> + !node_is_toptier(src_nid)) >>> + return false; >>> + >>> + /* >>> * Allow first faults or private faults to migrate immediately early in >>> * the lifetime of a task. The magic number 4 is based on waiting for >>> * two full passes of the "multi-stage node selection" test that is

3 years, 3 months

2
16
0 0

[PATCH 0/4] Some improvements for page migration

by Baolin Wang

Hi, This patch set adds THP migration statistics and reduces TLB flush when page migration, as well as fixing the page refcount failure stats. Please help to review. Thanks. Anshuman Khandual (1): mm/vmstat: add events for THP migration without split Baolin Wang (1): anolis: mm: migrate: Move the page refcount failure statistics to the correct place Huang Ying (1): NUMA balancing: reduce TLB flush via delaying mapping on hint page fault Zi Yan (1): mm/migrate: correct thp migration stats Documentation/vm/page_migration.rst | 27 ++++++++++++++++ include/linux/vm_event_item.h | 3 ++ include/trace/events/migrate.h | 17 ++++++++-- mm/memory.c | 53 ++++++++++++++++++------------- mm/migrate.c | 62 +++++++++++++++++++++++++++++-------- mm/vmstat.c | 3 ++ 6 files changed, 127 insertions(+), 38 deletions(-) -- 1.8.3.1

3 years, 3 months

2
9
0 0

[PATCH] mm: memcontrol: set the correct memcg swappiness restriction

by Baolin Wang

ANBZ: #80 commit 37bc3cb9bbef86d1ddbbc789e55b588c8a2cac26 upstream Since commit c843966c556d ("mm: allow swappiness that prefers reclaiming anon over the file workingset") has expended the swappiness value to make swap to be preferred in some systems. We should also change the memcg swappiness restriction to allow memcg swap-preferred. Link: https://lkml.kernel.org/r/d77469b90c45c49953ccbc51e54a1d465bc18f70.16276262… Fixes: c843966c556d ("mm: allow swappiness that prefers reclaiming anon over the file workingset") Signed-off-by: Baolin Wang <baolin.wang(a)linux.alibaba.com> Acked-by: Michal Hocko <mhocko(a)suse.com> Cc: Johannes Weiner <hannes(a)cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev(a)gmail.com> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org> --- Note: From mysql testing, we found the pagecache pages used to record logs will bring thrashing, and we can increase the swappiness to mitigate the thrashing by increasing the scanning propotion of anon pages when do demotion, which can improve about 2% performance. --- mm/memcontrol.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index df08e95..580ab02 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4520,7 +4520,7 @@ static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css, { struct mem_cgroup *memcg = mem_cgroup_from_css(css); - if (val > 100 || val < -1 || (css->parent && val < 0)) + if (val > 200 || val < -1 || (css->parent && val < 0)) return -EINVAL; if (css->parent) -- 1.8.3.1

3 years, 3 months

2
1
0 0

[PATCH v2] anolis: mm: migrate: Move the page refcount failure statistics to the correct place

by Baolin Wang

ANBZ: #80 Move the page refcount failure statistics to the correct place, since the file pages also need consider mapping and PagePrivate setting. Signed-off-by: Baolin Wang <baolin.wang(a)linux.alibaba.com> --- Changes from v1: - Add page count failure stats when failed to freeze the page count. --- mm/migrate.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 1ce6cf4..1128642c 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -458,13 +458,12 @@ int migrate_page_move_mapping(struct address_space *mapping, expected_count += is_device_private_page(page); expected_count += is_device_public_page(page); - if (page_count(page) != expected_count) - count_vm_events(PGMIGRATE_REFCOUNT_FAIL, hpage_nr_pages(page)); - if (!mapping) { /* Anonymous page without mapping */ - if (page_count(page) != expected_count) + if (page_count(page) != expected_count) { + count_vm_events(PGMIGRATE_REFCOUNT_FAIL, hpage_nr_pages(page)); return -EAGAIN; + } /* No turning back from here */ newpage->index = page->index; @@ -488,11 +487,13 @@ int migrate_page_move_mapping(struct address_space *mapping, radix_tree_deref_slot_protected(pslot, &mapping->i_pages.xa_lock) != page) { xa_unlock_irq(&mapping->i_pages); + count_vm_events(PGMIGRATE_REFCOUNT_FAIL, hpage_nr_pages(page)); return -EAGAIN; } if (!page_ref_freeze(page, expected_count)) { xa_unlock_irq(&mapping->i_pages); + count_vm_events(PGMIGRATE_REFCOUNT_FAIL, hpage_nr_pages(page)); return -EAGAIN; } -- 1.8.3.1

3 years, 3 months

1
0
0 0

[PATCH] mm: tiered: Do not promotion when tiered memory is turned off

by zhongjiang-ali

ANBZ: #80 sysctl_numa_balancing_mode is set to NUMA_BALANCING_MEMORY_TIERING allowing memory migration between fast and slow node, and the page of slow memory reuse the cpupid field. But it will bring in the issue when sysctl_numa_balancing_mode is turned off dynamtically. should_numa_migrate_memory will choose whether the slow memory should be migrated to fast memory when NUMA_BALANCING_MEMORY_TIERING is turned off simultaneously. It will fails to obtain the correct node from cpupid field in slow memory. hence it will trigger the panic. Signed-off-by: zhongjiang-ali <zhongjiang-ali(a)linux.alibaba.com> --- kernel/sched/fair.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0184145..6afa935 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3016,6 +3016,14 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page, last_cpupid = page_cpupid_xchg_last(page, this_cpupid); /* + * Migration will turn off between fast memory and slow node when + * sysctl_numa_balancing_mode disable the feature dynamically. + */ + if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) && + !node_is_toptier(src_nid)) + return false; + + /* * Allow first faults or private faults to migrate immediately early in * the lifetime of a task. The magic number 4 is based on waiting for * two full passes of the "multi-stage node selection" test that is -- 1.8.3.1

3 years, 3 months

3
2
0 0

[PATCH] autonuma: fix watermark checking in migrate_balanced_pgdat()

by Baolin Wang

From: Huang Ying <ying.huang(a)intel.com> ANBZ: #80 commit bfe9d006c971a5daefe7a8b27819ccd497090fd8 upstream When zone_watermark_ok() is called in migrate_balanced_pgdat() to check migration target node, the parameter classzone_idx (for requested zone) is specified as 0 (ZONE_DMA). But when allocating memory for autonuma in alloc_misplaced_dst_page(), the requested zone from GFP flags is ZONE_MOVABLE. That is, the requested zone is different. The size of lowmem_reserve for the different requested zone is different. And this may cause some issues. For example, in the zoneinfo of a test machine as below, Node 0, zone DMA32 pages free 61592 min 29 low 454 high 879 spanned 1044480 present 442306 managed 425921 protection: (0, 0, 62457, 62457, 62457) The free page number of ZONE_DMA32 is greater than "high watermark + lowmem_reserve[ZONE_DMA]", but less than "high watermark + lowmem_reserve[ZONE_MOVABLE]". And because __alloc_pages_node() in alloc_misplaced_dst_page() requests ZONE_MOVABLE, the zone_watermark_ok() on ZONE_DMA32 in migrate_balanced_pgdat() may always return true. So, autonuma may not stop even when memory pressure in node 0 is heavy. To fix the issue, ZONE_MOVABLE is used as parameter to call zone_watermark_ok() in migrate_balanced_pgdat(). This makes it same as requested zone in alloc_misplaced_dst_page(). So that migrate_balanced_pgdat() returns false when memory pressure is heavy. Link: http://lkml.kernel.org/r/20191101075727.26683-2-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang(a)intel.com> Acked-by: Mel Gorman <mgorman(a)suse.de> Cc: Michal Hocko <mhocko(a)suse.com> Cc: Rik van Riel <riel(a)redhat.com> Cc: Peter Zijlstra <peterz(a)infradead.org> Cc: Ingo Molnar <mingo(a)kernel.org> Cc: Dave Hansen <dave.hansen(a)linux.intel.com> Cc: Dan Williams <dan.j.williams(a)intel.com> Cc: Fengguang Wu <fengguang.wu(a)intel.com> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org> Signed-off-by: Baolin Wang <baolin.wang(a)linux.alibaba.com> --- Note: this patch fixes the problem that the DRAM node's kswapd is not waked up in time, and improves about 12% with mysql performance testing. --- mm/migrate.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/migrate.c b/mm/migrate.c index 6d25ea0..e2dbf24 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1969,7 +1969,7 @@ static bool migrate_balanced_pgdat(struct pglist_data *pgdat, int order) /* Avoid waking kswapd by allocating pages to migrate. */ if (!zone_watermark_ok(zone, order, high_wmark_pages(zone), - 0, 0)) + ZONE_MOVABLE, 0)) continue; return true; } -- 1.8.3.1

3 years, 4 months

2
1
0 0

[PATCH] autonuma: fix watermark checking in migrate_balanced_pgdat()

by Baolin Wang

From: Huang Ying <ying.huang(a)intel.com> When zone_watermark_ok() is called in migrate_balanced_pgdat() to check migration target node, the parameter classzone_idx (for requested zone) is specified as 0 (ZONE_DMA). But when allocating memory for autonuma in alloc_misplaced_dst_page(), the requested zone from GFP flags is ZONE_MOVABLE. That is, the requested zone is different. The size of lowmem_reserve for the different requested zone is different. And this may cause some issues. For example, in the zoneinfo of a test machine as below, Node 0, zone DMA32 pages free 61592 min 29 low 454 high 879 spanned 1044480 present 442306 managed 425921 protection: (0, 0, 62457, 62457, 62457) The free page number of ZONE_DMA32 is greater than "high watermark + lowmem_reserve[ZONE_DMA]", but less than "high watermark + lowmem_reserve[ZONE_MOVABLE]". And because __alloc_pages_node() in alloc_misplaced_dst_page() requests ZONE_MOVABLE, the zone_watermark_ok() on ZONE_DMA32 in migrate_balanced_pgdat() may always return true. So, autonuma may not stop even when memory pressure in node 0 is heavy. To fix the issue, ZONE_MOVABLE is used as parameter to call zone_watermark_ok() in migrate_balanced_pgdat(). This makes it same as requested zone in alloc_misplaced_dst_page(). So that migrate_balanced_pgdat() returns false when memory pressure is heavy. Link: http://lkml.kernel.org/r/20191101075727.26683-2-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang(a)intel.com> Acked-by: Mel Gorman <mgorman(a)suse.de> Cc: Michal Hocko <mhocko(a)suse.com> Cc: Rik van Riel <riel(a)redhat.com> Cc: Peter Zijlstra <peterz(a)infradead.org> Cc: Ingo Molnar <mingo(a)kernel.org> Cc: Dave Hansen <dave.hansen(a)linux.intel.com> Cc: Dan Williams <dan.j.williams(a)intel.com> Cc: Fengguang Wu <fengguang.wu(a)intel.com> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org> --- mm/migrate.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/migrate.c b/mm/migrate.c index a8f87cb..eae1565 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1859,7 +1859,7 @@ static bool migrate_balanced_pgdat(struct pglist_data *pgdat, if (!zone_watermark_ok(zone, 0, high_wmark_pages(zone) + nr_migrate_pages, - 0, 0)) + ZONE_MOVABLE, 0)) continue; return true; } -- 1.8.3.1

3 years, 4 months

1
1
0 0

[PATCH 0/3] Map page as PROT_NONE after demotion

by Baolin Wang

Hi, This patch set backported several patches from git branch[1], which helps to reduce the overhead of page table entries scanning and identify pages demoted wrongly by mapping the new page as PROT_NONE when page demotion. With this patch set, there are some slight performance improvement with mysql read and write testing. Testing command and results: sysbench /usr/share/sysbench/oltp_read_write.lua --mysql-user=root --mysql-password=root --tables=200 --table-size=1000000 --threads=16 --time=600 --report-interval=10 prepare With patches: queries performed: transactions: 799824 (1331.06 per sec.) queries: 15996480 (26621.19 per sec.) latency: avg: 12.02 95th percentile: 20.37 Without patches: queries performed: transactions: 791806 (1318.89 per sec.) queries: 15836120 (26377.75 per sec.) latency: avg: 12.13 95th percentile: 20.37 [1] https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/log/?h=t… Huang Ying (3): mm, migrate: use flags parameter for remove_migration_ptes() memory tiering: map page as PROT_NONE after demotion memory tiering: control rate limit based on pgpromote_demoted include/linux/mmzone.h | 4 +++ include/linux/page-flags.h | 9 ++++++ include/linux/page_ext.h | 3 ++ include/linux/rmap.h | 8 ++++- include/linux/sched/numa_balancing.h | 61 ++++++++++++++++++++++++++++++++++++ include/trace/events/mmflags.h | 9 +++++- include/trace/events/sched.h | 12 ++++--- kernel/sched/fair.c | 28 +++++++++++++++-- mm/huge_memory.c | 4 +-- mm/mempolicy.c | 2 ++ mm/migrate.c | 59 ++++++++++++++++++++++++++++------ mm/vmstat.c | 1 + 12 files changed, 179 insertions(+), 21 deletions(-) -- 1.8.3.1

3 years, 4 months

2
5
0 0

[PATCH 00/15] More optimization and improvements for tiered memory system

by Baolin Wang

Hi, This patch set backported more optimization and improvements for tiered memory system from git branch [1]. Please find details in each patch. Thanks. Note: Almost patches have no logic conflicts except patch 4, since we should promote pages one by one in K4.19. [1] https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/log/?h=t… Feng Tang (3): mm/migrate: make migrate_misplaced_page works for unmapped file pages memory tiering: support unmapped file cache pages promotion mm/migrate: add vm counters to track reasons of page migration failure Huang Ying (12): memory tiering: add trace point for threshold adjustment memory tiering: add node-vmstat for promote threshold NUMA balancing: migrate dirty file cache pages memory tiering: add file pages promotion counter memory tiering: add file pages demotion counter memory tiering: fix THP failed to be isolated live lock memory tiering: check order in migrate_balanced_pgdat() memory tiering: avoid kswap to be keep in failure state for long memory tiering: accelerate promotion threshold adjustment memory tiering: extend promotion threshold range memory tiering: loosen per second rate limit memory tiering: double hot threshold for write hint page fault include/linux/mempolicy.h | 4 +- include/linux/migrate.h | 11 +++- include/linux/mmzone.h | 5 ++ include/linux/sched/numa_balancing.h | 5 +- include/linux/vm_event_item.h | 5 ++ include/trace/events/sched.h | 24 +++++++++ kernel/sched/fair.c | 63 +++++++++++++++-------- mm/filemap.c | 26 +++++++++- mm/huge_memory.c | 2 +- mm/memory.c | 5 +- mm/mempolicy.c | 6 ++- mm/migrate.c | 98 +++++++++++++++++++++++++----------- mm/mprotect.c | 8 --- mm/page_alloc.c | 3 ++ mm/vmscan.c | 23 ++++++++- mm/vmstat.c | 8 +++ 16 files changed, 227 insertions(+), 69 deletions(-) -- 1.8.3.1

3 years, 4 months

3
19
0 0

2025

2024

2023

2022

2021

Pmem