From: Huang Ying <ying.huang(a)intel.com>
ANBZ: #80
commit a818f5363a0eba04bcff986c64c919d3f44b8017 upstream
In auto NUMA balancing page table scanning, if the pte_protnone() is
true, the PTE needs not to be changed because it's in target state
already. So other checking on corresponding struct page is unnecessary
too.
So, if we check pte_protnone() firstly for each PTE, we can avoid
unnecessary struct page accessing, so that reduce the cache footprint of
NUMA balancing page table scanning.
In the performance test of pmbench memory accessing benchmark with 80:20
read/write ratio and normal access address distribution on a 2 socket
Intel server with Optance DC Persistent Memory, perf profiling shows
that the autonuma page table scanning time reduces from 1.23% to 0.97%
(that is, reduced 21%) with the patch.
Link: http://lkml.kernel.org/r/20191101075727.26683-3-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang(a)intel.com>
Acked-by: Mel Gorman <mgorman(a)suse.de>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Rik van Riel <riel(a)redhat.com>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Ingo Molnar <mingo(a)kernel.org>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: Fengguang Wu <fengguang.wu(a)intel.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Baolin Wang <baolin.wang(a)linux.alibaba.com>
---
mm/mprotect.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 6525e96..01681e3 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -84,6 +84,10 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
int nid;
bool toptier;
+ /* Avoid TLB flush if possible */
+ if (pte_protnone(oldpte))
+ continue;
+
page = vm_normal_page(vma, addr, oldpte);
if (!page || PageKsm(page))
continue;
@@ -93,10 +97,6 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
page_mapcount(page) != 1)
continue;
- /* Avoid TLB flush if possible */
- if (pte_protnone(oldpte))
- continue;
-
/*
* Don't mess with PTEs if page is already on the node
* a single-threaded process is running on.
--
1.8.3.1
Hi, All,
I have a patch as below to reduce TLB shootdown during page promotion.
If not yet, you may consider to backport it to Anolis kernel.
b99a342d4f11a5455d999b12f5fee42ab6acaf8c
Author: Huang Ying <ying.huang(a)intel.com>
AuthorDate: Thu Apr 29 22:57:41 2021 -0700
Commit: Linus Torvalds <torvalds(a)linux-foundation.org>
CommitDate: Fri Apr 30 11:20:39 2021 -0700
NUMA balancing: reduce TLB flush via delaying mapping on hint page fault
Best Regards,
Huang, Ying
Mysql benchmark will produce too mnay promotion due to DRAM node is not
enough, but promotion frequently will result in performance decrease.
hence, we perfer to remote access rather than demote/promote traffic
frequently.
Huang Ying (3):
mm, migrate: use flags parameter for remove_migration_ptes()
memory tiering: measure whether demoted pages are hot
memory tiering: adjust promotion threshold based on hot pages demoted
include/linux/mmzone.h | 3 ++
include/linux/page-flags.h | 9 ++++++
include/linux/page_ext.h | 3 ++
include/linux/rmap.h | 8 ++++-
include/linux/sched/numa_balancing.h | 62 ++++++++++++++++++++++++++++++++++++
include/linux/sched/sysctl.h | 3 ++
include/trace/events/mmflags.h | 8 ++++-
kernel/sched/fair.c | 27 +++++++++++++---
kernel/sysctl.c | 16 ++++++++++
mm/huge_memory.c | 6 ++--
mm/mempolicy.c | 2 ++
mm/migrate.c | 60 ++++++++++++++++++++++++++++------
mm/vmstat.c | 1 +
13 files changed, 189 insertions(+), 19 deletions(-)
--
1.8.3.1
Currently, Mysql testcase show that a large number of thp are migrated
from pmem node to toptier node, it will bring in more pgpromote_demoted
and migrated failiure. because pmem node memory is marked as prot_none,
it will be migrated by cpu access as soon as possible when it is hot,
and it is unnesscessary to migrate thp to dram when dram memory is not
enough, which will bring in more demoted and promoted.
Hence, the patch forbid the thp to produce in pmem node. the result show
about 3% improvements. the relative statistics is as follows.
before appling patch:
mysql prepare:
pgpromote_demoted 908267
pgmigrate_fail_dst_node_fail 428223
pgmigrate_fail_numa_isolate_fail 460480
mysql run:
pgpromote_demoted 2901105
pgmigrate_fail_dst_node_fail 5653776
pgmigrate_fail_numa_isolate_fail 5686052
after appling patch:
mysql prepare:
pgpromote_demoted 839297
pgmigrate_fail_dst_node_fail 36585
pgmigrate_fail_numa_isolate_fail 36585
mysql run:
pgpromote_demoted 913828
pgmigrate_fail_dst_node_fail 235863
pgmigrate_fail_numa_isolate_fail 235870
Signed-off-by: zhongjiang-ali <zhongjiang-ali(a)linux.alibaba.com>
---
mm/page_alloc.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8cfce92..4fff3cd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -461,6 +461,17 @@ static __always_inline int get_pfnblock_migratetype(struct page *page, unsigned
return __get_pfnblock_flags_mask(page, pfn, PB_migrate_end, MIGRATETYPE_MASK);
}
+static inline bool allow_hugepage_allocation(int nid, unsigned int order)
+{
+ if (node_is_toptier(nid))
+ return true;
+
+ if (order != HPAGE_PMD_ORDER)
+ return true;
+
+ return false;
+}
+
/**
* set_pfnblock_flags_mask - Set the requested group of flags for a pageblock_nr_pages block of pages
* @page: The page within the block of interest
@@ -3689,6 +3700,9 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
}
}
+ if (!allow_hugepage_allocation(zone_to_nid(zone), order))
+ continue;
+
if (no_fallback && nr_online_nodes > 1 &&
zone != ac->preferred_zoneref->zone) {
int local_nid;
--
1.8.3.1
Currently, promote_success just include the normal page count when
it is migrated from pmem node to toptier node, but an huge page also
can trigger the same operation when thp numa fault work. hence it
miss the count in migrate_misplaced_transhuge_page.
Signed-off-by: zhongjiang-ali <zhongjiang-ali(a)linux.alibaba.com>
---
mm/migrate.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index e9adaa7..9d6cac9 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2138,7 +2138,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
if (nr_succeeded) {
count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
if (!node_is_toptier(page_to_nid(page)) && node_is_toptier(node))
- mod_node_page_state(NODE_DATA(node), PGPROMOTE_SUCCESS,
+ mod_node_page_state(pgdat, PGPROMOTE_SUCCESS,
nr_succeeded);
}
BUG_ON(!list_empty(&migratepages));
@@ -2264,6 +2264,9 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR);
+ if (!node_is_toptier(page_to_nid(page)) && node_is_toptier(node))
+ mod_node_page_state(pgdat, PGPROMOTE_SUCCESS,
+ HPAGE_PMD_NR);
mod_node_page_state(page_pgdat(page),
NR_ISOLATED_ANON + page_lru,
--
1.8.3.1