[PATCH] memory tiered: Do not use thp in pmem node

列表概述所有线索
下载

较新的

较旧的

[PMEM PATCH 0/3] memory tiered:...

[Pmem PATCH] memory tiering: fix...

zhongjiang-ali

9 Feb 2022 9 Feb '22

6:29 p.m.

Currently, Mysql testcase show that a large number of thp are migrated from pmem node to toptier node, it will bring in more pgpromote_demoted and migrated failiure. because pmem node memory is marked as prot_none, it will be migrated by cpu access as soon as possible when it is hot, and it is unnesscessary to migrate thp to dram when dram memory is not enough, which will bring in more demoted and promoted. Hence, the patch forbid the thp to produce in pmem node. the result show about 3% improvements. the relative statistics is as follows. before appling patch: mysql prepare: pgpromote_demoted 908267 pgmigrate_fail_dst_node_fail 428223 pgmigrate_fail_numa_isolate_fail 460480 mysql run: pgpromote_demoted 2901105 pgmigrate_fail_dst_node_fail 5653776 pgmigrate_fail_numa_isolate_fail 5686052 after appling patch: mysql prepare: pgpromote_demoted 839297 pgmigrate_fail_dst_node_fail 36585 pgmigrate_fail_numa_isolate_fail 36585 mysql run: pgpromote_demoted 913828 pgmigrate_fail_dst_node_fail 235863 pgmigrate_fail_numa_isolate_fail 235870 Signed-off-by: zhongjiang-ali <zhongjiang-ali(a)linux.alibaba.com> --- mm/page_alloc.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8cfce92..4fff3cd 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -461,6 +461,17 @@ static __always_inline int get_pfnblock_migratetype(struct page *page, unsigned return __get_pfnblock_flags_mask(page, pfn, PB_migrate_end, MIGRATETYPE_MASK); } +static inline bool allow_hugepage_allocation(int nid, unsigned int order) +{ + if (node_is_toptier(nid)) + return true; + + if (order != HPAGE_PMD_ORDER) + return true; + + return false; +} + /** * set_pfnblock_flags_mask - Set the requested group of flags for a pageblock_nr_pages block of pages * @page: The page within the block of interest @@ -3689,6 +3700,9 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) } } + if (!allow_hugepage_allocation(zone_to_nid(zone), order)) + continue; + if (no_fallback && nr_online_nodes > 1 && zone != ac->preferred_zoneref->zone) { int local_nid; -- 1.8.3.1

显示某日回复

Huang, Ying

10 Feb 10 Feb

8:58 a.m.

zhongjiang-ali <zhongjiang-ali(a)linux.alibaba.com> writes:

...

It appears that this will disable node reclaiming for THP allocation. So more pages will be allocated in PMEM node because of allocation fallback? Best Regards, Huang, Ying

...

if (no_fallback && nr_online_nodes > 1 && zone != ac->preferred_zoneref->zone) { int local_nid;

zhong jiang

11:19 a.m.

On 2022/2/10 8:58 上午, Huang, Ying wrote:

...

zhongjiang-ali <zhongjiang-ali(a)linux.alibaba.com> writes:

It appears that this will disable node reclaiming for THP allocation. So more pages will be allocated in PMEM node because of allocation fallback?

We just allow normal pages allocate in pmem node, hence, thp allocation will fallback to produce more normal pages. Mysql testcase show that too many thps is promoted to toptier , due to toptier memory is not enough, it will bring in more pgpromote_deomted and dst_node_full counter increasing. In that case, we prefer to remote access rather than migrate thp between pmem and toptier node frequently, which will make performance decrease.

...

Best Regards, Huang, Ying > if (no_fallback && nr_online_nodes > 1 && > zone != ac->preferred_zoneref->zone) { > int local_nid;

Huang, Ying

1:21 p.m.

zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes:

...

On 2022/2/10 8:58 H, Huang, Ying wrote:

zhongjiang-ali <zhongjiang-ali(a)linux.alibaba.com> writes:

It appears that this will disable node reclaiming for THP allocation. So more pages will be allocated in PMEM node because of allocation fallback?

Maybe we are looking at different source code :-). In latest upstream code, zone_allows_reclaim() is to control node reclaiming (or zone reclaim) only. Which repo should I look? Memory tiering is challenging for THP. There's no many free pages (a little more than high watermark) in DRAM node, so it's hard to allocate THP there except workloads start up. Just curiously, whether disabling THP helps your workload? Best Regards, Huang, Ying

...

> Best Regards, > Huang, Ying > >> if (no_fallback && nr_online_nodes > 1 && >> zone != ac->preferred_zoneref->zone) { >> int local_nid;

Baolin Wang

2:24 p.m.

...

zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes:

On 2022/2/10 8:58 H, Huang, Ying wrote:

zhongjiang-ali <zhongjiang-ali(a)linux.alibaba.com> writes:

It appears that this will disable node reclaiming for THP allocation. So more pages will be allocated in PMEM node because of allocation fallback?

Maybe we are looking at different source code :-). In latest upstream code, zone_allows_reclaim() is to control node reclaiming (or zone reclaim) only. Which repo should I look?

I think you misunderstood the change, the change is in get_page_from_freelist(), not in zone_allows_reclaim(). From my understanding, Zhongjiang is trying to disable the memory allocation fallback for THP, right? But that will cause more demotion if we can not fallback to PMEM node?

...

Memory tiering is challenging for THP. There's no many free pages (a little more than high watermark) in DRAM node, so it's hard to allocate THP there except workloads start up. Just curiously, whether disabling THP helps your workload?

From our previous testing, enabling THP can benefit the performance. And our ECS environment enabled the THP by default.

zhong jiang

3 p.m.

On 2022/2/10 2:24 下午, Baolin Wang wrote:

...

zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes:

On 2022/2/10 8:58 H, Huang, Ying wrote:

zhongjiang-ali <zhongjiang-ali(a)linux.alibaba.com> writes: > Currently, Mysql testcase show that a large number of thp are > migrated > from pmem node to toptier node, it will bring in more > pgpromote_demoted > and migrated failiure. because pmem node memory is marked as > prot_none, > it will be migrated by cpu access as soon as possible when it is hot, > and it is unnesscessary to migrate thp to dram when dram memory is > not > enough, which will bring in more demoted and promoted. > > Hence, the patch forbid the thp to produce in pmem node. the > result show > about 3% improvements. the relative statistics is as follows. > > before appling patch: > mysql prepare: > pgpromote_demoted 908267 > pgmigrate_fail_dst_node_fail 428223 > pgmigrate_fail_numa_isolate_fail 460480 > > mysql run: > pgpromote_demoted 2901105 > pgmigrate_fail_dst_node_fail 5653776 > pgmigrate_fail_numa_isolate_fail 5686052 > > after appling patch: > mysql prepare: > pgpromote_demoted 839297 > pgmigrate_fail_dst_node_fail 36585 > pgmigrate_fail_numa_isolate_fail 36585 > > mysql run: > pgpromote_demoted 913828 > pgmigrate_fail_dst_node_fail 235863 > pgmigrate_fail_numa_isolate_fail 235870 > > Signed-off-by: zhongjiang-ali <zhongjiang-ali(a)linux.alibaba.com> > --- > mm/page_alloc.c | 14 ++++++++++++++ > 1 file changed, 14 insertions(+) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 8cfce92..4fff3cd 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -461,6 +461,17 @@ static __always_inline int > get_pfnblock_migratetype(struct page *page, unsigned > return __get_pfnblock_flags_mask(page, pfn, PB_migrate_end, > MIGRATETYPE_MASK); > } > +static inline bool allow_hugepage_allocation(int nid, unsigned > int order) > +{ > + if (node_is_toptier(nid)) > + return true; > + > + if (order != HPAGE_PMD_ORDER) > + return true; > + > + return false; > +} > + > /** > * set_pfnblock_flags_mask - Set the requested group of flags > for a pageblock_nr_pages block of pages > * @page: The page within the block of interest > @@ -3689,6 +3700,9 @@ static bool zone_allows_reclaim(struct zone > *local_zone, struct zone *zone) > } > } > + if (!allow_hugepage_allocation(zone_to_nid(zone), > order)) > + continue; > + It appears that this will disable node reclaiming for THP allocation. So more pages will be allocated in PMEM node because of allocation fallback?

Maybe we are looking at different source code :-). In latest upstream code, zone_allows_reclaim() is to control node reclaiming (or zone reclaim) only. Which repo should I look?

No, I mean it should prevent memory allocation thp in pmem node.

...

From our previous testing, enabling THP can benefit the performance. And our ECS environment enabled the THP by default.

Huang, Ying

3:03 p.m.

Baolin Wang <baolin.wang(a)linux.alibaba.com> writes:

...

zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes:

On 2022/2/10 8:58 H, Huang, Ying wrote:

zhongjiang-ali <zhongjiang-ali(a)linux.alibaba.com> writes: > Currently, Mysql testcase show that a large number of thp are migrated > from pmem node to toptier node, it will bring in more pgpromote_demoted > and migrated failiure. because pmem node memory is marked as prot_none, > it will be migrated by cpu access as soon as possible when it is hot, > and it is unnesscessary to migrate thp to dram when dram memory is not > enough, which will bring in more demoted and promoted. > > Hence, the patch forbid the thp to produce in pmem node. the result show > about 3% improvements. the relative statistics is as follows. > > before appling patch: > mysql prepare: > pgpromote_demoted 908267 > pgmigrate_fail_dst_node_fail 428223 > pgmigrate_fail_numa_isolate_fail 460480 > > mysql run: > pgpromote_demoted 2901105 > pgmigrate_fail_dst_node_fail 5653776 > pgmigrate_fail_numa_isolate_fail 5686052 > > after appling patch: > mysql prepare: > pgpromote_demoted 839297 > pgmigrate_fail_dst_node_fail 36585 > pgmigrate_fail_numa_isolate_fail 36585 > > mysql run: > pgpromote_demoted 913828 > pgmigrate_fail_dst_node_fail 235863 > pgmigrate_fail_numa_isolate_fail 235870 > > Signed-off-by: zhongjiang-ali <zhongjiang-ali(a)linux.alibaba.com> > --- > mm/page_alloc.c | 14 ++++++++++++++ > 1 file changed, 14 insertions(+) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 8cfce92..4fff3cd 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -461,6 +461,17 @@ static __always_inline int get_pfnblock_migratetype(struct page *page, unsigned > return __get_pfnblock_flags_mask(page, pfn, PB_migrate_end, MIGRATETYPE_MASK); > } > +static inline bool allow_hugepage_allocation(int nid, unsigned > int order) > +{ > + if (node_is_toptier(nid)) > + return true; > + > + if (order != HPAGE_PMD_ORDER) > + return true; > + > + return false; > +} > + > /** > * set_pfnblock_flags_mask - Set the requested group of flags for a pageblock_nr_pages block of pages > * @page: The page within the block of interest > @@ -3689,6 +3700,9 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) > } > } > + if (!allow_hugepage_allocation(zone_to_nid(zone), > order)) > + continue; > + It appears that this will disable node reclaiming for THP allocation. So more pages will be allocated in PMEM node because of allocation fallback?

Maybe we are looking at different source code :-). In latest upstream code, zone_allows_reclaim() is to control node reclaiming (or zone reclaim) only. Which repo should I look?

I think you misunderstood the change, the change is in get_page_from_freelist(), not in zone_allows_reclaim().

OK, I see. I think the `diff` program fools me: @@ -3689,6 +3700,9 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) } } + if (!allow_hugepage_allocation(zone_to_nid(zone), order)) + continue; + if (no_fallback && nr_online_nodes > 1 && zone != ac->preferred_zoneref->zone) { int local_nid;

...

From my understanding, Zhongjiang is trying to disable the memory allocation fallback for THP, right?

I think so too now.

...

But that will cause more demotion if we can not fallback to PMEM node?

If THP fails to be allocated, normal pages will be allocated instead. And it appears that if THP is failed to be demoted (with this patch, it will always fail), THP will be split too. So we may have much less THP in system with the patch. Zhongjiang, Can you check it? Another choice is to split THP if migration fails. That's always a question to prefer THP or local/hot normal pages.

...

From our previous testing, enabling THP can benefit the performance. And our ECS environment enabled the THP by default.

Thanks! Best Regards, Huang, Ying

zhong jiang

4:58 p.m.

On 2022/2/10 3:03 下午, Huang, Ying wrote:

...

Baolin Wang <baolin.wang(a)linux.alibaba.com> writes:

zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes:

On 2022/2/10 8:58 H, Huang, Ying wrote: > zhongjiang-ali <zhongjiang-ali(a)linux.alibaba.com> writes: > >> Currently, Mysql testcase show that a large number of thp are migrated >> from pmem node to toptier node, it will bring in more pgpromote_demoted >> and migrated failiure. because pmem node memory is marked as prot_none, >> it will be migrated by cpu access as soon as possible when it is hot, >> and it is unnesscessary to migrate thp to dram when dram memory is not >> enough, which will bring in more demoted and promoted. >> >> Hence, the patch forbid the thp to produce in pmem node. the result show >> about 3% improvements. the relative statistics is as follows. >> >> before appling patch: >> mysql prepare: >> pgpromote_demoted 908267 >> pgmigrate_fail_dst_node_fail 428223 >> pgmigrate_fail_numa_isolate_fail 460480 >> >> mysql run: >> pgpromote_demoted 2901105 >> pgmigrate_fail_dst_node_fail 5653776 >> pgmigrate_fail_numa_isolate_fail 5686052 >> >> after appling patch: >> mysql prepare: >> pgpromote_demoted 839297 >> pgmigrate_fail_dst_node_fail 36585 >> pgmigrate_fail_numa_isolate_fail 36585 >> >> mysql run: >> pgpromote_demoted 913828 >> pgmigrate_fail_dst_node_fail 235863 >> pgmigrate_fail_numa_isolate_fail 235870 >> >> Signed-off-by: zhongjiang-ali <zhongjiang-ali(a)linux.alibaba.com> >> --- >> mm/page_alloc.c | 14 ++++++++++++++ >> 1 file changed, 14 insertions(+) >> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index 8cfce92..4fff3cd 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -461,6 +461,17 @@ static __always_inline int get_pfnblock_migratetype(struct page *page, unsigned >> return __get_pfnblock_flags_mask(page, pfn, PB_migrate_end, MIGRATETYPE_MASK); >> } >> +static inline bool allow_hugepage_allocation(int nid, unsigned >> int order) >> +{ >> + if (node_is_toptier(nid)) >> + return true; >> + >> + if (order != HPAGE_PMD_ORDER) >> + return true; >> + >> + return false; >> +} >> + >> /** >> * set_pfnblock_flags_mask - Set the requested group of flags for a pageblock_nr_pages block of pages >> * @page: The page within the block of interest >> @@ -3689,6 +3700,9 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) >> } >> } >> + if (!allow_hugepage_allocation(zone_to_nid(zone), >> order)) >> + continue; >> + > It appears that this will disable node reclaiming for THP allocation. > So more pages will be allocated in PMEM node because of allocation > fallback? We just allow normal pages allocate in pmem node, hence, thp allocation will fallback to produce more normal pages. Mysql testcase show that too many thps is promoted to toptier , due to toptier memory is not enough, it will bring in more pgpromote_deomted and dst_node_full counter increasing. In that case, we prefer to remote access rather than migrate thp between pmem and toptier node frequently, which will make performance decrease.

Maybe we are looking at different source code :-). In latest upstream code, zone_allows_reclaim() is to control node reclaiming (or zone reclaim) only. Which repo should I look?

I think you misunderstood the change, the change is in get_page_from_freelist(), not in zone_allows_reclaim().

From my understanding, Zhongjiang is trying to disable the memory allocation fallback for THP, right?

I think so too now.

But that will cause more demotion if we can not fallback to PMEM node?

The patch aims to prevent thp allocation in pmem node, I has checked that there are not an thp is created in pmem node which is intended. Dram node still has a lot of thp and can be collapsed.

...

Another choice is to split THP if migration fails. That's always a question to prefer THP or local/hot normal pages.

Test performance will decrease if a large number of thp in pmem node, promotion will fail more frequently relative to normal page allocation because dram memory is not enough to result in waking up kswapd. hence the influence is too much promotion failure and pgpromote_demoted. And Maybe thp is not really needed for testcase, but an subpage of thp.

...

From our previous testing, enabling THP can benefit the performance. And our ECS environment enabled the THP by default.

Thanks! Best Regards, Huang, Ying

Huang, Ying

11 Feb 11 Feb

7:58 a.m.

zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes:

...

On 2022/2/10 3:03 H, Huang, Ying wrote:

Baolin Wang <baolin.wang(a)linux.alibaba.com> writes:

zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes: > On 2022/2/10 8:58 H, Huang, Ying wrote: >> zhongjiang-ali <zhongjiang-ali(a)linux.alibaba.com> writes: >> >>> Currently, Mysql testcase show that a large number of thp are migrated >>> from pmem node to toptier node, it will bring in more pgpromote_demoted >>> and migrated failiure. because pmem node memory is marked as prot_none, >>> it will be migrated by cpu access as soon as possible when it is hot, >>> and it is unnesscessary to migrate thp to dram when dram memory is not >>> enough, which will bring in more demoted and promoted. >>> >>> Hence, the patch forbid the thp to produce in pmem node. the result show >>> about 3% improvements. the relative statistics is as follows. >>> >>> before appling patch: >>> mysql prepare: >>> pgpromote_demoted 908267 >>> pgmigrate_fail_dst_node_fail 428223 >>> pgmigrate_fail_numa_isolate_fail 460480 >>> >>> mysql run: >>> pgpromote_demoted 2901105 >>> pgmigrate_fail_dst_node_fail 5653776 >>> pgmigrate_fail_numa_isolate_fail 5686052 >>> >>> after appling patch: >>> mysql prepare: >>> pgpromote_demoted 839297 >>> pgmigrate_fail_dst_node_fail 36585 >>> pgmigrate_fail_numa_isolate_fail 36585 >>> >>> mysql run: >>> pgpromote_demoted 913828 >>> pgmigrate_fail_dst_node_fail 235863 >>> pgmigrate_fail_numa_isolate_fail 235870 >>> >>> Signed-off-by: zhongjiang-ali <zhongjiang-ali(a)linux.alibaba.com> >>> --- >>> mm/page_alloc.c | 14 ++++++++++++++ >>> 1 file changed, 14 insertions(+) >>> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>> index 8cfce92..4fff3cd 100644 >>> --- a/mm/page_alloc.c >>> +++ b/mm/page_alloc.c >>> @@ -461,6 +461,17 @@ static __always_inline int get_pfnblock_migratetype(struct page *page, unsigned >>> return __get_pfnblock_flags_mask(page, pfn, PB_migrate_end, MIGRATETYPE_MASK); >>> } >>> +static inline bool allow_hugepage_allocation(int nid, unsigned >>> int order) >>> +{ >>> + if (node_is_toptier(nid)) >>> + return true; >>> + >>> + if (order != HPAGE_PMD_ORDER) >>> + return true; >>> + >>> + return false; >>> +} >>> + >>> /** >>> * set_pfnblock_flags_mask - Set the requested group of flags for a pageblock_nr_pages block of pages >>> * @page: The page within the block of interest >>> @@ -3689,6 +3700,9 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) >>> } >>> } >>> + if (!allow_hugepage_allocation(zone_to_nid(zone), >>> order)) >>> + continue; >>> + >> It appears that this will disable node reclaiming for THP allocation. >> So more pages will be allocated in PMEM node because of allocation >> fallback? > We just allow normal pages allocate in pmem node, hence, thp > allocation will fallback to produce more normal pages. > > Mysql testcase show that too many thps is promoted to toptier , > due to toptier memory is not enough, it will bring in > > more pgpromote_deomted and dst_node_full counter increasing. In > that case, we prefer to remote access rather > > than migrate thp between pmem and toptier node frequently, which > will make performance decrease. Maybe we are looking at different source code :-). In latest upstream code, zone_allows_reclaim() is to control node reclaiming (or zone reclaim) only. Which repo should I look?

I think you misunderstood the change, the change is in get_page_from_freelist(), not in zone_allows_reclaim().

From my understanding, Zhongjiang is trying to disable the memory allocation fallback for THP, right?

I think so too now.

But that will cause more demotion if we can not fallback to PMEM node?

The patch aims to prevent thp allocation in pmem node,� I has checked that there are not an thp is created in pmem node which is intended.� Dram node still has a lot of thp and can be collapsed.

Another choice is to split THP if migration fails. That's always a question to prefer THP or local/hot normal pages.

Test performance will decrease if a large number of thp in pmem node, promotion will fail more frequently relative to normal page allocation because dram memory is not enough to result in waking up kswapd. hence the influence is too much promotion failure and pgpromote_demoted.� And Maybe thp is not really needed for testcase, but an subpage of thp.

Yes. So I suggest to try to fallback to split THP upon THP allocation failure on DRAM. Just disable nosplit logic in migrate_pages(). Best Regards, Huang, Ying

...

>>> Memory tiering is challenging for THP. There's no many free pages >>> (a >>> little more than high watermark) in DRAM node, so it's hard to allocate >>> THP there except workloads start up. >>> Just curiously, whether disabling THP helps your workload? >> From our previous testing, enabling THP can benefit the >> performance. And our ECS environment enabled the THP by default. > Thanks! > > Best Regards, > Huang, Ying

zhong jiang

3:15 p.m.

On 2022/2/11 7:58 上午, Huang, Ying wrote:

...

zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes:

On 2022/2/10 3:03 H, Huang, Ying wrote:

Baolin Wang <baolin.wang(a)linux.alibaba.com> writes:

> zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes: > >> On 2022/2/10 8:58 H, Huang, Ying wrote: >>> zhongjiang-ali <zhongjiang-ali(a)linux.alibaba.com> writes: >>> >>>> Currently, Mysql testcase show that a large number of thp are migrated >>>> from pmem node to toptier node, it will bring in more pgpromote_demoted >>>> and migrated failiure. because pmem node memory is marked as prot_none, >>>> it will be migrated by cpu access as soon as possible when it is hot, >>>> and it is unnesscessary to migrate thp to dram when dram memory is not >>>> enough, which will bring in more demoted and promoted. >>>> >>>> Hence, the patch forbid the thp to produce in pmem node. the result show >>>> about 3% improvements. the relative statistics is as follows. >>>> >>>> before appling patch: >>>> mysql prepare: >>>> pgpromote_demoted 908267 >>>> pgmigrate_fail_dst_node_fail 428223 >>>> pgmigrate_fail_numa_isolate_fail 460480 >>>> >>>> mysql run: >>>> pgpromote_demoted 2901105 >>>> pgmigrate_fail_dst_node_fail 5653776 >>>> pgmigrate_fail_numa_isolate_fail 5686052 >>>> >>>> after appling patch: >>>> mysql prepare: >>>> pgpromote_demoted 839297 >>>> pgmigrate_fail_dst_node_fail 36585 >>>> pgmigrate_fail_numa_isolate_fail 36585 >>>> >>>> mysql run: >>>> pgpromote_demoted 913828 >>>> pgmigrate_fail_dst_node_fail 235863 >>>> pgmigrate_fail_numa_isolate_fail 235870 >>>> >>>> Signed-off-by: zhongjiang-ali <zhongjiang-ali(a)linux.alibaba.com> >>>> --- >>>> mm/page_alloc.c | 14 ++++++++++++++ >>>> 1 file changed, 14 insertions(+) >>>> >>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>> index 8cfce92..4fff3cd 100644 >>>> --- a/mm/page_alloc.c >>>> +++ b/mm/page_alloc.c >>>> @@ -461,6 +461,17 @@ static __always_inline int get_pfnblock_migratetype(struct page *page, unsigned >>>> return __get_pfnblock_flags_mask(page, pfn, PB_migrate_end, MIGRATETYPE_MASK); >>>> } >>>> +static inline bool allow_hugepage_allocation(int nid, unsigned >>>> int order) >>>> +{ >>>> + if (node_is_toptier(nid)) >>>> + return true; >>>> + >>>> + if (order != HPAGE_PMD_ORDER) >>>> + return true; >>>> + >>>> + return false; >>>> +} >>>> + >>>> /** >>>> * set_pfnblock_flags_mask - Set the requested group of flags for a pageblock_nr_pages block of pages >>>> * @page: The page within the block of interest >>>> @@ -3689,6 +3700,9 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) >>>> } >>>> } >>>> + if (!allow_hugepage_allocation(zone_to_nid(zone), >>>> order)) >>>> + continue; >>>> + >>> It appears that this will disable node reclaiming for THP allocation. >>> So more pages will be allocated in PMEM node because of allocation >>> fallback? >> We just allow normal pages allocate in pmem node, hence, thp >> allocation will fallback to produce more normal pages. >> >> Mysql testcase show that too many thps is promoted to toptier , >> due to toptier memory is not enough, it will bring in >> >> more pgpromote_deomted and dst_node_full counter increasing. In >> that case, we prefer to remote access rather >> >> than migrate thp between pmem and toptier node frequently, which >> will make performance decrease. > Maybe we are looking at different source code :-). In latest > upstream > code, zone_allows_reclaim() is to control node reclaiming (or zone > reclaim) only. Which repo should I look? I think you misunderstood the change, the change is in get_page_from_freelist(), not in zone_allows_reclaim().

From my understanding, Zhongjiang is trying to disable the memory allocation fallback for THP, right?

I think so too now.

But that will cause more demotion if we can not fallback to PMEM node?

The patch aims to prevent thp allocation in pmem node, I has checked that there are not an thp is created in pmem node which is intended. Dram node still has a lot of thp and can be collapsed.

Another choice is to split THP if migration fails. That's always a question to prefer THP or local/hot normal pages.

Yes. So I suggest to try to fallback to split THP upon THP allocation failure on DRAM. Just disable nosplit logic in migrate_pages().

Ok, will do and resend, Thanks,

...

Best Regards, Huang, Ying >>>> Memory tiering is challenging for THP. There's no many free pages >>>> (a >>>> little more than high watermark) in DRAM node, so it's hard to allocate >>>> THP there except workloads start up. >>>> Just curiously, whether disabling THP helps your workload? >>> From our previous testing, enabling THP can benefit the >>> performance. And our ECS environment enabled the THP by default. >> Thanks! >> >> Best Regards, >> Huang, Ying

zhong jiang

5:19 p.m.

On 2022/2/11 7:58 上午, Huang, Ying wrote:

...

zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes:

On 2022/2/10 3:03 H, Huang, Ying wrote:

Baolin Wang <baolin.wang(a)linux.alibaba.com> writes:

From my understanding, Zhongjiang is trying to disable the memory allocation fallback for THP, right?

I think so too now.

But that will cause more demotion if we can not fallback to PMEM node?

The patch aims to prevent thp allocation in pmem node, I has checked that there are not an thp is created in pmem node which is intended. Dram node still has a lot of thp and can be collapsed.

Another choice is to split THP if migration fails. That's always a question to prefer THP or local/hot normal pages.

Yes. So I suggest to try to fallback to split THP upon THP allocation failure on DRAM. Just disable nosplit logic in migrate_pages().

The upstream do as you said. It will fallback to split thp into normal page when promotion fail to allocation thp on dram. but in practice, it is too late because the pmem node has a large lot of thp be produced. It is the root case that bring in the performance decrease. The solution I propose maybe a little radical to disable the thp production in pmem node. It has some side effect that numa balancing will scan more pages and fallback more frequenctly when allocate thp in pmem node.

...

Huang, Ying

6:52 p.m.

zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes:

...

On 2022/2/11 7:58 H, Huang, Ying wrote:

zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes:

On 2022/2/10 3:03 H, Huang, Ying wrote:

Baolin Wang <baolin.wang(a)linux.alibaba.com> writes: >> zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes: >> >>> On 2022/2/10 8:58 H, Huang, Ying wrote: >>>> zhongjiang-ali <zhongjiang-ali(a)linux.alibaba.com> writes: >>>> >>>>> Currently, Mysql testcase show that a large number of thp are migrated >>>>> from pmem node to toptier node, it will bring in more pgpromote_demoted >>>>> and migrated failiure. because pmem node memory is marked as prot_none, >>>>> it will be migrated by cpu access as soon as possible when it is hot, >>>>> and it is unnesscessary to migrate thp to dram when dram memory is not >>>>> enough, which will bring in more demoted and promoted. >>>>> >>>>> Hence, the patch forbid the thp to produce in pmem node. the result show >>>>> about 3% improvements. the relative statistics is as follows. >>>>> >>>>> before appling patch: >>>>> mysql prepare: >>>>> pgpromote_demoted 908267 >>>>> pgmigrate_fail_dst_node_fail 428223 >>>>> pgmigrate_fail_numa_isolate_fail 460480 >>>>> >>>>> mysql run: >>>>> pgpromote_demoted 2901105 >>>>> pgmigrate_fail_dst_node_fail 5653776 >>>>> pgmigrate_fail_numa_isolate_fail 5686052 >>>>> >>>>> after appling patch: >>>>> mysql prepare: >>>>> pgpromote_demoted 839297 >>>>> pgmigrate_fail_dst_node_fail 36585 >>>>> pgmigrate_fail_numa_isolate_fail 36585 >>>>> >>>>> mysql run: >>>>> pgpromote_demoted 913828 >>>>> pgmigrate_fail_dst_node_fail 235863 >>>>> pgmigrate_fail_numa_isolate_fail 235870 >>>>> >>>>> Signed-off-by: zhongjiang-ali <zhongjiang-ali(a)linux.alibaba.com> >>>>> --- >>>>> mm/page_alloc.c | 14 ++++++++++++++ >>>>> 1 file changed, 14 insertions(+) >>>>> >>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>>> index 8cfce92..4fff3cd 100644 >>>>> --- a/mm/page_alloc.c >>>>> +++ b/mm/page_alloc.c >>>>> @@ -461,6 +461,17 @@ static __always_inline int get_pfnblock_migratetype(struct page *page, unsigned >>>>> return __get_pfnblock_flags_mask(page, pfn, PB_migrate_end, MIGRATETYPE_MASK); >>>>> } >>>>> +static inline bool allow_hugepage_allocation(int nid, unsigned >>>>> int order) >>>>> +{ >>>>> + if (node_is_toptier(nid)) >>>>> + return true; >>>>> + >>>>> + if (order != HPAGE_PMD_ORDER) >>>>> + return true; >>>>> + >>>>> + return false; >>>>> +} >>>>> + >>>>> /** >>>>> * set_pfnblock_flags_mask - Set the requested group of flags for a pageblock_nr_pages block of pages >>>>> * @page: The page within the block of interest >>>>> @@ -3689,6 +3700,9 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) >>>>> } >>>>> } >>>>> + if (!allow_hugepage_allocation(zone_to_nid(zone), >>>>> order)) >>>>> + continue; >>>>> + >>>> It appears that this will disable node reclaiming for THP allocation. >>>> So more pages will be allocated in PMEM node because of allocation >>>> fallback? >>> We just allow normal pages allocate in pmem node, hence, thp >>> allocation will fallback to produce more normal pages. >>> >>> Mysql testcase show that too many thps is promoted to toptier , >>> due to toptier memory is not enough, it will bring in >>> >>> more pgpromote_deomted and dst_node_full counter increasing. In >>> that case, we prefer to remote access rather >>> >>> than migrate thp between pmem and toptier node frequently, which >>> will make performance decrease. >> Maybe we are looking at different source code :-). In latest >> upstream >> code, zone_allows_reclaim() is to control node reclaiming (or zone >> reclaim) only. Which repo should I look? > I think you misunderstood the change, the change is in > get_page_from_freelist(), not in zone_allows_reclaim(). OK, I see. I think the `diff` program fools me: @@ -3689,6 +3700,9 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) } } + if (!allow_hugepage_allocation(zone_to_nid(zone), order)) + continue; + if (no_fallback && nr_online_nodes > 1 && zone != ac->preferred_zoneref->zone) { int local_nid; > From my understanding, Zhongjiang is trying to disable the memory > allocation fallback for THP, right? I think so too now. > But that will cause more demotion if we can not fallback to PMEM node? If THP fails to be allocated, normal pages will be allocated instead. And it appears that if THP is failed to be demoted (with this patch, it will always fail), THP will be split too. So we may have much less THP in system with the patch. Zhongjiang, Can you check it?

The patch aims to prevent thp allocation in pmem node, I has checked that there are not an thp is created in pmem node which is intended. Dram node still has a lot of thp and can be collapsed.

Another choice is to split THP if migration fails. That's always a question to prefer THP or local/hot normal pages.

Yes. So I suggest to try to fallback to split THP upon THP allocation failure on DRAM. Just disable nosplit logic in migrate_pages().

The upstream do as you said.� It will fallback to split thp into normal page when promotion fail to allocation thp on dram.

Not for NUMA balancing. Because bool nosplit = (reason == MR_NUMA_MISPLACED); Best Regards, Huang, Ying

...

but in practice,� it is too late because the pmem node has a large lot of thp be produced. It is the root case that bring in the performance decrease. The solution I propose maybe a little radical to disable the thp production in pmem node. It has some side effect that numa balancing will scan more pages and fallback more frequenctly when allocate thp in pmem node. > Best Regards, > Huang, Ying > >>>>> Memory tiering is challenging for THP. There's no many free pages >>>>> (a >>>>> little more than high watermark) in DRAM node, so it's hard to allocate >>>>> THP there except workloads start up. >>>>> Just curiously, whether disabling THP helps your workload? >>>> From our previous testing, enabling THP can benefit the >>>> performance. And our ECS environment enabled the THP by default. >>> Thanks! >>> >>> Best Regards, >>> Huang, Ying

zhong jiang

14 Feb 14 Feb

12:51 p.m.

On 2022/2/11 6:52 下午, Huang, Ying wrote:

...

zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes:

On 2022/2/11 7:58 H, Huang, Ying wrote:

zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes:

On 2022/2/10 3:03 H, Huang, Ying wrote: > Baolin Wang <baolin.wang(a)linux.alibaba.com> writes: > >>> zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes: >>> >>>> On 2022/2/10 8:58 H, Huang, Ying wrote: >>>>> zhongjiang-ali <zhongjiang-ali(a)linux.alibaba.com> writes: >>>>> >>>>>> Currently, Mysql testcase show that a large number of thp are migrated >>>>>> from pmem node to toptier node, it will bring in more pgpromote_demoted >>>>>> and migrated failiure. because pmem node memory is marked as prot_none, >>>>>> it will be migrated by cpu access as soon as possible when it is hot, >>>>>> and it is unnesscessary to migrate thp to dram when dram memory is not >>>>>> enough, which will bring in more demoted and promoted. >>>>>> >>>>>> Hence, the patch forbid the thp to produce in pmem node. the result show >>>>>> about 3% improvements. the relative statistics is as follows. >>>>>> >>>>>> before appling patch: >>>>>> mysql prepare: >>>>>> pgpromote_demoted 908267 >>>>>> pgmigrate_fail_dst_node_fail 428223 >>>>>> pgmigrate_fail_numa_isolate_fail 460480 >>>>>> >>>>>> mysql run: >>>>>> pgpromote_demoted 2901105 >>>>>> pgmigrate_fail_dst_node_fail 5653776 >>>>>> pgmigrate_fail_numa_isolate_fail 5686052 >>>>>> >>>>>> after appling patch: >>>>>> mysql prepare: >>>>>> pgpromote_demoted 839297 >>>>>> pgmigrate_fail_dst_node_fail 36585 >>>>>> pgmigrate_fail_numa_isolate_fail 36585 >>>>>> >>>>>> mysql run: >>>>>> pgpromote_demoted 913828 >>>>>> pgmigrate_fail_dst_node_fail 235863 >>>>>> pgmigrate_fail_numa_isolate_fail 235870 >>>>>> >>>>>> Signed-off-by: zhongjiang-ali <zhongjiang-ali(a)linux.alibaba.com> >>>>>> --- >>>>>> mm/page_alloc.c | 14 ++++++++++++++ >>>>>> 1 file changed, 14 insertions(+) >>>>>> >>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>>>> index 8cfce92..4fff3cd 100644 >>>>>> --- a/mm/page_alloc.c >>>>>> +++ b/mm/page_alloc.c >>>>>> @@ -461,6 +461,17 @@ static __always_inline int get_pfnblock_migratetype(struct page *page, unsigned >>>>>> return __get_pfnblock_flags_mask(page, pfn, PB_migrate_end, MIGRATETYPE_MASK); >>>>>> } >>>>>> +static inline bool allow_hugepage_allocation(int nid, unsigned >>>>>> int order) >>>>>> +{ >>>>>> + if (node_is_toptier(nid)) >>>>>> + return true; >>>>>> + >>>>>> + if (order != HPAGE_PMD_ORDER) >>>>>> + return true; >>>>>> + >>>>>> + return false; >>>>>> +} >>>>>> + >>>>>> /** >>>>>> * set_pfnblock_flags_mask - Set the requested group of flags for a pageblock_nr_pages block of pages >>>>>> * @page: The page within the block of interest >>>>>> @@ -3689,6 +3700,9 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) >>>>>> } >>>>>> } >>>>>> + if (!allow_hugepage_allocation(zone_to_nid(zone), >>>>>> order)) >>>>>> + continue; >>>>>> + >>>>> It appears that this will disable node reclaiming for THP allocation. >>>>> So more pages will be allocated in PMEM node because of allocation >>>>> fallback? >>>> We just allow normal pages allocate in pmem node, hence, thp >>>> allocation will fallback to produce more normal pages. >>>> >>>> Mysql testcase show that too many thps is promoted to toptier , >>>> due to toptier memory is not enough, it will bring in >>>> >>>> more pgpromote_deomted and dst_node_full counter increasing. In >>>> that case, we prefer to remote access rather >>>> >>>> than migrate thp between pmem and toptier node frequently, which >>>> will make performance decrease. >>> Maybe we are looking at different source code :-). In latest >>> upstream >>> code, zone_allows_reclaim() is to control node reclaiming (or zone >>> reclaim) only. Which repo should I look? >> I think you misunderstood the change, the change is in >> get_page_from_freelist(), not in zone_allows_reclaim(). > OK, I see. I think the `diff` program fools me: > > @@ -3689,6 +3700,9 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) > } > } > + if (!allow_hugepage_allocation(zone_to_nid(zone), > order)) > + continue; > + > if (no_fallback && nr_online_nodes > 1 && > zone != ac->preferred_zoneref->zone) { > int local_nid; > > >> From my understanding, Zhongjiang is trying to disable the memory >> allocation fallback for THP, right? > I think so too now. > >> But that will cause more demotion if we can not fallback to PMEM node? > If THP fails to be allocated, normal pages will be allocated instead. > And it appears that if THP is failed to be demoted (with this patch, it > will always fail), THP will be split too. So we may have much less THP > in system with the patch. Zhongjiang, Can you check it? The patch aims to prevent thp allocation in pmem node, I has checked that there are not an thp is created in pmem node which is intended. Dram node still has a lot of thp and can be collapsed. > Another choice is to split THP if migration fails. That's always a > question to prefer THP or local/hot normal pages. Test performance will decrease if a large number of thp in pmem node, promotion will fail more frequently relative to normal page allocation because dram memory is not enough to result in waking up kswapd. hence the influence is too much promotion failure and pgpromote_demoted. And Maybe thp is not really needed for testcase, but an subpage of thp.

Yes. So I suggest to try to fallback to split THP upon THP allocation failure on DRAM. Just disable nosplit logic in migrate_pages().

The upstream do as you said. It will fallback to split thp into normal page when promotion fail to allocation thp on dram.

Not for NUMA balancing. Because bool nosplit = (reason == MR_NUMA_MISPLACED);

Maybe you has misunderstanded the patch aiming to solve the problem. The current issue is too many thps in pmem node and dram node is not enough, as mysql benchmark test, promote thp will be likely failure because of failed to allocation memory in dram. and in the situation, promote normal page is more suitable, but it is too late to fallback to split thp. but It will bring in more remote access when we disable nosplit logic in migrate_pages. Am I missing something , please correct me, Thanks.

...

Best Regards, Huang, Ying > but in practice, it is too late because the > pmem node has a large lot of thp be produced. > > It is the root case that bring in the performance decrease. > > > The solution I propose maybe a little radical to disable the thp > production in pmem node. It has some side effect > > that numa balancing will scan more pages and fallback more frequenctly > when allocate thp in pmem node. > >> Best Regards, >> Huang, Ying >> >>>>>> Memory tiering is challenging for THP. There's no many free pages >>>>>> (a >>>>>> little more than high watermark) in DRAM node, so it's hard to allocate >>>>>> THP there except workloads start up. >>>>>> Just curiously, whether disabling THP helps your workload? >>>>> From our previous testing, enabling THP can benefit the >>>>> performance. And our ECS environment enabled the THP by default. >>>> Thanks! >>>> >>>> Best Regards, >>>> Huang, Ying

Huang, Ying

1:35 p.m.

zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes:

...

On 2022/2/11 6:52 H, Huang, Ying wrote:

zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes:

On 2022/2/11 7:58 H, Huang, Ying wrote:

zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes: > On 2022/2/10 3:03 H, Huang, Ying wrote: >> Baolin Wang <baolin.wang(a)linux.alibaba.com> writes: >> >>>> zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes: >>>> >>>>> On 2022/2/10 8:58 H, Huang, Ying wrote: >>>>>> zhongjiang-ali <zhongjiang-ali(a)linux.alibaba.com> writes: >>>>>> >>>>>>> Currently, Mysql testcase show that a large number of thp are migrated >>>>>>> from pmem node to toptier node, it will bring in more pgpromote_demoted >>>>>>> and migrated failiure. because pmem node memory is marked as prot_none, >>>>>>> it will be migrated by cpu access as soon as possible when it is hot, >>>>>>> and it is unnesscessary to migrate thp to dram when dram memory is not >>>>>>> enough, which will bring in more demoted and promoted. >>>>>>> >>>>>>> Hence, the patch forbid the thp to produce in pmem node. the result show >>>>>>> about 3% improvements. the relative statistics is as follows. >>>>>>> >>>>>>> before appling patch: >>>>>>> mysql prepare: >>>>>>> pgpromote_demoted 908267 >>>>>>> pgmigrate_fail_dst_node_fail 428223 >>>>>>> pgmigrate_fail_numa_isolate_fail 460480 >>>>>>> >>>>>>> mysql run: >>>>>>> pgpromote_demoted 2901105 >>>>>>> pgmigrate_fail_dst_node_fail 5653776 >>>>>>> pgmigrate_fail_numa_isolate_fail 5686052 >>>>>>> >>>>>>> after appling patch: >>>>>>> mysql prepare: >>>>>>> pgpromote_demoted 839297 >>>>>>> pgmigrate_fail_dst_node_fail 36585 >>>>>>> pgmigrate_fail_numa_isolate_fail 36585 >>>>>>> >>>>>>> mysql run: >>>>>>> pgpromote_demoted 913828 >>>>>>> pgmigrate_fail_dst_node_fail 235863 >>>>>>> pgmigrate_fail_numa_isolate_fail 235870 >>>>>>> >>>>>>> Signed-off-by: zhongjiang-ali <zhongjiang-ali(a)linux.alibaba.com> >>>>>>> --- >>>>>>> mm/page_alloc.c | 14 ++++++++++++++ >>>>>>> 1 file changed, 14 insertions(+) >>>>>>> >>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>>>>> index 8cfce92..4fff3cd 100644 >>>>>>> --- a/mm/page_alloc.c >>>>>>> +++ b/mm/page_alloc.c >>>>>>> @@ -461,6 +461,17 @@ static __always_inline int get_pfnblock_migratetype(struct page *page, unsigned >>>>>>> return __get_pfnblock_flags_mask(page, pfn, PB_migrate_end, MIGRATETYPE_MASK); >>>>>>> } >>>>>>> +static inline bool allow_hugepage_allocation(int nid, unsigned >>>>>>> int order) >>>>>>> +{ >>>>>>> + if (node_is_toptier(nid)) >>>>>>> + return true; >>>>>>> + >>>>>>> + if (order != HPAGE_PMD_ORDER) >>>>>>> + return true; >>>>>>> + >>>>>>> + return false; >>>>>>> +} >>>>>>> + >>>>>>> /** >>>>>>> * set_pfnblock_flags_mask - Set the requested group of flags for a pageblock_nr_pages block of pages >>>>>>> * @page: The page within the block of interest >>>>>>> @@ -3689,6 +3700,9 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) >>>>>>> } >>>>>>> } >>>>>>> + if (!allow_hugepage_allocation(zone_to_nid(zone), >>>>>>> order)) >>>>>>> + continue; >>>>>>> + >>>>>> It appears that this will disable node reclaiming for THP allocation. >>>>>> So more pages will be allocated in PMEM node because of allocation >>>>>> fallback? >>>>> We just allow normal pages allocate in pmem node, hence, thp >>>>> allocation will fallback to produce more normal pages. >>>>> >>>>> Mysql testcase show that too many thps is promoted to toptier , >>>>> due to toptier memory is not enough, it will bring in >>>>> >>>>> more pgpromote_deomted and dst_node_full counter increasing. In >>>>> that case, we prefer to remote access rather >>>>> >>>>> than migrate thp between pmem and toptier node frequently, which >>>>> will make performance decrease. >>>> Maybe we are looking at different source code :-). In latest >>>> upstream >>>> code, zone_allows_reclaim() is to control node reclaiming (or zone >>>> reclaim) only. Which repo should I look? >>> I think you misunderstood the change, the change is in >>> get_page_from_freelist(), not in zone_allows_reclaim(). >> OK, I see. I think the `diff` program fools me: >> >> @@ -3689,6 +3700,9 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) >> } >> } >> + if (!allow_hugepage_allocation(zone_to_nid(zone), >> order)) >> + continue; >> + >> if (no_fallback && nr_online_nodes > 1 && >> zone != ac->preferred_zoneref->zone) { >> int local_nid; >> >> >>> From my understanding, Zhongjiang is trying to disable the memory >>> allocation fallback for THP, right? >> I think so too now. >> >>> But that will cause more demotion if we can not fallback to PMEM node? >> If THP fails to be allocated, normal pages will be allocated instead. >> And it appears that if THP is failed to be demoted (with this patch, it >> will always fail), THP will be split too. So we may have much less THP >> in system with the patch. Zhongjiang, Can you check it? > The patch aims to prevent thp allocation in pmem node, I has > checked that there are not an thp is created > > in pmem node which is intended. Dram node still has a lot of > thp and can be collapsed. > >> Another choice is to split THP if migration fails. That's always a >> question to prefer THP or local/hot normal pages. > Test performance will decrease if a large number of thp in pmem > node, promotion will fail more frequently > > relative to normal page allocation because dram memory is not enough > to result in waking up kswapd. > > > hence the influence is too much promotion failure and > pgpromote_demoted. And Maybe thp is not > > really needed for testcase, but an subpage of thp. Yes. So I suggest to try to fallback to split THP upon THP allocation failure on DRAM. Just disable nosplit logic in migrate_pages().

The upstream do as you said. It will fallback to split thp into normal page when promotion fail to allocation thp on dram.

Not for NUMA balancing. Because bool nosplit = (reason == MR_NUMA_MISPLACED);

Maybe you has misunderstanded the patch aiming to solve the problem. The current issue is too many thps in pmem node and dram node is not enough, as mysql benchmark test,

IMHO, the problem may be the following. One or some. 1. THP causes high promotion/demotion traffic, it consumes too much PMEM throughput. 2. THP reduce the accuracy of the hot/cold pages placement. Some cold normal pages in THP are placed in DRAM and some warm pages are placed in PMEM. 3. THP in PMEM fails to be promoted to DRAM because THP cannot be allocated in DRAM. This causes bad hot/cold pages placement too. Per my understanding, you think the real problem is No. 3 above. Better to get some statistics to prove that or any other possibility.

...

promote thp will be likely failure because of failed to allocation memory in dram.� and in the situation, promote normal page is more suitable, but it is too late to fallback to split thp.

Why? After splitting THP, we can still promote the normal pages in the THP. Another issue of preventing allocating THP on PMEM is that demoting THP pages from DRAM to PMEM will always fail. In a system without swap, this will split the THP finally. But I don't think we should rely on this. Best Regards, Huang, Ying

...

but It will bring in more remote access when we disable nosplit logic in migrate_pages.� Am I missing something , please correct me, Thanks. > > Best Regards, > Huang, Ying > >> but in practice, it is too late because the >> pmem node has a large lot of thp be produced. >> >> It is the root case that bring in the performance decrease. >> >> >> The solution I propose maybe a little radical to disable the thp >> production in pmem node. It has some side effect >> >> that numa balancing will scan more pages and fallback more frequenctly >> when allocate thp in pmem node. >> >>> Best Regards, >>> Huang, Ying >>> >>>>>>> Memory tiering is challenging for THP. There's no many free pages >>>>>>> (a >>>>>>> little more than high watermark) in DRAM node, so it's hard to allocate >>>>>>> THP there except workloads start up. >>>>>>> Just curiously, whether disabling THP helps your workload? >>>>>> From our previous testing, enabling THP can benefit the >>>>>> performance. And our ECS environment enabled the THP by default. >>>>> Thanks! >>>>> >>>>> Best Regards, >>>>> Huang, Ying

zhong jiang

3:40 p.m.

On 2022/2/14 1:35 下午, Huang, Ying wrote:

...

zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes:

On 2022/2/11 6:52 H, Huang, Ying wrote:

zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes:

On 2022/2/11 7:58 H, Huang, Ying wrote: > zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes: > >> On 2022/2/10 3:03 H, Huang, Ying wrote: >>> Baolin Wang <baolin.wang(a)linux.alibaba.com> writes: >>> >>>>> zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes: >>>>> >>>>>> On 2022/2/10 8:58 H, Huang, Ying wrote: >>>>>>> zhongjiang-ali <zhongjiang-ali(a)linux.alibaba.com> writes: >>>>>>> >>>>>>>> Currently, Mysql testcase show that a large number of thp are migrated >>>>>>>> from pmem node to toptier node, it will bring in more pgpromote_demoted >>>>>>>> and migrated failiure. because pmem node memory is marked as prot_none, >>>>>>>> it will be migrated by cpu access as soon as possible when it is hot, >>>>>>>> and it is unnesscessary to migrate thp to dram when dram memory is not >>>>>>>> enough, which will bring in more demoted and promoted. >>>>>>>> >>>>>>>> Hence, the patch forbid the thp to produce in pmem node. the result show >>>>>>>> about 3% improvements. the relative statistics is as follows. >>>>>>>> >>>>>>>> before appling patch: >>>>>>>> mysql prepare: >>>>>>>> pgpromote_demoted 908267 >>>>>>>> pgmigrate_fail_dst_node_fail 428223 >>>>>>>> pgmigrate_fail_numa_isolate_fail 460480 >>>>>>>> >>>>>>>> mysql run: >>>>>>>> pgpromote_demoted 2901105 >>>>>>>> pgmigrate_fail_dst_node_fail 5653776 >>>>>>>> pgmigrate_fail_numa_isolate_fail 5686052 >>>>>>>> >>>>>>>> after appling patch: >>>>>>>> mysql prepare: >>>>>>>> pgpromote_demoted 839297 >>>>>>>> pgmigrate_fail_dst_node_fail 36585 >>>>>>>> pgmigrate_fail_numa_isolate_fail 36585 >>>>>>>> >>>>>>>> mysql run: >>>>>>>> pgpromote_demoted 913828 >>>>>>>> pgmigrate_fail_dst_node_fail 235863 >>>>>>>> pgmigrate_fail_numa_isolate_fail 235870 >>>>>>>> >>>>>>>> Signed-off-by: zhongjiang-ali <zhongjiang-ali(a)linux.alibaba.com> >>>>>>>> --- >>>>>>>> mm/page_alloc.c | 14 ++++++++++++++ >>>>>>>> 1 file changed, 14 insertions(+) >>>>>>>> >>>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>>>>>> index 8cfce92..4fff3cd 100644 >>>>>>>> --- a/mm/page_alloc.c >>>>>>>> +++ b/mm/page_alloc.c >>>>>>>> @@ -461,6 +461,17 @@ static __always_inline int get_pfnblock_migratetype(struct page *page, unsigned >>>>>>>> return __get_pfnblock_flags_mask(page, pfn, PB_migrate_end, MIGRATETYPE_MASK); >>>>>>>> } >>>>>>>> +static inline bool allow_hugepage_allocation(int nid, unsigned >>>>>>>> int order) >>>>>>>> +{ >>>>>>>> + if (node_is_toptier(nid)) >>>>>>>> + return true; >>>>>>>> + >>>>>>>> + if (order != HPAGE_PMD_ORDER) >>>>>>>> + return true; >>>>>>>> + >>>>>>>> + return false; >>>>>>>> +} >>>>>>>> + >>>>>>>> /** >>>>>>>> * set_pfnblock_flags_mask - Set the requested group of flags for a pageblock_nr_pages block of pages >>>>>>>> * @page: The page within the block of interest >>>>>>>> @@ -3689,6 +3700,9 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) >>>>>>>> } >>>>>>>> } >>>>>>>> + if (!allow_hugepage_allocation(zone_to_nid(zone), >>>>>>>> order)) >>>>>>>> + continue; >>>>>>>> + >>>>>>> It appears that this will disable node reclaiming for THP allocation. >>>>>>> So more pages will be allocated in PMEM node because of allocation >>>>>>> fallback? >>>>>> We just allow normal pages allocate in pmem node, hence, thp >>>>>> allocation will fallback to produce more normal pages. >>>>>> >>>>>> Mysql testcase show that too many thps is promoted to toptier , >>>>>> due to toptier memory is not enough, it will bring in >>>>>> >>>>>> more pgpromote_deomted and dst_node_full counter increasing. In >>>>>> that case, we prefer to remote access rather >>>>>> >>>>>> than migrate thp between pmem and toptier node frequently, which >>>>>> will make performance decrease. >>>>> Maybe we are looking at different source code :-). In latest >>>>> upstream >>>>> code, zone_allows_reclaim() is to control node reclaiming (or zone >>>>> reclaim) only. Which repo should I look? >>>> I think you misunderstood the change, the change is in >>>> get_page_from_freelist(), not in zone_allows_reclaim(). >>> OK, I see. I think the `diff` program fools me: >>> >>> @@ -3689,6 +3700,9 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) >>> } >>> } >>> + if (!allow_hugepage_allocation(zone_to_nid(zone), >>> order)) >>> + continue; >>> + >>> if (no_fallback && nr_online_nodes > 1 && >>> zone != ac->preferred_zoneref->zone) { >>> int local_nid; >>> >>> >>>> From my understanding, Zhongjiang is trying to disable the memory >>>> allocation fallback for THP, right? >>> I think so too now. >>> >>>> But that will cause more demotion if we can not fallback to PMEM node? >>> If THP fails to be allocated, normal pages will be allocated instead. >>> And it appears that if THP is failed to be demoted (with this patch, it >>> will always fail), THP will be split too. So we may have much less THP >>> in system with the patch. Zhongjiang, Can you check it? >> The patch aims to prevent thp allocation in pmem node, I has >> checked that there are not an thp is created >> >> in pmem node which is intended. Dram node still has a lot of >> thp and can be collapsed. >> >>> Another choice is to split THP if migration fails. That's always a >>> question to prefer THP or local/hot normal pages. >> Test performance will decrease if a large number of thp in pmem >> node, promotion will fail more frequently >> >> relative to normal page allocation because dram memory is not enough >> to result in waking up kswapd. >> >> >> hence the influence is too much promotion failure and >> pgpromote_demoted. And Maybe thp is not >> >> really needed for testcase, but an subpage of thp. > Yes. So I suggest to try to fallback to split THP upon THP allocation > failure on DRAM. Just disable nosplit logic in migrate_pages(). The upstream do as you said. It will fallback to split thp into normal page when promotion fail to allocation thp on dram.

Not for NUMA balancing. Because bool nosplit = (reason == MR_NUMA_MISPLACED);

Maybe you has misunderstanded the patch aiming to solve the problem. The current issue is too many thps in pmem node and dram node is not enough, as mysql benchmark test,

I has some statistics to test mysql showed in the patch. before appling patch: mysql prepare: pgpromote_demoted 908267 pgmigrate_fail_dst_node_fail 428223 pgmigrate_fail_numa_isolate_fail 460480 mysql run: pgpromote_demoted 2901105 pgmigrate_fail_dst_node_fail 5653776 pgmigrate_fail_numa_isolate_fail 5686052 after appling patch: mysql prepare: pgpromote_demoted 839297 pgmigrate_fail_dst_node_fail 36585 pgmigrate_fail_numa_isolate_fail 36585 mysql run: pgpromote_demoted 913828 pgmigrate_fail_dst_node_fail 235863 pgmigrate_fail_numa_isolate_fail 235870 pgpromote_demoted and pgmigrate_fail_dst_node_fail decrease dramatically.

...

promote thp will be likely failure because of failed to allocation memory in dram. and in the situation, promote normal page is more suitable, but it is too late to fallback to split thp.

Why? After splitting THP, we can still promote the normal pages in the THP.

I mean that avoid producing too much thp in pmem node. Maybe we can limit the total number of thp in pmem node.????

...

Another issue of preventing allocating THP on PMEM is that demoting THP pages from DRAM to PMEM will always fail. In a system without swap, this will split the THP finally. But I don't think we should rely on this.

IMO, in general, THP just can access by numa fault in pmem node, promote will trigger to decide whether migrate to dram or not, which will result in remote access and demote/promote traffic. It is the root cause that dram node is not enough. Maybe split all thp in pmem node has a little bit radical. but It seem to me has not better solution.

...

Best Regards, Huang, Ying > but It will bring in more remote access when we disable nosplit logic > in migrate_pages. Am I missing something > > , please correct me, Thanks. > >> Best Regards, >> Huang, Ying >> >>> but in practice, it is too late because the >>> pmem node has a large lot of thp be produced. >>> >>> It is the root case that bring in the performance decrease. >>> >>> >>> The solution I propose maybe a little radical to disable the thp >>> production in pmem node. It has some side effect >>> >>> that numa balancing will scan more pages and fallback more frequenctly >>> when allocate thp in pmem node. >>> >>>> Best Regards, >>>> Huang, Ying >>>> >>>>>>>> Memory tiering is challenging for THP. There's no many free pages >>>>>>>> (a >>>>>>>> little more than high watermark) in DRAM node, so it's hard to allocate >>>>>>>> THP there except workloads start up. >>>>>>>> Just curiously, whether disabling THP helps your workload? >>>>>>> From our previous testing, enabling THP can benefit the >>>>>>> performance. And our ECS environment enabled the THP by default. >>>>>> Thanks! >>>>>> >>>>>> Best Regards, >>>>>> Huang, Ying

Huang, Ying

8:26 p.m.

zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes:

...

On 2022/2/14 1:35 H, Huang, Ying wrote:

zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes:

On 2022/2/11 6:52 H, Huang, Ying wrote:

zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes: > On 2022/2/11 7:58 H, Huang, Ying wrote: >> zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes: >> >>> On 2022/2/10 3:03 H, Huang, Ying wrote: >>>> Baolin Wang <baolin.wang(a)linux.alibaba.com> writes: >>>> >>>>>> zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes: >>>>>> >>>>>>> On 2022/2/10 8:58 H, Huang, Ying wrote: >>>>>>>> zhongjiang-ali <zhongjiang-ali(a)linux.alibaba.com> writes: >>>>>>>> >>>>>>>>> Currently, Mysql testcase show that a large number of thp are migrated >>>>>>>>> from pmem node to toptier node, it will bring in more pgpromote_demoted >>>>>>>>> and migrated failiure. because pmem node memory is marked as prot_none, >>>>>>>>> it will be migrated by cpu access as soon as possible when it is hot, >>>>>>>>> and it is unnesscessary to migrate thp to dram when dram memory is not >>>>>>>>> enough, which will bring in more demoted and promoted. >>>>>>>>> >>>>>>>>> Hence, the patch forbid the thp to produce in pmem node. the result show >>>>>>>>> about 3% improvements. the relative statistics is as follows. >>>>>>>>> >>>>>>>>> before appling patch: >>>>>>>>> mysql prepare: >>>>>>>>> pgpromote_demoted 908267 >>>>>>>>> pgmigrate_fail_dst_node_fail 428223 >>>>>>>>> pgmigrate_fail_numa_isolate_fail 460480 >>>>>>>>> >>>>>>>>> mysql run: >>>>>>>>> pgpromote_demoted 2901105 >>>>>>>>> pgmigrate_fail_dst_node_fail 5653776 >>>>>>>>> pgmigrate_fail_numa_isolate_fail 5686052 >>>>>>>>> >>>>>>>>> after appling patch: >>>>>>>>> mysql prepare: >>>>>>>>> pgpromote_demoted 839297 >>>>>>>>> pgmigrate_fail_dst_node_fail 36585 >>>>>>>>> pgmigrate_fail_numa_isolate_fail 36585 >>>>>>>>> >>>>>>>>> mysql run: >>>>>>>>> pgpromote_demoted 913828 >>>>>>>>> pgmigrate_fail_dst_node_fail 235863 >>>>>>>>> pgmigrate_fail_numa_isolate_fail 235870 >>>>>>>>> >>>>>>>>> Signed-off-by: zhongjiang-ali <zhongjiang-ali(a)linux.alibaba.com> >>>>>>>>> --- >>>>>>>>> mm/page_alloc.c | 14 ++++++++++++++ >>>>>>>>> 1 file changed, 14 insertions(+) >>>>>>>>> >>>>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>>>>>>> index 8cfce92..4fff3cd 100644 >>>>>>>>> --- a/mm/page_alloc.c >>>>>>>>> +++ b/mm/page_alloc.c >>>>>>>>> @@ -461,6 +461,17 @@ static __always_inline int get_pfnblock_migratetype(struct page *page, unsigned >>>>>>>>> return __get_pfnblock_flags_mask(page, pfn, PB_migrate_end, MIGRATETYPE_MASK); >>>>>>>>> } >>>>>>>>> +static inline bool allow_hugepage_allocation(int nid, unsigned >>>>>>>>> int order) >>>>>>>>> +{ >>>>>>>>> + if (node_is_toptier(nid)) >>>>>>>>> + return true; >>>>>>>>> + >>>>>>>>> + if (order != HPAGE_PMD_ORDER) >>>>>>>>> + return true; >>>>>>>>> + >>>>>>>>> + return false; >>>>>>>>> +} >>>>>>>>> + >>>>>>>>> /** >>>>>>>>> * set_pfnblock_flags_mask - Set the requested group of flags for a pageblock_nr_pages block of pages >>>>>>>>> * @page: The page within the block of interest >>>>>>>>> @@ -3689,6 +3700,9 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) >>>>>>>>> } >>>>>>>>> } >>>>>>>>> + if (!allow_hugepage_allocation(zone_to_nid(zone), >>>>>>>>> order)) >>>>>>>>> + continue; >>>>>>>>> + >>>>>>>> It appears that this will disable node reclaiming for THP allocation. >>>>>>>> So more pages will be allocated in PMEM node because of allocation >>>>>>>> fallback? >>>>>>> We just allow normal pages allocate in pmem node, hence, thp >>>>>>> allocation will fallback to produce more normal pages. >>>>>>> >>>>>>> Mysql testcase show that too many thps is promoted to toptier , >>>>>>> due to toptier memory is not enough, it will bring in >>>>>>> >>>>>>> more pgpromote_deomted and dst_node_full counter increasing. In >>>>>>> that case, we prefer to remote access rather >>>>>>> >>>>>>> than migrate thp between pmem and toptier node frequently, which >>>>>>> will make performance decrease. >>>>>> Maybe we are looking at different source code :-). In latest >>>>>> upstream >>>>>> code, zone_allows_reclaim() is to control node reclaiming (or zone >>>>>> reclaim) only. Which repo should I look? >>>>> I think you misunderstood the change, the change is in >>>>> get_page_from_freelist(), not in zone_allows_reclaim(). >>>> OK, I see. I think the `diff` program fools me: >>>> >>>> @@ -3689,6 +3700,9 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) >>>> } >>>> } >>>> + if (!allow_hugepage_allocation(zone_to_nid(zone), >>>> order)) >>>> + continue; >>>> + >>>> if (no_fallback && nr_online_nodes > 1 && >>>> zone != ac->preferred_zoneref->zone) { >>>> int local_nid; >>>> >>>> >>>>> From my understanding, Zhongjiang is trying to disable the memory >>>>> allocation fallback for THP, right? >>>> I think so too now. >>>> >>>>> But that will cause more demotion if we can not fallback to PMEM node? >>>> If THP fails to be allocated, normal pages will be allocated instead. >>>> And it appears that if THP is failed to be demoted (with this patch, it >>>> will always fail), THP will be split too. So we may have much less THP >>>> in system with the patch. Zhongjiang, Can you check it? >>> The patch aims to prevent thp allocation in pmem node, I has >>> checked that there are not an thp is created >>> >>> in pmem node which is intended. Dram node still has a lot of >>> thp and can be collapsed. >>> >>>> Another choice is to split THP if migration fails. That's always a >>>> question to prefer THP or local/hot normal pages. >>> Test performance will decrease if a large number of thp in pmem >>> node, promotion will fail more frequently >>> >>> relative to normal page allocation because dram memory is not enough >>> to result in waking up kswapd. >>> >>> >>> hence the influence is too much promotion failure and >>> pgpromote_demoted. And Maybe thp is not >>> >>> really needed for testcase, but an subpage of thp. >> Yes. So I suggest to try to fallback to split THP upon THP allocation >> failure on DRAM. Just disable nosplit logic in migrate_pages(). > The upstream do as you said. It will fallback to split thp > into normal page when promotion fail to allocation > > thp on dram. Not for NUMA balancing. Because bool nosplit = (reason == MR_NUMA_MISPLACED);

Maybe you has misunderstanded the patch aiming to solve the problem. The current issue is too many thps in pmem node and dram node is not enough, as mysql benchmark test,

Thanks for your data. Can you also show pgpromote_success, pgdemote_kswapd, pgdemote_direct?

...

promote thp will be likely failure because of failed to allocation memory in dram. and in the situation, promote normal page is more suitable, but it is too late to fallback to split thp.

Why? After splitting THP, we can still promote the normal pages in the THP.

I mean that avoid producing too much thp in pmem node.� Maybe we can limit the total number of thp in pmem node.

IMO, in general, THP just can access by numa fault in pmem node, � promote will trigger to decide whether migrate to dram or not, which will result in remote access and demote/promote traffic. It is the root cause that dram node is not enough.

DRAM node is always nearly full in our current policy. It appears that kswapd cannot free enough order-9 pages on DRAM node to accommodate promoted THP pages. We can check /proc/buddyinfo to verify that.

...

Maybe split all thp in pmem node has a little bit radical. but It seem to me has not better solution.

Yes. I think that isn't a final solution. We can trying 1. Adopt some patches from v0.8, https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?… https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?… Then increase the promote watermark to about 100MB and wake up kswapd earlier. 2. If the above doesn't work, we can try split to be promoted THP if there are enough free pages but no order-9 pages in DRAM. Best Regards, Huang, Ying

...

> Best Regards, > Huang, Ying > >> but It will bring in more remote access when we disable nosplit logic >> in migrate_pages. Am I missing something >> >> , please correct me, Thanks. >> >>> Best Regards, >>> Huang, Ying >>> >>>> but in practice, it is too late because the >>>> pmem node has a large lot of thp be produced. >>>> >>>> It is the root case that bring in the performance decrease. >>>> >>>> >>>> The solution I propose maybe a little radical to disable the thp >>>> production in pmem node. It has some side effect >>>> >>>> that numa balancing will scan more pages and fallback more frequenctly >>>> when allocate thp in pmem node. >>>> >>>>> Best Regards, >>>>> Huang, Ying >>>>> >>>>>>>>> Memory tiering is challenging for THP. There's no many free pages >>>>>>>>> (a >>>>>>>>> little more than high watermark) in DRAM node, so it's hard to allocate >>>>>>>>> THP there except workloads start up. >>>>>>>>> Just curiously, whether disabling THP helps your workload? >>>>>>>> From our previous testing, enabling THP can benefit the >>>>>>>> performance. And our ECS environment enabled the THP by default. >>>>>>> Thanks! >>>>>>> >>>>>>> Best Regards, >>>>>>> Huang, Ying

zhong jiang

15 Feb 15 Feb

11:49 a.m.

On 2022/2/14 8:26 下午, Huang, Ying wrote:

...

zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes:

On 2022/2/14 1:35 H, Huang, Ying wrote:

zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes:

On 2022/2/11 6:52 H, Huang, Ying wrote: > zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes: > >> On 2022/2/11 7:58 H, Huang, Ying wrote: >>> zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes: >>> >>>> On 2022/2/10 3:03 H, Huang, Ying wrote: >>>>> Baolin Wang <baolin.wang(a)linux.alibaba.com> writes: >>>>> >>>>>>> zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes: >>>>>>> >>>>>>>> On 2022/2/10 8:58 H, Huang, Ying wrote: >>>>>>>>> zhongjiang-ali <zhongjiang-ali(a)linux.alibaba.com> writes: >>>>>>>>> >>>>>>>>>> Currently, Mysql testcase show that a large number of thp are migrated >>>>>>>>>> from pmem node to toptier node, it will bring in more pgpromote_demoted >>>>>>>>>> and migrated failiure. because pmem node memory is marked as prot_none, >>>>>>>>>> it will be migrated by cpu access as soon as possible when it is hot, >>>>>>>>>> and it is unnesscessary to migrate thp to dram when dram memory is not >>>>>>>>>> enough, which will bring in more demoted and promoted. >>>>>>>>>> >>>>>>>>>> Hence, the patch forbid the thp to produce in pmem node. the result show >>>>>>>>>> about 3% improvements. the relative statistics is as follows. >>>>>>>>>> >>>>>>>>>> before appling patch: >>>>>>>>>> mysql prepare: >>>>>>>>>> pgpromote_demoted 908267 >>>>>>>>>> pgmigrate_fail_dst_node_fail 428223 >>>>>>>>>> pgmigrate_fail_numa_isolate_fail 460480 >>>>>>>>>> >>>>>>>>>> mysql run: >>>>>>>>>> pgpromote_demoted 2901105 >>>>>>>>>> pgmigrate_fail_dst_node_fail 5653776 >>>>>>>>>> pgmigrate_fail_numa_isolate_fail 5686052 >>>>>>>>>> >>>>>>>>>> after appling patch: >>>>>>>>>> mysql prepare: >>>>>>>>>> pgpromote_demoted 839297 >>>>>>>>>> pgmigrate_fail_dst_node_fail 36585 >>>>>>>>>> pgmigrate_fail_numa_isolate_fail 36585 >>>>>>>>>> >>>>>>>>>> mysql run: >>>>>>>>>> pgpromote_demoted 913828 >>>>>>>>>> pgmigrate_fail_dst_node_fail 235863 >>>>>>>>>> pgmigrate_fail_numa_isolate_fail 235870 >>>>>>>>>> >>>>>>>>>> Signed-off-by: zhongjiang-ali <zhongjiang-ali(a)linux.alibaba.com> >>>>>>>>>> --- >>>>>>>>>> mm/page_alloc.c | 14 ++++++++++++++ >>>>>>>>>> 1 file changed, 14 insertions(+) >>>>>>>>>> >>>>>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>>>>>>>> index 8cfce92..4fff3cd 100644 >>>>>>>>>> --- a/mm/page_alloc.c >>>>>>>>>> +++ b/mm/page_alloc.c >>>>>>>>>> @@ -461,6 +461,17 @@ static __always_inline int get_pfnblock_migratetype(struct page *page, unsigned >>>>>>>>>> return __get_pfnblock_flags_mask(page, pfn, PB_migrate_end, MIGRATETYPE_MASK); >>>>>>>>>> } >>>>>>>>>> +static inline bool allow_hugepage_allocation(int nid, unsigned >>>>>>>>>> int order) >>>>>>>>>> +{ >>>>>>>>>> + if (node_is_toptier(nid)) >>>>>>>>>> + return true; >>>>>>>>>> + >>>>>>>>>> + if (order != HPAGE_PMD_ORDER) >>>>>>>>>> + return true; >>>>>>>>>> + >>>>>>>>>> + return false; >>>>>>>>>> +} >>>>>>>>>> + >>>>>>>>>> /** >>>>>>>>>> * set_pfnblock_flags_mask - Set the requested group of flags for a pageblock_nr_pages block of pages >>>>>>>>>> * @page: The page within the block of interest >>>>>>>>>> @@ -3689,6 +3700,9 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>>> + if (!allow_hugepage_allocation(zone_to_nid(zone), >>>>>>>>>> order)) >>>>>>>>>> + continue; >>>>>>>>>> + >>>>>>>>> It appears that this will disable node reclaiming for THP allocation. >>>>>>>>> So more pages will be allocated in PMEM node because of allocation >>>>>>>>> fallback? >>>>>>>> We just allow normal pages allocate in pmem node, hence, thp >>>>>>>> allocation will fallback to produce more normal pages. >>>>>>>> >>>>>>>> Mysql testcase show that too many thps is promoted to toptier , >>>>>>>> due to toptier memory is not enough, it will bring in >>>>>>>> >>>>>>>> more pgpromote_deomted and dst_node_full counter increasing. In >>>>>>>> that case, we prefer to remote access rather >>>>>>>> >>>>>>>> than migrate thp between pmem and toptier node frequently, which >>>>>>>> will make performance decrease. >>>>>>> Maybe we are looking at different source code :-). In latest >>>>>>> upstream >>>>>>> code, zone_allows_reclaim() is to control node reclaiming (or zone >>>>>>> reclaim) only. Which repo should I look? >>>>>> I think you misunderstood the change, the change is in >>>>>> get_page_from_freelist(), not in zone_allows_reclaim(). >>>>> OK, I see. I think the `diff` program fools me: >>>>> >>>>> @@ -3689,6 +3700,9 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) >>>>> } >>>>> } >>>>> + if (!allow_hugepage_allocation(zone_to_nid(zone), >>>>> order)) >>>>> + continue; >>>>> + >>>>> if (no_fallback && nr_online_nodes > 1 && >>>>> zone != ac->preferred_zoneref->zone) { >>>>> int local_nid; >>>>> >>>>> >>>>>> From my understanding, Zhongjiang is trying to disable the memory >>>>>> allocation fallback for THP, right? >>>>> I think so too now. >>>>> >>>>>> But that will cause more demotion if we can not fallback to PMEM node? >>>>> If THP fails to be allocated, normal pages will be allocated instead. >>>>> And it appears that if THP is failed to be demoted (with this patch, it >>>>> will always fail), THP will be split too. So we may have much less THP >>>>> in system with the patch. Zhongjiang, Can you check it? >>>> The patch aims to prevent thp allocation in pmem node, I has >>>> checked that there are not an thp is created >>>> >>>> in pmem node which is intended. Dram node still has a lot of >>>> thp and can be collapsed. >>>> >>>>> Another choice is to split THP if migration fails. That's always a >>>>> question to prefer THP or local/hot normal pages. >>>> Test performance will decrease if a large number of thp in pmem >>>> node, promotion will fail more frequently >>>> >>>> relative to normal page allocation because dram memory is not enough >>>> to result in waking up kswapd. >>>> >>>> >>>> hence the influence is too much promotion failure and >>>> pgpromote_demoted. And Maybe thp is not >>>> >>>> really needed for testcase, but an subpage of thp. >>> Yes. So I suggest to try to fallback to split THP upon THP allocation >>> failure on DRAM. Just disable nosplit logic in migrate_pages(). >> The upstream do as you said. It will fallback to split thp >> into normal page when promotion fail to allocation >> >> thp on dram. > Not for NUMA balancing. Because > > bool nosplit = (reason == MR_NUMA_MISPLACED); Maybe you has misunderstanded the patch aiming to solve the problem. The current issue is too many thps in pmem node and dram node is not enough, as mysql benchmark test,

Thanks for your data. Can you also show pgpromote_success, pgdemote_kswapd, pgdemote_direct?

before appling patch: mysql prepare: pgdemote_kswapd 5419718 pgdemote_direct 133301 mysql run: pgdemote_kswapd 8464536 pgdemote_direct 137455 after appling patch: mysql prepare: pgdemote_kswapd 22631547 pgdemote_direct 161228 mysql run: pgdemote_kswapd 27817324 pgdemote_direct 161228

...

promote thp will be likely failure because of failed to allocation memory in dram. and in the situation, promote normal page is more suitable, but it is too late to fallback to split thp.

Why? After splitting THP, we can still promote the normal pages in the THP.

I mean that avoid producing too much thp in pmem node.? Maybe we can limit the total number of thp in pmem node.

IMO, in general, THP just can access by numa fault in pmem node, ? promote will trigger to decide whether migrate to dram or not, which will result in remote access and demote/promote traffic. It is the root cause that dram node is not enough.

pgmigrate_fail_dst_node_fail is increasing that show DRAM can not meet the watermark when thp allocate. mysql test policy that DRAM node is always nearly full.

...

Maybe split all thp in pmem node has a little bit radical. but It seem to me has not better solution.

Worry about the increase of watermark will result in more demote/promote traffic.

...

2. If the above doesn't work, we can try split to be promoted THP if there are enough free pages but no order-9 pages in DRAM.

The key is dram node has not enough memory rather than order-9 pages do not meet the thp allocation request.

...

Best Regards, Huang, Ying >> Best Regards, >> Huang, Ying >> >>> but It will bring in more remote access when we disable nosplit logic >>> in migrate_pages. Am I missing something >>> >>> , please correct me, Thanks. >>> >>>> Best Regards, >>>> Huang, Ying >>>> >>>>> but in practice, it is too late because the >>>>> pmem node has a large lot of thp be produced. >>>>> >>>>> It is the root case that bring in the performance decrease. >>>>> >>>>> >>>>> The solution I propose maybe a little radical to disable the thp >>>>> production in pmem node. It has some side effect >>>>> >>>>> that numa balancing will scan more pages and fallback more frequenctly >>>>> when allocate thp in pmem node. >>>>> >>>>>> Best Regards, >>>>>> Huang, Ying >>>>>> >>>>>>>>>> Memory tiering is challenging for THP. There's no many free pages >>>>>>>>>> (a >>>>>>>>>> little more than high watermark) in DRAM node, so it's hard to allocate >>>>>>>>>> THP there except workloads start up. >>>>>>>>>> Just curiously, whether disabling THP helps your workload? >>>>>>>>> From our previous testing, enabling THP can benefit the >>>>>>>>> performance. And our ECS environment enabled the THP by default. >>>>>>>> Thanks! >>>>>>>> >>>>>>>> Best Regards, >>>>>>>> Huang, Ying >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Pmem mailing list -- pmem(a)lists.openanolis.cn >>>>>>>> To unsubscribe send an email to pmem-leave(a)lists.openanolis.cn

1514

不活跃天数

1520

活的天数

pmem@lists.openanolis.cn

Manage subscription

16 评论

4 参与者

标签 (0)

参与者 (4)

Baolin Wang
Huang, Ying
zhong jiang
zhongjiang-ali