zhong jiang
<zhongjiang-ali(a)linux.alibaba.com> writes:
On 2022/2/11 6:52 H, Huang, Ying wrote:
> zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes:
>
>> On 2022/2/11 7:58 H, Huang, Ying wrote:
>>> zhong jiang <zhongjiang-ali(a)linux.alibaba.com> writes:
>>>
>>>> On 2022/2/10 3:03 H, Huang, Ying wrote:
>>>>> Baolin Wang <baolin.wang(a)linux.alibaba.com> writes:
>>>>>
>>>>>>> zhong jiang <zhongjiang-ali(a)linux.alibaba.com>
writes:
>>>>>>>
>>>>>>>> On 2022/2/10 8:58 H, Huang, Ying wrote:
>>>>>>>>> zhongjiang-ali
<zhongjiang-ali(a)linux.alibaba.com> writes:
>>>>>>>>>
>>>>>>>>>> Currently, Mysql testcase show that a large
number of thp are migrated
>>>>>>>>>> from pmem node to toptier node, it will bring in
more pgpromote_demoted
>>>>>>>>>> and migrated failiure. because pmem node memory
is marked as prot_none,
>>>>>>>>>> it will be migrated by cpu access as soon as
possible when it is hot,
>>>>>>>>>> and it is unnesscessary to migrate thp to dram
when dram memory is not
>>>>>>>>>> enough, which will bring in more demoted and
promoted.
>>>>>>>>>>
>>>>>>>>>> Hence, the patch forbid the thp to produce in
pmem node. the result show
>>>>>>>>>> about 3% improvements. the relative statistics is
as follows.
>>>>>>>>>>
>>>>>>>>>> before appling patch:
>>>>>>>>>> mysql prepare:
>>>>>>>>>> pgpromote_demoted 908267
>>>>>>>>>> pgmigrate_fail_dst_node_fail 428223
>>>>>>>>>> pgmigrate_fail_numa_isolate_fail 460480
>>>>>>>>>>
>>>>>>>>>> mysql run:
>>>>>>>>>> pgpromote_demoted 2901105
>>>>>>>>>> pgmigrate_fail_dst_node_fail 5653776
>>>>>>>>>> pgmigrate_fail_numa_isolate_fail 5686052
>>>>>>>>>>
>>>>>>>>>> after appling patch:
>>>>>>>>>> mysql prepare:
>>>>>>>>>> pgpromote_demoted 839297
>>>>>>>>>> pgmigrate_fail_dst_node_fail 36585
>>>>>>>>>> pgmigrate_fail_numa_isolate_fail 36585
>>>>>>>>>>
>>>>>>>>>> mysql run:
>>>>>>>>>> pgpromote_demoted 913828
>>>>>>>>>> pgmigrate_fail_dst_node_fail 235863
>>>>>>>>>> pgmigrate_fail_numa_isolate_fail 235870
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: zhongjiang-ali
<zhongjiang-ali(a)linux.alibaba.com>
>>>>>>>>>> ---
>>>>>>>>>> mm/page_alloc.c | 14 ++++++++++++++
>>>>>>>>>> 1 file changed, 14 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>>>>>>> index 8cfce92..4fff3cd 100644
>>>>>>>>>> --- a/mm/page_alloc.c
>>>>>>>>>> +++ b/mm/page_alloc.c
>>>>>>>>>> @@ -461,6 +461,17 @@ static __always_inline int
get_pfnblock_migratetype(struct page *page, unsigned
>>>>>>>>>> return __get_pfnblock_flags_mask(page,
pfn, PB_migrate_end, MIGRATETYPE_MASK);
>>>>>>>>>> }
>>>>>>>>>> +static inline bool
allow_hugepage_allocation(int nid, unsigned
>>>>>>>>>> int order)
>>>>>>>>>> +{
>>>>>>>>>> + if (node_is_toptier(nid))
>>>>>>>>>> + return true;
>>>>>>>>>> +
>>>>>>>>>> + if (order != HPAGE_PMD_ORDER)
>>>>>>>>>> + return true;
>>>>>>>>>> +
>>>>>>>>>> + return false;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> /**
>>>>>>>>>> * set_pfnblock_flags_mask - Set the
requested group of flags for a pageblock_nr_pages block of pages
>>>>>>>>>> * @page: The page within the block of
interest
>>>>>>>>>> @@ -3689,6 +3700,9 @@ static bool
zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
>>>>>>>>>> }
>>>>>>>>>> }
>>>>>>>>>> + if
(!allow_hugepage_allocation(zone_to_nid(zone),
>>>>>>>>>> order))
>>>>>>>>>> + continue;
>>>>>>>>>> +
>>>>>>>>> It appears that this will disable node reclaiming for
THP allocation.
>>>>>>>>> So more pages will be allocated in PMEM node because
of allocation
>>>>>>>>> fallback?
>>>>>>>> We just allow normal pages allocate in pmem node, hence,
thp
>>>>>>>> allocation will fallback to produce more normal pages.
>>>>>>>>
>>>>>>>> Mysql testcase show that too many thps is promoted to
toptier ,
>>>>>>>> due to toptier memory is not enough, it will bring in
>>>>>>>>
>>>>>>>> more pgpromote_deomted and dst_node_full counter
increasing. In
>>>>>>>> that case, we prefer to remote access rather
>>>>>>>>
>>>>>>>> than migrate thp between pmem and toptier node
frequently, which
>>>>>>>> will make performance decrease.
>>>>>>> Maybe we are looking at different source code :-). In
latest
>>>>>>> upstream
>>>>>>> code, zone_allows_reclaim() is to control node reclaiming (or
zone
>>>>>>> reclaim) only. Which repo should I look?
>>>>>> I think you misunderstood the change, the change is in
>>>>>> get_page_from_freelist(), not in zone_allows_reclaim().
>>>>> OK, I see. I think the `diff` program fools me:
>>>>>
>>>>> @@ -3689,6 +3700,9 @@ static bool zone_allows_reclaim(struct zone
*local_zone, struct zone *zone)
>>>>> }
>>>>> }
>>>>> + if (!allow_hugepage_allocation(zone_to_nid(zone),
>>>>> order))
>>>>> + continue;
>>>>> +
>>>>> if (no_fallback && nr_online_nodes > 1
&&
>>>>> zone != ac->preferred_zoneref->zone) {
>>>>> int local_nid;
>>>>>
>>>>>
>>>>>> From my understanding, Zhongjiang is trying to disable the
memory
>>>>>> allocation fallback for THP, right?
>>>>> I think so too now.
>>>>>
>>>>>> But that will cause more demotion if we can not fallback to PMEM
node?
>>>>> If THP fails to be allocated, normal pages will be allocated
instead.
>>>>> And it appears that if THP is failed to be demoted (with this patch,
it
>>>>> will always fail), THP will be split too. So we may have much less
THP
>>>>> in system with the patch. Zhongjiang, Can you check it?
>>>> The patch aims to prevent thp allocation in pmem node, I has
>>>> checked that there are not an thp is created
>>>>
>>>> in pmem node which is intended. Dram node still has a lot of
>>>> thp and can be collapsed.
>>>>
>>>>> Another choice is to split THP if migration fails. That's always
a
>>>>> question to prefer THP or local/hot normal pages.
>>>> Test performance will decrease if a large number of thp in pmem
>>>> node, promotion will fail more frequently
>>>>
>>>> relative to normal page allocation because dram memory is not enough
>>>> to result in waking up kswapd.
>>>>
>>>>
>>>> hence the influence is too much promotion failure and
>>>> pgpromote_demoted. And Maybe thp is not
>>>>
>>>> really needed for testcase, but an subpage of thp.
>>> Yes. So I suggest to try to fallback to split THP upon THP allocation
>>> failure on DRAM. Just disable nosplit logic in migrate_pages().
>> The upstream do as you said. It will fallback to split thp
>> into normal page when promotion fail to allocation
>>
>> thp on dram.
> Not for NUMA balancing. Because
>
> bool nosplit = (reason == MR_NUMA_MISPLACED);
Maybe you has misunderstanded the patch aiming to solve the problem.
The current issue is too many thps in pmem node and dram node is not
enough, as mysql benchmark test,
IMHO, the problem may be the following. One or some.
1. THP causes high promotion/demotion traffic, it consumes too much PMEM
throughput.
2. THP reduce the accuracy of the hot/cold pages placement. Some cold
normal pages in THP are placed in DRAM and some warm pages are placed
in PMEM.
3. THP in PMEM fails to be promoted to DRAM because THP cannot be
allocated in DRAM. This causes bad hot/cold pages placement too.
Per my understanding, you think the real problem is No. 3 above. Better
to get some statistics to prove that or any other possibility.
I has some statistics to test mysql showed in the patch.
before appling patch:
mysql prepare:
pgpromote_demoted 908267
pgmigrate_fail_dst_node_fail 428223
pgmigrate_fail_numa_isolate_fail 460480
mysql run:
pgpromote_demoted 2901105
pgmigrate_fail_dst_node_fail 5653776
pgmigrate_fail_numa_isolate_fail 5686052
after appling patch:
mysql prepare:
pgpromote_demoted 839297
pgmigrate_fail_dst_node_fail 36585
pgmigrate_fail_numa_isolate_fail 36585
mysql run:
pgpromote_demoted 913828
pgmigrate_fail_dst_node_fail 235863
pgmigrate_fail_numa_isolate_fail 235870
pgpromote_demoted and pgmigrate_fail_dst_node_fail decrease dramatically.