Skip to main content
 首页 » 操作系统

Linux 调度器之CFS任务选核

2022年07月19日155www_RR

一、select_task_rq_fair()函数

CFS任务选核最终都是要走 select_task_rq_fair() 函数,三种CFS选核路径如下:

try_to_wake_up //core.c 
    select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags); //唤醒选核路径 
 
wake_up_new_task //core.c 
    select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0); //fork选核路径 
 
sched_exec //core.c 
    select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0); //exec选核路径

对比可以看到,只有任务唤醒选核才传了wake_flags,可能选择sync唤醒。sched_domain 的flag中是否包含 SD_BALANCE_XX 可以通过下面方法查看:

/proc/sys/kernel/sched_domain # find ./ -name flags | xargs cat 
SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES //MC 
SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_ASYM_CPUCAPACITY SD_PREFER_SIBLING //DIE

可以看出对于嵌入式这种大小核的Arm64架构中,MC和DIE都没有 SD_BALANCE_WAKE 这个标志。

函数解读:

/* 
 * select_task_rq_fair: Select target runqueue for the waking task in domains 
 * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE, 
 * SD_BALANCE_FORK, or SD_BALANCE_EXEC. 
 * 
 * Balances load by selecting the idlest CPU in the idlest group, or under 
 * certain conditions an idle sibling CPU if the domain has SD_WAKE_AFFINE set. 
 * 
 * Returns the target CPU number. 
 * 
 * preempt must be disabled. 
 */ 
/* 
 * 传参: 
 * p: 待选核任务 
 * prev_cpu:任务之前运行的cpu 
 * sd_flag: 在包含此标志的sd中为任务选核,可取值 SD_BALANCE_WAKE(唤醒任务的选核)、SD_BALANCE_FORK(fork的新任务选核)、SD_BALANCE_EXEC(exec的任务选核) 
 * wake_flags: 可取值有 WF_SYNC、WF_FORK、WF_MIGRATED、WF_ON_CPU、WF_ANDROID_VENDOR,但是只对WF_SYNC进行了特殊处理,表示同步唤醒,其它标志被此函数忽略。 
 * 
 * 返回值:选出的目标cpu 
 *  
 * 执行环境:必须关抢占 
 */ 
static int select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags) 
{ 
    struct sched_domain *tmp, *sd = NULL; 
    int cpu = smp_processor_id(); 
    int new_cpu = prev_cpu; //目标cpu先默认取prev_cpu 
    int want_affine = 0; 
    int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING); 
    int target_cpu = -1; 
 
    //注册了这个HOOK且不是对新fork出来的任务进行选核,就更新任务的负载 
    if (trace_android_rvh_select_task_rq_fair_enabled() && !(sd_flag & SD_BALANCE_FORK))  
        sync_entity_load_avg(&p->se); 
    //Vendor厂商可以更改选核路径,若是选到了核,就直接退出了 
    trace_android_rvh_select_task_rq_fair(p, prev_cpu, sd_flag, wake_flags, &target_cpu); probe_android_rvh_select_task_rq_fair 
    if (target_cpu >= 0) 
        return target_cpu; 
 
    //只有对唤醒任务的选核才有可能走EAS选核路径 
    if (sd_flag & SD_BALANCE_WAKE) { 
        //记录当前任何和被唤醒任务p之间是否有固定的唤醒关系 
        record_wakee(p); 
 
        //全局使能控制/proc/sys/kernel/sched_energy_aware是否开启EAS 
        if (sched_energy_enabled()) { 
            //进行EAS选核,EAS选到核了,就直接退出。里面判断了rd->overutilized,若overutilized则直接失败退出 
            new_cpu = find_energy_efficient_cpu(p, prev_cpu, sync); 
            if (new_cpu >= 0) 
                return new_cpu; 
            //若EAS没有选到核,就回退为初始状态,继续走传统路径进行选核 
            new_cpu = prev_cpu; 
        } 
 
        //若p和current之间比较有相关性且任务p允许运行在当前cpu上,就认为是want_affine 
        want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr); 
    } 
 
    rcu_read_lock(); 
    for_each_domain(cpu, tmp) { //MC-->DIE 
        /* If both 'cpu' and 'prev_cpu' are part of this domain, cpu is a valid SD_WAKE_AFFINE target.*/ 
        /* SD_WAKE_AFFINE 标志MC和DIE都定义了,恒成立, 因此可以解释为只要满足want_affine, 就肯定会进入 */ 
        if (want_affine && (tmp->flags & SD_WAKE_AFFINE) && cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) { 
            //若当前cpu就是任务p之前运行的prev_cpu, 不用调用wake_affine()选核了,若是唤醒选核直接选idle的兄弟cpu,非唤醒选核直接选prev_cpu 
            if (cpu != prev_cpu) 
                new_cpu = wake_affine(tmp, p, cpu, prev_cpu, sync); //若是既不走快速路径,又不走慢速路径,就返回它。 
 
            sd = NULL; /* Prefer wake_affine over balance flags */ 
            break; 
        } 
 
        /* 
         * 注意 tmp->flags 不会包含 SD_BALANCE_WAKE, 因为MC和DIE都没设置这个标志。只有 SD_BALANCE_FORK、SD_BALANCE_EXEC 选核才有可能赋值 
         * 对于want_affine=0的情况,这个循环退出后,sd就是当前cpu对应的DIE层级的sd. 这里的作用是找到支持sd_flag的最高层级的sd 
         */ 
        if (tmp->flags & sd_flag) 
            sd = tmp; 
        else if (!want_affine) 
            break; 
    } 
 
    if (unlikely(sd)) { //SD_BALANCE_WAKE这类休眠唤醒的任务选核不可能走慢速路径,因为sd恒为NULL 
        /* Slow path */ 
        new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag); 
    } else if (sd_flag & SD_BALANCE_WAKE) { //对唤醒的任务进行选核 
        /* Fast path */ 
        new_cpu = select_idle_sibling(p, prev_cpu, new_cpu); 
 
        if (want_affine) 
            current->recent_used_cpu = cpu; 
    } 
    rcu_read_unlock(); 
 
    return new_cpu; 
}

1. 主要逻辑

(1) Vendor可以注册HOOK来决定是否执行这个原生的选核流程,若注册了且选到了核,则原生选核流程不再执行。否则执行原生选核流程。

(2) 判断是否进行EAS选核,若EAS选到和核则直接退出,否则继续往下走选核流程。EAS选核的条件为:
a. 必须是唤醒场景的选核,fork和exec场景的选核都不会走EAS。
b. EAS全局控制开关/proc/sys/kernel/sched_energy_aware必须是开启的。
c. 系统没有overutilized,若是overutilized则直接退出EAS选核。

(3) 满足wake_affine选出的核会作为一个候选结果,若是唤醒场景,还有走快速路径进行选核。

(4) 慢速路径选核只有fork和exec两类选核流程会走,唤醒选核是不会走慢速路径的。且慢速路径传参sd只可能是当前cpu对应的DIE层级的sd。

(5) 快速路径选核只有唤醒选核流程会走,fork和exec两类选核流程不会走快速选核流程。虽然名字叫select_idle_sibling,但是在同cluster中选不到idle核时,还是会在所有cluster中进行选核。

二、wake affine 属性

wake affine 只支持唤醒场景的选核。主要判断当前任务与被唤醒任务之间是否有比较固定的唤醒关系,来帮助选出一个候选cpu。

1. task_struct 中有三个成员用来线程唤醒信息

struct task_struct { 
    ... 
    unsigned int        wakee_flips; //该线程唤醒不同wakee的次数,值越大说明越偏向于唤醒不同的线程且唤醒越频繁 
    unsigned long        wakee_flip_decay_ts; //记录上次wakee_flips衰减的时间点,其每秒衰减为原来的50% 
    struct task_struct    *last_wakee; //该线程上次唤醒的线程 
    ... 
}

2. record_wakee()

/* 
 * 若current对p有比较固定的唤醒关系的话,current->wakee_flips将非常的小。 
 * 若经常唤醒其它不同的线程,其值将比较大 
 */ 
static void record_wakee(struct task_struct *p) 
{ 
    /* 
     * Only decay a single time; tasks that have less then 1 wakeup per 
     * jiffy will not have built up many flips. 
     */ 
    //超过1秒衰减为原来的1/2 
    if (time_after(jiffies, current->wakee_flip_decay_ts + HZ)) { 
        current->wakee_flips >>= 1; 
        current->wakee_flip_decay_ts = jiffies; 
    } 
 
    //当前任务上下文中,若上次唤醒的不是任务p, 当前任务的wakee_flips加1 
    if (current->last_wakee != p) { 
        current->last_wakee = p; 
        current->wakee_flips++; 
    } 
}

Waker唤醒wakee的场景中,有两种选核思路:一种是聚合的思路,即让waker和wakee尽量的close,从而提高cache hit。另外一种思考是分散,即让load尽量平均分配在多个cpu上。不同的唤醒模型使用不同的放置策略,两种简单的唤醒模型:

(1) 在1:N模型中,一个server会不断的唤醒多个不同的client。
(2) 在1:1模型中,线程A和线程B不断的唤醒对方。

在1:N模型中,如果N是一个较大的数值,那么让waker和wakee尽量的close会导致负荷的极度不平均,这会waker所在的sd会承担太多的task,从而引起性能下降。
在1:1模型中,让waker和wakee尽量的close不存在这样的问题,还能提高性能。
实际的程序中,唤醒关系可能没有那么简单,一个wakee可能是另外一个关系中的waker,交互可能是M:N的形式。考虑这样一个场景:waker把wakee拉近,而wakee自身的wakee flips比较大,由于wakee也会做waker,那么更多的线程也会拉近waker所在的sd,从而进一步加剧CPU资源的竞争。因此waker和wakee的wakee flips的数值都不能太大,太大的时候应该禁止wake affine。内核中通过 wake_wide() 来判断是否使能wake affine。

3. wake_wide()

/* 
 * 通过开关频率启发式检测 M:N 唤醒者/被唤醒者关系。 
 * 许多唤醒者唤醒的是与上次不同的任务,其频率大约是唤醒某一任务的 N 倍。 
 * 为了确定我们是否应该让负载分散与合并到共享缓存,我们在一对伙伴的一个中寻找最小为 llc_size 的“翻转”频率,在另一个中寻找大于 lls_size 频率的因子。 
 * 满足这两个条件,我们可以相对确定这种关系是非一夫一妻制的,伙伴数量超过套接字大小。 
 * Waker/wakee 是client/server、worker/dispatcher、中断源或任何无关紧要的东西,传播标准是明显的合作伙伴数量超过套接字大小。 
 * 
 * 对于有比较固定唤醒关系的,返回0,否则返回1 
 */ 
static int wake_wide(struct task_struct *p) 
{ 
    unsigned int master = current->wakee_flips; 
    unsigned int slave = p->wakee_flips; 
    int factor = __this_cpu_read(sd_llc_size); //表示此cpu所在cluster中cpu的个数 
 
    if (master < slave) 
        swap(master, slave); 
    //这里是用的是或 
    if (slave < factor || master < slave * factor) 
        return 0; 
    return 1; 
}

这个函数返回0,并且当前cpu在任务p的亲和性里面,就判断为want_affine。这里的or逻辑是存疑的,因为master和slave其一的wakee_flips比较小就满足wake affine,这会使得任务太容易在LLC domain堆积了。在1:N模型中(例如手机转屏的时候,一个线程会唤醒非常非常多的线程来处理configChange消息),master的wakee_flips巨大无比,slave的wakee_flips非常小,如果仍然wake affine是不合理的。

4. wake_affine()

/* 
 * select_task_rq_fair传参:sd为当前cpu对应的MC或DIE层级的sd; p为待选核任务; this_cpu为当前正在执行的cpu;  
 * prev_cpu为任务之前运行的cpu; sync表示选核时是否指定了同步WF_SYNC标志 
 */ 
static int wake_affine(struct sched_domain *sd, struct task_struct *p, int this_cpu, int prev_cpu, int sync) 
{ 
    int target = nr_cpumask_bits; 
 
    if (sched_feat(WA_IDLE)) 
        target = wake_affine_idle(this_cpu, prev_cpu, sync); 
 
    if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits) //若上面没有选到cpu 
        target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync); 
 
    schedstat_inc(p->se.statistics.nr_wakeups_affine_attempts); //增加统计计数 
    if (target == nr_cpumask_bits) 
        return prev_cpu; //若是还是没选到核,就返回prev_cpu,否则返回选到的cpu 
 
    schedstat_inc(sd->ttwu_move_affine); 
    schedstat_inc(p->se.statistics.nr_wakeups_affine); 
    return target; 
} 
 
 
/* 
 * The purpose of wake_affine() is to quickly determine on which CPU we can run 
 * soonest. For the purpose of speed we only consider the waking and previous CPU. 
 * 
 * wake_affine_idle() - only considers 'now', it check if the waking CPU is 
 *            cache-affine and is (or    will be) idle. 
 * 
 * wake_affine_weight() - considers the weight to reflect the average 
 *              scheduling latency of the CPUs. This seems to work 
 *              for the overloaded case. 
 * 
 * wake_affine传参:this_cpu是当前正在执行的cpu; prev_cpu是任务上次运行的cpu; sync是是 
 * 否指定了AF_SYNC同步唤醒标志 
 */ 
static int wake_affine_idle(int this_cpu, int prev_cpu, int sync) 
{ 
    /* 
     * If this_cpu is idle, it implies the wakeup is from interrupt 
     * context. Only allow the move if cache is shared. Otherwise an 
     * interrupt intensive(密集型) workload could force all tasks onto one 
     * node depending on the IO topology or IRQ affinity settings. 
     * 
     * If the prev_cpu is idle and cache affine then avoid a migration. 
     * There is no guarantee that the cache hot data from an interrupt 
     * is more important than cache hot data on the prev_cpu and from 
     * a cpufreq perspective, it's better to have higher utilisation 
     * on one CPU. 
     */ 
    //若正在执行的cpu是idle的(中断唤醒),且当前cpu和任务上次运行的cpu属于同一个cluster 
    if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu)) 
        //若prev_cpu也idle就返回prev_cpu,否则返回当前正在运行的idle cpu 
        return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu; 
 
    /* 
     * 若同步唤醒且当前cpu上只有一个任务(也就是waker)在运行,就返回当前cpu,因为sync唤醒, 
     * 当前waker马上就要睡眠了。 
     */ 
    if (sync && cpu_rq(this_cpu)->nr_running == 1) 
        return this_cpu; 
 
    return nr_cpumask_bits; //没选到cpu返回它 
} 
 
 
/* 
 * wake_affine传参:sd是当前正在运行的cpu对应的MC或DIE层级的sd; p是待选核的任务; this_cpu是当前正在执行的cpu; prev_cpu是任务上次运行的cpu;  
 * sync是是否指定了AF_SYNC同步唤醒标志。 
 */ 
static int wake_affine_weight(struct sched_domain *sd, struct task_struct *p, int this_cpu, int prev_cpu, int sync) 
{ 
    s64 this_eff_load, prev_eff_load; 
    unsigned long task_load; 
 
    //当前正在运行cpu上的CFS任务的负载 
    this_eff_load = cpu_load(cpu_rq(this_cpu)); //return cfs_rq->avg.load_avg; 
 
    if (sync) { 
        unsigned long current_load = task_h_load(current); //约为 current->se.avg.load_avg; 
        //当前任务的负载比当前cpu上cfs任务的负载之和还大?应该是安全的容错处理 
        if (current_load > this_eff_load) 
            return this_cpu; 
 
        this_eff_load -= current_load; 
    } 
 
    task_load = task_h_load(p); //约为 p->se.avg.load_avg; 
 
    this_eff_load += task_load; //计算若p把current从当前cpu上挤下去后,cpu上CFS任务的总负载 
    if (sched_feat(WA_BIAS)) 
        this_eff_load *= 100; //乘以100 
     
    this_eff_load *= capacity_of(prev_cpu); //再乘以cpu算力中可用于cfs任务的算力(出去温限/irq/rt/dl占用后的算力),乘以的竟然是prev_cpu的! 
 
    prev_eff_load = cpu_load(cpu_rq(prev_cpu)); //prev_cpu::rq.cfs_rq->avg.load_avg 
    prev_eff_load -= task_load; //这里为什么要减去? 
    if (sched_feat(WA_BIAS)) 
        prev_eff_load *= 100 + (sd->imbalance_pct - 100) / 2; //MC和DIE都是乘以 100+(117-100)/2 
    prev_eff_load *= capacity_of(this_cpu); //乘以的竟然是this_cpu可用于cfs任务的算力! 
 
    /* 
     * If sync, adjust the weight of prev_eff_load such that if 
     * prev_eff == this_eff that select_idle_sibling() will consider 
     * stacking the wakee on top of the waker if no other CPU is 
     * idle. 
     * 如果同步唤醒,则调整 prev_eff_load 的权重,以便如果 prev_eff == this_eff 则 select_idle_sibling() 将考虑在没有其它空闲CPU的情况下将 
     * 被唤醒的任务放在唤醒它的CPU上。 
     */ 
    if (sync) 
        prev_eff_load += 1; //前面乘以100又乘以CPU可用于CFS任务的算力了,这里加1有啥用? 
 
    /*公式整理一下为:this_cpu_load/this_cpu_cap < 108.5% * prev_cpu_load/prev_cpu_cap 成立就返回this_cpu*/ 
    return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits; 
}

三、慢速路径

只有fork和exec类型的选核才会走慢速路径,慢速路径选核函数为 find_idlest_cpu(),走慢速路径选核时传参sd只可能是DIE层级的。

/* 
 * 只有fork或exec的选核才可能走这个函数。 
 * 
 * 传参:sd当前cpu对应的DIE层级的sd(只可能是DIE层级); p是待选核任务; cpu为当前cpu; prev_cpu为任务p之前运行的cpu; sd_flag表示哪种选核类型。 
 * 
 * 作用:在最idle的sg中找出最idle的cpu, 若是没找到就返回当前cpu. 
 */ 
static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p, int cpu, int prev_cpu, int sd_flag) 
{ 
    int new_cpu = cpu; 
 
    //这个sd中的所有cpu都不在任务p的cpu亲和性里面 
    if (!cpumask_intersects(sched_domain_span(sd), p->cpus_ptr)) 
        return prev_cpu; 
 
    /* 
     * We need task's util for cpu_util_without, sync it up to prev_cpu's last_update_time. 
     */ 
    //仅对于exec类型的选核更新任务的负载 
    if (!(sd_flag & SD_BALANCE_FORK)) 
        sync_entity_load_avg(&p->se); 
 
    while (sd) { 
        struct sched_group *group; 
        struct sched_domain *tmp; 
        int weight; 
 
        if (!(sd->flags & sd_flag)) { //恒不成立,因为fork或exec类型的选核MC/DIE都支持 
            sd = sd->child; 
            continue; 
        } 
 
        //若是返回NULL,直接选的是当前cpu 
        group = find_idlest_group(sd, p, cpu); 
        if (!group) { 
            sd = sd->child; 
            continue; 
        } 
 
        //若是在d中找到了最忙的sg,就在其中选退出延迟最小且最后进idle的cpu,若是没有idle的cpu就选load最小的cpu 
        new_cpu = find_idlest_group_cpu(group, p, cpu); 
        if (new_cpu == cpu) { 
            /* Now try balancing at a lower domain level of 'cpu': */ 
            sd = sd->child; 
            continue; 
        } 
 
        /* Now try balancing at a lower domain level of 'new_cpu': */ 
        cpu = new_cpu; 
        weight = sd->span_weight; 
        sd = NULL; 
        for_each_domain(cpu, tmp) { 
            if (weight <= tmp->span_weight) //一进来就退出了,相当于没有执行 
                break; 
            if (tmp->flags & sd_flag) 
                sd = tmp; 
        } 
    } 
 
    return new_cpu; 
}

1. find_idlest_group()

在当前cpu对应的sd中找出最空闲的sg

/* 
 * find_idlest_group() finds and returns the least busy CPU group within the domain. 
 * 
 * Assumes p is allowed on at least one CPU in sd. 
 * 
 * find_idlest_cpu传参:sd为当前cpu DIE层级对应的sd; p为待选核任务; cpu为当前cpu; 
 */ 
static struct sched_group *find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) 
{ 
    struct sched_group *idlest = NULL, *local = NULL, *group = sd->groups; 
    struct sg_lb_stats local_sgs, tmp_sgs; 
    struct sg_lb_stats *sgs; 
    unsigned long imbalance; 
    struct sg_lb_stats idlest_sgs = { 
            .avg_load = UINT_MAX, 
            .group_type = group_overloaded, //初始化为最大的不均衡 
    }; 
 
    //1024 * (sd->imbalance_pct-100) / 100 
    imbalance = scale_load_down(NICE_0_LOAD) * (sd->imbalance_pct-100) / 100; 
 
    //遍历MC层级先的每个cpu 
    do { 
        int local_group; 
 
        /* Skip over this group if it has no CPUs allowed */ 
        //跳过任务p亲和性不允许的cpu 
        if (!cpumask_intersects(sched_group_span(group), p->cpus_ptr)) 
            continue; 
 
        //当前正在运行的cpu是否在这个group中 
        local_group = cpumask_test_cpu(this_cpu, sched_group_span(group)); 
        if (local_group) { 
            sgs = &local_sgs; 
            local = group; 
        } else { 
            sgs = &tmp_sgs; 
        } 
 
        update_sg_wakeup_stats(sd, group, sgs, p); 
 
        //group和idlest比较,看group是否比idlest更闲 
        if (!local_group && update_pick_idlest(idlest, &idlest_sgs, group, sgs)) { 
            idlest = group; 
            idlest_sgs = *sgs; 
        } 
 
    } while (group = group->next, group != sd->groups); 
 
 
    /* There is no idlest group to push tasks to */ 
    //若没找到idlest sg,选当前cpu 
    if (!idlest) 
        return NULL; 
 
    /* The local group has been skipped because of CPU affinity */ 
    //此sd中若没有local group就直接返回idlest,否则还要和local group再PK一下,此例中,local group恒存在的 
    if (!local) 
        return idlest; 
 
    /* 
     * If the local group is idler than the selected idlest group 
     * don't try and push the task. 
     */ 
    //若local sg更闲,选当前cpu 
    if (local_sgs.group_type < idlest_sgs.group_type) 
        return NULL; 
 
    /* 
     * If the local group is busier than the selected idlest group 
     * try and push the task. 
     */ 
    if (local_sgs.group_type > idlest_sgs.group_type) 
        return idlest; 
 
    //local group和idlest group一样闲的情况了: 
    switch (local_sgs.group_type) { 
    case group_overloaded: 
    case group_fully_busy: 
        /* 
         * When comparing groups across NUMA domains, it's possible for 
         * the local domain to be very lightly loaded relative to the 
         * remote domains but "imbalance" skews the comparison making 
         * remote CPUs look much more favourable. When considering 
         * cross-domain, add imbalance to the load on the remote node 
         * and consider staying local. 
         */ 
        //大小核架构没有SD_NUMA标志,恒不执行 
        if ((sd->flags & SD_NUMA) && ((idlest_sgs.avg_load + imbalance) >= local_sgs.avg_load)) 
            return NULL; 
 
        /* 
         * If the local group is less loaded than the selected 
         * idlest group don't try and push any tasks. 
         */ 
        //最闲的sg和local sg差值大于阈值17*1024,选当前cpu 
        if (idlest_sgs.avg_load >= (local_sgs.avg_load + imbalance)) 
            return NULL; 
 
        //local_avg_load/idlest_avg_load <= sd->imbalance_pct%,选当前cpu 
        if (100 * local_sgs.avg_load <= sd->imbalance_pct * idlest_sgs.avg_load) 
            return NULL; 
        break; 
 
    case group_imbalanced: 
    case group_asym_packing: 
        /* Those type are not used in the slow wakeup path */ 
        return NULL; //选当前cpu 
 
    case group_misfit_task: 
        /* Select group with the highest max capacity */ 
        if (local->sgc->max_capacity >= idlest->sgc->max_capacity) 
            return NULL; //选当前cpu 
        break; 
 
    case group_has_spare: 
        if (sd->flags & SD_NUMA) { 
#ifdef CONFIG_NUMA_BALANCING //没使能,不执行 
            int idlest_cpu; 
            /* 
             * If there is spare capacity at NUMA, try to select 
             * the preferred node 
             */ 
            if (cpu_to_node(this_cpu) == p->numa_preferred_nid) 
                return NULL; 
 
            idlest_cpu = cpumask_first(sched_group_span(idlest)); 
            if (cpu_to_node(idlest_cpu) == p->numa_preferred_nid) 
                return idlest; 
#endif 
            /* 
             * Otherwise, keep the task on this node to stay close 
             * its wakeup source and improve locality. If there is 
             * a real need of migration, periodic load balance will 
             * take care of it. 
             */ 
            if (local_sgs.idle_cpus) 
                return NULL;  //若local sg有idle cpu存在,选当前cpu 
        } 
 
        /* 
         * Select group with highest number of idle CPUs. We could also 
         * compare the utilization which is more stable but it can end 
         * up that the group has less spare capacity but finally more 
         * idle CPUs which means more opportunity to run task. 
         */ 
        //若local sg的idle cpu个数比idlest sg更多,选当前cpu 
        if (local_sgs.idle_cpus >= idlest_sgs.idle_cpus) 
            return NULL; 
        break; 
    } 
 
    return idlest; 
}

update_sg_wakeup_stats 更新 sgs:

/* 
 * update_sg_wakeup_stats - Update sched_group's statistics for wakeup. 
 * @sd: The sched_domain level to look for idlest group. 
 * @group: sched_group whose statistics are to be updated. 
 * @sgs: variable to hold the statistics for this group. 
 * @p: The task for which we look for the idlest group/CPU. 
 * 
 * find_idlest_group传参:sd为当前cpu对应的MC层级的sd; group为MC层级的一个sg; sgs为未初始化的参数; p为待选核任务 
 * 作用:更新 sgs  
 */ 
static inline void update_sg_wakeup_stats(struct sched_domain *sd, 
                      struct sched_group *group, struct sg_lb_stats *sgs, struct task_struct *p) 
{ 
    int i, nr_running; 
 
    memset(sgs, 0, sizeof(*sgs)); 
 
    //MC层级一个sg中只有一个cpu 
    for_each_cpu(i, sched_group_span(group)) { 
        struct rq *rq = cpu_rq(i); 
        unsigned int local; 
 
        sgs->group_load += cpu_load_without(rq, p); 
        sgs->group_util += cpu_util_without(i, p); 
        sgs->group_runnable += cpu_runnable_without(rq, p); 
        local = task_running_on_cpu(i, p); //判断p是否queue在这个cpu的rq上,此case下应该return 0 
        sgs->sum_h_nr_running += rq->cfs.h_nr_running - local; 
 
        nr_running = rq->nr_running - local; 
        sgs->sum_nr_running += nr_running; 
 
        /* 
         * No need to call idle_cpu_without() if nr_running is not 0 
         */ 
        if (!nr_running && idle_cpu_without(i, p)) 
            sgs->idle_cpus++; 
 
    } 
 
    /* Check if task fits in the group */ 
    //只有DIE层级有这个标志,恒不成立 
    if (sd->flags & SD_ASYM_CPUCAPACITY && !task_fits_capacity(p, group->sgc->max_capacity)) { 
        sgs->group_misfit_task_load = 1; 
    } 
 
    sgs->group_capacity = group->sgc->capacity; 
    sgs->group_weight = group->group_weight; 
    sgs->group_type = group_classify(sd->imbalance_pct, group, sgs); 
 
    /* 
     * Computing avg_load makes sense only when group is fully busy or overloaded 
     */ 
    if (sgs->group_type == group_fully_busy || sgs->group_type == group_overloaded) 
        sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) / sgs->group_capacity; 
} 
 
 
//根据imbalance_pct和sgs来确认group_type 
static inline enum group_type group_classify(unsigned int imbalance_pct, struct sched_group *group, struct sg_lb_stats *sgs) 
{ 
    /* 
     * sgs->sum_nr_running <= sgs->group_weight 返回fase, 只要cpu个数大于任务个数,就不认为group_overloaded 
     * group_util/group_capacity > imbalance_pct/100=117% 或 group_runnable/group_capacity > imbalance_pct/100=117% 返回true 
     */ 
    if (group_is_overloaded(imbalance_pct, sgs)) 
        return group_overloaded; 
 
    if (sg_imbalanced(group)) //return group->sgc->imbalance; 
        return group_imbalanced; 
 
    if (sgs->group_asym_packing) 
        return group_asym_packing; 
 
    if (sgs->group_misfit_task_load) 
        return group_misfit_task; 
 
    if (!group_has_capacity(imbalance_pct, sgs)) 
        return group_fully_busy; 
 
    return group_has_spare; 
}

后续判断的group和idlest比较,看group是否比idlest更闲:

static bool update_pick_idlest(struct sched_group *idlest, struct sg_lb_stats *idlest_sgs, 
                   struct sched_group *group, struct sg_lb_stats *sgs) 
{ 
    if (sgs->group_type < idlest_sgs->group_type) 
        return true; 
 
    if (sgs->group_type > idlest_sgs->group_type) 
        return false; 
 
    /* 
     * The candidate and the current idlest group are the same type of 
     * group. Let check which one is the idlest according to the type. 
     */ 
    //group_type相等的情况下比较谁更闲 
    switch (sgs->group_type) { 
    case group_overloaded: 
    case group_fully_busy: //哪个sg的avg_load,哪个sg更闲 
        /* Select the group with lowest avg_load. */ 
        if (idlest_sgs->avg_load <= sgs->avg_load) 
            return false; 
        break; 
 
    case group_imbalanced: 
    case group_asym_packing: 
        /* Those types are not used in the slow wakeup path */ 
        return false; 
 
    case group_misfit_task: //哪个sg的最大算力大,哪个sg更闲 
        /* Select group with the highest max capacity */ 
        if (idlest->sgc->max_capacity >= group->sgc->max_capacity) 
            return false; 
        break; 
 
    case group_has_spare: //哪个sg的idle cpu个数多,哪个sg更闲 
        /* Select group with most idle CPUs */ 
        if (idlest_sgs->idle_cpus > sgs->idle_cpus) 
            return false; 
 
        /* Select group with lowest group_util */ 
        //若idle cpu个数相等,哪个group_util小哪个更闲 
        if (idlest_sgs->idle_cpus == sgs->idle_cpus && idlest_sgs->group_util <= sgs->group_util) 
            return false; 
 
        break; 
    } 
 
    return true; 
}

2. find_idlest_group_cpu()

/* 
 * find_idlest_group_cpu - find the idlest CPU among the CPUs in the group. 
 * 
 * find_idlest_cpu传参:group为在当前cpu对应的MC层级sd中找出的最闲的sg; p为待选核任务; this_cpu为当前正在执行的cpu 
 */ 
static int find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) 
{ 
    unsigned long load, min_load = ULONG_MAX; 
    unsigned int min_exit_latency = UINT_MAX; 
    u64 latest_idle_timestamp = 0; 
    int least_loaded_cpu = this_cpu; 
    int shallowest_idle_cpu = -1; 
    int i; 
 
    /* Check if we have any choice: */ 
    //对于MC层级的sg这里就直接就返回了,因为sg中只有一个cpu,DIE层级继续往下走。 
    if (group->group_weight == 1) 
        return cpumask_first(sched_group_span(group)); 
 
    /* Traverse only the allowed CPUs */ 
    //DIE层级的sg,遍历这个cluster中任务p亲和性允许的cpu 
    for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) { 
        if (sched_idle_cpu(i)) //若此cpu rq中只有SCHED_IDLE类型的任务,就选此cpu 
            return i; 
 
        //判断此cpu是否是idle cpu: rq->curr==rq->idle && rq->nr_running==0 && rq->ttwu_pending==0 
        if (available_idle_cpu(i)) { 
            struct rq *rq = cpu_rq(i); 
            struct cpuidle_state *idle = idle_get_state(rq); //return rq->idle_state; 
            if (idle && idle->exit_latency < min_exit_latency) { 
                /* 
                 * We give priority to a CPU whose idle state has the smallest exit latency irrespective 
                 * of any idle timestamp. 
                 */ 
                min_exit_latency = idle->exit_latency; //记录最小退出延迟 
                latest_idle_timestamp = rq->idle_stamp; 
                shallowest_idle_cpu = i; //记录最小的退出延迟的cpu 
            } else if ((!idle || idle->exit_latency == min_exit_latency) && rq->idle_stamp > latest_idle_timestamp) { //idle退出延迟相等,且进入idle的时间更晚 
                /* 
                 * If equal or no active idle state, then the most recently idled CPU might have 
                 * a warmer cache. 
                 */ 
                latest_idle_timestamp = rq->idle_stamp; 
                shallowest_idle_cpu = i; 
            } 
        } else if (shallowest_idle_cpu == -1) {//对于非idle类型的cpu,并且之前也没有遍历到idle cpu 
            load = cpu_load(cpu_rq(i)); 
            if (load < min_load) { 
                min_load = load; //记录load最小的cpu 
                least_loaded_cpu = i; 
            } 
        } 
    } 
 
    //若是有idle cpu就选idle cpu,若是没有idle cpu就选load最小的cpu 
    return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu; 
}

同 load_balance()寻找最忙的cpu类似。这里的慢速路径,选择最空闲的cpu,先在sd中找最空闲的sg,然后再sg中找最空闲的cpu。这里最空闲指的是当前cpu所在cluster内,有idle的cpu就算idle的cpu,若是没有idle的cpu就选负载最小的cpu。

四、快速路径

快速选核路径只有唤醒选核流程才会走。虽然函数名中是选兄弟cpu,但是在选不到idle cpu的时候,也会在其它Cluster进行选核

/* 
 * Try and locate an idle core/thread in the LLC cache domain. 
 * 
 * select_task_rq_fair传参:p为待选核任务; prev是任务上次运行的cpu; target是备选目标cpu,可能是prev_cpu也可能是wake_affine选出的cpu 
 */ 
static int select_idle_sibling(struct task_struct *p, int prev, int target) 
{ 
    struct sched_domain *sd; 
    unsigned long task_util; 
    int i, recent_used_cpu; 
 
    /* 
     * On asymmetric system, update task utilization because we will check 
     * that the task fits with cpu's capacity. 
     */ 
    if (static_branch_unlikely(&sched_asym_cpucapacity)) { 
        sync_entity_load_avg(&p->se); 
        task_util = uclamp_task_util(p); //将util_est clamp 在MIN和MAX之间 
    } 
 
    /* 
     * 传入的target是一个idle cpu且可以容纳的了uclamp后的任务的util,就直接选之前选的cpu。 
     * Wake affine的场景下,target cpu是通过wake_affine函数寻找的target cpu,其他场景下, 
     * target CPU其实等于prev CPU。 
     */ 
    if ((available_idle_cpu(target) || sched_idle_cpu(target)) && asym_fits_capacity(task_util, target)) 
        return target; 
 
    /* 
     * If the previous CPU is cache affine and idle, don't be stupid: 
     * 
     * 若传入的参数是由wake_affine选出的cpu,且和prev_cpu属于同一个cluster(共享L2-Cache), 且prev_cpu是idle的, 且prev_cpu能容纳的下uclamp后的任务p的util 
     */ 
    if (prev != target && cpus_share_cache(prev, target) && (available_idle_cpu(prev) || sched_idle_cpu(prev)) && asym_fits_capacity(task_util, prev)) 
        return prev; 
 
    /* 
     * Allow a per-cpu kthread to stack with the wakee if the kworker thread and the tasks previous CPUs are the same. 
     * The assumption is that the wakee queued work for the per-cpu kthread that is now complete and the wakeup is 
     * essentially a sync wakeup. An obvious example of this pattern is IO completions. 
     * 翻译: 
     * 如果 kworker 线程和任务prev_cpu是同一个,则允许 per-cpu kthread 与被唤醒线程堆叠。假设是被唤醒任务往per-cpu的kthread 
     * 上queue work, 唤醒本质上是同步唤醒。这种模式的一个明显例子是 IO 完成。 
     * 
     * 当前任务是一个per-cpu的内核线程,且任务的prev_cpu就是当前cpu,且当前cpu上最多只有一个任务在运行,就选prev_cpu. 
     */ 
    if (is_per_cpu_kthread(current) &&  prev == smp_processor_id() && this_rq()->nr_running <= 1) { 
        return prev; 
    } 
 
    /* Check a recently used CPU as a potential idle candidate: */ 
    /* 
     * recent_used_cpu的更新位置: 
     * select_task_rq_fair: 更新 current->recent_used_cpu 为当前正在运行的cpu 
     * select_idle_sibling: 将待选核任务 p->recent_used_cpu 设置为 prev_cpu 
     * 
     * 若 p->recent_used_cpu 既不是prev_cpu也不是target_cpu, 但是是和target_cpu共cluster的,且是idle的, 且任务的亲和性允许, 
     * 且算力能够容纳任务,那么就选 p->recent_used_cpu。 
     */ 
    recent_used_cpu = p->recent_used_cpu; 
    if (recent_used_cpu != prev && recent_used_cpu != target && cpus_share_cache(recent_used_cpu, target) && 
        (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) && cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) && 
        asym_fits_capacity(task_util, recent_used_cpu)) { 
        /* 
         * Replace recent_used_cpu with prev as it is a potential candidate for the next wake: 
         */ 
        p->recent_used_cpu = prev; 
        return recent_used_cpu; //还将p放在它最近使用过的cpu上 
    } 
 
    /* 
     * For asymmetric CPU capacity systems, our domain of interest is sd_asym_cpucapacity rather than sd_llc. 
     */ 
    if (static_branch_unlikely(&sched_asym_cpucapacity)) { 
        sd = rcu_dereference(per_cpu(sd_asym_cpucapacity, target)); //target_cpu对应的DIE层级的sd 
        /* 
         * On an asymmetric CPU capacity system where an exclusive cpuset defines a symmetric island (i.e. one unique 
         * capacity_orig value through the cpuset), the key will be set but the CPUs within that cpuset will not have 
         * a domain with SD_ASYM_CPUCAPACITY. These should follow the usual symmetric capacity path. 
         */ 
        if (sd) { 
            //这个函数还可能返回-1,但是并没有兼容这个错误的结果!后续的使用也没有做判断,是个BUG!! 
            i = select_idle_capacity(p, sd, target); 
            return ((unsigned)i < nr_cpumask_bits) ? i : target; //这里会返回,下面的部分不会被执行到 
        } 
    } 
 
    /* 下面就是非大小核架构会执行的路径了 */ 
 
    sd = rcu_dereference(per_cpu(sd_llc, target)); //target_cpu对应的MC层级的sd 
    if (!sd) 
        return target; 
 
    i = select_idle_core(p, sd, target); //CONFIG_SCHED_SMT没定义为空函数 
    if ((unsigned)i < nr_cpumask_bits) 
        return i; 
 
    i = select_idle_cpu(p, sd, target); //CONFIG_SCHED_SMT没定义为空函数 
    if ((unsigned)i < nr_cpumask_bits) 
        return i; 
 
    i = select_idle_smt(p, sd, target); //CONFIG_SCHED_SMT没定义为空函数 
    if ((unsigned)i < nr_cpumask_bits) 
        return i; 
 
    return target; 
} 
 
 
/* 
 * select_idle_sibling传参:p为待选核任务; sd为参数target_cpu对应的DIE层级的sd; target为候选的target_cpu 
 * 
 * 作用:从target_cpu开始遍历,找一个能容纳下任务p的idle cpu, 若所有idle cpu都不能容纳下任务,就返回算力最大的idle cpu, 
 * 否则返回-1; 
 */ 
static int select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target) 
{ 
    unsigned long task_util, best_cap = 0; 
    int cpu, best_cpu = -1; 
    struct cpumask *cpus; 
 
    cpus = this_cpu_cpumask_var_ptr(select_idle_mask); 
    //先赋值后使用 
    cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); 
 
    task_util = uclamp_task_util(p); 
 
    //这里从target_cpu开始遍历,利用了之前判断的结果target_cpu的算力是能容纳任务p的 
    for_each_cpu_wrap(cpu, cpus, target) { 
        unsigned long cpu_cap = capacity_of(cpu); 
 
        if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu)) //若此cpu非idle则跳过这个cpu 
            continue; 
        if (fits_capacity(task_util, cpu_cap)) //返回一个能容纳任务的idle cpu 
            return cpu; 
 
        if (cpu_cap > best_cap) { //若是没有能容纳的,则返回一个算力最大的 
            best_cap = cpu_cap; 
            best_cpu = cpu; 
        } 
    } 
 
    return best_cpu; 
}

1. 此函数中要使用到p的util, 所以需要先更新其负载。

2. 快速路径选核依次判断:
(1) 传入的target是一个idle cpu, 且可以容纳的了uclamp后的任务的util,就直接选之前选的这个 target_cpu
(2) 若传入的参数target是由wake_affine选出的cpu,且和prev_cpu属于同一个cluster(共享L2-Cache), 且prev_cpu是idle的, 且prev_cpu能容纳的下uclamp后的任务p的util, 那么选择 prev_cpu。
(3) 当前任务是一个per-cpu的内核线程,且任务的prev_cpu就是当前cpu,且当前cpu上最多只有一个任务在运行,就选 prev_cpu。一个典型的例子是IO操作场景。
(4) 若 p->recent_used_cpu 既不是prev_cpu也不是target_cpu, 但是是和target_cpu共cluster的,且是idle的, 且任务的亲和性允许,且算力能够容纳任务,那么就选 p->recent_used_cpu。
(5) 从target_cpu开始遍历,找一个能容纳下任务p的idle cpu, 若所有idle cpu都不能容纳下任务,就返回算力最大的idle cpu, 否则返回-1;

3. 注意有个BUG: select_idle_capacity()返回-1的话没有做容错处理。

五、总结

1. 有三种任务选核路径,sd_flag分别对应为 SD_BALANCE_WAKE(唤醒场景)、SD_BALANCE_FORK(fork新任务场景)、SD_BALANCE_EXEC(exec场景)。其中EAS选核和wake-affine选核只适用于唤醒场景,EAS选到核后直接退出,wake-affine选到核后还需要走快速路径继续选核,希望在当前cpu的同cluster中选一个空闲cpu或prev_cpu,若选不到,最终也会尝试在所有cluster中选核。fork新任务场景和exec场景走慢速选核路径,找出系统中最闲的cpu作为目标cpu。

参考:https://blog.csdn.net/feelabclihu/article/details/122007603?spm=1001.2014.3001.5501


本文参考链接:https://www.cnblogs.com/hellokitty2/p/15750931.html