Skip to main content
 首页 » 操作系统

Linux 调度器之RT负载均衡

2022年07月19日199grandyang

基于 Linux-4.19.153

一、相关结构成员描述

1. struct root_domain

实时调度器需要几个全局的或者说系统范围的资源来作出调度决定,以及 CPU 数量的增加而出现的可伸缩性瓶颈(由于锁保护的这些资源的竞争),Root Domain 引入的目的就是为了减少这样的竞争以改善可伸缩性。
cpuset 提供了一个把 CPU 分成子集被一个进程或者或一组进程使用的机制。几个 cpuset 可以重叠。如果没有其他的 cpuset 包含重叠的 CPU,这个 cpuset 被称为“互斥的(exclusive)”。每个互斥的 cpuset 定义了一个与其他 cpuset 或 CPU 分离的孤岛域(isolated domain,也叫作 root domain)。与每个 root domian 有关的信息存在 struct root_domain 结构(对象)中:

//kernel/sched/sched.h 
/* 
 * 我们添加了 root-domain 的概念,用于定义 per-domain 的变量。 
 * 每个互斥的 cpuset 本质上通过将成员 CPU 与其他任何 cpuset 完 
 * 全划分开来定义一个岛域。每当创建一个新的独占 cpuset 时,我们也 
 * 会创建并附加一个新的 root-domain 对象。 
 */ 
 
struct root_domain { 
    //root domain 的引用计数,当 rd 被运行队列引用时加1,反之减1 
    atomic_t        refcount; 
    //实时任务过载的(rt overload)的CPU的数目 
    atomic_t        rto_count; 
    struct rcu_head        rcu; 
    //属于该 rd 的CPU掩码 
    cpumask_var_t        span; 
    cpumask_var_t        online; 
 
    /* 
     * Indicate pullable load on at least one CPU, e.g: 
     * - More than one runnable task 
     * - Running task is misfit 
     */ 
    //表明该 rd 有任一CPU有多于一个的可运行任务 
    int            overload; 
 
    /* Indicate one or more cpus over-utilized (tipping point) */ 
    int            overutilized; 
 
    /* 
     * The bit corresponding to a CPU gets set here if such CPU has more 
     * than one runnable -deadline task (as it is below for RT tasks). 
     */ 
    cpumask_var_t        dlo_mask; 
    atomic_t        dlo_count; 
    struct dl_bw        dl_bw; 
    struct cpudl        cpudl; 
 
    /* 
     * Indicate whether a root_domain's dl_bw has been checked or 
     * updated. It's monotonously increasing value. 
     * 
     * Also, some corner cases, like 'wrap around' is dangerous, but given 
     * that u64 is 'big enough'. So that shouldn't be a concern. 
     */ 
    u64 visit_gen; 
 
#ifdef HAVE_RT_PUSH_IPI 
    /* 
     * For IPI pull requests, loop across the rto_mask. 
     */ 
    struct irq_work        rto_push_work; 
    raw_spinlock_t        rto_lock; 
    /* These are only updated and read within rto_lock */ 
    int            rto_loop; 
    int            rto_cpu; 
    /* These atomics are updated outside of a lock */ 
    atomic_t        rto_loop_next; 
    atomic_t        rto_loop_start; 
#endif 
    /* 
     * The "RT overload" flag: it gets set if a CPU has more than 
     * one runnable RT task. 
     */ 
    //某CPU有多于一个的可运行实时任务,对应的位被设置 
    cpumask_var_t        rto_mask; 
    //包含在 rd 中的CPU优先级管理结构成员 
    struct cpupri        cpupri; 
 
    unsigned long        max_cpu_capacity; 
 
    /* 
     * NULL-terminated list of performance domains intersecting with the 
     * CPUs of the rd. Protected by RCU. 
     */ 
    struct perf_domain __rcu *pd; 
};

这些 rd 被用于减小 per-domain 变量的全局变量的范围。无论何时一个互斥 cpuset 被创建,一个新 root domain 对象也会被创建,信息来自 CPU 成员。缺省情况下,一个单独的高层次的 rd 被创建,并把所有 CPU 作为成员。所有的实时调度决定只在一个 rd 的范围内作出决定。

2. struct task_struct

struct task_struct { 
    ... 
    struct sched_rt_entity        rt; 
    #ifdef CONFIG_SMP 
        /*符合条件的RT认为通过此成员挂入 rq->rt.pushable_tasks 链表,表示是可push的任务*/ 
        struct plist_node        pushable_tasks; 
        struct rb_node            pushable_dl_tasks; 
    #endif 
    ... 
};

二、CPU优先级管理

1. CPU优先级管理(CPU Priority Management)跟踪系统中每个 CPU 的优先级,为了让进程迁移的决定更有效率。CPU优先级有 102 个,下面是cpupri与prio的对应关系:

//kernel/sched/cpupri.h 
cpupri                    prio 
---------------------------- 
CPUPRI_INVALID (-1)        -1 
CPUPRI_IDLE(0)            MAX_PRIO(140) 
CPUPRI_NORMAL(1)        MAX_RT_PRIO ~ MAX_PRIO-1 (100~139) 
2~101                    99~0

注意,运行idle任务的CPU的cpupri=0,运行CFS任务的CPU的cpupri=1。

static int convert_prio(int prio) 
{ 
    int cpupri; 
 
    if (prio == CPUPRI_INVALID) /* -1 */ 
        cpupri = CPUPRI_INVALID; /* -1 */ 
    else if (prio == MAX_PRIO) /* 140 */ 
        cpupri = CPUPRI_IDLE; /* 0 */ 
    else if (prio >= MAX_RT_PRIO) /* 100 */ 
        cpupri = CPUPRI_NORMAL; /* 1 */ 
    else 
        cpupri = MAX_RT_PRIO - prio + 1; /* 100 - prio + 1 */ 
 
    return cpupri; 
}

传参prio=99返回0,传参prio=100返回100.

cpupri 数值越大表示优先级越高(用了减法)。处于 CPUPRI_INVALID 状态的 CPU 没有资格参与 task routing。cpupri 属于 root domain的,每个互斥的 cpuset 由一个含有 cpupri 数据的 root momain 组成。系统从两个维度的位映射来维护这些 CPU 状态:
(1) CPU 的优先级,由任务优先级映射而来。
(2) 在某个优先级上的 CPU。


2. 相关数据结构

//kernel/sched/cpupri.h 
struct cpupri_vec { 
    //在这个优先级上的 CPU 的数量 
    atomic_t        count; 
    //在这个优先级上的 CPU 位码 
    cpumask_var_t        mask; 
}; 
 
//体现两个维度 
struct cpupri { 
    //持有关于一个 cpuset 在 某个特定的优先级上的 所有 CPU 的信息 
    struct cpupri_vec    pri_to_cpu[CPUPRI_NR_PRIORITIES]; 
    //指示一个 CPU 的优先级,指向一个数组,每个CPU一个成员,主要用于记录当前cpu的cpupri值,便于更新修改。 
    int            *cpu_to_pri; 
};

通过 cpupri_find()/cpupri_find_fitness() 和 cpupri_set() 来查找和设置 CPU 优先级是实时负载均衡快速找到要迁移的任务的关键。

3. cpupri_set() 函数

(1) 函数分析

/** 
 * cpupri_set - update the CPU priority setting 
 * @cp: The cpupri context 
 * @cpu: The target CPU 
 * @newpri: The priority (INVALID,NORMAL,RT1-RT99,HIGHER) to assign to this CPU 
 * 
 * Note: Assumes cpu_rq(cpu)->lock is locked 
 * 
 * Returns: (void) 
 */ 
void cpupri_set(struct cpupri *cp, int cpu, int newpri) 
{ 
    //获取当前cpu的cpupri 
    int *currpri = &cp->cpu_to_pri[cpu]; 
    int oldpri = *currpri; 
    int do_mb = 0; 
 
    //将p->prio转换为cpuprio 
    newpri = convert_prio(newpri); 
 
    BUG_ON(newpri >= CPUPRI_NR_PRIORITIES); 
 
    if (newpri == oldpri) 
        return; 
 
    //若是cpupri变化了,就更新此cpupri对应的信息 
    if (likely(newpri != CPUPRI_INVALID)) { 
        struct cpupri_vec *vec = &cp->pri_to_cpu[newpri]; 
 
        cpumask_set_cpu(cpu, vec->mask); 
 
        smp_mb__before_atomic(); 
        atomic_inc(&(vec)->count); 
        do_mb = 1; 
    } 
    //然后删除旧信息 
    if (likely(oldpri != CPUPRI_INVALID)) { 
        struct cpupri_vec *vec  = &cp->pri_to_cpu[oldpri]; 
 
        if (do_mb) 
            smp_mb__after_atomic(); 
         
        atomic_dec(&(vec)->count); 
        smp_mb__after_atomic(); 
        cpumask_clear_cpu(cpu, vec->mask); 
    } 
 
    *currpri = newpri; 
}

举个例子,比如CPU2上的任务从prio=120的CFS任务切换为了prio=97的RT任务,此时先读取cp->cpu_to_pri[2]当前的cpupri的值,由于CPU2上先前运行的是CFS任务,因为读取的值是1。然后计算新任务运行下new-cpupri为101-97=4,于是将CPU2的掩码设置进cp->pri_to_cpu[4] 的 vec->mask 中,并将 vec->count 计数加1,表示处于cpupri=4优先级的CPU又增加了一个CPU2。然后将CPU2从cp->pri_to_cpu[1] 的 vec->mask 中删除,并将 vec->count 计数减1,表示cpupri=1优先级的CPU又减少一个CPU2。

(2) cpupri_set()的调用路径:

                rt_sched_class.rq_offline 回调 
                    rq_offline_rt //传参(cpupri, rq->cpu, CPUPRI_INVALID) 
                rt_sched_class.rq_online 回调 
                    rq_online_rt 
enqueue_rt_entity 
dequeue_rt_entity 
    dequeue_rt_stack 
        __dequeue_rt_entity 
            dec_rt_tasks 
                dec_rt_prio 
                    dec_rt_prio_smp 
    enqueue_rt_entity 
    dequeue_rt_entity     
        __enqueue_rt_entity             
            inc_rt_tasks         
                inc_rt_prio     
                    inc_rt_prio_smp 
                        cpupri_set

可见主要是在enqueue/dequeue RT任务的路径中调用,应该是当一个CPU其上任务切换的时候调用,由于CFS任务的cpupri都是1,所以只有涉及RT的任务切换才会调用,调用函数都在rt.c中。

三、PUSH任务迁移

1. PUSH任务的基本思想

根据cpupri搜索出一组cpu优先级最低的cpu作为候选cpu,然后从候选cpu中选出一个cpu作为目标cpu,然后push本rq上queue者的优先级最高的并且可push的RT任务过去。持续循环执行,直到没有可push的任务为止。

源cpu就是 push_rt_task(struct rq *rq) 参数中的rq所属的cpu,从这个cpu的rq上往外push RT任务。

2. PUSH任务的时机

push_rt_task()函数会在以下时间点被调用:

(1) rt_mutex锁优先级改变、__sched_setscheduler()导致调度类改变、__schedule()任务切换

rt_mutex_setprio //core.c 
__sched_setscheduler //core.c 
    check_class_changed //core.c 在调度类改变的时候调用,会先调用上一个调度类的switched_from,再调用下一个调度类的switched_to 
        rt_sched_class.switched_to //rt.c 回调 
            switched_to_rt //rt.c 若p在rq上且不是rq上正在运行的任务,且p运行在多个cpu上运行且rq->rt.overload了,才调用 
__schedule //core.c 
    pick_next_task //core.c 选择下一个任务 
        rt_sched_class.pick_next_task //rt.c 回调 
            pick_next_task_rt //rt.c 无条件调用 
                rt_queue_push_tasks //rt.c 判断参数rq上有可push的任务,即 rq->rt.pushable_tasks 链表不为空调用 
                    queue_balance_callback(rq, &per_cpu(rt_push_head, rq->cpu), push_rt_tasks); //rt.c 头插法挂入 rq->balance_callback 链表

回调时机:

__sched_setscheduler 
rt_mutex_setprio 
schedule_tail //core.c 没有找到调用的地方 
__schedule //core.c 任务切换函数最后调用 
    balance_callback //core.c 
        __balance_callback //core.c 依次回调 rq->balance_callback 链表上的所有函数,持rq->lock关中断调用的

(2) 有cpu执行拉RT任务的时候,告诉其它CPU推出去一些任务

pull_rt_task(rq) //rt.c 使能 RT_PUSH_IPI 时才会执行,在拉任务时触发push. rq为当前队列,告诉其它cpu往当前cpu上push一些任务 
    tell_cpu_to_push //rt.c 有rto的cpu才queue 
        irq_work_queue_on(&rq->rd->rto_push_work, cpu); 
            rto_push_irq_work_func //发现有可push的任务,持有rq->lock spin锁调用 
                push_rt_tasks(rq) 
                irq_work_queue_on(&rd->rto_push_work, cpu); //自己queue自己,只要有rto的cpu就不断queue自己,构成一个"内核线程"一直运行,直到没有rto的cpu. 
 
init_rootdomain 
    init_irq_work(&rd->rto_push_work, rto_push_irq_work_func);

(3) 若唤醒的是RT任务又认为不能及时得到调度执行,就将其从唤醒的rq上push走

ttwu_do_wakeup //core.c 
wake_up_new_task //core.c 
    rt_sched_class.task_woken //调度类回调 
        task_woken_rt //rt.c 
            push_rt_tasks(rq)

task_woken_rt() 中调用 push_rt_tasks() 的条件比较苛刻,如下。表示为唤醒的任务p不是rq上正在running的任务,且当前rq也没有设置resched标志位(不会马上重新调度),且p也允许在其它CPU上运行,且rq当前正在运行的任务是DL或RT任务,且rq的当前任务只能在当前CPU运行或优先级比p更高。才会调用push_rt_tasks()将唤醒的RT任务push走。

static void task_woken_rt(struct rq *rq, struct task_struct *p) 
{ 
    if (!task_running(rq, p) && 
        !test_tsk_need_resched(rq->curr) && 
        p->nr_cpus_allowed > 1 && 
        (dl_task(rq->curr) || rt_task(rq->curr)) && 
        (rq->curr->nr_cpus_allowed < 2 || rq->curr->prio <= p->prio)) 
        push_rt_tasks(rq); 
}

3. PUSH任务的结束条件

见 push_rt_tasks(rq),从rq上一直往外push任务,直到没有任务可push了才停止。

4. PUSH任务逻辑实现——push_rt_tasks()

(1) push_rt_tasks()

//rt.c 作用:从参数rq上推一些任务到其它rq上 
static void push_rt_tasks(struct rq *rq) 
{ 
    /* push_rt_task will return true if it moved an RT */ 
    while (push_rt_task(rq)) //如果有任务可PUSH将一直执行下去 
        ; 
}

(2) push_rt_task()

/* 
 * If the current CPU has more than one RT task, see if the non 
 * running task can migrate over to a CPU that is running a task 
 * of lesser priority. 
 */ 
//push出去任务了返回1,否则返回0 
static int push_rt_task(struct rq *rq) 
{ 
    struct task_struct *next_task; 
    struct rq *lowest_rq; 
    int ret = 0; 
 
    /* update_rt_migration()中设置,多余1个RT任务且有可迁移的RT任务设置为1 */ 
    if (!rq->rt.overloaded) 
        return 0; 
 
    //从rq->rt.pushable_tasks链表头取出可push的task,最先取出的是优先级最高的RT task. 
    next_task = pick_next_pushable_task(rq); 
    if (!next_task) 
        return 0; 
 
retry: 
    //取出来的应该是Runnable的,而不能是正在running的任务 
    if (unlikely(next_task == rq->curr)) { 
        WARN_ON(1); 
        return 0; 
    } 
 
    /* 
     * It's possible that the next_task slipped in of 
     * higher priority than current. If that's the case 
     * just reschedule current. 
     * 翻译: 
     * next_task 是可能比当前的优先级更高的,如果是这种情况, 
     * 只需触发一次重新调度。 
     */ 
    if (unlikely(next_task->prio < rq->curr->prio)) { 
        resched_curr(rq); 
        return 0; 
    } 
 
    /* We might release rq lock */ 
    get_task_struct(next_task); 
 
    /* find_lock_lowest_rq locks the rq if found */ 
    //根据cpupri找到cpu优先级最低cpu作为任务要push到的目的cpu 
    lowest_rq = find_lock_lowest_rq(next_task, rq); 
    //1.如果 lowest_rq 没有找到 
    if (!lowest_rq) { 
        struct task_struct *task; 
        /* 
         * find_lock_lowest_rq releases rq->lock 
         * so it is possible that next_task has migrated. 
         * 
         * We need to make sure that the task is still on the same 
         * run-queue and is also still the next task eligible for pushing. 
         * 翻译: 
         * find_lock_lowest_rq 释放 rq->lock,因此 next_task 可能已被迁移走了。 
         * 需要确保任务仍然还在这个rq中,并且仍然是下一个有资格被推送的任务。因此 
         * 需要再重新执行一次这个函数。 
         */ 
        task = pick_next_pushable_task(rq); 
        //(1)重新选出的待push task还是原来的task 
        if (task == next_task) { 
            /* 
             * The task hasn't migrated, and is still the next 
             * eligible task, but we failed to find a run-queue 
             * to push it to.  Do not retry in this case, since 
             * other CPUs will pull from us when ready. 
             * 翻译: 
             * 该任务尚未迁移,仍然是下一个符合条件的任务,但我们未能找到将其推送到的目 
             * 标运行队列。 在这种情况下不要重试,因为其他 CPU 会在准备好时从我们这里拉取。 
             */ 
            goto out; 
        } 
 
        //(2)重新选出的待push task不是原来的task 
        if (!task) 
            /* No more tasks, just exit */ 
            goto out; 
 
        /* Something has shifted, try again. */ 
        //再次选出的是不同的task了,重新试一次 
        put_task_struct(next_task); 
        next_task = task; 
        goto retry; 
    } 
 
    //2.如果 lowest_rq 没有找到了,就将任务从rq上摘下放到lowest_rq上 
    deactivate_task(rq, next_task, 0); 
    set_task_cpu(next_task, lowest_rq->cpu); 
    activate_task(lowest_rq, next_task, 0); 
    ret = 1; 
 
    //对目标lowest_rq触发一次重新调度 
    resched_curr(lowest_rq); 
 
    //CONFIG_LOCKDEP相关,若是没有使能只是释放lowest_rq->lock 
    double_unlock_balance(rq, lowest_rq); 
 
out: 
    put_task_struct(next_task); 
 
    return ret; 
}

(3) pick_next_pushable_task()

选择出rq上queue着状态的优先级最高的RT任务,优先push优先级最高的RT任务。

static struct task_struct *pick_next_pushable_task(struct rq *rq) //rt.c 
{ 
    struct task_struct *p; 
 
    //rq->rt.pushable_tasks 链表不为空表示有可push的任务 
    if (!has_pushable_tasks(rq)) 
        return NULL; 
 
    //first也就使链表上优先级最高的那个RT任务,也就是push rq上queue着的最高优先级的RT任务 
    p = plist_first_entry(&rq->rt.pushable_tasks, struct task_struct, pushable_tasks); 
 
    BUG_ON(rq->cpu != task_cpu(p)); /*校验p是挂载在此cpu rq上的*/ 
    BUG_ON(task_current(rq, p)); /*return rq->curr == p;校验p不是正在运行的任务*/ 
    BUG_ON(p->nr_cpus_allowed <= 1);/*校验p是允许在多个cpu上运行的,否则不能push*/ 
 
    BUG_ON(!task_on_rq_queued(p)); /*return p->on_rq == TASK_ON_RQ_QUEUED; 校验p是queue在rq上的*/ 
    BUG_ON(!rt_task(p)); /*return prio < 100 校验p必须是RT任务*/ 
 
    return p; 
}

(4) find_lock_lowest_rq()

根据cpupri找出cpu优先级最低的cpu作为push任务的目标cpu.

/* Will lock the rq it finds */ 
static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq *rq) 
{ 
    struct rq *lowest_rq = NULL; 
    int tries; 
    int cpu; 
 
    //最大try 3次 
    for (tries = 0; tries < RT_MAX_TRIES; tries++) { 
        //选择一个cpu优先级最低的cpu(比task运行的cpu的优先级低,否则返回-1) 
        cpu = find_lowest_rq(task); 
        //没找到的话cpu==-1可能成立,或后面的恒不会成立 
        if ((cpu == -1) || (cpu == rq->cpu)) 
            break; 
 
        //找到了task要被push到的目标cpu的rq 
        lowest_rq = cpu_rq(cpu); 
 
        //这个if判断有可能成立,因为没有持lowest_rq的锁,它上面可能又queue了高优先级的任务 
        if (lowest_rq->rt.highest_prio.curr <= task->prio) { 
            /* 
             * Target rq has tasks of equal or higher priority, 
             * retrying does not release any lock and is unlikely 
             * to yield a different result. 
             * 翻译: 
             * 目标 rq 具有相同或更高优先级的任务,重试不会释放任何锁 
             * 并且不太可能产生不同的结果。因此放弃retry,返回没找到lowest_rq。 
             */ 
            lowest_rq = NULL; 
            break; 
        } 
 
        /* if the prio of this runqueue changed, try again */ 
        //? ############### 
        //下面是做一些校验,主要是判断环境有没有变化来判断是否应该将lowest_rq置为NULL 
        if (double_lock_balance(rq, lowest_rq)) { 
            /* 
             * We had to unlock the run queue. In the mean time, task could have 
             * migrated already or had its affinity changed. 
             * Also make sure that it wasn't scheduled on its rq. 
             */ 
            if (unlikely(task_rq(task) != rq || 
                     !cpumask_test_cpu(lowest_rq->cpu, &task->cpus_allowed) || 
                     task_running(rq, task) || 
                     !rt_task(task) || 
                     !task_on_rq_queued(task))) { 
 
                double_unlock_balance(rq, lowest_rq); 
                lowest_rq = NULL; 
                break; 
            } 
        } 
 
        /* If this rq is still suitable use it. */ 
        //大概率是成功的,满足就不retry了,直接返回找到的lowest_rq 
        if (lowest_rq->rt.highest_prio.curr > task->prio) 
            break; 
 
        /* try again */ 
        double_unlock_balance(rq, lowest_rq); 
        lowest_rq = NULL; 
    } 
 
    return lowest_rq; 
}

5. 何时往 rq->rt.pushable_tasks 链表上添加可push的任务

在 enqueue_task_rt 中,只有当p不是正在执行的任务且可以在多个CPU上运行时才会挂入 p->pushable_tasks 链表,p->prio越小优先级高的越挂在靠前的位置。

static void enqueue_pushable_task(struct rq *rq, struct task_struct *p) 
{ 
    plist_del(&p->pushable_tasks, &rq->rt.pushable_tasks); 
    plist_node_init(&p->pushable_tasks, p->prio); 
    //p->prio值越小,插入的位置越靠前 
    plist_add(&p->pushable_tasks, &rq->rt.pushable_tasks); 
 
    /* Update the highest prio pushable task */ 
    if (p->prio < rq->rt.highest_prio.next) 
        rq->rt.highest_prio.next = p->prio; 
} 
 
static void dequeue_pushable_task(struct rq *rq, struct task_struct *p) 
{ 
    plist_del(&p->pushable_tasks, &rq->rt.pushable_tasks); 
 
    /* Update the new highest prio pushable task */ 
    if (has_pushable_tasks(rq)) { 
        p = plist_first_entry(&rq->rt.pushable_tasks, struct task_struct, pushable_tasks); 
        rq->rt.highest_prio.next = p->prio; 
    } else 
        rq->rt.highest_prio.next = MAX_RT_PRIO; 
}

调用路径:

rt_sched_class.enqueue_task 
    enqueue_task_rt //rt.c 在函数最后执行,只有当p满足不是正在执行的任务且满足可以在多于1个CPU上运行才调用 
        enqueue_pushable_task(rq, p); 
 
rt_sched_class.dequeue_task 
    dequeue_task_rt //无条件执行 
        dequeue_pushable_task(rq, p);

6. rt_rq->overloaded 标志的设置

在enqueue/dequeue RT任务时,判断rt_rq上有可迁移的实时任务时更新。

static void update_rt_migration(struct rt_rq *rt_rq) 
{ 
    if (rt_rq->rt_nr_migratory && rt_rq->rt_nr_total > 1) { 
        if (!rt_rq->overloaded) { 
            //rd->rto_count++ 和设置更新rd->rto_mask cpu掩码 
            rt_set_overload(rq_of_rt_rq(rt_rq)); 
            rt_rq->overloaded = 1; 
        } 
    } else if (rt_rq->overloaded) { 
        //rd->rto_count-- 和清除更新rd->rto_mask cpu掩码 
        rt_clear_overload(rq_of_rt_rq(rt_rq)); 
        rt_rq->overloaded = 0; 
    } 
}

调用路径:

__enqueue_rt_entity 
    inc_rt_tasks 
        inc_rt_migration //无条件调用,无条件 rt_rq->rt_nr_total++,若p允许在多于一个cpu上运行才执行 rt_rq->rt_nr_migratory++; 
__dequeue_rt_entity 
    dec_rt_tasks 
        dec_rt_migration //无条件调用,无条件执行 rt_rq->rt_nr_total--,若p允许在多于一个cpu上运行才执行 rt_rq->rt_nr_migratory--; 
            update_rt_migration

四、PULL任务迁移

1. PULL任务的基本思想

当选下一个RT任务时,若发现rq上的最高优先级的RT任务的优先级比prev还低,就认为需要pull rt任务过来。此时又分两种情况:

(1) 不使能 RT_PUSH_IPI

从runnable RT最高优先级比自己高的cpu上拉rt任务过来,对每个cpu都执行这样的操作,然后触发本cpu抢占调度。

(2) 使能 RT_PUSH_IPI

采用逐个向每个rto cpu上queue irq_work 的方式来触发rto cpu进行push task,然后走push task的处理逻辑,以push task的方式代替pull task.

2. PULL任务的时机

rt_mutex_setprio //core.c 
__sched_setscheduler //core.c 
    check_class_changed //core.c 
        rt_sched_class.switched_from 
            switched_from_rt //若p是runnable的rt任务且rq上已经没有rt任务在运行了调用 
    check_class_changed     
        rt_sched_class.prio_changed         
            prio_changed_rt //若p是当前正在执行的任务且其优先级降低了调用 
                rt_queue_pull_task 
                    queue_balance_callback(rq, &per_cpu(rt_pull_head, rq->cpu), pull_rt_task); 
                rt_sched_class.pick_next_task 
                    pick_next_task_rt //判断需要pull时才pull 
                        pull_rt_task

执行时机一,在要选择下一个RT任务时。need_pull_rt_task用来判断是否需要pull任务,只要当前rq上queue的RT线程的最高优先级还比prev任务的优先级低,就认为需要pull任务到rq中来。

static inline bool need_pull_rt_task(struct rq *rq, struct task_struct *prev) 
{ 
    /* Try to pull RT tasks here if we lower this rq's prio */ 
    return rq->rt.highest_prio.curr > prev->prio; 
}

调用时机二,queue_balance_callback,同 push_rt_task

3. PULL任务逻辑实现——pull_rt_task()

3.1 先看没有使能 RT_PUSH_IPI sched feat 的情况

(1) pull_rt_task 函数

static void pull_rt_task(struct rq *this_rq) 
{ 
    int this_cpu = this_rq->cpu, cpu; 
    bool resched = false; 
    struct task_struct *p; 
    struct rq *src_rq; 
    //return rq->rd->rto_count, 只要一个cpu上有可迁移的任务就加1 
    int rt_overload_count = rt_overloaded(this_rq); 
 
    if (likely(!rt_overload_count)) 
        return; 
 
    /* 
     * Match the barrier from rt_set_overloaded; this guarantees that if we 
     * see overloaded we must also see the rto_mask bit. 
     */ 
    smp_rmb(); 
 
    /* If we are the only overloaded CPU do nothing */ 
    //目前只有本cpu一个是rt_overload,那就没有必要去拉rt任务过来了 
    if (rt_overload_count == 1 && cpumask_test_cpu(this_rq->cpu, this_rq->rd->rto_mask)) 
        return; 
 
#ifdef HAVE_RT_PUSH_IPI 
    //若是使能了这个feature,就会通知其它CPU推任务到本rq,而不会执行拉任务的动作了 
    if (sched_feat(RT_PUSH_IPI)) { 
        tell_cpu_to_push(this_rq); 
        return; 
    } 
#endif 
 
    //对于每一个rt超载的cpu都执行 
    for_each_cpu(cpu, this_rq->rd->rto_mask) { 
        //跳过本cpu,肯定不能从本cpu上往本cpu上拉任务 
        if (this_cpu == cpu) 
            continue; 
 
        src_rq = cpu_rq(cpu); 
 
        /* 
         * Don't bother taking the src_rq->lock if the next highest 
         * task is known to be lower-priority than our current task. 
         * This may look racy, but if this value is about to go 
         * logically higher, the src_rq will push this task away. 
         * And if its going logically lower, we do not care 
         * 翻译: 
         * 如果已知下一个最高优先级的任务的优先级低于当前任务的优先级,不需要 
         * 持有 src_rq->lock。 这可能看起来存在竞争,但如果这个值在逻辑上即将 
         * 变得更高,src_rq 将把这个任务推开。 如果它在逻辑上降低,我们不在乎 
         */ 
        //只选最高优先级比自己的高的作为备选src_rq (enqueue时会更新) 
        if (src_rq->rt.highest_prio.next >= this_rq->rt.highest_prio.curr) 
            continue; 
 
        /* 
         * We can potentially drop this_rq's lock in 
         * double_lock_balance, and another CPU could alter this_rq 
         * 翻译: 
         * 在 double_lock_balance 中可能会释放 this_rq 的锁,而另一个 
         * CPU 可能会更改 this_rq 
         */ 
        double_lock_balance(this_rq, src_rq); 
 
        /* 
         * We can pull only a task, which is pushable on its rq, and no others. 
         */ 
        //从src_rq上选出一个优先级最高的runnable的RT任务 
        p = pick_highest_pushable_task(src_rq, this_cpu); 
 
        /* 
         * Do we have an RT task that preempts the to-be-scheduled task? 
         */ 
        if (p && (p->prio < this_rq->rt.highest_prio.curr)) { 
            WARN_ON(p == src_rq->curr); //选出的RT任务不能是src_rq上正在执行的任务 
            WARN_ON(!task_on_rq_queued(p)); //选出的RT任务不能是非runnable的任务 
 
            /* 
             * There's a chance that p is higher in priority than what's currently 
             * running on its CPU. This is just that p is wakeing up and hasn't had 
             * a chance to schedule. We only pull p if it is lower in priority than 
             * the current task on the run queue. 
             */ 
            /* 若选从src_rq上选出的p比src_rq上正在执行的任务优先级还高,就不跳过它, 
             * 因为它可以抢占低优先级的任务从而很快被调度执行。 
             */ 
            if (p->prio < src_rq->curr->prio) 
                goto skip; 
 
            resched = true; 
 
            //从源src_rq上摘下来放到this_rq 
            deactivate_task(src_rq, p, 0); 
            set_task_cpu(p, this_cpu); 
            activate_task(this_rq, p, 0); 
            /* 
             * We continue with the search, just in 
             * case there's an even higher prio task 
             * in another runqueue. (low likelihood 
             * but possible) 
             */ 
        } 
skip: 
        double_unlock_balance(this_rq, src_rq); 
    } 
 
    //若pull过来了任务,就触发一次抢占调度 
    if (resched) 
        resched_curr(this_rq); 
}

(2) pick_highest_pushable_task 函数

/* 
 * Return the highest pushable rq's task, which is suitable to be executed 
 * on the CPU, NULL otherwise 
 */ 
//传参: rq: 源rq, cpu: 目的地cpu 
static struct task_struct *pick_highest_pushable_task(struct rq *rq, int cpu) 
{ 
    struct plist_head *head = &rq->rt.pushable_tasks; 
    struct task_struct *p; 
 
    //判断 rq->rt.pushable_tasks 为空表示rq上没有可push的任务 
    if (!has_pushable_tasks(rq)) 
        return NULL; 
 
    /* 
     * 按优先级由高到低遍历src_rq上的每一个可push的任务,若其非 
     * running且亲和性允许运行在目标cpu上就返回第一个满足条件的任务p 
     */ 
    plist_for_each_entry(p, head, pushable_tasks) { 
        if (pick_rt_task(rq, p, cpu)) 
            return p; 
    } 
 
    return NULL; 
} 
 
static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu) 
{ 
    if (!task_running(rq, p) && cpumask_test_cpu(cpu, &p->cpus_allowed)) 
        return 1; 
 
    return 0; 
}

3.2 使能 RT_PUSH_IPI sched feat 的情况

pull_rt_task 逻辑委托给 tell_cpu_to_push(this_rq),让其它cpu往this_rq上push任务来代替拉任务,以减少拉任务带来的锁竞争。

(1) tell_cpu_to_push()函数:

static void tell_cpu_to_push(struct rq *rq) 
{ 
    int cpu = -1; 
 
    /* Keep the loop going if the IPI is currently active */ 
    //唯一增加其值的地方,没有降低其值的地方 
    atomic_inc(&rq->rd->rto_loop_next); 
 
    /* Only one CPU can initiate a loop at a time */ 
    if (!rto_start_trylock(&rq->rd->rto_loop_start)) 
        return; 
 
    raw_spin_lock(&rq->rd->rto_lock); 
 
    /* 
     * The rto_cpu is updated under the lock, if it has a valid CPU 
     * then the IPI is still running and will continue due to the 
     * update to loop_next, and nothing needs to be done here. 
     * Otherwise it is finishing up and an ipi needs to be sent. 
     */ 
    //初始化为-1,只在 rto_next_cpu 中赋值为cpu id或-1 
    if (rq->rd->rto_cpu < 0) 
        //返回一个rt overload 的cpu 
        cpu = rto_next_cpu(rq->rd); 
 
    raw_spin_unlock(&rq->rd->rto_lock); 
 
    //将 rd->rto_loop_start 设置为0 
    rto_start_unlock(&rq->rd->rto_loop_start); 
 
    if (cpu >= 0) { 
        /* Make sure the rd does not get freed while pushing。rd->refcount++;*/ 
        sched_get_rd(rq->rd); 
        //向参数cpu指定的CPU上queue一个irq_work 
        irq_work_queue_on(&rq->rd->rto_push_work, cpu); rto_push_irq_work_func 
    } 
}

(2) rto_next_cpu 函数:

static int rto_next_cpu(struct root_domain *rd) 
{ 
    int next; 
    int cpu; 
 
    /* 
     * When starting the IPI RT pushing, the rto_cpu is set to -1, 
     * rt_next_cpu() will simply return the first CPU found in 
     * the rto_mask. 
     * 
     * If rto_next_cpu() is called with rto_cpu is a valid CPU, it 
     * will return the next CPU found in the rto_mask. 
     * 
     * If there are no more CPUs left in the rto_mask, then a check is made 
     * against rto_loop and rto_loop_next. rto_loop is only updated with 
     * the rto_lock held, but any CPU may increment the rto_loop_next 
     * without any locking. 
     */ 
    for (;;) { 
 
        /* When rto_cpu is -1 this acts like cpumask_first() */ 
        cpu = cpumask_next(rd->rto_cpu, rd->rto_mask); 
 
        rd->rto_cpu = cpu; 
 
        //正常情况下从这里就返回了 
        if (cpu < nr_cpu_ids) 
            return cpu; 
 
        //rto_mask中没有cpu掩码了,赋值为-1 
        rd->rto_cpu = -1; 
 
        /* 
         * ACQUIRE ensures we see the @rto_mask changes 
         * made prior to the @next value observed. 
         * 
         * Matches WMB in rt_set_overload(). 
         */ 
        next = atomic_read_acquire(&rd->rto_loop_next); 
 
        if (rd->rto_loop == next) 
            break; 
 
        rd->rto_loop = next; 
    } 
 
    return -1; 
}

(3) rto_push_irq_work_func 函数.

注意备注,是在硬中断上下文调用的。

/* Called from hardirq context */ 
void rto_push_irq_work_func(struct irq_work *work) 
{ 
    struct root_domain *rd =container_of(work, struct root_domain, rto_push_work); 
    struct rq *rq; 
    int cpu; 
 
    rq = this_rq(); 
 
    /* 
     * We do not need to grab the lock to check for has_pushable_tasks. 
     * When it gets updated, a check is made if a push is possible. 
     */ 
    if (has_pushable_tasks(rq)) { 
        raw_spin_lock(&rq->lock); 
        //触发push任务的流程 
        push_rt_tasks(rq); 
        raw_spin_unlock(&rq->lock); 
    } 
 
    raw_spin_lock(&rd->rto_lock); 
 
    /* Pass the IPI to the next rt overloaded queue */ 
    //取出下一个rto cpu 
    cpu = rto_next_cpu(rd); 
 
    raw_spin_unlock(&rd->rto_lock); 
 
    if (cpu < 0) { 
        sched_put_rd(rd); 
        return; 
    } 
 
    /* Try the next RT overloaded CPU */ 
    /* 
     * 自己queue自己,但是queue的cpu却是下一个rto cpu了,直到所有 
     * 的rto cpu都执行了push task的操作才停止。 
     */ 
    irq_work_queue_on(&rd->rto_push_work, cpu); //rto_push_irq_work_func 
}

本文参考链接:https://www.cnblogs.com/hellokitty2/p/15974333.html