Skip to content

Scheduler And Task Lifecycle

Imported from _research/manual-study-linux/scheduler-process.md.

Scheduler And Task Lifecycle

Status: true-full verified for core scheduler/task lifecycle rows.

This volume explains how Linux creates a task, publishes it as runnable, chooses the next task, performs the context switch, and delegates policy to the fair, real-time, and deadline scheduling classes.

Source Surface

Core verified files:

  • kernel/sched/core.c: scheduler core, wakeups, task selection, context switching, and public schedule() path.
  • kernel/sched/sched.h: scheduler-private runqueue structures and class callback contracts.
  • kernel/sched/fair.c: fair/EEVDF scheduling class.
  • kernel/sched/rt.c: real-time FIFO/RR scheduling class.
  • kernel/sched/deadline.c: SCHED_DEADLINE EDF/CBS scheduling class.
  • kernel/fork.c: task/process construction and publication into the scheduler-visible process model.

Supporting reviewed documentation:

  • Documentation/scheduler/sched-design-CFS.rst
  • Documentation/scheduler/sched-eevdf.rst
  • Documentation/scheduler/sched-rt-group.rst
  • Documentation/scheduler/sched-deadline.rst

Entry Points

Task creation enters through kernel_clone() and syscall wrappers in kernel/fork.c, then converges on copy_process(). The scheduler-specific fork hooks are sched_fork(), sched_cgroup_fork(), and sched_post_fork() in kernel/sched/core.c.

The first runnable publication path is wake_up_new_task(). Regular scheduling enters through schedule(), which loops through __schedule_loop() and __schedule(). Policy-specific work enters through struct sched_class callbacks: fair, RT, and deadline each register a class table.

Core Data Structures

struct rq is the per-CPU runqueue. It holds hot CPU-local scheduling state: runnable counts, wakeup flags, CPU capacity, current and donor task pointers, idle task, switch count, lock state, and embedded class queues for CFS, RT, and deadline. The source note anchors this at kernel/sched/sched.h:1128-1189.

struct sched_class is the scheduler policy vtable. It defines callbacks such as enqueue_task, dequeue_task, wakeup_preempt, balance, pick_task, put_prev_task, set_next_task, tick handling, fork/death hooks, migration, and class-change hooks. The callback comments include lock requirements at kernel/sched/sched.h:2585-2653.

kernel/fork.c owns the staged construction of task_struct and process resources before a task becomes visible or runnable. The task is not merely a scheduler object: it is tied to credentials, namespaces, memory context, files, signal state, seccomp, cgroups, pid allocation, and scheduler placement.

Control Flow

Task creation is a staged constructor:

  1. copy_process() validates clone flags, namespace/thread constraints, pidfd rules, and privilege requirements.
  2. It duplicates the task structure and stack, copies credentials, checks limits, and initializes task-local state.
  3. It copies or shares memory, fs, file, signal, seccomp, and subsystem state according to clone flags.
  4. It allocates a PID and performs cgroup and scheduler placement before visibility.
  5. It enters the no-failure publication phase: attach pids, insert process tree links, update counters, run post-fork hooks, and return the task.
  6. kernel_clone() wakes the task and handles vfork/ptrace details.

Scheduler initialization happens inside that construction flow. sched_fork() marks the child TASK_NEW, resets inherited policy where required, selects a class, and initializes scheduler state. sched_cgroup_fork() attaches the task to a scheduler task group and initial CPU. wake_up_new_task() changes the task to TASK_RUNNING, selects a CPU, locks the runqueue, activates the task, traces the wakeup, and asks the class whether the new task should preempt the current task.

Regular scheduling is:

  1. schedule() calls __schedule_loop() with preemption disabled.
  2. __schedule() locks the current CPU runqueue and updates clocks.
  3. If the previous task is blocking, it is deactivated.
  4. pick_next_task() delegates to the scheduling class machinery.
  5. The selected task is published through rq->curr.
  6. If prev != next, context_switch() switches memory context and CPU register/stack state.
  7. finish_task_switch() completes accounting and drops dead-task references.

Concurrency And Lifetime

The scheduler is built around per-CPU runqueue ownership. Task operations use task pi_lock plus the task’s current runqueue lock. Because a task can migrate while locks are being acquired, the locking path verifies the task’s runqueue and retries if it changed.

__schedule() requires preemption disabled. Runqueue lock state and memory barriers around rq->curr are part of the user/kernel memory-ordering contract. Dead tasks keep enough lifetime to finish the final context switch away from them; their final references are dropped only after switch cleanup.

Class callbacks are not free-form functions. Their comments in struct sched_class document which runqueue locks must be held for enqueue, dequeue, wakeup, balance, pick, migration, and state transitions.

Resource And Failure Model

Process creation is failure-heavy until publication. copy_process() has many subsystem setup steps and unwind labels so partially initialized tasks do not leak resources. After the no-failure point, the task is visible and cleanup must use normal task lifecycle rules rather than construction unwind.

Scheduling classes manage different resource models:

  • Fair scheduling accounts service with virtual runtime, lag, slice, and virtual deadline.
  • RT scheduling accounts fixed-priority runnable queues and runtime throttling so RT work cannot starve everything forever.
  • Deadline scheduling accounts reservation bandwidth, runtime, period, absolute deadline, throttling, and replenishment.

Extension Points

The scheduler’s main extension point is struct sched_class. Fair, RT, and deadline all provide class tables through DEFINE_SCHED_CLASS(...). The core does not know each policy’s internal data structure; it calls the class methods under the shared runqueue contract.

Process creation also exposes subsystem hooks: credentials, memory management, files, signal handling, seccomp, cgroups, tracing, and scheduler setup all participate before publication.

C Implementation Walkthrough

Fork To Runnable

kernel/fork.c starts by validating clone constraints at kernel/fork.c:1989-2089. It duplicates the task and core resources through kernel/fork.c:2100-2259, then performs subsystem copy/allocation and PID allocation at kernel/fork.c:2262-2305. Cgroup permission and scheduler-cgroup placement happen before external visibility at kernel/fork.c:2386-2407. The visible publication and no-failure phase run through kernel/fork.c:2420-2558, while failures before that point unwind through kernel/fork.c:2560-2618.

Scheduler-specific setup happens in kernel/sched/core.c. sched_fork() marks the child TASK_NEW, resets policy, selects the class, and initializes state at kernel/sched/core.c:4803-4871. sched_cgroup_fork() attaches task group and initial CPU at kernel/sched/core.c:4874-4901. wake_up_new_task() publishes the first runnable state at kernel/sched/core.c:4934-4965.

Schedule To Context Switch

The public schedule() wrapper is at kernel/sched/core.c:7312-7325. __schedule() handles the core operation at kernel/sched/core.c:7055-7236: lock the runqueue, handle blocking/deactivation, pick the next task, publish rq->curr, trace the switch, and enter context_switch() when needed.

context_switch() performs memory-map switching, membarrier/rseq obligations, lock handoff, and the architecture switch_to() boundary at kernel/sched/core.c:5441-5505. Switch pairing and cleanup are handled by prepare_task_switch(), finish_task_switch(), and schedule_tail() at kernel/sched/core.c:5278-5439.

Fair Class

Fair scheduling uses EEVDF. Weighted average virtual runtime is computed at kernel/sched/fair.c:768-814. Lag computation and clamping appear at kernel/sched/fair.c:818-875; eligibility avoids lossy division at kernel/sched/fair.c:877-915. The EEVDF selection comment and augmented tree strategy are at kernel/sched/fair.c:1117-1135. Runtime accounting and virtual deadline refresh happen in update_curr() at kernel/sched/fair.c:1982-2007. The fair callback table is registered at kernel/sched/fair.c:15352-15400.

Real-Time Class

RT scheduling initializes priority queues and bitmap state at kernel/sched/rt.c:68-95. Runtime throttling is timer/bandwidth-backed through kernel/sched/rt.c:125-134, checked at kernel/sched/rt.c:863-904, and charged by update_curr_rt() at kernel/sched/rt.c:970-990. Selection uses the first active priority bit and FIFO list head at kernel/sched/rt.c:1682-1698. The RT callback table is at kernel/sched/rt.c:2601-2637.

Deadline Class

Deadline scheduling initializes its RB-tree runqueue and bandwidth counters at kernel/sched/deadline.c:519-532. CBS setup/replenishment is at kernel/sched/deadline.c:724-799, runtime accounting starts around kernel/sched/deadline.c:1345-1368, and current execution is charged at kernel/sched/deadline.c:2124-2147. Enqueue/dequeue and replenishment logic run at kernel/sched/deadline.c:2356-2488. Selection uses the cached leftmost deadline entity at kernel/sched/deadline.c:2769-2844, and the deadline class table is at kernel/sched/deadline.c:3644-3676.

Rust Translation

Model task creation as a typestate pipeline:

  • validated clone request;
  • allocated unpublished task;
  • resources copied or shared;
  • pid and scheduler/cgroup placement reserved;
  • visible but not runnable;
  • runnable or fully unwound.

Represent runqueue access with guard types that encode lock and interrupt state. If scheduling classes stay dynamic, translate struct sched_class into sealed traits or static vtables whose method signatures require the proper guards. Keep the architecture context switch as a narrow unsafe boundary surrounded by safe accounting, memory-map, trace, and lifetime code.

Fair scheduling needs a deliberate intrusive or arena-backed tree for scheduling entities. RT needs a fixed priority-array abstraction that keeps bitmap and lists consistent. Deadline needs validated reservation types so invalid runtime/deadline/period triples cannot enter the scheduler after admission.

AI-Native Translation

Agent runtimes should copy the lifecycle shape:

  • Jobs are validated and initialized before publication.
  • Runnable state is explicit and observable.
  • Scheduling classes have budgets, priorities, and admission checks.
  • Context switches are telemetry events.
  • High-priority jobs are throttled.
  • Deadline jobs have reservation validation and replenishment.

The scheduler’s class split is a useful model for AI workloads: fair background jobs, privileged latency-sensitive jobs, and deadline-bound jobs should have different data structures and invariants but share a common runqueue contract.

Evidence Table

SourceEvidence
kernel/sched/sched.hstruct rq and runqueue lock-ordering at lines 1128-1189; struct rq_flags at lines 1849-1858; struct sched_class callback contract at lines 2585-2653.
kernel/sched/core.crunqueue storage at lines 131-132; task/runqueue locking retry at lines 732-775; fork setup at lines 4803-4901; first runnable wakeup at lines 4934-4965; switch prep/finish at lines 5278-5439; context switch at lines 5441-5505; __schedule() at lines 7055-7236; schedule() at lines 7312-7325.
kernel/sched/fair.caverage vruntime at lines 768-814; lag and eligibility at lines 818-915; EEVDF selection comments at lines 1117-1135; accounting at lines 1982-2007; fair class table at lines 15352-15400.
kernel/sched/rt.cRT runqueue init at lines 68-95; bandwidth init at lines 125-134; throttling at lines 863-904; runtime charging at lines 970-990; priority selection at lines 1682-1698; RT class table at lines 2601-2637.
kernel/sched/deadline.cdeadline design comment at lines 3-11; bandwidth helpers at lines 213-289; deadline runqueue init at lines 519-532; CBS replenishment at lines 724-799; runtime accounting at lines 1345-1368 and 2124-2147; enqueue/dequeue at lines 2356-2488; selection at lines 2769-2844; class table at lines 3644-3676; parameter validation at lines 3879-3930.
kernel/fork.ctask cache setup at lines 854-897; task duplication at lines 914-1018; memory copying at lines 1522-1598; fs/files copying at lines 1616-1665; signal setup at lines 1667-1770; clone validation at lines 1989-2089; task/resource setup at lines 2100-2305; pre-visibility cgroup/scheduler placement at lines 2386-2407; publication at lines 2420-2558; unwind at lines 2560-2618; public clone/thread entrypoints at lines 2671-3048.

Source Notes

  • file-notes/linux__kernel__sched__sched.h.md
  • file-notes/linux__kernel__sched__core.c.md
  • file-notes/linux__kernel__sched__fair.c.md
  • file-notes/linux__kernel__sched__rt.c.md
  • file-notes/linux__kernel__sched__deadline.c.md
  • file-notes/linux__kernel__fork.c.md
  • file-notes/linux__Documentation__scheduler__sched-design-CFS.rst.md
  • file-notes/linux__Documentation__scheduler__sched-eevdf.rst.md
  • file-notes/linux__Documentation__scheduler__sched-rt-group.rst.md
  • file-notes/linux__Documentation__scheduler__sched-deadline.rst.md