Skip to content

linux/kernel/sched/core.c

Imported from _research/manual-study-linux/file-notes/linux__kernel__sched__core.c.md.

File Notes: kernel/sched/core.c

Status: reviewed.

Purpose

Implements the scheduler core: per-CPU runqueue storage, task/runqueue locking, fork-time scheduler initialization, wakeup, task selection, context switching, and the public schedule() entrypoint.

Key Types And Functions

  • runqueues: per-CPU struct rq storage.
  • ___task_rq_lock() / _task_rq_lock(): task-to-runqueue locking protocol.
  • sched_fork(), sched_cgroup_fork(), sched_post_fork(): task creation scheduler phases.
  • wake_up_new_task(): first publication to runnable state.
  • pick_next_task(): core pick path, with core-scheduling support when enabled.
  • __schedule() and schedule(): main scheduling loop.
  • context_switch() and finish_task_switch(): memory/register switch and post-switch cleanup.

Data Flow

Each CPU owns a runqueue in DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues). Task operations lock the task’s pi_lock, find its current runqueue, acquire the runqueue lock, then verify the task did not migrate while the lock was being acquired. If it did, the code releases and retries.

Fork setup marks the task TASK_NEW, resets inherited scheduling policy where needed, chooses a scheduler class, initializes runtime accounting, and keeps the child off CPU. Cgroup fork setup assigns the scheduler task group and CPU, then calls class-specific task_fork. wake_up_new_task() changes the state to TASK_RUNNING, selects a CPU, locks the runqueue, activates the task, traces the wakeup, and invokes class preemption logic.

schedule() calls __schedule_loop(), which disables preemption and invokes __schedule() until rescheduling is no longer needed. __schedule() locks the runqueue, handles blocking/deactivation, calls pick_next_task(), publishes rq->curr, traces the switch, and enters context_switch() when prev != next.

context_switch() runs pre-switch hooks, switches the active memory map, handles membarrier/rseq obligations, prepares lock transfer, and calls switch_to() for the architecture register/stack switch. The returning task then completes cleanup through finish_task_switch().

Invariants And Safety Contracts

  • TASK_NEW prevents a newly forked task from being run or externally woken before scheduler initialization finishes.
  • Task/runqueue locking retries if a task migrates during lock acquisition.
  • __schedule() must be called with preemption disabled.
  • The runqueue lock and memory barriers around rq->curr are part of the user/kernel memory-ordering contract, not just scheduler-local details.
  • Dead tasks drop their final task reference only after the last context switch away from them.

Rust Translation Guidance

Build task creation as a phase-typed pipeline: allocated task, scheduler initialized task, cgroup-attached task, published task, runnable task. Use RAII guards for task and runqueue locks, with retry semantics for migration. Keep the architecture context switch as a very small unsafe boundary; surround it with safe accounting, memory-map, and lifecycle code.

AI-Native Systems Guidance

Agent job schedulers should copy the lifecycle shape, not the code: validate and initialize jobs before publication, use explicit runnable-state transitions, emit tracepoints around wakeup/switch, and make the context switch between jobs a policy-observable event.

Evidence

  • Per-CPU runqueues are defined at kernel/sched/core.c:131-132.
  • Task/runqueue locking retries on task migration at kernel/sched/core.c:732-749 and explains acquire/release ordering at kernel/sched/core.c:759-775.
  • ttwu_runnable() serializes against schedule() and either restores a queued task to running state or falls back to full wakeup at kernel/sched/core.c:3857-3888.
  • sched_fork() marks the child TASK_NEW, resets inherited policy, selects the class, and initializes scheduling state at kernel/sched/core.c:4803-4871.
  • sched_cgroup_fork() attaches task group and initial CPU at kernel/sched/core.c:4874-4901.
  • wake_up_new_task() publishes the first runnable state and enqueues the task at kernel/sched/core.c:4934-4965.
  • prepare_task_switch(), finish_task_switch(), and schedule_tail() define switch pairing and first-run behavior at kernel/sched/core.c:5278-5439.
  • context_switch() handles memory-map and register/stack switching at kernel/sched/core.c:5441-5505.
  • pick_next_task() delegates to __pick_next_task() unless core scheduling is enabled at kernel/sched/core.c:6210-6265 and kernel/sched/core.c:6664-6669.
  • __schedule() locks the runqueue, chooses next, publishes rq->curr, and calls context_switch() at kernel/sched/core.c:7055-7236; schedule() wraps it at kernel/sched/core.c:7312-7325.