Documentation/driver-api/hw-recoverable-errors.rst
Source file repositories/reference/linux-study-clean/Documentation/driver-api/hw-recoverable-errors.rst
File Facts
- System
- Linux kernel
- Corpus path
Documentation/driver-api/hw-recoverable-errors.rst- Extension
.rst- Size
- 2106 bytes
- Lines
- 61
- Domain
- Support Tooling And Documentation
- Bucket
- Documentation
- Inferred role
- Support Tooling And Documentation: documentation
- Status
- atlas-only
Why This File Exists
Repository support layer: documentation, build tooling, samples, user-space helper tools, generated initramfs support, licenses, and validation utilities.
- Repository support layer: documentation, build tooling, samples, user-space helper tools, generated initramfs support, licenses, and validation utilities.
- Defines or uses C structs; map object ownership, embedded links, reference counts, and lock ownership.
Dependency Surface
- No C-style include directives detected by the generator.
Detected Declarations
function example
Annotated Snippet
.. SPDX-License-Identifier: GPL-2.0
=================================================
Recoverable Hardware Error Tracking in vmcoreinfo
=================================================
Overview
--------
This feature provides a generic infrastructure within the Linux kernel to track
and log recoverable hardware errors. These are hardware recoverable errors
visible that might not cause immediate panics but may influence health, mainly
because new code path will be executed in the kernel.
By recording counts and timestamps of recoverable errors into the vmcoreinfo
crash dump notes, this infrastructure aids post-mortem crash analysis tools in
correlating hardware events with kernel failures. This enables faster triage
and better understanding of root causes, especially in large-scale cloud
environments where hardware issues are common.
Benefits
--------
- Facilitates correlation of hardware recoverable errors with kernel panics or
unusual code paths that lead to system crashes.
- Provides operators and cloud providers quick insights, improving reliability
and reducing troubleshooting time.
- Complements existing full hardware diagnostics without replacing them.
Data Exposure and Consumption
-----------------------------
- The tracked error data consists of per-error-type counts and timestamps of
last occurrence.
- This data is stored in the `hwerror_data` array, categorized by error source
types like CPU, memory, PCI, CXL, and others.
- It is exposed via vmcoreinfo crash dump notes and can be read using tools
like `crash`, `drgn`, or other kernel crash analysis utilities.
- There is no other way to read these data other than from crash dumps.
- These errors are divided by area, which includes CPU, Memory, PCI, CXL and
others.
Typical usage example (in drgn REPL):
.. code-block:: python
>>> prog['hwerror_data']
(struct hwerror_info[HWERR_RECOV_MAX]){
{
.count = (int)844,
.timestamp = (time64_t)1752852018,
},
...
}
Enabling
--------
- This feature is enabled when CONFIG_VMCORE_INFO is set.
Annotation
- Detected declarations: `function example`.
- Atlas domain: Support Tooling And Documentation / Documentation.
- Implementation status: atlas-only.
Implementation Notes
- This generated page is the file-by-file coverage layer; curated subsystem chapters should link here when they synthesize a multi-file control flow.
- Core OS pages should be promoted from atlas-only to deep-reviewed when they explain data structures, invariants, locking, lifecycle, and C implementation snippets.
- Driver-family pages are intentionally pattern-oriented unless they are part of the selected PCIe/NVMe representative device path.