CIQ

Vendor Kernels, Bugs and Stability

Vendor Kernels, Bugs and Stability
Ronnie SahlbergJonathan MapleJeremy Allison
May 14, 2024

Introduction

Linux vendor kernels are currently created by taking a frozen snapshot of a specific linux release associated with a git reference or git tag, and then back-porting selected fixes as the upstream git tree changes. Changes are selected to address specific bug fixes, and to a much lesser extent, new features may be added. This model was invented twenty-five years ago when out of tree device drivers were much more common, as many device vendors had not yet understood how important Linux support was going to be for their hardware.

The theory is that by carefully selecting changes to be back-ported, usually associated with security problems, the resulting kernel will be more stable and secure.

This paper analyzes this theory by examining the change rate and bug count of a selected vendor kernel - Red Hat Enterprise Linux (RHEL) 8.8, kernel version 4.18.0-477.27.1 and comparing this to upstream kernels published by kernel.org. Kernel version 4.18.0-477.27.1 is the version that Rocky Linux 8 is also based upon. In particular, we analyzed the kernel-4.18.0-477.27.1.el8_8.src.rpm source code RPM.

Declining rate of change for back-ported commits in RHEL 8.8

Analyzing the number of back-ports into RHEL 8.8 we find that there are 111750 individual commits listed in the change-log. Analyzing this further we can see when these commits were back-ported over time.

Chart demonstrating the rise and decline of commits to RHEL 8.8 between August 2018 and June 2023 Figure 1. Number of commits to RHEL 8.8 over time.

This only looks at the number of back-ported commits, not their content or sizes, but it gives a rough understanding of the process. Initially in the life-cycle of major release RHEL 8, there is steady back-porting activity. This means that the kernel version number of 4.18 does not accurately describe the actual shipped RHEL 8 kernel as there are many changes, features and sometimes entire new sub-systems that have been back-ported from later kernel releases. Then around November 2021 something changes and the rate of back-porting decreases. Not by a great amount but it is noticeable. Again around November 2022 the rate changes a second time and this time the drop is dramatic.

These two dates where the rate of back-porting changes correspond to the release of minor versions RHEL 8.5 and RHEL 8.7.

At RHEL 8.5, when we are halfway through a major release, maintainers will often slow down the rate of back-porting and be more conservative in what they back-port instead of using the philosophy of “back-port everything”. Then at RHEL 8.7, when we are ~75% through the development life-cycle of a major release the back-porting process changes dramatically and focuses on a low rate of change, expecting that this will lead to greater stability and security.

Unfixed bugs in RHEL 8.8 with available upstream fixes

When bugs are fixed in the upstream public kernel.org git tree, the commit messages explicitly state which earlier git commit reference introduced the bug. This gives us the originating git commit reference showing when a bug was introduced, as well as the git commit reference that shows when the bug was fixed. Using this data we can identify which of the git commits which accidentally introduced a bug were back-ported to a RHEL kernel without the corresponding fix for the bug being applied afterwards.

i.e. This is a count of all the known bugs from an upstream kernel that were introduced, but never fixed in RHEL 8.

For the most recent RHEL 8 kernels, at the time of writing, these counts are:

  • RHEL 8.6: 5034
  • RHEL 8.7: 4767
  • RHEL 8.8: 4594

In RHEL 8.8 we have a total of 4594 known bugs with fixes that exist upstream, but for which known fixes have not been back-ported to RHEL 8.8.

The situation is worse for RHEL 8.6 and RHEL 8.7 as they cut off back-porting earlier than RHEL 8.8 but of course that did not prevent new bugs from being discovered and fixed upstream.

A note on the transparency of the data

The technique we use to discover candidates for missing bug-fixes has flaws due to Red Hat not publishing the complete and corresponding source code in the preferred form of the work for making modifications to it. Instead of publishing the git trees which would show all the individual changes done in RHEL 8.8, Red Hat only publishes “squashed” versions of the source code changes, hiding the individual commits and change history.

This can lead to both false positives as well as false negatives.

As there are so many introduced bugs it is not feasible to investigate all of them. We manually checked a random subset of candidate commits flagged by this technique. These checks suggest that about 10% of the entries are false positives (meaning the fix was applied, but back-ported as part of another commit instead of being applied in the same form as the upstream fix) while 90% are genuinely missing bug-fixes in the released kernel. In addition, some of these bugs may be in code paths that are disabled via kernel config file settings. No analysis has been done on which bugs may be enabled or disabled for a specific vendor kernel config.

Because of the sensitive nature of this data, we are unfortunately unable to provide it to the general public, but we have already made this list available separately to Red Hat and to other companies who ship vendor Linux kernels based on this source code (SuSE, Oracle, Alma).

Disclaimer: Not all bugs are serious or important. No analysis of their severity has been undertaken but it is probable that at least some of them are serious and important. A list of potentially serious problems are enumerated below in Appendix 1. However it is appropriate to quote experienced Linux kernel developer and Linux Weekly News editor-in-chief Jonathan Corbet here:

“In the kernel, just about any bug, if you're clever enough, can be exploitable to compromise the system.”

How long do bugs remain unfixed?

The assumption is that bugs that are introduced upstream are found and fixed relatively quickly in the upstream git source tree.

That is not the case. Possibly all the bugs that are easy to trigger in common use scenarios and bugs that are obvious are fixed quickly but there are a lot of bugs that are found upstream and are fixed that were not recently introduced in a git commit.

This has an impact on the quality of the RHEL kernels. As these kernels mature and back-porting slows, it is assumed that most bugs have already been patched in these kernels. As we saw in the numbers above, and we will see in the graphs below that is far from the truth.

Slowing down the back-porting process to stabilize the kernel simultaneously leads to a growing number of known bugs that remain unfixed.

Upstream Kernel Version Cumulative upstream bug fixes missing in RHEL 8.8 kernel
v5.0 0
v5.1 7
v5.2 155
v5.3 229
v5.4 380
v5.5 459
v5.6 535
v5.7 615
v5.8 687
v5.9 763
v5.10 842
v5.11 917
v5.12 994
v5.13 1075
v5.14 1149
v5.15 1299
v5.16 1384
v5.17 1531
v5.18 1680
v5.19 1832
v6.0 2058
v6.1 2282
v6.2 2579
v6.3 2945
v6.4 3239
v6.5 3534
v6.6 3753
v6.7 3967
v6.8-rc4 4179

Figure 2. Cumulative number of fixes per upstream kernel version, starting at v5.0.

i.e., Between v5.2 and v5.3 there were 74 (229 - 155) bugs / security fixes that are missing from RHEL 8.8

Chart showing the increasing rates of newly-discovered bugs in upstream kernels that aren't fixed in RHEL 8. Figure 3. Rate of newly discovered bugs in upstream kernels not fixed in RHEL 8.

In this graph we can clearly see that not only does the number of known bugs in RHEL 8.8 grow over time, the rate of growth increases.

In figure 3, the small uptick in the rate of new bugs around upstream Linux version 5.15 in November 2021 and the much greater uptick in the bug rate that occurs around upstream Linux version 6.0 in November 2022 directly correlates to the change in the rate of backports to RHEL 8 as we see in Figure 1.

As back-porting into RHEL slows down towards the tail of the development lifetime, the number of new bugs that are discovered but not fixed in RHEL 8 accelerates.

This means that over time, the security of the RHEL kernels get worse and worse as more issues are discovered in the upstream code and are potentially exploitable but fewer and fewer of the fixes for these known bugs are back-ported into RHEL kernels.

After reaching RHEL 8.7, the theory is that the kernel has been stabilized, with a corresponding improvement in security. However we still have an influx of newly discovered bugs in the upstream kernel affecting RHEL 8.7 that are not addressed. Each minor version of upstream is released on an approximately quarterly basis and we can see that the influx of new bugs that are unaddressed in RHEL is growing. The number of known issues in these kernels increases by approximately 250 new bugs per quarter or more.

Upstream kernels and incorrectly-evaluated CVEs

There have historically been many issues with the handling of kernel Common Vulnerabilities and Exposures (CVEs). Many important bug-fixes were not assigned a CVE in the first place making the CVE tracking less than optimal. Recent legal changes being proposed might make Open Source projects like the Linux kernel responsible for assigning and monitoring security issues tracked by CVEs.

Because of this, and the recent plethora of incorrectly evaluated CVE reports, kernel.org now has become the sole CVE numbering authority (CNA) for the Linux kernel. This means many of the bug-fixes that go into the kernel.org stable Linux git branch that were never issued a CVE in the past will now be assigned CVEs in batches going forward. As there are many fixes regularly added to the kernel.org stable Linux git branch this means a large number of new CVEs will be created by kernel.org which will include more direct references to the specific security vulnerabilities.

In March 2024 there were 270 new CVEs created for the stable Linux kernel. So far in April 2024 there are 342 new CVEs:

Note that when these CVEs are announced, they have already been fixed in the stable Linux kernel git branch. The kernel CNA does not score the CVEs with a severity rating, and the mitigation advice released for all of these CVEs is always the same:

The Linux kernel CVE team recommends that you update to the latest stable kernel version for this, and many other bugfixes. Individual changes are never tested alone, but rather are part of a larger kernel release. Cherry-picking individual commits is not recommended or supported by the Linux kernel community at all.

Due to the number of CVEs that are now issued for the kernel the CVE process has changed. There is longer an embargo process for most CVEs but instead the fixes are introduced in the stable branches first, and then CVEs are created for those fixes. For organizations this means there will likely be significantly more CVEs to evaluate and a lot less time to decide on what action to take due to the lack of an embargo period. This is a lot of CVEs to analyze, track and make decisions on whether they are important for consumers of vendor kernels.

Our data shows that there is a large overlap between these new CVEs and the ongoing growth of unresolved bugs in the RHEL kernels.

Rate of change, bug-fixes in a stable branch in upstream

Checking the rate of bug fixes going in upstream we are now at about 3000 fixes per minor release/quarter.

Chart showing the number of bug fixes available in stable upstream RHEL branches per minor release. Figure 4: Number of bug fixes in upstream per minor release

While the majority of fixes going into upstream are for recently introduced bugs, we can compare the total number of fixes in each minor release of the upstream 6.x kernels with the missing bug fixes in RHEL 8.

In upstream we have over 2500 fixes per minor release in total. In the previous graphs we saw that in each minor release of upstream 6.x kernels, about 250 of these bugs affected RHEL 8.8.

Even though RHEL 8.8 is “stable” and ceased active development in late 2022 about 10% of all newly discovered bugs still affect RHEL 8.8.

Think of it this way: Next month kernel developers will find and fix about 1000 bugs upstream. About 100 of these bugs will be present in RHEL 8.8 and most of them will not be fixed.

Conclusion: The vendor model is broken and cannot be fixed

All upstream bugs back-ported into stable branches going forward will be classified as CVEs. In our technical opinion this creates a strong incentive for customers that are concerned with CVEs, and ensuring that their systems are secure, to subscribe to and use a stable kernel instead of a vendor kernel. We believe that the only realistic way for a customer to know they run a kernel that is as secure as possible is to switch to a stable kernel branch.

  • The vendor kernel model is broken. It can not be fixed.
  • A vendor kernel is an insecure kernel. A late cycle stabilized vendor kernel is doubly so.
  • There are just too many known open bugs. It is not feasible to analyze or classify them all.
  • An upstream stable kernel provides much greater protection from security vulnerabilities and general bugs in the kernel code.

Postscript

This whitepaper is not meant as a criticism of the engineers working at any Linux vendors who are dedicated to producing high quality work in their products on behalf of their customers. This problem is extremely difficult to solve. We know this is an open secret amongst many in the industry and would like to put concrete numbers describing the problem to encourage discussion. Our hope is for Linux vendors and the community as a whole to rally behind the kernel.org stable kernels as the best long term supported solution. As engineers, we would prefer this to allow us to spend more time fixing customer specific bugs and submitting feature improvements upstream, rather than the endless grind of backporting upstream changes into vendor kernels, a practice which can introduce more bugs than it fixes.

Appendix 1. List of missing fixes in RHEL 8 by keyword search

“uaf|use.after.free” : 117
“crash|panic|oops” : 289
”deadlock”: 97
”corruption”: 58
“out.of.bounds”: 53
”double.free”: 27
"null.pointer|null.ptr": 156
"null.deref": 22

A random subset of these bugs were analyzed and the fixes were found to be missing in RHEL 8. Some of the missing fixes we examined are explicitly disclosed as being exploitable from user-space.