CentOS -> RedHat Migration

Project Summary: Link to heading

OS Replacement for an existing HPC cluster and it’s surrounding ecosystem. Harvard Medical School - Summer 2025

Primary driver: Link to heading

RedHat changed the placement of CentOS in the Linux eco system to a rolling release. Placing it ahead of stable. This made CentOS unfit for production workloads in our opinion.

Decision Factors: Link to heading

Stability
Cost
Application Support
Team Familiarity

Stability was a clear factor given the recategorization of CentOS. Not only was a stable release needed, we needed a release that was unlikely to meet the same fate. Early front runners were Debian, Ubuntu, Suse and RedHat EL. There was a lot of support internally to go with a more community driven distro that was less likely to meet the whims of corporate priority shifts. While this was strongly considered, ultimately cost became a deciding factor.

Cost is always a consideration when looking at any software project and operating systems are no different. While there are a lot of cost free options in the Linux world in most cases you get what you pay for. In the Linux world this typically means lack of support. The technical teams were fine with having an unsupported option as we had never had support for the OS before it wasn’t felt as a need. However when we did get quotes from vendors just to make sure we had considered all options RedHat came back with such an obscenely low cost that it became impossible to not seriously consider the option.

Application support was a strong factor in the decision process. The existing cluster was running CenOS, a RedHad compatible distro. The desire to not retool everything along with all the other challenges made the argument for non RedHat compatible distros really hard. CentOS had near complete package parity making the conversion as straight forward as possible. Moving to an incompatible distro would have required hundreds of engineer hours to retool existing configs with new package names and software trees. RedHat also had an existing deployment system in Satellite that was based off Foreman, the tool the team was already using for deployments. RedHat was a clear winner here as well.

Team familiarity was also a consideration. The team was used to RedHat (SVR4) based systems, the ecosystem, tools and general system layouts. Moving to a new distro meant relearning a significant portion of the stack. It was felt that this would add to the learning curve and slow down the migration significantly. The teams comfort with some of the surrounding infrastructure was also thought to be a potential time saver as the project progressed.

Based on these factors it was decided that we would move forward with RedHat Enterprise Linux as the OS of choice for the migration.

Major changes Link to heading

Along with the change in base OS this upgrade also represented a jump from CentOS 7 to RHEL 9. A major jump is OS release base that would introduce major breaking changes. Core libraries would change, making software built against the old libs unable to run in the new environment. Updates to things like GlibC, a standard core library were sure to require a significant number of software packages to be rebuilt to work on the new OS. While this was an opportunity to clean out old software that probably was no longer needed, it represented a significant amount of work.

In addition to the core OS release came the replacement of the deployment system from Foreman to RH Satellite. While Satellite is based off of Foreman and shares a lot of the same functionality, the modifications RedHad added has made the migration a much steeper learning curve then expected. The methods of software distribution and the requirement for systems to be registered added significant and unexpected complexity to the conversion. Satellite also would much prefer to hand off to RedHat’s own Ansible software for system configuration but we were a Puppet shop. Satellite does support Puppet but the integration is not as clean and required some time to figure out. All of these hurdles had to be managed before we could even begin the OS migration efforts.

Most of the software upgrade work was managed by the Research Computing Consulting group. A user-facing team of researchers who managed the day to day user operations of the cluster as well as most of the non system software deployed. With out their help the upgrade wouldn’t have been possible.

There were of course a long list of more minor changes that had to be managed. Some of which were changes that we had wanted to implement for some time. This major upgrade proved and ample opportunity to clear some technical debt we had been sitting on.

Basic roll out plan: Link to heading

It was decided that we would tackle the most impactful changes first. We started with the HPC compute and login nodes as those would be the changes most felt by the user community. We did have a dev cluster and were able to test changes there. This gave us the opportunity to test and perfect the distribution of the new os and rebuilding of nodes. Once that testing was done, an outage was scheduled and the plan put in motion. We took 3 days to redeploy nearly 500 nodes, completely rebuilding them from the disk up. This process was managed by the team and went as well as any such thing can. There were of course a small subset of nodes that did not rebuild cleanly and required more hand holding but over 98% of nodes were back online by the end of the outage window.

In addition to the compute and login migration was the migration of an existing web hosting stack that was tied to the HPC cluster. This was a historical architecture running custom code that required some hand holding to migrate. Not only were these stand alone servers, many of them had code bases that were no longer maintained and required legacy configs that were difficult to support. The migration effort gave us the opportunity to find sites no longer needed and eliminate them. We also had the opportunity to streamline the management of these systems and start to think about ways to do it better.

After the compute and login nodes, the transfer nodes were updated along with other user facing systems like cron servers. From there began the slog of updating all the systems required to support a modern HPC environment. These are a mix of management systems and other nodes supporting tools used by the cluster or it’s admin group. There are a lot of these systems, most of them are unique and require more hand holding to upgrade.

A last minute mini project: Link to heading

During the migration of the supporting systems it was brought up by the team that we also had a project to update our monitoring and alerting system. We had been using a system called Sensu for some time but they had changed their licensing model to one with a cost we couldn’t justify. As a result we were planning to move to Prometheus, a more modern and fully open source solution. The team felt it was a waste of effort to move all these systems to RedHat and then have a whole other project to migrate them to the new monitoring platform. It was decided to deploy Prometheus and set that up to monitor all the rest of the systems as they got upgraded. This saved the team a bunch of time and aggravation later down the road. A migration was planned to move all the compute an login nodes to the new system in bulk at a later date since they did not require as much hands on work.

Migration results: Link to heading

While the migration of support systems are still under way the overall migration was a success. The cluster upgrade went smoothly, applications were migrated, and users were able to continue their research with minimal down time and minimal disruption. The major problems involved in a migration of this size were managed effectively even if there is still some work to be done. As of this time there have been no reports of applications or research projects that could not be migrated to the new OS and no workflows that could not continue.

Migrations of this size require a massive effort across several teams. They require coordination and cooperation at a fairly deep level. The teams at Harvard are amazing, work incredibly well together, and have managed to create an extremely log ego environment where everyone can contribute and everyone is focused on the mission.

Follow up work Link to heading

Migrations like this have a tendency to point out weak spots in your design. Things you could do better / differently or maybe shouldn’t be doing at all. We were able to clean up some of our web hosting but there is a lot of work left to do. Some of the things recommended for follow up.

Sure up Satellite with redundant hosts to make it more stable and available
Look at making Prometheus a robust enough system to be consumed by other teams
Redeploy Puppet servers as containers in a scaling group to better manage the resource.