Solution Deployment Workflow

The following figure shows the high-level workflow of the installation process:

FIGURE 7. End to End Solution Deployment Workflow

The following steps provide an overview of each step that needs to be performed for deploying the HPE ProLiant NGS-optimized solution for Red Hat OpenShift Container Platform 4.16:

The way you interact with the installation program differs depending on your installation type.

For clusters with installer-provisioned infrastructure, you delegate the infrastructure bootstrapping and provisioning to the installation program instead of doing it yourself. The installation program creates all of the networking, machines, and operating systems that are required to support the cluster.
If you provision and manage the infrastructure for your cluster, you must provide all of the cluster infrastructure and resources, including the bootstrap machine, networking, load balancing, storage, and individual cluster machines.

Set up iPXE, TFTP, and DHCP for RHCOS

In this step, the iPXE server is leveraged to boot the machine. The iPXE and TFTP server is set up to boot RHCOS. The PXE boot process is the initial stage for deploying the solution and configuring DHCP is an integral part of this process. This configuration can be done using the sudo access.

For more information on configuring the iPXE set up, see the Deploy iPXE guide.

Configure a load balancer for RHOCP 4 nodes

In the multi-node RHOCP cluster deployment, the load balancer is mandatory. For this solution, Hewlett Packard Enterprise has leveraged the required traffic for HAProxy load balancing. This configuration can be done using the sudo access. For commercial load balancer such as F5 Big-IP or any other RHOCP 4 supported load balancer, visit the manufacture website.

Configure BindDNS

In the User-Provisioned Infrastructure (UPI), DNS records are required for each machine. These records resolve the hostnames for all other machines in a RHOCP cluster. This component can also be configured using the sudo access for Linux-based DNS solution. It provides details on configuring the sudo to allow non-root users to execute root level commands.

Configure firewall ports

In the User-Provisioned Infrastructure (UPI), the network connectivity between machines allows cluster components to communicate within the RHOCP cluster. Hence, the required ports must be open between RHOCP cluster nodes. This component can also be configured using the sudo access for Linux-based firewall. For third-party firewall solutions, visit the manufacture website. It provides details on configuring the sudo to allow non-root users to execute root level commands.

For more information, see the Installing a user-provisioned bare metal cluster with network customizations and Networking requirements for user-provisioned infrastructure sections in the OpenShift Container Platform 4.16 documentation.

Start RHOCP 4 User-Provisioned Infrastructure setup

The User-Provisioned Infrastructure (UPI) begins with installing a bastion host. This setup uses RHEL 9.4 virtual machine as a bastion host. This bastion host is used for deployment and management of the RHOCP 4 clusters. The setup and configuration of this step can be completed using the sudo user access.

For more information, see the Generating a key pair for cluster node SSH access section in the OpenShift Container Platform 4.16 documentation.

Download RHOCP 4 software version and images

To download the RHOCP 4 image, see the RHCOS image mirror page. Check the access token for your cluster and install it on the bastion host. The bastion host is used for deploying and managing the RHOCP 4 clusters. The setup and configuration of this step can be completed using the sudo user access.

For more information, see the Obtaining the installation program section in the OpenShift Container Platform 4.16 documentation.

Create ignition config files

This step begins with the creation of the install-config.yaml in a new folder. Use the Red Hat OpenShift installer tool to convert the YAML file to the ignition config file, which is required to install the RHOCP 4. During this process, system modification is not done on the bastion host or the provisioning server. This setup can be completed using the sudo access.

For more information, see the Manually creating the installation configuration file section in the OpenShift Container Platform 4.16 documentation.

Upload ignition config files to the web

In this step, the ignition config files are uploaded to an internal website that allows anonymous access to the iPXE boot process. Update the iPXE default file to point to the website location of the ignition file. The action required in this step can be done using the sudo user.

For more information, see the Installing RHCOS by using PXE or iPXE booting section in the OpenShift Container Platform 4.16 documentation.

NOTE

KVM is an open-source virtualization technology that converts your Linux machine into a type-1 bare-metal hypervisor and allows you to run multiple Virtual Machines (VMs) or guest VMs on Red Hat Linux.

For more information, see the Getting started with virtualization section in the Red Hat Enterprise Linux 8 documentation.

Deploy bootstrap node

The bootstrap node is a temporary node that is used to bring up the RHOCP cluster. After the cluster is up, this machine can be decommissioned, and the hardware can be reused. The iPXE boot process must use bootstrapping information as a part of the iPXE boot parameter to install the RHCOS on this node.

Deploy master node

The master node uses the iPXE image for RHCOS after the bootstrap node. The iPXE boot process must use the master.ign information as a part of the iPXE boot parameter to install the RHCOS on this node. The root user is not active by default in RHCOS. Since the root login is not available, log in as the core user.

Create the cluster

The four nodes, one bootstrap and three master nodes boot up and are available at the login prompt for RHCOS. To complete the bootstrap process, log in as a sudo user on the bastion host or provision server and use the Red Hat OpenShift installer tool.

For more information, see the Waiting for the bootstrap process to complete section in the OpenShift Container Platform 4.16 documentation.

Log in to the cluster

After the bootstrap process has completed successfully, login to the cluster. The kubeconfig file is present in the auth directory where the ignition files are created on the bastion host. Export the cluster kubeconfig file and log in to your cluster as a default system user. The kubeconfig file contains information about the cluster that is used by the CLI to connect a client to the correct cluster and API server. This file is specific to a cluster and is created during the RHOCP installation. After logging in, approve the pending Certificate Signing Requests (CSRs) for the nodes.

For more information, see the Approving the certificate signing requests for your machines section in the OpenShift Container Platform 4.16 documentation.

Configure operators

After the control plane initializes, you must immediately configure operators that are not available. It ensures their availability (for example, image-registry).

For more information, see the Image registry storage configuration section in the OpenShift Container Platform 4.16 documentation. To complete this step, you can also log in as a sudo user on the bastion host or provision server.

Add Worker nodes

In the RHOCP, you can add RHEL worker nodes to a User-Provisioned Infrastructure cluster or an installation-provisioned infrastructure cluster on the x86_64 architecture. For more information, see the Adding RHEL compute machines to an OpenShift Container Platform cluster section in the OpenShift Container Platform 4.16 documentation.

Preparing the execution environment for RHOCP worker node

Prerequisites:

RedHat Enterprise Linux 9.4 must be installed and registered on your host machine
Configure BOND

Setting up RHEL 9.4 installer machine

This section assumes the following considerations for our deployment environment:

A server running Red Hat Enterprise Linux (RHEL) 9.4 exists within the deployment environment and is accessible to the installation user to be used as an installer machine. This server must have internet connectivity.
A virtual machine is used to act as an installer machine and the same host is utilized as an Ansible Engine host. We are using one of the worker machines as an installer machine to execute Ansible Playbook.

Prerequisites to execute ansible playbook:

RHEL 9.4 installer machine must have the following configurations:

The installer machine must have at least 500 GB disk space (especially in the "/" partition), 4 CPU cores and 16 GB RAM.
RHEL 9.4 installer machine must be subscribed with valid Red Hat credentials. To register the installer machine for the Red Hat subscription, run the following command:

$ sudo subscription-manager register --username=<username> --password=<password> --auto-attach

Sync time with NTP server.
SSH key pair must be available on the installer machine. If the SSH key is not available, generate a new SSH key pair with the following command:

$ ssh-keygen

To set up the installer machine:

Create and activate a Python3 virtual environment for deploying this solution with the following commands:

$ python3 -m venv <virtual_environment_name> 

$ source <virtual_environment_name>/bin/activate

Download the OpenShift repositories using the following commands in the Ansible Engine:

$ mkdir /opt 

$ cd /opt 

$ yum install -y git 

$ git clone <https://github.com/HewlettPackard/hpe-solutions-openshift.git>

Setup the installer machine to configure the nginx, development tools, and other python packages required for LTI installation. Navigate to the $BASE_DIR directory and run the following command:

$ cd $BASE_DIR
$ sh setup.sh

Note

$BASE_DIR refers to /opt/hpe-solutions-openshift/DL-LTI-Openshift/

As part of setup.sh script it will create nginx service, so user must download and copy Rhel 9.4 DVD ISO to /usr/share/nginx/html/
Minimum Storage requirement for management servers

Management Servers	Host OS disk	Storage Pool disk
Server 1	2 x 1.6 TB	2 x 1.6 TB
Server 2	2 x 1.6 TB	2 x 1.6 TB
Server 3	2 x 1.6 TB	2 x 1.6 TB

Host OS disk – raid1 for redundancy

Creating and deleting logical drives

Create and delete logical drives on the head nodes following below steps.

Input File Update:-

User has to update the input.yaml file in $BASE_DIR/create_delete_logicaldrives directory to execute the logical drive script.
User needs to update all the details in the input.yaml file which include:-


		ILOServers:

			- ILOIP: 172.28.*.*

				ILOuser: admin

				ILOPassword: Password

				controller: 12  

				RAID: Raid1

				PhysicalDrives: 1I:1:1,1I:1:2  

			- ILOIP: 172.28.*.*

				ILOuser: admin

				ILOPassword: Password

				controller: 1

				RAID: Raid1

				PhysicalDrives: 1I:3:1,1I:3:2

			- ILOIP: 172.28.*.*

				ILOuser: admin

				ILOPassword: Password

				controller: 11

				RAID: Raid1

				PhysicalDrives: 1I:3:1,1I:3:2

Note:-

1. To find controller id login to the respective ILO -> System Information -> Storage tab where inside Location find the **slot number** as the controller id. 

			 # Example - Slot = 12 

2. To find the PhysicalDrives login to the respective ILO -> System Information -> Storage tab inside Unconfigured Drives where under Location you can deduce PhysicalDrives based on these information:

		 # Slot: 12:Port=1I:Box=1:Bay=1

		# Example - 1I:1:1 ('Port:Box:Bay')

		# Slot: 12:Port=1I:Box=1:Bay=2

		# Example - 1I:1:2 ('Port:Box:Bay')

Playbook Execution:-

To delete all the existing logical drives in the server in case if any and to create new logical drives named 'RHEL Boot Volume' in respective ILO servers run the site.yml playbook inside create_delete_logicaldrives directory with the below mentioned command

	$ ansible-playbook site.yml --ask-vault-pass

We can provide the input variables in any one of the below methods

Method 1. Input.py : Automation way of taking input

Through the input.py, go to the directory /opt/hpe-solutions-openshift/DL-LTI-Openshift/ and run the below command.

   python input.py

Here it will prompt for values which are not obtained from SCID json files.

A sample input collection through input.py is as follows.

	Enter server serial number for the first head node server ( Example: 2M2210020X )
	2M205107TH
	Enter ILO address for the first head node server ( Example: 192.28.201.5 )
	172.28.201.13
	Enter ILO username for the first head node server ( Example: admin )
	admin
	Enter ILO password for the first head node server ( Example: Password )
	Password
	Enter Host FQDN for the first head node server ( Example: kvm1.xyz.local )
	headnode1.isv.local
	etc ...............................'

After execution of input.py, it will generate input.yaml and hosts file in the same location.

Method 2. Input.yaml: Manually editing input file

Go to the directory $BASE_DIR(/opt/hpe-solutions-openshift/DL-LTI-Openshift/), here we will have input.yaml and hosts files.

A preconfigured Ansible vault file (input.yaml) is provided as a part of this solution, which consists of sensitive information to support the host and virtual machine deployment.

cd $BASE_DIR

Run the following commands on the installer VM to edit the vault to match the installation environment.

ansible-vault edit input.yaml

NOTE

The default password for the Ansible vault file is changeme

Sample input_sample.yml can be found in the $BASE_DIR along with description of each input variable.

A sample input.yaml file is as follows with a few filled parameters.

 - Server_serial_number: 2M20510XXX
   ILO_Address: 172.28.*.*
   ILO_Username: admin
   ILO_Password: *****
   Hostname: headnode3.XX.XX                #ex. headnode3.isv.local
   Host_Username: root
   Host_Password: ******
   HWADDR1: XX:XX:XX:XX:XX:XX             #mac address for server physical interface1 
   HWADDR2: XX:XX:XX:XX:XX:XX             #mac address for server physical interface2
   Host_OS_disk: sda
   Host_VlanID: 230
   Host_IP: 172.28.*.*
   Host_Netmask: 255.*.*.*
   Host_Prefix: XX                          #ex. 8,24,32
   Host_Gateway: 172.28.*.*
   Host_DNS: 172.28.*.*
   Host_Domain: XX.XX                       #ex. isv.local
   corporate_proxy: 172.28.*.*              #provide corporate proxy, ex. proxy.houston.hpecorp.net
   corporate_proxy_port: XX               #corporate proxy port no, ex. 8080

config:
   HTTP_server_base_url: http://172.28.*.*/  #Installer IP address
   HTTP_file_path: /usr/share/nginx/html/    
   OS_type: rhel9
   OS_image_name: rhel-9.4-x86_64-dvd.iso    # ISO image should be present in /usr/share/nginx/html/
   base_kickstart_filepath: /opt/hpe-solutions-openshift/DL-LTI-Openshift/playbooks/roles/rhel9_os_deployment/tasks/ks_rhel9.cfg

A sample hosts files is as follows

		   '[kvm_nodes]
			172.28.*.*
			172.28.*.*
			172.28.*.*

			[ansible_host]
			172.28.*.*

			[rhel9_installerVM]
			172.28.*.*

			[binddns_masters]
			172.28.*.*

			[binddns_slaves]
			172.28.*.*
			172.28.*.*

			[masters_info]
			master1 ip=172.28.*.* hostname=headnode1

			[slaves_info]
			slave1 ip=172.28.*.* hostname=headnode2
			slave2 ip=172.28.*.* hostname=headnode3'

Deploying RHOCP cluster using Ansible playbooks

The Lite Touch Installation (LTI) package includes Ansible playbooks with scripts to deploy RHOCP cluster. You can use one of the following two methods to deploy RHOCP cluster:

Run a consolidated playbook: This method includes a single playbook for deploying the entire solution. This site.yml playbook contains a script that performs all the tasks starting from the OS deployment until the RHOCP cluster is successfully installed and running. To run LTI using a consolidated playbook:

$ ansible-playbook -i hosts site.yml --ask-vault-pass

NOTE

The default password for the Ansible vault file is changeme

Run individual playbooks: This method includes multiple playbooks with scripts that enable you to deploy specific parts of the solution depending on your requirements. The playbooks in this method must be executed in a specific sequence to deploy the solution. The following table includes the purpose of each playbook required for the deployment:

TABLE 8. RHOCP cluster deployment using Ansible playbooks

Playbook	Description
rhel9_os_deployment.yml	This playbook contains the script to deploy RHEL 9.4 OS on BareMetal servers.
copy_ssh_headnode.yml	This playbook contains the script to copy the SSH public key from the installer machine to the head nodes.
prepare_rhel_hosts.yml	This playbook contains the script to prepare nodes for the RHOCP head nodes.
ntp.yml	This playbook contains the script to create NTP setup to enable time synchronization on head nodes.
binddns.yml	This playbook contains the script to deploy Bind DNS on three head nodes and acts as active-passive cluster configuration.
haproxy.yml	This playbook contains the script to deploy HAProxy on the head nodes and acts as active-active cluster configuration.
squid_proxy.yml	This playbook contains the script to deploy the Squid proxy on the head nodes to get web access.
storage_pool.yml	This playbook contains the script to create the storage pools on the head nodes.
rhel9_installerVM.yml	This playbook contains the script to create a RHEL 9 installer machine, which will also be used as an installer at a later stage.
copy_ssh_installerVM.yml	This playbook contains the script to copy the SSH public key to the RHEL 9 installer machine.
prepare_rhel9_installer.yml	This playbook contains the script to prepare the RHEL 9 installer.
copy_scripts.yml	This playbooks contains the script to copy ansible code to rhel9 installer and headnodes.
download_ocp_packages.yml	This playbook contains the script to download the required RHOCP packages.
generate_manifest.yml	This playbook contains the script to generate the manifest files.
copy_ocp_tool.yml	This playbook contains the script to copy the RHOCP tools from the current installer to the head nodes and RHEL 9 installer.
deploy_ipxe_ocp.yml	This playbook contains the script to deploy the iPXE server on the head nodes.
ocp_vm.yml	This playbook contains the script to create bootstrap and master nodes.

To run individual playbooks:

Do one of the following:

Edit site.yml file and add a comment for all the playbooks you do not want to execute.

For example, add the following comments in the site.yml file to deploy RHEL 9.4 OS:

- import_playbook: playbooks/rhel9_os_deployment.yml
- import_playbook: playbooks/copy_ssh_headnode.yml
- import_playbook: playbooks/prepare_rhel_hosts.yml
- import_playbook: playbooks/ntp.yml
- import_playbook: playbooks/binddns.yml
- import_playbook: playbooks/haproxy.yml
- import_playbook: playbooks/squid_proxy.yml
- import_playbook: playbooks/storage_pool.yml
- import_playbook: playbooks/rhel9_installerVM.yml
- import_playbook: playbooks/copy_ssh_installerVM.yml
- import_playbook: playbooks/prepare_rhel9_installer.yml
- import_playbook: playbooks/download_ocp_packages.yml
- import_playbook: playbooks/generate_manifest.yml
- import_playbook: playbooks/copy_ocp_tool.yml
- import_playbook: playbooks/deploy_ipxe_ocp.yml
- import_playbook: playbooks/ocp_vm.yml

Run the individual YAML files using the following command:

$ ansible-playbook -i hosts playbooks/<yaml_filename>.yml --ask-vault-pass

For example, run the following YAML file to deploy RHEL 9.4 OS:

$ ansible-playbook -i hosts playbooks/rhel9_os_deployment.yml --ask-vault-pass

For more information on executing individual playbooks, see the consecutive sections.

Deploying RHEL 9 OS on baremetal servers

This section describes how to run the playbook that contains the script for deploying RHEL 9.4 OS on BareMetal servers. To deploy RHEL 9.4 OS on the head nodes:

Navigate to the $BASE_DIR(/opt/hpe-solutions-openshift/DL-LTI-Openshift/) directory on the installer.
Run the following playbook:

$ ansible-playbook -i hosts playbooks/rhel9_os_deployment.yml --ask-vault-pass

Copying SSH key to head nodes

Once the OS is installed on the head nodes, copy the ssh key from the installer machine to the head nodes. It uses playbook that contains the script to copy the SSH public key from the installer machine to the head nodes.

To copy the SSH key to the head node run the following playbook:

$ ansible-playbook -i hosts playbooks/copy_ssh_headnode.yml --ask-vault-pass

Setting up head nodes

This section describes how to run the playbook that contains the script to prepare nodes for the RHOCP head nodes.

To register the head nodes to Red Hat subscription and download and install KVM Virtualization packages run the following playbook:

$ ansible-playbook -i hosts playbooks/prepare_rhel_hosts.yml --ask-vault-pass

Setting up NTP server on head nodes

This section describes how to run the playbook that contains the script to set up NTP server and enable time synchronization on all head nodes.

To set up NTP server on head nodes run the following playbook:

$ ansible-playbook -i hosts playbooks/ntp.yml --ask-vault-pass

Deploying Bind DNS on head nodes

This section describes how to deploy Bind DNS service on all three head nodes for active-passive cluster configuration.

To deploy Bind DNS service on head nodes run the following playbook:

$ ansible-playbook -i hosts playbooks/binddns.yml --ask-vault-pass

Deploying HAProxy on head nodes

The RHOCP 4.16 uses an external load balancer to communicate from outside the cluster with services running inside the cluster. This section describes how to deploy HAProxy on all three head nodes for active-active cluster configuration.

To deploy HAProxy server configuration on head nodes run the following playbook:

$ ansible-playbook -i hosts playbooks/haproxy.yml --ask-vault-pass

Deploying Squid proxy on head nodes

Squid is a proxy server that caches content to reduce bandwidth and load web pages more quickly. This section describes how to set up Squid as a proxy for HTTP, HTTPS, and FTP protocol, as well as authentication and restricting access. It uses a playbook that contains the script to deploy the Squid proxy on the head nodes to get web access.

To deploy Squid proxy server on head nodes run the following playbook:

$ ansible-playbook -i hosts playbooks/squid_proxy.yml --ask-vault-pass

Creating storage pools on head nodes

This section describes how to use the storage_pool.yml playbook that contains the script to create the storage pools on the head nodes.

To create the storage pools run the following playbook:

$ ansible-playbook -i hosts playbooks/storage_pool.yml --ask-vault-pass

Creating RHEL 9 installer machine

This section describes how to create a RHEL 9 installer machine using the rhel9_installerVM.yml playbook. This installer machine is also used as an installer for deploying the RHOCP cluster and adding RHEL 9.4 worker nodes.

To create a RHEL 9 installer machine run the following playbook:

$ ansible-playbook -i hosts playbooks/rhel9_installerVM.yml --ask-vault-pass

Copying SSH key to RHEL 9 installer machine

This section describes how to copy the SSH public key to the RHEL 9 installer machine using the copy_ssh_installerVM.yml playbook.

To copy the SSH public key to the RHEL 9 installer machine run the following playbook:

$ ansible-playbook -i hosts playbooks/copy_ssh_installerVM.yml --ask-vault-pass

Setting up RHEL 9 installer

This section describes how to set up the RHEL 9 installer using the prepare_rhel9_installer.yml playbook.

To set up the RHEL 9 installer run the following playbook:

$ ansible-playbook -i hosts playbooks/prepare_rhel9_installer.yml --ask-vault-pass

Downloading RHOCP packages

This section provides details about downloading the required RHOCP 4.16 packages using a playbook.

To download RHOCP 4.16 packages:

Download the required packages on the installer VM with the following playbook:

$ ansible-playbook -i hosts playbooks/download_ocp_packages.yml --ask-vault-pass

Generating Kubernetes manifest files

The manifests and ignition files define the master node and worker node configuration and are key components of the RHOCP 4.16 installation. This section describes how to use the generate_manifest.yml playbook that contains the script to generate the manifest files.

To generate Kubernetes manifest files run the following playbook:

$ ansible-playbook -i hosts playbooks/generate_manifest.yml --ask-vault-pass

Copying RHOCP tools

This section describes how to copy the RHOCP tools from the present installer to head nodes and RHEL 9 installer using the copy_ocp_tool.yml playbook.

To copy the RHOCP tools to the head nodes and RHEL 9 installer run the following playbook:

$ ansible-playbook -i hosts playbooks/copy_ocp_tool.yml --ask-vault-pass

Deploying iPXE server on head nodes

This section describes how to deploy the iPXE server on the head nodes using the deploy_ipxe_ocp.yml playbook.

To deploy the iPXE server run the following playbook:

$ ansible-playbook -i hosts playbooks/deploy_ipxe_ocp.yml --ask-vault-pass

Creating bootstrap and master nodes

This section describes how to create bootstrap and master nodes using the scripts in the ocp_vm.yml playbook.

To create bootstrap and master VMs on Kernel-based Virtual Machine (KVM):

Run the following playbook:

$ ansible-playbook -i hosts playbooks/ocp_vm.yml --ask-vault-pass

Deploying RHOCP cluster

Once the playbooks are executed successfully and the Bootstrap and master nodes are deployed with the RHCOS, deploy the RHOCP cluster.

To deploy the RHOCP cluster:

This installer VM was created as a KVM VM on one of the head nodes using the rhel9_installerVM.yml playbook. For more information, see the Creating RHEL 9 installer machine section.

Add the kubeconfig path in the environment variables using the following command:

$ export KUBECONFIG=/opt/hpe-solutions-openshift/DL-LTI-Openshift/playbooks/roles/generate_ignition_files/ignitions/auth/kubeconfig

Run the following command:

$ openshift-install wait-for bootstrap-complete --dir=/opt/hpe-solutions-openshift/DL-LTI-Openshift/playbooks/roles/generate_ignition_files/ignitions --log-level debug

Complete the RHOCP 4.16 cluster installation with the following command:

$openshift-install wait-for install-complete --dir=/opt/hpe-solutions-openshift/DL-LTI-Openshift/playbooks/roles/generate_ignition_files/ignitions --log-level=debug

The following output is displayed:

DEBUG OpenShift Installer v4.16

DEBUG Built from commit 6ed04f65b0f6a1e11f10afe658465ba8195ac459 

INFO Waiting up to 30m0s for the cluster at https://api.rrocp.pxelocal.local:6443 to initialize... 

DEBUG Still waiting for the cluster to initialize: Working towards 4.16: 99% complete 

DEBUG Still waiting for the cluster to initialize: Working towards 4.16: 99% complete, waiting on authentication, console,image-registry 

DEBUG Still waiting for the cluster to initialize: Working towards 4.16: 99% complete 

DEBUG Still waiting for the cluster to initialize: Working towards 4.16: 100% complete, waiting on image-registry 

DEBUG Still waiting for the cluster to initialize: Cluster operator image-registry is still updating 

DEBUG Still waiting for the cluster to initialize: Cluster operator image-registry is still updating 

DEBUG Cluster is initialized 

INFO Waiting up to 10m0s for the openshift-console route to be created...

DEBUG Route found in openshift-console namespace: console 

DEBUG Route found in openshift-console namespace: downloads 

DEBUG OpenShift console route is created 

INFO Install complete! 

INFO Access the OpenShift web-console here: https://console-openshift-console.apps.ocp.ngs.local

INFO Login to the console with user: kubeadmin, password: a6hKv-okLUA-Q9p3q-UXLc3

The RHOCP cluster is successfully installed.

After the installation is complete, check the status of the created cluster:

$ oc get nodes

Running Red Hat OpenShift Container Platform Console

Prerequisites:

The RHOCP cluster installation must be complete.

NOTE

The installer machine provides the Red Hat OpenShift Container Platform Console link and login details when the RHOCP cluster installation is complete.

To access the Red Hat OpenShift Container Platform Console:

Open a web browser and enter the following link:

https://console-openshift-console.apps.<customer.defined.domain> 

Sample one for reference:  https://console-openshift-console.apps.ocp.ngs.local

- Username: kubeadmin
- Password: <password>

NOTE

If the password is lost or forgotten, search for the kubeadmin-password file located in the /opt/hpe-solutions-openshift/DL-LTI-Openshift/playbooks/roles/generate_ignition_files/ignitions/auth/kubeadmin-password directory on the installer machine.

The following figure shows the Red Hat OpenShift Container Platform Console after successful deployment:

FIGURE 8. Red Hat OpenShift Container Platform Console login screen

Adding RHEL 9.4 worker nodes to RHOCP cluster using Ansible playbooks

NOTE

RHEL Worker Nodes are supported as best effort and require it own update and lifecycle management hence as not actively recommended.

The Lite Touch Installation (LTI) package includes Ansible playbooks with scripts to add the RHEL 9.4 worker nodes to the RHOCP cluster. You can use one of the following two methods to add the RHEL 9.4 worker nodes:

Run a consolidated playbook: This method includes a single playbook, site.yml, that contains a script to perform all the tasks for adding the RHEL 9.4 worker nodes to the existing RHOCP cluster. To run LTI using a consolidated playbook:

$ ansible-playbook -i inventory/hosts site.yml --ask-vault-pass

NOTE

The default password for the Ansible vault file is changeme

Run individual playbooks: This method includes multiple playbooks with scripts that enable you to deploy specific tasks for adding the RHEL 9.4 worker nodes to the existing RHOCP cluster. The playbooks in this method must be executed in a specific sequence to add the worker nodes.

The following table includes the purpose of each playbook required for the deployment:

TABLE 9. Add RHEL 9.4 nodes using Ansible playbooks

Playbook	Description
rhel9_os_deployment.yml	This playbook contains the scripts to deploy RHEL 9.4 OS on worker nodes.
copy_ssh.yml	This playbook contains the script to copy the SSH public key to the RHEL 9.4 worker nodes.
prepare_worker_nodes.yml	This playbook contains the script to prepare nodes for the RHEL 9.4 worker nodes.
ntp.yml	This playbook contains the script to create NTP setup to enable time synchronization on the worker nodes.
openshift-ansible/playbooks/scaleup.yml	This playbook contains the script to add worker nodes to the RHOCP cluster. This playbook queries the master, generates and distributes new certificates for the new hosts, and then runs the configuration playbooks on the new hosts.

To run individual playbooks do one of the following:

Edit site.yaml file and add a comment for all the playbooks except the ones that you want to execute.

For example, add the following comments in the site.yaml file to deploy RHEL 9.4 OS on the worker nodes:

import_playbook: playbooks/rhel9_os_deployment.yml

# import_playbook: playbooks/copy_ssh.yml

# import_playbook: playbooks/prepare_worker_nodes.yml

# import_playbook: playbooks/ntp.yml

# import_playbook: openshift-ansible/playbooks/scaleup.yml

Run the individual YAML files using the following command:

$ ansible-playbook -i inventory/hosts playbooks/<yaml_filename>.yml --ask-vault-pass

For example, run the following YAML file to deploy RHEL 9.4 OS on the worker nodes:

$ ansible-playbook -i inventory/hosts playbooks/rhel9_os_deployment.yml --ask-vault-pass

For more information on executing individual playbooks, see the consecutive sections.

Adding RHEL 9.4 worker nodes

This section describes how to add RHEL 9.4 worker nodes to an existing RHOCP cluster.

To add RHEL 9.4 worker nodes to the RHOCP cluster:

This installer VM was created as a KVM VM on one of the head nodes using the rhel9_installerVM.yml playbook. For more information, see the Creating RHEL 9 installer machine section.

Navigate to the directory $BASE_DIR/worker_nodes/

cd $BASE_DIR/worker_nodes/

NOTE

$BASE_DIR refers to /opt/hpe-solutions-openshift/DL-LTI-Openshift/

Run the following commands on the rhel9 installer VM to edit the vault input file.

ansible-vault edit input.yaml

The installation user should review hosts file (located on the installer VM at $BASE_DIR/inventory/hosts)

vi inventory/hosts

Copy Rhel9.4 DVD ISO to /usr/share/nginx/html/
Navigate to the $BASE_DIR/worker_nodes/ directory and run the following command:

$ sh setup.sh

Add the worker nodes to the cluster using one of the following methods:

Run the following sequence of playbooks:

ansible-playbook -i inventory/hosts playbooks/rhel9_os_deployment.yml --ask-vault-pass

ansible-playbook -i inventory/hosts playbooks/copy_ssh.yml --ask-vault-pass

ansible-playbook -i inventory/hosts playbooks/prepare_worker_nodes.yml --ask-vault-pass

ansible-playbook -i inventory/hosts playbooks/ntp.yml --ask-vault-pass

ansible-playbook -i inventory/hosts openshift-ansible/playbooks/scaleup.yml --ask-vault-pass

If you want to deploy the entire solution to add the worker nodes to the cluster, execute the following playbook:

$ ansible-playbook -i inventory/hosts site.yml --ask-vault-pass

Once all the playbooks are executed successfully, check the status of the node using the following command:

$ oc get nodes

The following output is displayed:

NAME			STATUS	ROLES		AGE	VERSION

master0.ocp.ngs.local	Ready	master,worker	3d	v1.29.4+8ca71f7

master1.ocp.ngs.local	Ready	master,worker	3d	v1.29.4+8ca71f7

master2.ocp.ngs.local	Ready	master,worker	3d	v1.29.4+8ca71f7

worker1.ocp.ngs.local	Ready	worker		1d	v1.29.4+8ca71f7

worker2.ocp.ngs.local	Ready	worker		1d	v1.29.4+8ca71f7

worker3.ocp.ngs.local	Ready	worker		1d	v1.29.4+8ca71f7

Once the worker nodes are added to the cluster, set the “mastersSchedulable” parameter as false to ensure that the master nodes are not used to schedule pods.
Edit the schedulers.config.openshift.io resource.

$ oc edit schedulers.config.openshift.io cluster

Configure the mastersSchedulable field.

apiVersion: config.openshift.io/v1 

kind: Scheduler 

metadata: 

`	`creationTimestamp: “2024-01-04T09:20:06Z"

`	`generation: 2

`	`name: cluster

`	`resourceVersion: “5939203"

`	`uid: a636d30a-d377-11e9-88d4-0a60097bee62

spec:

`	`mastersSchedulable: false 

`	`policy:

`		`name: “"

status: { }

NOTE

Set the mastersSchedulable to true to allow Control Plane nodes to be schedulable or false to disallow Control Plane nodes to be schedulable.

Save the file to apply the changes.

$ oc get nodes

The following output is displayed:

NAME			STATUS	ROLES	AGE	VERSION

master0.ocp.ngs.local	Ready	master	3d	v1.29.4+8ca71f7

master1.ocp.ngs.local	Ready	master	3d	v1.29.4+8ca71f7

master2.ocp.ngs.local	Ready	master	3d	v1.29.4+8ca71f7

worker1.ocp.ngs.local	Ready	worker	1d	v1.29.4+8ca71f7

worker2.ocp.ngs.local	Ready	worker	1d	v1.29.4+8ca71f7

worker3.ocp.ngs.local	Ready	worker	1d	v1.29.4+8ca71f7

NOTE

To add more worker nodes, update worker details in HAProxy and binddns on head nodes and then add RHEL 9.4 worker nodes to the RHOCP cluster.

Adding RHCOS worker nodes to RHOCP cluster using Ansible playbooks

The Lite Touch Installation (LTI) package includes Ansible playbooks with scripts to add the RHCOS worker nodes to the RHOCP cluster. You can use one of the following two methods to add the RHCOS worker nodes:

Run a consolidated playbook: This method includes a single playbook, site.yml, that contains a script to perform all the tasks for adding the RHCOS worker nodes to the existing RHOCP cluster. To run LTI using a consolidated playbook:

$ ansible-playbook -i hosts site.yml --ask-vault-pass

NOTE

The default password for the Ansible vault file is changeme

Run individual playbooks: This method includes multiple playbooks with scripts that enable you to deploy specific tasks for adding the RHCOS worker nodes to the existing RHOCP cluster. The playbooks in this method must be executed in a specific sequence to add the worker nodes.

The following table includes the purpose of each playbook required for the deployment:

TABLE 9. Playbook Description

Playbook	Description
rhel9_os_deployment.yml	This playbook contains the scripts to deploy RHEL 9.4 OS on baremetal servers.
copy_ssh_workernode.yml	This playbook contains the script to copy the ssh public key from installer machine to the KVM worker nodes.
prepare_rhel_hosts.yml	This playbook contains the script to prepare KVM worker nodes with required packages and subscription.
ntp.yml	This playbook contains the script to create NTP setup on KVM worker nodes to make sure time synchronization.
binddns.yml	This playbook contains the script to deploy bind dns on three head nodes and it will work as both Active & Passive.
haproxy.yml	This playbook contains the script to deploy haproxy on the head nodes and it will act as Active.
storage_pool.yml	This playbook contains the script to create the storage pools on the KVM Worker nodes.
deploy_ipxe_ocp.yml	This playbook contains the script to deploy the ipxe code on the RHEL 9 installer machine.
ocp_rhcosworkervm.yml	This playbook contains the script to add kvm based coreos nodes to exsting Openshift cluster.

To run individual playbooks do one of the following:

Edit site.yaml file and add a comment for all the playbooks except the ones that you want to execute.

For example, add the following comments in the site.yaml file to deploy RHCOS on the worker nodes:

import_playbook: playbooks/rhel9_os_deployment.yml
# import_playbook: playbooks/copy_ssh_workernode.yml
# import_playbook: playbooks/prepare_rhel_hosts.yml
# import_playbook: playbooks/ntp.yml
# import_playbook: playbooks/binddns.yml
# import_playbook: playbooks/haproxy.yml
# import_playbook: playbooks/storage_pool.yml
# import_playbook: playbooks/deploy_ipxe_ocp.yml
# import_playbook: playbooks/ocp_rhcosworkervm.yml

Run the individual YAML files using the following command:

$ ansible-playbook -i hosts playbooks/<yaml_filename>.yml --ask-vault-pass

For example, run the following YAML file to deploy RHEL 9.4 OS on the worker nodes:

$ ansible-playbook -i hosts playbooks/rhel9_os_deployment.yml --ask-vault-pass

For more information on executing individual playbooks, see the consecutive sections.

Adding RHCOS worker nodes

This section covers the steps to Enable KVM hypervisor on Worker Nodes and add RHCOS worker VM nodes to an existing Red Hat OpenShift Container Platform cluster.

This installer VM was created as a KVM VM on one of the head nodes using the rhel9_installerVM.yml playbook. For more information, see the Creating RHEL 9 installer machine section.

Navigate to the $BASE_DIR(/opt/hpe-solutions-openshift/DL-LTI-Openshift/) directory, then copy input file and hosts file to $BASE_DIR/coreos_kvmworker_nodes/ and later update ocp worker details in input file and kvm_workernodes group as per sample host file.

ansible-vault edit input.yaml

vi hosts
    '[kvm_workernodes]
    KVMworker1 IP
    KVMworker2 IP
    KVMworker3 IP'

NOTE

ansible vault password is changeme

Copy RHEL 9.4 DVD ISO to the /usr/share/nginx/html/ directory.
Navigate to the /opt/hpe-solutions-openshift/DL-LTI-Openshift/coreos_kvmworker_nodes/ directory add the worker nodes to the cluster using one of the following methods:

Run the following sequence of playbooks:

    ansible-playbook -i hosts playbooks/rhel9_os_deployment.yml --ask-vault-pass
    ansible-playbook -i hosts playbooks/copy_ssh_workernode.yml --ask-vault-pass
    ansible-playbook -i hosts playbooks/prepare_rhel_hosts.yml --ask-vault-pass
    ansible-playbook -i hosts playbooks/ntp.yml --ask-vault-pass
    ansible-playbook -i hosts playbooks/binddns.yml --ask-vault-pass
    ansible-playbook -i hosts playbooks/haproxy.yml --ask-vault-pass
    ansible-playbook -i hosts playbooks/storage_pool.yml --ask-vault-pass
    ansible-playbook -i hosts playbooks/deploy_ipxe_ocp.yml --ask-vault-pass
    ansible-playbook -i hosts playbooks/ocp_rhcosworkervm.yml --ask-vault-pass

If you want to deploy the entire solution to add the RHCOS worker nodes to the cluster, execute the following playbook:

$ ansible-playbook -i hosts site.yml --ask-vault-pass

After successful execution of all playbooks, check the node status as below.

Approving server certificates (CSR) for newly added nodes

The administrator needs to approve the CSR requests generated by each kubelet.

You can approve all Pending CSR requests using below command

$ oc get csr -o json | jq -r '.items[] | select(.status == {} ) | .metadata.name' | xargs oc adm certificate approve

Later, Verify Node status using below command:

$ oc get nodes

Execute the following command to set the parameter mastersSchedulable parameter as false, so that master nodes will not be used to schedule pods.

$ oc edit scheduler

Adding BareMetal CoreOS worker nodes to RHOCP cluster using Ansible playbooks

The Lite Touch Installation (LTI) package includes Ansible playbooks with scripts to add the bare metal CoreOS worker nodes to the RHOCP cluster. You can use one of the following two methods to add the CoreOS worker nodes:

Run a consolidated playbook: This method includes a single playbook, site.yml, that contains a script to perform all the tasks for adding the CoreOS worker nodes to the existing RHOCP cluster. To run LTI using a consolidated playbook:

$ ansible-playbook -i hosts site.yml --ask-vault-pass

NOTE

The default password for the Ansible vault file is changeme

Run individual playbooks: This method includes multiple playbooks with scripts that enable you to deploy specific tasks for adding the CoreOS worker nodes to the existing RHOCP cluster. The playbooks in this method must be executed in a specific sequence to add the worker nodes.

The following table includes the purpose of each playbook required for the deployment:

TABLE 9. Playbook Description

Playbook	Description
binddns.yml	This playbook contains the script to deploy bind dns on three worker nodes and it will work as both Active & Passive.
haproxy.yml	This playbook contains the script to deploy haproxy on the worker nodes and it will act as Active.
deploy_ipxe_ocp.yml	This playbook contains the script to deploy the ipxe code on the worker machine.

To run individual playbooks do one of the following:

Edit site.yml file and add a comment for all the playbooks except the ones that you want to execute.

For example, add the following comments in the site.yml file to bind dns on the worker nodes:

import_playbook: playbooks/binddns.yml
# import_playbook: playbooks/haproxy.yml
# import_playbook: playbooks/deploy_ipxe_ocp.yml

Run the individual YAML files using the following command:

$ ansible-playbook -i hosts playbooks/<yaml_filename>.yml --ask-vault-pass

For example, run the following YAML file to bind dns to the worker nodes:

$ ansible-playbook -i hosts playbooks/binddns.yml --ask-vault-pass

For more information on executing individual playbooks, see the consecutive sections.

Adding CoreOS worker nodes

This section covers the steps to add RHCOS worker nodes to an existing Red Hat OpenShift Container Platform cluster.

This installer VM was created as a KVM VM on one of the head nodes using the rhel8_installerVM.yml playbook. For more information, see the Creating RHEL 9 installer machine section.

Navigate to the $BASE_DIR(/opt/hpe-solutions-openshift/DL-LTI-Openshift/) directory, then copy input file and hosts file to $BASE_DIR/coreos_BareMetalworker_nodes/ and later update ocp worker details in input file.

ansible-vault edit input.yaml
------------------------------------------------------------------------------------------------------------
ocp_workers:
 - name: worker1
   ip: 172.28.xx.xxx
   fqdn: xxx.ocp.isv.local                   #ex. mworker1.ocp.isv.local
   mac_address: XX:XX:XX:XX:XX:XX			 #For BareMetal core os worker update mac address of server NIC
 - name: worker2
   ip: 172.28.xx.xxx
   fqdn: xxx.ocp.isv.local                 #ex. mworker2.ocp.isv.local
   mac_address: XX:XX:XX:XX:XX:XX 		   #For BareMetal core os worker update mac address of server NIC
 - name: worker3
   ip: 172.28.xx.xxx
   fqdn: xxx.ocp.isv.local                   #ex. mworker3.ocp.isv.local
   mac_address: XX:XX:XX:XX:XX:XX 		     #For BareMetal core os worker update mac address of server NIC
------------------------------------------------------------------------------------------------------------

NOTE

import the hosts file from the $BASE_DIR

ansible vault password is changeme

Navigate to the /opt/hpe-solutions-openshift/DL-LTI-Openshift/coreos_BareMetalworker_nodes/ directory add the worker nodes to the cluster using one of the following methods:

Run the following sequence of playbooks:

$ ansible-playbook -i hosts playbooks/binddns.yml --ask-vault-pass
$ ansible-playbook -i hosts playbooks/haproxy.yml --ask-vault-pass
$ ansible-playbook -i hosts playbooks/deploy_ipxe_ocp.yml --ask-vault-pass

If you want to deploy the entire solution to add the RH CoreOS worker nodes to the cluster, execute the following playbook:

$ ansible-playbook -i hosts site.yml --ask-vault-pass

After successful execution of all playbooks, check the node status as below.

Approving server certificates (CSR) for newly added nodes

The administrator needs to approve the CSR requests generated by each kubelet.

You can approve all Pending CSR requests using below command

$ oc get csr -o json | jq -r '.items[] | select(.status == {} ) | .metadata.name' | xargs oc adm certificate approve

Later, Verify Node status using below command:

$ oc get nodes

Execute the following command to set the parameter mastersSchedulable parameter as false, so that master nodes will not be used to schedule pods.

$ oc edit scheduler

Adding COREOS GPU worker nodes to RHOCP cluster

NOTE

For addition of worker node into the RHOCP cluster, you can follow the process documented at section "Adding BareMetal CoreOS worker nodes to RHOCP cluster using Ansible playbooks".

NVIDIA GPU Operator on openshift cluster

NVIDIA supports the use of graphics processing unit (GPU) resources on OpenShift Container Platform.

The NVIDIA GPU Operator leverages the Operator framework within OpenShift Container Platform to manage the full lifecycle of NVIDIA software components required to run GPU-accelerated workloads.

The prerequisites needed for running containers and VMs with GPU(s) differs, with the primary difference being the drivers required. For example, the datacenter driver is needed for containers, the vfio-pci driver is needed for GPU passthrough, and the NVIDIA vGPU Manager is needed for creating vGPU devices.

Prerequisites:
A working OpenShift 4.16 cluster with GPU enabled worker node.
Access to the OpenShift cluster as a cluster-admin to perform the required steps.
OpenShift CLI (oc) is installed.
OpenShift Virtualization operator is installed
NFD Operator need to install

Installing the Node Feature Discovery (NFD) Operator

The Node Feature Discovery (NFD) Operator is a prerequisite for the NVIDIA GPU Operator. Install the NFD Operator using the Red Hat OperatorHub catalog in the OpenShift Container Platform web console

Procedure

In the OpenShift Container Platform web console, click Operators → OperatorHub.
Choose Node Feature Discovery from the list of available Operators, and then click Install.
On the Install Operator page, select A specific namespace on the cluster, and then click Install. You do not need to create a namespace because it is created for you.

Verification

To verify that the NFD Operator installed successfully:

Navigate to the Operators → Installed Operators page.
Ensure that Node Feature Discovery is listed in the openshift-nfd project with a Status of InstallSucceeded.

NOTE

During installation an Operator might display a Failed status. If the installation later succeeds with an InstallSucceeded message, you can ignore the Failed message.

Create NodeFeatureDiscovery CR instance

When the Node Feature Discovery is installed, create an instance of Node Feature Discovery using the NodeFeatureDiscovery tab.

Click Operators > Installed Operators from the side menu.
Find the Node Feature Discovery entry.
Click NodeFeatureDiscovery under the Provided APIs field.
Click Create NodeFeatureDiscovery.
In the subsequent screen click Create. This starts the Node Feature Discovery Operator that proceeds to label the nodes in the cluster that have GPUs.

NOTE

The values pre-populated by the OperatorHub are valid for the GPU Operator.

Verify that the Node Feature Discovery Operator is functioning correctly

The Node Feature Discovery Operator uses vendor PCI IDs to identify hardware in a node. NVIDIA uses the PCI ID 10de. Use the OpenShift Container Platform web console or the CLI to verify that the Node Feature Discovery Operator is functioning correctly.

In the OpenShift Container Platform web console, click Compute > Nodes from the side menu.
Select a worker node that you know contains a GPU.
Click the Details tab.
Under Node labels verify that the following label is present: feature.node.kubernetes.io/pci-10de.present=true

NOTE

0x10de is the PCI vendor ID that is assigned to NVIDIA.

Verify the GPU device (pci-10de) is discovered on the GPU node:

$ oc describe node | egrep 'Roles|pci' | grep -v master

$ oc get nodes -l feature.node.kubernetes.io/pci-10de.present

Enabling the IOMMU driver on hosts

To enable the IOMMU (Input-Output Memory Management Unit) driver in the kernel, create the MachineConfig object and add the kernel arguments.

NOTE

Enabling IOMMU is needed for GPU with Openshift Virtualization

Prerequisites:
- Administrative privilege to a working OpenShift Container Platform cluster.
- Intel or AMD CPU hardware.
- Intel Virtualization Technology for Directed I/O extensions or AMD IOMMU in the BIOS (Basic Input/Output System) is enabled.
Procedure:
- Create a MachineConfig object that identifies the kernel argument. The following example shows a kernel argument for an Intel CPU:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 100-worker-iommu
spec:
  config:
    ignition:
      version: 3.2.0
  kernelArguments:
      - intel_iommu=on

- Create the new MachineConfig object:

$ oc create -f 100-worker-kernel-arg-iommu.yaml

- Verify that the new MachineConfig object was added:

$ oc get machineconfig

Labeling worker nodes

The GPU Operator uses the value of the label to determine which operands to deploy. Assign the following values to the label: container, vm-passthrough, and vm-vgpu. Use the following command to add a label to a worker node:

$ oc label node <node1-name> --overwrite nvidia.com/gpu.workload.config=vm-vgpu
$ oc label node <node2-name> --overwrite nvidia.com/gpu.workload.config=container
$ oc label node <node3-name> --overwrite nvidia.com/gpu.workload.config= vm-passthrough

Verify the GPU device details before installation.

ssh to the node, you can list the NVIDIA GPU devices with a command like the following example:

$ lspci -nnk -d 10de:

Installing the NVIDIA GPU Operator by using the web console

In the OpenShift Container Platform web console from the side menu, navigate to Operators > OperatorHub and select All Projects.
In Operators > OperatorHub, search for the NVIDIA GPU Operator. For additional information see the Red Hat OpenShift Container Platform documentation.

Note

The suggested namespace to use is the nvidia-gpu-operator. You can choose any existing namespace or create a new namespace name. If you install in any other namespace other than nvidia-gpu-operator, the GPU Operator will not automatically enable namespace monitoring, and metrics and alerts will not be collected by Prometheus.

If only trusted operators are installed in this namespace, you can manually enable namespace monitoring with this command:

$ oc label ns/$NAMESPACE_NAME openshift.io/cluster-monitoring=true

Select the NVIDIA GPU Operator, click Install. In the subsequent screen click Install

FIGURE 9. NVIDIA GPU Operator deployment

Setup Openshift internal image registry to upload vGPU Manager image

To start the image registry, change the Image Registry Operator configuration’s managementState from Removed to Managed

$ oc patch configs.imageregistry.operator.openshift.io cluster –type merge –patch ‘{“spec”:{“managementState”:”Managed”}}’

Image registry storage configuration:

Configuring the Image Registry Operator to use CephFS storage with Red Hat OpenShift Data Foundation

Create a PVC to use the cephfs storage class. For example:

cat <<EOF | oc apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
 name: registry-storage-pvc
 namespace: openshift-image-registry
spec:
 accessModes:
 - ReadWriteMany
 resources:
   requests:
     storage: 100Gi
 storageClassName: ocs-storagecluster-cephfs
EOF

Configure the image registry to use the CephFS file system storage by entering the following command:

$ oc patch config.image/cluster -p '{"spec":{"managementState":"Managed","replicas":2,"storage":{"managementState":"Unmanaged","pvc":{"claim":"registry-storage-pvc"}}}}' --type=merge

Set Default Route to True:

$ oc patch configs.imageregistry.operator.openshift.io/cluster --patch '{"spec":{"defaultRoute":true}}' --type=merge

$ HOST=$(oc get route default-route -n openshift-image-registry --template='{{ .spec.host }}')
$ podman login -u kubeadmin -p $(oc whoami -t) --tls-verify=false $HOST

Create the secret to access the vGPU manager image

create a secret object for storing your registry API key (the mechanism used to authenticate your access to the private container registry).

Navigate to Home > Projects and ensure the nvidia-gpu-operator is selected.
In the OpenShift Container Platform web console, click Secrets from the Workloads drop down.
Click the Create Drop down.
Select Image Pull Secret.
Enter the following into each field:
a. Secret name: private-registry-secret
b. Authentication type: Image registry credentials
c. Registry server address: < image-registry.openshift-image-registry.svc:5000 >
d. Username: kubeadmin
e. Password: < kubeadm-password >
f. Email: < YOUR-EMAIL >
Click Create.
A pull secret is created.

FIGURE 10. Pull secret creation

Building the vGPU Manager image

Note

Building a vGPU Manager image is only required for NVIDIA vGPU. If you plan to use GPU Passthrough only, skip this section.

Use the following steps to build the vGPU Manager container and push it to a private registry.

Download the vGPU Software from the NVIDIA Licensing Portal.
• Login to the NVIDIA Licensing Portal and navigate to the Software Downloads section.
• The NVIDIA vGPU Software is located on the Driver downloads tab of the Software Downloads page.
• Click the Download link for the Linux KVM complete vGPU package. Confirm that the Product Version column shows the vGPU version to install. Unzip the bundle to obtain the NVIDIA vGPU Manager for Linux (NVIDIA-Linux-x86_64-{version}-vgpu-kvm.run file)

Use the following steps to clone the driver container repository and build the driver image.

Open a terminal and clone the driver container image repository:

$ git clone https://gitlab.com/nvidia/container-images/driver
$ cd driver

Change to the vgpu-manager directory for your OS:

$ cd vgpu-manager/rhel9

Copy the NVIDIA vGPU Manager from your extracted zip file:

$ cp NVIDIA-Linux-x86_64-550.90.05-vgpu-kvm.run .

Set the following environment variables:
• PRIVATE_REGISTRY - Name of the private registry used to store the driver image.
• VERSION - The NVIDIA vGPU Manager version downloaded from the NVIDIA Software Portal.
• OS_TAG - This must match the Guest OS version. For RedHat OpenShift, specify rhcos4.x where x is the supported minor OCP version.
• CUDA_VERSION - CUDA base image version to build the driver image with.
• $ export PRIVATE_REGISTRY=image-registry.openshift-image-registry.svc:5000/openshift VERSION=550.90.05 OS_TAG=rhcos4.16 CUDA_VERSION=12.4

Note

The recommended registry to use is the Integrated OpenShift Container Platform registry.

Build the NVIDIA vGPU Manager image:

$ docker build \
    --build-arg DRIVER_VERSION=${VERSION} \
    --build-arg CUDA_VERSION=${CUDA_VERSION} \
    -t ${PRIVATE_REGISTRY}/vgpu-manager:${VERSION}-${OS_TAG} .

Push the NVIDIA vGPU Manager image to your private registry:

$ docker push ${PRIVATE_REGISTRY}/vgpu-manager:${VERSION}-${OS_TAG}

$ docker push image-registry.openshift-image-registry.svc:5000/openshift/vgpu-manager: 550.90.05- rhcos4.16

Create the cluster policy for the NVIDIA GPU Operator:

The ClusterPolicy configures the GPU stack, configuring the image names and repository, pod restrictions/credentials and so on.

Table 12: clusterpolicy configuration for GPU-accelerated containers, GPU accelerated VMs with GPU passthrough and GPU accelerated-VMs with vGPU

Create the ClusterPolicy:

In the OpenShift Container Platform web console, from the side menu, select Operators -> Installed Operators, and click NVIDIA GPU Operator.
Select the ClusterPolicy tab, then click Create ClusterPolicy. The platform assigns the default name gpu-cluster-policy.

FIGURE 11. ClusterPolicy creations

Modify the clusterpolicy.json file as described in the Table 12:

Note

The vgpuManager options are only required if you want to use the NVIDIA vGPU.

save the changes:

FIGURE 12. created Clusterpolicy for NVIDIA GPU's

The vGPU Device Manager, deployed by the GPU Operator, automatically creates vGPU devices which can be assigned to KubeVirt VMs. Without additional configuration, the GPU Operator creates a default set of devices on all GPUs.

Verify the successful installation of the NVIDIA GPU Operator:

Run the following command to view these new pods and daemonsets:

$ oc get pods,daemonset -n nvidia-gpu-operator

Running a sample GPU Application Run a simple CUDA VectorAdd sample, which adds two vectors together to ensure the GPUs have bootstrapped correctly. Run the following:

$ cat << EOF | oc create -f -
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
 restartPolicy: OnFailure
 containers:
 - name: cuda-vectoradd
   image: "nvidia/samples:vectoradd-cuda11.2.1"
   resources:
     limits:
       nvidia.com/gpu: 1
EOF

Check the logs of the container:

$ oc logs cuda-vectoradd

Getting information about the GPU

The nvidia-smi shows memory usage, GPU utilization, and the temperature of the GPU. Test the GPU access by running the popular nvidia-smi command within the pod. To view GPU utilization, run nvidia-smi from a pod in the GPU Operator daemonset.

Change to the nvidia-gpu-operator project:
Run the following command to view these new pods:

$  oc get pod -owide -lopenshift.driver-toolkit=true

$ oc exec -it nvidia-driver-daemonset-<4xx.xxxxxxxxxxxx> -- nvidia-smi

2 tables are generated. The first table reflects the information about all available GPUs (the example shows 2 GPU). The second table provides details on the processes using the GPUs.

Verify the GPU devices on worker node after installation:

ssh to the worker node which is configured for vm-vgpu workload, and you can list the NVIDIA vGPU device created with a command like the following example:

$ lspci -nnk -d 10de:

ssh to the worker node which is configured for vm-passthrough workload, and you can list the NVIDIA GPU devices with a command like the following example:

$ lspci -nnk -d 10de:

ssh to the worker node which is configured for Container workload, and you can list the NVIDIA GPU devices with a command like the following example:

$ lspci -nnk -d 10de:

Add GPU Resources to the HyperConverged Custom Resource:

Update the HyperConverged custom resource so that all GPU and vGPU devices in your cluster are permitted and can be assigned to virtual machines.

Determine the resource names for the GPU devices:

$ oc get node <gpu-node > -o json | jq '.status.allocatable | with_entries(select(.key | startswith("nvidia.com/"))) | with_entries(select(.value != "0"))'

Determine the PCI device IDs for the GPUs.

ssh to the node, you can list the NVIDIA GPU devices with a command like the following example:

$ lspci -nnk -d 10de:

Modify the HyperConvered custom resource like the following example:

$ oc edit hyperconverged kubevirt-hyperconverged -n openshift-cnv

  permittedHostDevices:
    mediatedDevices:
    - externalResourceProvider: true
      mdevNameSelector: NVIDIA_L40S-24Q
      resourceName: nvidia.com/NVIDIA_L40S-24Q
    pciHostDevices:
    - externalResourceProvider: true
      pciDeviceSelector: 10DE:26b9
      resourceName: nvidia.com/AD102GL_L40S

Creating a virtual machine with GPU

Assign GPU devices, either passthrough or vGPU, to virtual machines.

Prerequisites

The GPU devices are configured in the HyperConverged custom resource (CR).

Procedure

Assign the GPU devices to a virtual machine (VM) by editing the spec.domain.devices.gpus field of the VirtualMachine manifest:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
....
spec:
  domain:
    devices:
      gpus:
      - deviceName: nvidia.com/NVIDIA_L40S-24Q
        name: gpu1
….

deviceName: the resource name associated with the GPU. name: name to identify the device on the VM.

RHOCP must be up and running.

To deploy a sample application on RHOCP 4.16 using Ephemeral storage:

Create a new project with namespace as "my-nginx-example".

$ oc new-project my-nginx-example

Deploy a new application.

$ oc new-app httpd-example --name=my-nginx-example --param=NAME=my-nginx-example

Validate the created service and route for the application.

$ oc status

Retrieve the service details of the application.

$ oc get svc my-nginx-example

Retrieve the route created for the application.

$ oc get route my-nginx-example

Use the route in your browser to access the application.

FIGURE 15. NGINX application Web Console login screen