vSphere with Tanzu Deployment Error

While I was enabling enabling Workload Management (Tanzu) on vSphere 7.0 U1 with NSX, I hit an error.

Error configuring cluster NIC on master VM. This operation is part of API server configuration and will be retried.

This isn’t a particularly complicated fix but I wanted to document the steps taken to find the issue.

TL;DR: Enabling Workload Management on vSphere 7.0.1 with NSX-T deploys a medium Load Balancer. Ensure you’ve got enough resources available to support that on your NSX-T Edge.

The first place I looked was the /var/log/vmware/wcp/wcpsvc.log file on the vCenter. This showed up the below errors:

2020-11-05T15:47:43.271Z error wcp [opID=5fa2ffc6-domain-c7] Unexpected object: &Status{ListMeta:ListMeta{SelfLink:,ResourceVersion:,Continue:,RemainingItemCount:nil,},Status:Failure,Message:an error on the server ("unable to decode an event from the watch stream: net/http: request canceled (Client.Timeout exceeded while reading body)") has prevented the request from succeeding,Reason:InternalError,Details:&StatusDetails{Name:,Group:,Kind:,Causes:[]StatusCause{StatusCause{Type:UnexpectedServerResponse,Message:unable to decode an event from the watch stream: net/http: request canceled (Client.Timeout exceeded while reading body),Field:,},StatusCause{Type:ClientWatchDecoding,Message:unable to decode an event from the watch stream: net/http: request canceled (Client.Timeout exceeded while reading body),Field:,},},RetryAfterSeconds:0,UID:,},Code:500,}
2020-11-05T15:47:43.271Z error wcp [opID=5fa2ffc6-domain-c7] Error watching NSX CRD resources.
2020-11-05T15:47:43.271Z error wcp [opID=5fa2ffc6-domain-c7] Error creating NSX resources. Err: Kubernetes API call failed. Details Error watching NSX CRD resources.
2020-11-05T15:47:43.271Z error wcp [opID=5fa2ffc6-domain-c7] Failed to create cluster network interface for MasterNode: VirtualMachine:vm-3921. Err: Kubernetes API call failed. Details Error watching NSX CRD resources.
2020-11-05T15:47:43.271Z error wcp [opID=5fa2ffc6-domain-c7] Error configuring cluster NIC on master VM vm-3921: Kubernetes API call failed. Details Error watching NSX CRD resources.
2020-11-05T15:47:43.271Z error wcp [opID=5fa2ffc6-domain-c7] Error configuring API server on cluster domain-c7 Error configuring cluster NIC on master VM. This operation is part of API server configuration and will be retried.

This seemed to point to some sort of routing issue. It couldn’t be between vCenter, NSX Manager & TKG Supervisor nodes as they were all on the same subnet. I did wonder if it could be the Ingress/Egress subnets that I configured as part of enabling Workload Management. However in my lab they’d recently been hosting TKGI workloads so I knew the routing should be good. Just to make sure I check the NSX API logs on the NSX Manager. While these didn’t show any entries with the same timestamp as above, they did show that some communications had occured. At this point though the communications were all between the components on the management network, nothing on the ingress/egress networks yet.

This led to me checking out the logs on the Supervisor nodes themselves. First thing is to get the password by running the /usr/lib/vmware-wcp/decryptK8Pwd.py password retrieval script on the vCenter. Then SSH to the root account on that node. Once logged in I ran:

kubectl get pod --all-namespaces

This showed straight away that the nsx-ncp pod was in CrashLoopBackOff, so I viewed the logs on that pod.

kubectl logs nsx-ncp-c9d6544d6-mvzf6 --namespace vmware-system-nsx

The results?

2020-11-05T20:29:26.997Z 42135b555eec05e200b91e6625b4d8a3 NSX 9 - [nsx@6876 comp="nsx-container-ncp" subcomp="ncp" level="CRITICAL"] nsx_ujo.ncp.main Failed to initialize container orchestrator adaptor: Tier1 ID domain-c7:1cbff893-84c5-46e2-811b-f476e6b14b36 is in ERROR state: Found errors in the request. Please refer to the related errors for details.: [Routing] Insufficient resources to do auto allocation in edge cluster EdgeCluster/190ed531-d56f-4dce-b183-248e9bec8e8c of pool type LB_ALLOCATION_POOL because of below reason(s).: [Routing] Node(s) [TransportNode/93dfbcc1-1236-4ecc-a365-2f7c6c3a605e] don't have enough capacity to allocate.

So there wasn’t enough resources to deploy a load balancer on the NSX Edge node. I know the Tanzu requirements are a Large node, but if it was only deploying a small load balancer it should have enough resources. So I checked out the load balancer usage by hitting this API on NSX.

GET https://[NSX-IP]/policy/api/v1/infra/lb-node-usage-summary

That returned this:

{
    "results": [
        {
            "current_load_balancer_credits": 3,
            "load_balancer_credit_capacity": 10,
            "current_pool_member_count": 24,
            "pool_member_capacity": 2000,
            "usage_percentage": 30.0,
            "severity": "GREEN",
            "node_counts": [
                {
                    "severity": "GREEN",
                    "node_count": 1
                },
                {
                    "severity": "ORANGE",
                    "node_count": 0
                },
                {
                    "severity": "RED",
                    "node_count": 0
                }
            ],
            "enforcement_point_path": "/infra/sites/default/enforcement-points/default"
        }
    ],
    "intent_path": "/infra/lb-node-usage-summary"
}

So the credit capacity is 10, which is inline with a medium Edge node, but I was already using 3. Turns out there was some remnents of TKGI LBs still hanging around on NSX-T but it also seemed to indicate that the Supervisor cluster wanted to deploy a medium LB. Next step, uplift the edge to a Large and see what happens. Sure enough it created successfully, if I hit the Node Usage API again, I get:

{
    "results": [
        {
            "current_load_balancer_credits": 13,
            "load_balancer_credit_capacity": 40,
            "current_pool_member_count": 32,
            "pool_member_capacity": 7500,
            "usage_percentage": 32.5,
            "severity": "GREEN",
            "node_counts": [
                {
                    "severity": "GREEN",
                    "node_count": 1
                },
                {
                    "severity": "ORANGE",
                    "node_count": 0
                },
                {
                    "severity": "RED",
                    "node_count": 0
                }
            ],
            "enforcement_point_path": "/infra/sites/default/enforcement-points/default"
        }
    ],
    "intent_path": "/infra/lb-node-usage-summary"
}

Credits in use comes to 13 which confirms the medium load balancer theory, which shows in NSX Manager as well.

For those that are not familiar with load balancer credits, each load balancer deployed consumes a number of credits and each size of edge node has a maximum capacity of load balancers. You can find out more at the config max tool.