Machine Learning at Scale with OCI and Kubeflow
Sanjay Basu | Head of Technology Strategy, Oracle Cloud Engineering
Seshadri Dehalisan
Official Disclaimer:
The views and opinions expressed in this blog are those of the authors
and do not necessarily reflect the official policy or position of Oracle
Corporation.
Setting the context
Enterprises are increasingly reliant on machine learning (ML) to further their organization's goals. While machine learning can provide the necessary competitive advantage and intelligence, enterprises need framework to harvest the benefits. This multi-series blog discusses the challenges with machine learning at scale and how you can use the combined power of Oracle Cloud Infrastructure (OCI) offerings and open source Kubeflow platform to achieve your ML outcome.
Challenges
Machine learning at scale introduces multiple challenges as outlined in the below diagram.
OCI & Kubeflow to rescue
Oracle Cloud Infrastructure (OCI) offers multiple services to enable enterprises' ML needs such as data science services, compute service with multiple shapes such as highly performant GPU, bare metal, HPC and genral compute shape as well as managed Kubernetes referred to Oracle Container Engine for Kubernetes (OKE). OCI also offers the other underlying foundational components from Network, Storage and Security perspectives.
Kubeflow is an open source project that contains a curated set of compatible tools and frameworks specific for ML. Kubeflow runs on Kubernetes. Deploying Kubeflow on OKE enables deployment of machine learning workflows that are composable, scalable, secure and portable.
Implementing Kubeflow on OKE
OCI offers ability to create OKE clusters in different ways - Console, Terraform, Oracle Resource Manager, or OCI SDKs. The blog will not go through the steps to create a OKE cluster and will let reader to go through the link referred here. OKE Clusters are completely managed - that is the control plane is managed by Oracle and customer has flexibility to choose disparate shapes for their worker nodes. The workers can be further grouped distinctly into different pools called node pools that can serve different purposes.
Training for machine learning is resource intensive and slow while the model serving are typically light weight and have stringent performance SLAs. So, one can consider distinct node pools for training to that of serving. It is important to note that the OCI shapes are homogenous within a node pool.
-
Create a OKE Kubernetes Cluster. For the purposes of illustration, we have created a 3 node cluster on VM.Flex.E3 shape
$ kubectl get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME 10.0.24.218 Ready node 2d15h v1.20.8 10.0.24.218 Oracle Linux Server 7.9 5.4.17-2102.203.6.el7uek.x86_64 cri-o://1.20.2 10.0.39.106 Ready node 2d15h v1.20.8 10.0.39.106 Oracle Linux Server 7.9 5.4.17-2102.203.6.el7uek.x86_64 cri-o://1.20.2 10.0.40.137 Ready node 2d15h v1.20.8 10.0.40.137 Oracle Linux Server 7.9 5.4.17-2102.203.6.el7uek.x86_64 cri-o://1.20.2
Kustomize
Kubeflow uses Kustomize (a Kubernetes native application configuration management tool) to install its components. Kubeflow offers two options to implement Kubeflow - single command installation of all components and multi-command individual component installation. For this blog, we have chosen to illustrate single command installation. As of this writing, Kubeflow is not compatible with latest version of Kustomize 4.x and Kustomize 3.2 should be used. It can be downloaded from here.
./kustomize version
Version: {KustomizeVersion:3.2.0 GitCommit:a3103f1e62ddb5b696daa3fd359bb6f2e8333b49 BuildDate:2019-09-18T16:26:36Z GoOs:darwin GoArch:amd64}
Make sure to add Kustomize to your path or install Kustomize is common directory such as /usr/local/lib
Get the Kubeflow Repo
git clone https://github.com/kubeflow/manifests.git
cd manifests
Make sure to add Kustomize to your path or install Kustomize is common directory such as /usr/local/lib
Pre-deploy Customizations
Default login credentials for Kubeflow out of the box is user@example.com and password of 12341234
Let us change the password to something more secure before deployment. Password change can be done as follows. This assumes you have Python version 3 installed in your client environment. You will enter the desired password and it will return an encrypted hash value
pip3 install passlib
pip3 install bcrypt
python3 -c 'from passlib.hash import bcrypt; import getpass; print(bcrypt.using(rounds=12, ident="2y").hash(getpass.getpass()))'
Take the hash value and replace in the config-map.yaml in manifests/common/dex/base directory We will take up changing the default username and pointing it to external SSO in subsequent blog posts
Deploy Kubeflow
while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 30; done
Kubeflow Components
Once Kubeflow installs successfully, it would have created multiple components and namespaces. Details of the key components are given below.
Namespace | Purpose |
---|---|
Kubeflow | Primary namespace for Kubeflow components |
KFServing | Components for Serverless Kubernetes Inferencing |
cert-manager | Kubeflow leverages Zero-trust and uses mutual-TLS. Namespace for managing mutual tls and admission web hooks |
Istio | Components that secure traffic, enforce network authorization and routing policies |
Dex | Components for OpenID Connect Identity |
At this point, Istio-ingressgateway is exposed as NodePort. This can be verified as follows
kubectl describe svc istio-ingressgateway -n istio-system
Name: istio-ingressgateway
Namespace: istio-system
Labels: app=istio-ingressgateway
install.operator.istio.io/owning-resource=unknown
istio=ingressgateway
istio.io/rev=default
operator.istio.io/component=IngressGateways
release=istio
Annotations:
Selector: app=istio-ingressgateway,istio=ingressgateway
Type: NodePort
IP: 10.233.254.172
Port: status-port 15021/TCP
TargetPort: 15021/TCP
NodePort: status-port 32723/TCP
Endpoints: 10.234.0.7:15021
Port: http2 80/TCP
TargetPort: 8080/TCP
NodePort: http2 31323/TCP
Endpoints: 10.234.0.7:8080
Port: https 443/TCP
TargetPort: 8443/TCP
NodePort: https 31547/TCP
Endpoints: 10.234.0.7:8443
Port: tcp 31400/TCP
TargetPort: 31400/TCP
NodePort: tcp 32426/TCP
Endpoints: 10.234.0.7:31400
Port: tls 15443/TCP
TargetPort: 15443/TCP
NodePort: tls 30051/TCP
Endpoints: 10.234.0.7:15443
Session Affinity: None
External Traffic Policy: Cluster
Events:
Considering it is nodeport, you can access Kubeflow with a simple port forward to test it out. This can be done as shown below:
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
Now you can point your browser to localhost:8080 and it will open Kubeflow UI as shown below
Now, you can enter user@example.com and the password you created earlier. The UI will look as below:
Conclusion
The objective of this blog series is to enable MLOps teams to overcome the ML pipeline related implementation issues by automating the model deployments into the core software applications and / or standing up an As-A-Service, API based software delivery component.
Till now, we have identified why Kubeflow is needed and how OCI and Kubeflow complement each other. The basic process to get Kubeflow has been expanded as well. In the interest of keeping the blog size to manageable limit, we will provide the key cornerstones of Kubeflow industrialization in subsequent blogs. The next series is shown below.
Comments
Post a Comment