Your Resource for All Things Apps, Ops, and Infrastructure
Choosing a Container Technology? Keep These 3 Things in Mind

Scaling Large Data Science Environments With Spark and Kubernetes

The evolution of technology means that IT infrastructure teams have had trouble keeping up with their company’s scaling needs. Our team at Vertical Trail has assisted many clients with this issue, and one of our preferred methods is using Spark on Kubernetes.

Spark and Kubernetes have unique capabilities and benefits, but together, they bring out the best of each other to achieve more. This was proven by recent work with one of our clients, in which data science and large-scale data engineering platforms were combined into a unified orchestration platform to reduce infrastructure management and administration while facilitating faster tool integration to support user needs.

Value of Spark and Kubernetes Individually

Kubernetes

Kubernetes helps organizations automate and templatize their infrastructure to provide better scalability and management. Kubernetes containers can access scalable storage and process data at scale, making them a preferred candidate for data science and engineering activities. Kubernetes improves infrastructure utilization, decreases operational costs, and reduces a team’s required level of expertise because Kubernetes automates and simplifies daily container workflows.

Another benefit is that “Docker is a container runtime environment that is frequently used with Kubernetes” – so it’s possible to use Docker as a template that Kubernetes can work with to manage multiple Docker containers. This enables more efficient use of resources and optimizes operational efficiency, giving a team more time to work on other projects.

Spark

Spark provides in-memory computing capabilities to deliver speed, application support, and ease of use. Spark can be 100x faster than a Big Data processing tool for large-scale data processing by leveraging memory computing and other optimizations. Combined with Kubernetes’ compute power, this scalability further improves, reducing processing time and increasing the ability to scale dynamically without affecting rates.

Increased Value of Spark and Kubernetes Together

Separation of compute and storage scalabilities provides an opportunity to use distributed technologies such as Spark in infrastructure orchestration platforms. Traditionally, Spark requires a service provider to manage it because it’s an open source tool. Another option is to use a cluster, which, in my opinion, isn’t that appealing because it requires more skill – meaning the everyday person can’t use it. In other words, both options require some kind of outside or specialized assistance.

Spark, therefore, needs a cluster expander; with this Spark-Kubernetes solution, there is no need for Hadoop (and therefore no Hadoop-based Administrator) because a Spark image can be built out and run on Kubernetes. In other words, it is possible to take a container and scale it out on a unified infrastructure. There are also different possibilities to add to Spark and Kubernetes, making expansion and management easier. It is, then, possible to provision, monitor, and scale data – which means that scaling Spark is now a black box: distribution can be optimized.

By running Spark and Kubernetes this way, support is needed to run data at scale. With the simplicity of this solution, however, traditional infrastructure teams can manage it, thereby reducing the knowledge gap between administrators. This ensures that a business increases operational efficiency while reducing support time and efforts because there is no designated player who has to maintain the data.

Bringing Spark onto Kubernetes also creates one unified platform that reuses existing architecture to increase efficiency: it’s less infrastructure to manage, it provides one interface to manage workloads, there is better utilization of spare cycles, and it reduces the need for understanding complex Big Data infrastructure needs.

Our Experience

We recently implemented Spark on Kubernetes to replace a Python- / R-based data science platform and Big Data cluster for one of our clients. This process improved time to market new data science use cases by 35% and reduced infrastructure management cost by 40%. During this process, Spark data pipelines were optimized to reduce the three years of data processing time by 85%. It’s evident that Spark on Kubernetes increased efficiency and reduced costs for this organization by reusing some of their existing architecture and creating a unified platform that could be managed by their existing team.

Conduct a Health Check on your ransomware resiliency and work fearlessly toward the future.

Subscribe to the AHEAD i/o Newsletter