Spark Job Tuning Cheat Sheet

Partitions, caching, and shuffles

Last Updated: November 21, 2025

Focus Areas

Focus
Match `spark.sql.shuffle.partitions` to cluster size
Cache reused DataFrames

Commands & Queries

spark-submit --conf spark.sql.shuffle.partitions=200 job.py
Tune shuffles
df.repartition(100)
Repartition function
df.persist(StorageLevel.MEMORY_ONLY)
Cache DataFrame

Summary

Partition, cache, and limit shuffles to speed Spark jobs.

💡 Pro Tip: Use `persist()` for reused data and avoid wide dependencies.
← Back to Data Science & ML | Browse all categories | View all cheat sheets