Spark Job Tuning Cheat Sheet

Last Updated: November 21, 2025

Focus Areas

Focus
Match `spark.sql.shuffle.partitions` to cluster size
`Cache reused DataFrames`


         spark-submit --conf spark.sql.shuffle.partitions=200 job.py

Tune shuffles


         df.repartition(100)

Repartition function


         df.persist(StorageLevel.MEMORY_ONLY)

Cache DataFrame

Partition, cache, and limit shuffles to speed Spark jobs.

💡 Pro Tip: Use `persist()` for reused data and avoid wide dependencies.