Last Updated: November 21, 2025
Apache Spark
Big data processing framework
Core Concepts
| Item | Description |
|---|---|
RDD
|
Resilient Distributed Dataset |
DataFrame
|
Structured data with named columns |
Transformation
|
Lazy operation (map, filter, join) |
Action
|
Triggers computation (count, collect) |
Common Operations
| Item | Description |
|---|---|
select()
|
Select specific columns |
filter()
|
Filter rows by condition |
groupBy()
|
Group data for aggregation |
join()
|
Join two DataFrames |
SQL Operations
# Register DataFrame as temp view
df.createOrReplaceTempView("people")
# Run SQL query
result = spark.sql("""
SELECT name, age
FROM people
WHERE age > 18
ORDER BY age DESC
""")
result.show()
💡 Pro Tips
Quick Reference
Spark processes data in-memory for fast analytics