Difference between external and internal hive table
partitioning and bucketing
SQL questions like how to find duplicate records etc
spark performance improvement
cluster vs client mode in spark
Spark optimization techniques
Hive optimization techniques
repartition vs coleasce
RDD vs Dataframe
File formats
Serialization
Catalyst optimizer and why
Hadoop architecture
how do we handle duplicate data and query.
Unit testing
Job schdeuling
Best file format (at the time) was ORC as it was compressed and not cleartext/flat file. Parquet is also a good option and is now considered standard (this was before Delta). Less