Meta interview question

How to design a multimodal LLM? Coding: DFS on tree.