The platform has thousands of aerospace suppliers with structured attributes (capacities,
certifications) and unstructured documents attached to them (HTML pages, PDFs). All of this
information has some commonalities, but a lot fo what makes each of these companies
successfully doesn’t necessarily fit a common schema.
A buyer for an aerospace company should be able to communicate a need in plain language
and receive a list of suppliers that match its requirements and the context surrounding the
request.
Note: The current system uses full-text ElasticSearch, and you can test it out here:
https://axya.co/suppliers_directory?page=0
Challenge:
1. Index structured and unstructured data into a unified semantic search solution to answer
capability queries (e.g., "CNC machining for titanium aerospace parts").
2. Make sure that part of the query that is deterministic gets treated as such (i.e. specific
certification required or geolocalisation of the suppliers).
Key Requirements:
● Data Ingestion & Preprocessing: Describe ETL for structured tables and document
parsing (PDF, HTML), metadata extraction, and cleaning.
● Embedding & Vector Store: Choose embedding models (e.g., OpenAI embeddings,
Sentence Transformers) and vector database architecture.
● “RAG” Pipeline: Illustrate how a retrieval layer and LLM can be combined to answer
free‐text queries with structured output (e.g., top-N supplier list with relevancy scores).
● Cloud Deployment: Architect an AWS-based solution for indexing, query API, and
autoscaling.
● MLOps & Monitoring: Propose a CI/CD process for retraining embeddings (if needed),
refreshing indexes, and tracking query performance and drift.
Note 1: Whenever possible, we much prefer to reuse existing technologies than to add new
ones.
Note 2: all of the information collected and used for indexing are public information from
suppliers.
Deliverables
1. Slide Deck: 12–15 slides covering both projects end-to-end.
2. Architecture Diagrams: Detailed AWS diagrams for each system’s components, data
flows, and failover strategies.
3. Code Snippets / Pseudocode: Examples of key modules (e.g., data ingestion, model
inference, CI pipeline definitions).
4. Security & Compliance Notes: Brief discussion on data privacy and access controls
(when necessary).
5. (bonus) Optional Prototype: If time permits, a minimal proof‐of‐concept (e.g., Jupyter
notebook or small Lambda function).