Skip to content

[1.3] Knowledge base ingestion pipeline (GCS → chunking + embedding → Vertex AI Vector Search) #64

@jasminetay-moe

Description

@jasminetay-moe

Epic: E1 — TW RAG + Model Service
Priority: P0
Role: Engineer

User Story

As an Engineer, I want to build an end-to-end ingestion pipeline that reads guidance materials from GCS, chunks and embeds them, and indexes them in Vertex AI Vector Search, so that the RAG layer can retrieve relevant passages based on student/teacher context.

Context

Combines stories 1.7 (GCS ingestion connection) and 1.8 (chunking + embedding pipeline). GCS is the source-of-truth for guidance materials. Documents are read from the bucket, chunked with configurable size/overlap, embedded via Vertex AI, and indexed in Vector Search using @google-cloud/aiplatform (Vertex RAG Engine). Chunk size and embedding model choices affect retrieval quality — tune during implementation.

Acceptance Criteria

  • Different document types (PDF, Word, PPTX) successfully read from the GCS bucket
  • Files imported into Vertex RAG Engine corpus
  • Documents chunked with configurable chunk size and overlap
  • Chunks embedded using Vertex AI embeddings model and indexed in Vertex AI Vector Search
  • End-to-end pipeline tested: GCS document → Vector Search index
  • Retrieval quality tested with sample queries representative of student/teacher contexts
  • Connection and processing errors handled gracefully with logging

Dependencies

Combines #12 (1.7) and #13 (1.8)


📄 PRD: Part 1 — Glow CI PRD

Metadata

Metadata

Assignees

No one assigned

    Labels

    engineerEngineer storyp0Must-havepart-1Epic 1: Technical Integration

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions