This project is a proof of concept (PoC) that demonstrates how multimodal large language models (LLMs) can be applied to complex research tasks by combining vision, text analysis, and real-time web search capabilities.
The focus of this PoC is not full production scalability, but validating the seamless integration of vision and text, mathematical rendering quality, and user experience under real-world research constraints.
Advanced research often requires more than just text-based interactions. Researchers and developers face specific challenges:
- Visual Context: Questions often relate to charts, diagrams, or physical objects that text cannot fully describe.
- Stale Information: Standard LLMs have knowledge cut-offs; research requires up-to-date web data.
- Formatting Constraints: Scientific output containing mathematical formulas (LaTeX) and code blocks is often poorly rendered in standard chat interfaces.
This PoC explores how to unify GPT-4o's vision capabilities with Web Search tools in a single, polished UI to solve these friction points.
- Multimodal Analysis: Simultaneous processing of high-resolution images and complex text queries using GPT-4o.
- Real-Time Web Research: Autonomous decision-making by the model to perform web searches when current data is needed.
-
Scientific Rendering Engine: A specialized frontend pipeline using
KaTeXandReact Markdownto render complex mathematical equations ($E=mc^2$ ) and tables perfectly. -
Context-Aware Memory: Maintains conversational context across turns using a
previous_response_idmechanism. - Modern UX Architecture: A clean, responsive interface built with Tailwind CSS and Lucide icons, distinguishing clearly between user input and AI analysis.
The solution implements a Microservices-lite architecture pattern, bridging a modern React frontend with the OpenAI Responses API through a secure FastAPI gateway.
- Ingestion: The React Frontend captures multimodal input (text prompts + raw image bytes) and handles client-side state.
- Orchestration: The FastAPI Backend acts as a secure proxy, restructuring the payload and validating requests via Pydantic.
- Reasoning & Retrieval: The request is forwarded to OpenAI
gpt-4o, which dynamically decides whether to use internal knowledge or trigger the Hosted Web Search tool. - Rendering: The structured response is returned to the client, where the Markdown Engine renders mathematical notation (LaTeX) and visual content in real-time.
- Python 3.10+
- FastAPI: High-performance Backend API
- React (Vite + TS): Reactive User Interface
- OpenAI GPT-4o: Multimodal Reasoning Model
- Tailwind CSS: Styling Framework
- KaTeX / Remark: Mathematical & Markdown Rendering
The PoC sends images directly as Base64 Data URIs to the API.
- Why: To simplify the architecture for the PoC and avoid the complexity of setting up a separate object storage service (like S3 or Azure Blob) during the validation phase.
- Trade-off: Increases payload size, which is acceptable for single-user testing but would need optimization in production.
Conversation history is currently held in the Frontend state.
- Why: To ensure immediate UI responsiveness and reduce backend database overhead during the prototyping of the interaction model.
- Future: A persistent database (PostgreSQL/Redis) would be required for long-term history storage.
The chat endpoint waits for the full generation before responding.
- Why: To ensure the integrity of the markdown and math syntax before rendering.
- Trade-off: Users see a loader instead of a stream. Streaming responses would be the next step for UX improvement.
This PoC intentionally does not include:
- User Authentication (OAuth/JWT).
- Persistent database storage for chat history (Session-based only).
- File storage buckets (Images are transient).
- Streaming API responses (Server-Sent Events).
If extended beyond PoC, the following areas would be addressed:
- Streaming Support: Implementing SSE for typewriter-effect responses.
- Storage Layer: Integrating AWS S3 or Azure Blob for image handling.
- Vector Memory: Adding a vector database (Pinecone/Chroma) for long-term research memory.
- Containerization: Dockerizing frontend and backend for orchestration (K8s).
Follow these steps to set up and run the project locally on your machine.
Navigate to the backend directory (root where main.py exists):
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install fastapi uvicorn python-dotenv openai pydanticCreate a .env file in the root directory and add your OpenAI API Key:
OPENAI_API_KEY="sk-your-openai-api-key"Start the Server:
uvicorn main:app --reloadServer runs at: http://localhost:8000
Navigate to the frontend directory:
cd frontend
npm installMake sure specific rendering libraries are installed:
npm install react-markdown remark-math remark-gfm rehype-katex lucide-reactStart the Application:
npm run devApp runs at: http://localhost:5173
Once the application is running:
- Visual Analysis: Click the Paperclip icon to upload a diagram, chart, or photo. Ask a question like "Analyze the trend in this chart."
- Deep Research: Ask a question requiring current data, e.g., "What were the latest stock market trends for AI companies yesterday?". The system will auto-trigger web search.
- Math & Code: Try asking for a formula: "Explain the Black-Scholes equation." Observe the LaTeX rendering.
Umit Sener