Azure Enterprise RAG Assistant (OCR-Enabled) — PoC

This project is a forward-deployed style proof of concept (PoC) that demonstrates how large language models can be applied to enterprise PDF documents using a Retrieval-Augmented Generation (RAG) approach.

The focus of this PoC is not full production scalability, but validating retrieval quality, explainability, and user experience under real-world enterprise constraints such as scanned documents, complex layouts, and hallucination risk.

📸 Interface Preview

---

🎯 Problem Statement

Enterprise knowledge is often stored in PDF documents that are difficult to work with using LLMs:

Scanned Content: Documents may be partially or fully scanned.
Complex Layouts: Content often includes tables, long sections, and academic formatting.
Trust Issues: Ungrounded LLM responses can introduce hallucination risk and reduce trust.

This PoC explores how to safely enable “ask your data” scenarios while preserving traceability, context integrity, and controlled model behaviour.

🚀 What This PoC Demonstrates

Hybrid PDF Processing: A fast text extraction path (PyMuPDF) with an automatic OCR fallback (Azure AI Document Intelligence) for scanned pages.
Context-Preserving Chunking: Avoids breaking tables and structured content across chunks.
Semantic Retrieval: Uses vector search for deep understanding of queries.
Grounded Generation: Answers are generated using only the retrieved context.
Transparent Citations: Page-level citation visibility for user verification.
Smart Conversation Routing: Distinguishes between conversational follow-ups (chat) and document search requests (RAG).
Security Persona: Controlled prompt persona to handle sensitive academic topics without false refusals.

🏗 High-Level Architecture

Upload: The user uploads a PDF through the Streamlit interface.
Processing:
- Digital text is extracted using PyMuPDF.
- If insufficient text is found (scanned page), OCR is applied using Azure AI Document Intelligence.
Indexing: Extracted text is split into large overlapping chunks, embedded using text-embedding-ada-002, and stored in Azure AI Search.
Routing: User queries are classified as either conversational follow-ups or document search requests.
Retrieval & Generation: Relevant chunks are retrieved and passed to GPT-4o for answer generation.
Display: Answers are displayed along with expandable references showing source text and page numbers.

🛠 Technology Stack

Python 3.10+
Streamlit: User Interface
Azure OpenAI: GPT-4o (Reasoning) & text-embedding-ada-002 (Embeddings)
Azure AI Search: Vector Store
Azure AI Document Intelligence: OCR Fallback
PyMuPDF: Fast PDF Parsing
LangChain: Orchestration & Chains

⚖️ Design Decisions and Trade-Offs

Fixed Index Strategy (V1)

The PoC uses a single, fixed Azure AI Search index. New documents are appended to the same index rather than creating a new index per upload.

Why: To avoid routing ambiguity during conversational retrieval and keep the PoC focused on retrieval quality/UX rather than index orchestration.
Future: Index versioning (V2, V3) or tenant isolation would be introduced for production.

Hybrid Parsing Strategy

OCR is used only when necessary to control cost and latency.

Why: A simple text-length threshold determines when OCR is triggered. This reflects a common enterprise trade-off between accuracy and performance.

Conversation Routing

A lightweight router distinguishes between conversational follow-ups and new search queries.

Why: This reduces unnecessary retrieval calls (cost saving) and improves response relevance (better UX).

Explainability First

The PoC prioritizes transparency.

Why: Source documents and page numbers are always visible via "Reference Tabs". The system favours explainability over aggressive automation to build user trust.

⚠️ PoC Scope and Known Limitations

This PoC intentionally does not include:

Multi-tenant isolation
Concurrent user handling
Index lifecycle automation
Full observability and telemetry
Authentication and authorization
Persistent chat history (Session based only)

🔮 Production-Oriented Next Steps

If extended beyond PoC, the following areas would be addressed:

Microservices decomposition for ingestion, retrieval, and chat.
Tenant-aware index strategies.
Structured logging and monitoring.
Evaluation frameworks (RAGAS or TruLens).
Persistent conversation storage (CosmosDB/PostgreSQL).
Fine-grained access control (RBAC).

🚀 Quick Start & How to Use

Follow these steps to set up and run the project locally on your machine.

1. Clone the Repository

Open your terminal and clone the repo:

git clone https://github.com/UmitFSD/ask-your-data.git
cd ask-your-data

2. Install Dependencies

Make sure you have Python 3.10+ installed. Then install the required libraries:

pip install -r requirements.txt

3. Configure Azure Credentials

Since this project uses Azure services, you need to provide your API keys.

Create a folder named .streamlit in the root directory.
Inside that folder, create a file named secrets.toml.
Paste your Azure keys into the file like this:

# .streamlit/secrets.toml

AZURE_OPENAI_KEY = "your-openai-key-here"
AZURE_OPENAI_ENDPOINT = "https://your-resource.openai.azure.com/"

AZURE_SEARCH_KEY = "your-search-key-here"
AZURE_SEARCH_ENDPOINT = "https://your-search-service.search.windows.net"

AZURE_DOCUMENT_INTELLIGENCE_KEY = "your-doc-intel-key-here"
AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT = "https://your-form-recognizer.cognitiveservices.azure.com/"

4. Run the Application

Start the Streamlit app:

streamlit run azure_rag.py

The application will open automatically in your browser at http://localhost:8501.

📖 Usage Guide

Once the application is running, follow these steps to interact with your data:

1. Ingest Data

Open the Data Ingestion sidebar on the left.
Drag and drop your PDF document (e.g., a financial report, academic paper).
Click the "Analyze & Index Document" button.
Wait for the "✅ Indexing Complete!" message.

2. Chat with your Document

You can now ask questions in the chat input box. The system handles context-aware follow-ups and multilingual queries.

🧩 Example Questions to Try

Summarization: "Summarize the executive summary and key findings."
Specific Data: "What are the risk factors mentioned in page 5?"
Table Analysis: "Interpret the numbers in Table 3."
Multilingual: "Bu dökümandaki ana riskler nelerdir?" (Turkish)

👨‍💻 Author

Ümit Şener
Senior Cloud & AI Specialist

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
README.md		README.md
azure_rag.py		azure_rag.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Azure Enterprise RAG Assistant (OCR-Enabled) — PoC

📸 Interface Preview

🎯 Problem Statement

🚀 What This PoC Demonstrates

🏗 High-Level Architecture

🛠 Technology Stack

⚖️ Design Decisions and Trade-Offs

Fixed Index Strategy (V1)

Hybrid Parsing Strategy

Conversation Routing

Explainability First

⚠️ PoC Scope and Known Limitations

🔮 Production-Oriented Next Steps

🚀 Quick Start & How to Use

1. Clone the Repository

2. Install Dependencies

3. Configure Azure Credentials

4. Run the Application

📖 Usage Guide

1. Ingest Data

2. Chat with your Document

🧩 Example Questions to Try

👨‍💻 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Azure Enterprise RAG Assistant (OCR-Enabled) — PoC

📸 Interface Preview

🎯 Problem Statement

🚀 What This PoC Demonstrates

🏗 High-Level Architecture

🛠 Technology Stack

⚖️ Design Decisions and Trade-Offs

Fixed Index Strategy (V1)

Hybrid Parsing Strategy

Conversation Routing

Explainability First

⚠️ PoC Scope and Known Limitations

🔮 Production-Oriented Next Steps

🚀 Quick Start & How to Use

1. Clone the Repository

2. Install Dependencies

3. Configure Azure Credentials

4. Run the Application

📖 Usage Guide

1. Ingest Data

2. Chat with your Document

🧩 Example Questions to Try

👨‍💻 Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages