Skip to content

UmitFSD/ask-your-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

14 Commits
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Azure Enterprise RAG Assistant (OCR-Enabled) โ€” PoC

Python Azure Streamlit Status

This project is a forward-deployed style proof of concept (PoC) that demonstrates how large language models can be applied to enterprise PDF documents using a Retrieval-Augmented Generation (RAG) approach.

The focus of this PoC is not full production scalability, but validating retrieval quality, explainability, and user experience under real-world enterprise constraints such as scanned documents, complex layouts, and hallucination risk.


๐Ÿ“ธ Interface Preview

---dashboard

๐ŸŽฏ Problem Statement

Enterprise knowledge is often stored in PDF documents that are difficult to work with using LLMs:

  • Scanned Content: Documents may be partially or fully scanned.
  • Complex Layouts: Content often includes tables, long sections, and academic formatting.
  • Trust Issues: Ungrounded LLM responses can introduce hallucination risk and reduce trust.

This PoC explores how to safely enable โ€œask your dataโ€ scenarios while preserving traceability, context integrity, and controlled model behaviour.


๐Ÿš€ What This PoC Demonstrates

  • Hybrid PDF Processing: A fast text extraction path (PyMuPDF) with an automatic OCR fallback (Azure AI Document Intelligence) for scanned pages.
  • Context-Preserving Chunking: Avoids breaking tables and structured content across chunks.
  • Semantic Retrieval: Uses vector search for deep understanding of queries.
  • Grounded Generation: Answers are generated using only the retrieved context.
  • Transparent Citations: Page-level citation visibility for user verification.
  • Smart Conversation Routing: Distinguishes between conversational follow-ups (chat) and document search requests (RAG).
  • Security Persona: Controlled prompt persona to handle sensitive academic topics without false refusals.

๐Ÿ— High-Level Architecture

image
  1. Upload: The user uploads a PDF through the Streamlit interface.
  2. Processing:
    • Digital text is extracted using PyMuPDF.
    • If insufficient text is found (scanned page), OCR is applied using Azure AI Document Intelligence.
  3. Indexing: Extracted text is split into large overlapping chunks, embedded using text-embedding-ada-002, and stored in Azure AI Search.
  4. Routing: User queries are classified as either conversational follow-ups or document search requests.
  5. Retrieval & Generation: Relevant chunks are retrieved and passed to GPT-4o for answer generation.
  6. Display: Answers are displayed along with expandable references showing source text and page numbers.

๐Ÿ›  Technology Stack

  • Python 3.10+
  • Streamlit: User Interface
  • Azure OpenAI: GPT-4o (Reasoning) & text-embedding-ada-002 (Embeddings)
  • Azure AI Search: Vector Store
  • Azure AI Document Intelligence: OCR Fallback
  • PyMuPDF: Fast PDF Parsing
  • LangChain: Orchestration & Chains

โš–๏ธ Design Decisions and Trade-Offs

Fixed Index Strategy (V1)

The PoC uses a single, fixed Azure AI Search index. New documents are appended to the same index rather than creating a new index per upload.

  • Why: To avoid routing ambiguity during conversational retrieval and keep the PoC focused on retrieval quality/UX rather than index orchestration.
  • Future: Index versioning (V2, V3) or tenant isolation would be introduced for production.

Hybrid Parsing Strategy

OCR is used only when necessary to control cost and latency.

  • Why: A simple text-length threshold determines when OCR is triggered. This reflects a common enterprise trade-off between accuracy and performance.

Conversation Routing

A lightweight router distinguishes between conversational follow-ups and new search queries.

  • Why: This reduces unnecessary retrieval calls (cost saving) and improves response relevance (better UX).

Explainability First

The PoC prioritizes transparency.

  • Why: Source documents and page numbers are always visible via "Reference Tabs". The system favours explainability over aggressive automation to build user trust.

โš ๏ธ PoC Scope and Known Limitations

This PoC intentionally does not include:

  • Multi-tenant isolation
  • Concurrent user handling
  • Index lifecycle automation
  • Full observability and telemetry
  • Authentication and authorization
  • Persistent chat history (Session based only)

๐Ÿ”ฎ Production-Oriented Next Steps

If extended beyond PoC, the following areas would be addressed:

  • Microservices decomposition for ingestion, retrieval, and chat.
  • Tenant-aware index strategies.
  • Structured logging and monitoring.
  • Evaluation frameworks (RAGAS or TruLens).
  • Persistent conversation storage (CosmosDB/PostgreSQL).
  • Fine-grained access control (RBAC).

๐Ÿš€ Quick Start & How to Use

Follow these steps to set up and run the project locally on your machine.

1. Clone the Repository

Open your terminal and clone the repo:

git clone https://github.com/UmitFSD/ask-your-data.git
cd ask-your-data

2. Install Dependencies

Make sure you have Python 3.10+ installed. Then install the required libraries:

pip install -r requirements.txt

3. Configure Azure Credentials

Since this project uses Azure services, you need to provide your API keys.

  1. Create a folder named .streamlit in the root directory.
  2. Inside that folder, create a file named secrets.toml.
  3. Paste your Azure keys into the file like this:
# .streamlit/secrets.toml

AZURE_OPENAI_KEY = "your-openai-key-here"
AZURE_OPENAI_ENDPOINT = "https://your-resource.openai.azure.com/"

AZURE_SEARCH_KEY = "your-search-key-here"
AZURE_SEARCH_ENDPOINT = "https://your-search-service.search.windows.net"

AZURE_DOCUMENT_INTELLIGENCE_KEY = "your-doc-intel-key-here"
AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT = "https://your-form-recognizer.cognitiveservices.azure.com/"

4. Run the Application

Start the Streamlit app:

streamlit run azure_rag.py

The application will open automatically in your browser at http://localhost:8501.


๐Ÿ“– Usage Guide

Once the application is running, follow these steps to interact with your data:

1. Ingest Data

  1. Open the Data Ingestion sidebar on the left.
  2. Drag and drop your PDF document (e.g., a financial report, academic paper).
  3. Click the "Analyze & Index Document" button.
  4. Wait for the "โœ… Indexing Complete!" message.

2. Chat with your Document

You can now ask questions in the chat input box. The system handles context-aware follow-ups and multilingual queries.

๐Ÿงฉ Example Questions to Try

  • Summarization: "Summarize the executive summary and key findings."
  • Specific Data: "What are the risk factors mentioned in page 5?"
  • Table Analysis: "Interpret the numbers in Table 3."
  • Multilingual: "Bu dรถkรผmandaki ana riskler nelerdir?" (Turkish)

๐Ÿ‘จโ€๐Ÿ’ป Author

รœmit ลžener
Senior Cloud & AI Specialist

About

Advanced RAG Assistant featuring Query Rewriting, Smart Routing, and Hybrid Parsing (OCR) using Azure OpenAI.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages