A lightweight, production-grade NLP-based system to extract contextual metadata, keywords, topics, named entities, sentiment, and summaries from documents like PDF, DOCX, and TXT.
Built using FastAPI, transformers, BERTopic, KeyBERT, and deployed with Streamlit UI.
- 📁 Supports PDF, DOCX, and TXT files
- 🧠 Generates:
- Summarized text
- Sentiment analysis
- Named entities grouped by type
- Keywords (TF-IDF + KeyBERT)
- Document structure insights (headers, paragraphs, sentence stats)
- Topic modeling with BERTopic
- 💡 NLP stack includes HuggingFace Transformers, spaCy, KeyBERT, and more
- 🌐 Interactive Streamlit front-end
- ✅ JSON download support
- Backend: FastAPI
- Frontend: Streamlit
- NLP Libraries: transformers, spacy, keybert, bertopic, pytesseract
- PDF/Text Processing: pdfminer.six, docx2txt, python-docx, pdfplumber
git clone https://github.com/jayjain4554/Automated-Meta-Data-Generation.git
cd Automated-Meta-Data-Generationpython -m venv venv
source venv/bin/activate # Linux/macOS
venv\Scripts\activate # Windowspip install -r requirements.txtpython -m spacy download en_core_web_smuvicorn app.main:app --reloadRuns on: http://localhost:8000
streamlit run streamlit_app.pyRuns on: http://localhost:8501
| File Type | Supported |
|---|---|
.pdf |
✅ |
.docx |
✅ |
.txt |
✅ |
- Push this repo to GitHub
- Go to https://streamlit.io/cloud
- Click “New app” and select the repo
- Set the main file to
streamlit_app.py - Set Python version (e.g., 3.10)
- Add
requirements.txtin app setup
Make sure your repo contains only essential files, and not venv/ or large models
Jay Jain BTech Chemical Engineering, IIT Roorkee
