Dataset, models, and baselines for extracting diagnoses, medication allergies, and usual medications from Portuguese ER admission notes.
This repository accompanies the paper "NER Models for Portuguese Emergency Room Notes: Extracting Diagnoses, Medication Allergies, and Usual Medications" and contains everything needed to reproduce the experiments: the annotated dataset, fine-tuning and evaluation scripts for encoder models, generative LLM baselines, prompts, and pre-computed results.
Clinical use disclaimer: The clinical notes in this dataset are fictional or synthetic and are intended for research and benchmarking only. The models and scripts are not validated for direct clinical use. Deployment in any clinical workflow requires institutional approval, data-governance review, and prospective clinical validation.
Emergency Room (ER) handovers require rapid identification of a patient's principal diagnosis, usual medications, and medication allergies. This project develops and evaluates specialised NER models for these three entities in Portuguese, a language underrepresented in clinical NLP resources.
Key contributions:
- A synthetic dataset of 300 Portuguese ER admission notes (275 LLM-generated with Llama 3.3 + 15 physician-validated), covering eight medical specialties.
- Two-layer annotation: entity spans for the three target classes, plus mappings to standard terminologies (ICD-10 for diagnoses, ATC for medication allergies, SNOMED CT for usual medications).
- Fine-tuned NER models: BioBERT-PT and MediAlbertina, benchmarked against few-shot Gemini and Gemma baselines.
| Model | Exact match | IoU ≥ 0.50 |
|---|---|---|
| BioBERT-PT | 0.75 | 0.82 |
| MediAlbertina | 0.70 | 0.80 |
| Gemma 4 (open-weight) | 0.48 | 0.63 |
| Gemini 2.5 Flash Lite (closed-weight) | 0.35 | 0.44 |
Encoder models substantially outperform generative baselines. Principal diagnosis is the most challenging class; usual medication extraction achieves the strongest performance across all models.
ER_NER/
├── dataset/
│ ├── train.json # 257 documents — NER training split (spans + polarity)
│ ├── val.json # 28 documents — NER validation split (spans + polarity)
│ ├── test-real.json # 15 documents — physician-validated evaluation split
│ ├── dataset_with_codes.json # All 300 documents with terminology codes (ICD-10, ATC, SNOMED CT)
│ └── ER_NER_Dataset_Characterization.ipynb # Dataset statistics and analysis
│
├── encoders_training_testing/
│ ├── train.py # Fine-tuning script (HuggingFace Transformers)
│ ├── train-run.sh # Training launcher with hyperparameter config
│ ├── test-per-class.py # Evaluation script (exact match + IoU@0.5)
│ ├── test_sub.sh # Evaluation launcher
│ └── additional_details.md # Full training and evaluation details
│
├── generative_testing/
│ ├── ER_NER_baseline__cleaned.ipynb # Gemini / Gemma few-shot NER extraction
│ └── ER_NER_evaluation.ipynb # Evaluation of generative model outputs
│
├── prompts/
│ ├── prompt_synthetic_data_gen.md # Prompt template used to generate clinical notes
│ └── prompt_generative_NER_extraction.md # Prompt template used for generative NER baselines
│
├── results/
│ ├── biobertpt.json # BioBERT-PT predictions on test set
│ ├── medialbertina.json # MediAlbertina predictions on test set
│ ├── gemini-2.5-flash-lite/ # Per-document Gemini predictions (JSON + HTML)
│ └── gemma-4-31b-it/ # Per-document Gemma predictions (JSON + HTML)
│
├── images/
│ ├── annotation_scheme.png
│ ├── dataset_split_count.png
│ ├── annotation_example_diagnosis.png
│ └── usualmedication+medicationallergies_example.png
│
└── README.md
The dataset is released in two complementary versions, suited to different use cases:
| File | Documents | Content | Use case |
|---|---|---|---|
train.json / val.json / test-real.json |
257 / 28 / 15 | Entity spans and polarity only | NER model training and evaluation |
dataset_with_codes.json |
300 | Entity spans + ICD-10, ATC, and SNOMED CT codes | Normalisation, interoperability, downstream tasks |
| Split | File | Documents | Annotated spans |
|---|---|---|---|
| Train | train.json |
257 | 1,493 |
| Validation | val.json |
28 | 166 |
| Test | test-real.json |
15 | 86 |
The train/validation split uses iterative multi-label stratification to ensure proportional class representation. The test set is composed exclusively of the 15 physician-validated notes, providing a close-to-real-world evaluation benchmark.
| Entity | JSON label | Terminology | Field in dataset_with_codes.json |
|---|---|---|---|
| Principal diagnosis | Diagnóstico |
ICD-10 | ICD10 |
| Usual medication | Medicação Habitual |
SNOMED CT | SNOMEDCT |
| Medication allergy | Alergias medicamentosas |
ATC | ATC |
Medication allergies also carry a Polaridade field (Positiva / Negativa) in both dataset versions.
NER splits (train.json, val.json, test-real.json) — spans and polarity only:
{
"doc_id": 1,
"text": "...",
"annotations": [
{
"begin": 924,
"end": 942,
"label": "Medicação Habitual"
},
{
"begin": 310,
"end": 319,
"label": "Alergias medicamentosas",
"Polaridade": "Positiva"
}
]
}Full dataset with codes (dataset_with_codes.json) — spans and terminology codes:
{
"doc_id": 1,
"text": "...",
"annotations": [
{
"begin": 924,
"end": 942,
"label": "Medicação Habitual",
"SNOMEDCT": "376701008",
"ATC": "",
"ICD10": ""
},
{
"begin": 310,
"end": 319,
"label": "Alergias medicamentosas",
"Polaridade": "Positiva",
"SNOMEDCT": "",
"ATC": "J01CA04",
"ICD10": ""
}
]
}The begin and end fields are character-level, exclusive-end offsets into text. Recover the span text with:
span = document["text"][annotation["begin"]:annotation["end"]]Each annotation only populates the terminology field relevant to its entity type; the other code fields are empty strings.
- Medication allergies: the annotated markable is the allergenic agent itself (e.g.
"penicillin"in "reports a known allergy to penicillin"). For negated contexts (e.g. "has no drug allergies"), the markable is the negated phrase (e.g."drug allergies") withPolaridade = Negativa. - Principal diagnosis: the most specific ICD-10 code is assigned where possible. Overly generic descriptions (e.g. "cardiovascular disease") are not coded.
- Usual medication: spans include the medication name plus dosage and administration instructions when present.
The task is formulated as BIO sequence labelling over four effective classes:
| BIO class | Description |
|---|---|
Alergias medicamentosas__Positiva |
Positive medication allergy |
Alergias medicamentosas__Negativa |
Explicit absence of medication allergy |
Medicação Habitual |
Usual/chronic medication |
Diagnóstico |
Principal diagnosis |
This yields 9 output labels: O plus B- and I- for each of the four classes.
Two base models are supported:
| Model | HuggingFace identifier |
|---|---|
| MediAlbertina | portugueseNLP/medialbertina_pt-pt_900m |
| BioBERT-PT | pucpr/biobertpt-all |
Configure paths and hyperparameters in encoders_training_testing/train-run.sh, then run:
cd encoders_training_testing
bash train-run.shHandling long documents: notes are split into overlapping 512-token windows with a stride of 128. At inference, logit scores from overlapping windows are averaged per token before argmax decoding.
Handling class imbalance: inverse-frequency class weights are applied to the cross-entropy loss (O-class weight capped at 0.25× mean non-O weight; I-X weights forced equal to corresponding B-X weights). Documents containing at least one Alergias medicamentosas__Negativa span are oversampled ×4 during training.
The best checkpoint (by validation F1) is saved to {OUTPUT_DIR}/best/. See encoders_training_testing/additional_details.md for the complete training specification.
cd encoders_training_testing
python test-per-class.py \
--model_dir {OUTPUT_DIR}/best \
--test_json ../dataset/test-real.json \
--max_len 512 \
--stride 128 \
--score_mode joint \
--pred_json predictions.jsonTwo span matching criteria are reported:
- Exact match: predicted and gold spans must have identical character-level start and end indices.
- Relaxed match (IoU ≥ 0.50): the character-level overlap between predicted and gold spans must meet a minimum intersection-over-union threshold of 50%.
Evaluation runs in joint extraction + polarity mode (--score_mode joint): missed and spurious spans are penalised in addition to polarity errors.
Few-shot generative NER experiments are in generative_testing/ER_NER_baseline__cleaned.ipynb. Both baselines use LangExtract to obtain structured outputs.
For each test document, the most semantically similar training document (by cosine similarity of text embeddings) is retrieved and used as a dynamic few-shot example.
| Model | Type | Identifier |
|---|---|---|
| Gemini 2.5 Flash Lite | Closed-weight (API) | gemini-2.5-flash-lite |
| Gemma 4 | Open-weight (local) | gemma-4-31B-it |
Privacy note: closed-weight API models are unsuitable for real ER settings due to data governance constraints. Open-weight models can be deployed locally within hospital infrastructure.
Pre-computed per-document predictions are available in results/gemini-2.5-flash-lite/ and results/gemma-4-31b-it/.
| File | Purpose |
|---|---|
prompts/prompt_synthetic_data_gen.md |
Template used with Llama 3.3 to generate synthetic clinical notes. Parameterised by {medical specialty} and {allergy} (presence/absence), with a physician-validated note as {example}. |
prompts/prompt_generative_NER_extraction.md |
Extraction prompt for the generative baselines, defining the three entity classes and extraction rules. |
- Five physicians from four specialties each wrote one fictional ER admission note.
- Fifteen variations were generated from these examples using the synthetic data generation prompt, then reviewed and validated by the same physicians.
- 275 additional notes were generated using Llama 3.3, with the 15 validated notes as few-shot examples. Medical specialty and allergy presence/absence were varied systematically across generations.
- The resulting 275 synthetic notes were combined with the 15 physician-validated notes for annotation.
Quality evaluation: two independent physicians assessed 60 synthetic notes each using a six-question Likert protocol. The notes scored positively on medication clarity (Q2) and allergy identification (Q5), with moderate scores on diagnosis specificity (Q4). See the paper for the full evaluation results and inter-annotator agreement (Krippendorff's α).
Annotation was performed by a PhD student in Linguistics with a pharmaceutical background, using a layered approach (markables → allergy codes → diagnosis codes → medication codes).
If you use this dataset, models, or code in your work, please cite:
@inproceedings{ernermodels2026,
title = {NER Models for Portuguese Emergency Room Notes: Extracting Diagnoses, Medication Allergies, and Usual Medications},
author = {Anonymous},
booktitle = {Anonymous Submission},
year = {2026}
}This entry will be updated with the full citation upon publication.
- Dataset (
dataset/): CC BY-NC 4.0 — free for research; commercial use requires permission. - Code (
encoders_training_testing/,generative_testing/,prompts/): Apache 2.0.
