merging the working changes#37
Open
anup28kmr wants to merge 65 commits into
Open
Conversation
# Conflicts: # docker-compose.yaml # fraud_detection/src/app.py # orchestrator/src/app.py # suggestions/Dockerfile # suggestions/requirements.txt # suggestions/src/app.py # transaction_verification/Dockerfile # transaction_verification/requirements.txt # transaction_verification/src/app.py # utils/pb/fraud_detection/fraud_detection.proto # utils/pb/fraud_detection/fraud_detection_pb2.py # utils/pb/fraud_detection/fraud_detection_pb2.pyi # utils/pb/suggestions/suggestions.proto # utils/pb/suggestions/suggestions_pb2.py # utils/pb/transaction_verification/transaction_verification.proto # utils/pb/transaction_verification/transaction_verification_pb2.py # utils/pb/transaction_verification/transaction_verification_pb2.pyi # utils/pb/transaction_verification/transaction_verification_pb2_grpc.py
Merging final changes for check-point-1
Implementation-based checks pass via scripts/checkpoint2-checks.ps1, including checkout scenarios and leader failover. Remaining checkpoint 2 work is documentation and diagrams.
# Conflicts: # fraud_detection/requirements.txt # fraud_detection/src/app.py # orchestrator/requirements.txt # orchestrator/src/app.py # suggestions/requirements.txt # suggestions/src/app.py # transaction_verification/requirements.txt # transaction_verification/src/app.py # utils/pb/fraud_detection/fraud_detection.proto # utils/pb/fraud_detection/fraud_detection_pb2.py # utils/pb/fraud_detection/fraud_detection_pb2.pyi # utils/pb/fraud_detection/fraud_detection_pb2_grpc.py # utils/pb/suggestions/suggestions.proto # utils/pb/suggestions/suggestions_pb2.py # utils/pb/suggestions/suggestions_pb2_grpc.py # utils/pb/transaction_verification/transaction_verification.proto # utils/pb/transaction_verification/transaction_verification_pb2.py # utils/pb/transaction_verification/transaction_verification_pb2_grpc.py
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Closes the three code/documentation gaps identified in §15.4 of the commit-1ed359b rubric audit, plus documentation polish: - Persist committed kv_store to disk so restarted replicas come back with post-commit stock instead of SEED_STOCK. Uses write-then-rename with a per-thread temp file so concurrent ReplicateWrite handlers do not race on os.replace. - Switch backup replication to persistent cached gRPC channels so concurrent writes fan out over a single HTTP/2 connection per backup instead of opening new channels per call. - Strengthen the concurrent-writes test with hard pass/fail assertions and cross-replica convergence checks. - Document the per-key locking strategy in README and docs/consistency-redesign.md §7. - Add an honesty note to docs/commitment-protocol.md §4.1 making clear that Dequeue is a destructive popleft() with no ack/requeue. - Record Phase 14 in Charlie-Lima-Alfa.md §14, tick §12 checklist items, and mark §15.4 gaps 1-3 as closed. - Also includes a deferred Golf-Papa-Tango.md audit-update section that had been written earlier but not yet committed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds the 6f00e05 rubric audit (CLA §16, GPT audit in GPT doc) and closes the three documentation-consistency gaps it surfaced: README step 2 now points to checkpoint3-checks.ps1, commitment-protocol.md §6 summary drops the stale queue-redelivery claim, and the checkpoint3-checks.ps1 stale-seed comment is rewritten to reflect the Phase 14 kv_store.json persistence. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Archives the 2c42158 peer-review findings in Charlie-Lima-Alfa.md §17 and closes all three in the same commit: 1. docs/commitment-protocol.md §4.2 rewritten so neither the body ("re-dequeue the order") nor the closing paragraph ("queue's redelivery semantics") contradicts the §4.1 honesty note. The replacement leader's redelivery trigger is correctly identified as original-leader-restart or user-resubmit. 2. README.md CP2 "Known limitations" list now explicitly scopes itself to the CP2 snapshot and points to the CP3 kv_store.json persistence added later in the same document. 3. scripts/checkpoint3-checks.ps1 now has a Test-ConcurrentWritesBonus helper invoked in the -SkipBonus guarded section, reported as bonus:concurrent-writes. Closes the §16.4.6 / §16.6 item 7 gap as well. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Archives the daf4812 peer-review findings in Charlie-Lima-Alfa.md §18 and closes both in the same commit: - Gap 18.2 (Medium): harden find_primary() in books_database/tests/test_concurrent_writes.py against the leader-stabilization race that can appear right after a failover/restore cycle. The new gate requires (a) majority WhoIsPrimary agreement across the 3 replicas, (b) a live primary-only Read RPC against the named leader succeeds (exercises the is_leader-False reject branch), and (c) the same leader_id holds for 3 consecutive checks spaced ~1s apart, with a 30s overall deadline. - Gap 18.3 (Low): README step 2 expected-result sentence now says "all 19 checks pass" (matches the post-daf4812 script) and names both bonuses (participant-failure recovery + concurrent-writes). Also includes a small Golf-Papa-Tango.md clarification that the `checkpoint-3` tag is created after team lead peer review + merge to master, plus the full daf4812 audit text used as input to §18. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The team lead's Windows host has Hyper-V dynamic-port exclusions covering 50000-50159, which blocks every gRPC service host port the project was using (50051-50061). Frontend (8080) and orchestrator (8081) sit outside the blocked range and are unaffected. Shifts every gRPC host port by +200 into the free 50160-55195 band (50051 -> 50251, ..., 50061 -> 50261) so the stack runs on his laptop. Container-internal ports are unchanged so PEERS env vars and all service-to-service Docker DNS calls keep working. Touches the matching host-side Python tests + scripts/_cp3_db_probe.py that hard-code 127.0.0.1:5005x/50060/50061. Container-side service code is untouched. README is untouched because the only host endpoints it advertises (8080 frontend, 8081 orchestrator) did not move. Also adds Checkpoint-3-Test-Plan.md, a one-laptop walk-through that maps every CP3 rubric line (R1-R6, B1-B3) to a concrete pass/fail check, leans on scripts/checkpoint3-checks.ps1 for the bulk of the verification, and lists the three known fragile spots (post-failover races, kv_store.json restart persistence, no queue redelivery) that the audit history flagged. Adds .claude/ and books_database/state/ to .gitignore so per-machine Claude Code session state and runtime DB state never leak into commits and pollute fresh checkouts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fold the per-area design notes (commitment-protocol.md, consistency-redesign.md, vector-clock-redesign.md) and the internal CP3 test plan into the root README.md so a reviewer reads one document instead of navigating scattered files. Main content is rubric-shaped (R1-R6 + B1-B3 with a "where it lives / how to verify" table) and the graded bonus write-ups (B1 concurrent writes, B2 participant-failure recovery) live in the main flow; the longer R1/R2 design rationale and the B3 coordinator-failure analysis sit in an appendix. Untrack local-only development notes (Charlie-Lima-Alfa.md, Golf-Papa-Tango.md) and stale folder READMEs (docs/README.md, utils/README.md). Add a .gitignore rule (*.md, !/README.md) so future local notes stay out of the remote. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The earlier consolidation kept design rationale in a separate Appendix.md, but Appendix.md is local-only via the *.md gitignore rule, so the bonus B3 coordinator-failure analysis (graded on the written analysis itself) and the R1/R2 design documentation would not have been visible to graders on the remote. Inline them into README.md so the rubric table's anchor links resolve in-document. Appendix.md keeps the truly skippable material (repository layout, CP2 carry-over, log-grep recipes, known limitations). Also fix the R3 cell that linked to the now-skippable Appendix A.4 section, replacing the broken anchor with a concrete pointer. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…_dev # Conflicts: # README.md
Captures the verbatim requirement set from Guide13 (E2E testing), Guide14 (monitoring/observability), and Guide15 (CP4 brief), maps each requirement to its current repo state at f33f8da, and lays out the per-requirement implementation sketch plus the design choices (OTEL + OTLP HTTP, Locust, pytest for the four E2E scenarios, no frontend instrumentation, no CP3 refactors). Whitelists the three CP4 deliverable docs in .gitignore so they survive the existing 'no .md except README' rule.
Adds the observability compose service (grafana/otel-lgtm), a shared
utils/telemetry.py helper, OTLP-HTTP exporter deps in the four
instrumented services, and per-service instruments that cover the
Guide14 mandatory inventory (>=2 examples each of Span / Counter /
UpDownCounter / Histogram / Asynchronous Gauge):
orchestrator checkout_requests_total, checkout_latency_seconds,
in_flight_checkouts, HTTP span per request
order_executor twopc_total, twopc_latency_seconds,
inflight_2pc_attempts, executor_is_primary
(async gauge), run_2pc span
books_database db_writes_total (Write + 2PC commit paths),
db_write_latency_seconds, db_kv_store_keys
(async gauge), db_pending_orders (async gauge),
db_write span
payment_service payment_total, payment_latency_seconds, prepare
and commit spans
Also shifts gRPC host ports from 502xx to 512xx in compose and the
seven test files that connect to them. The CP3 verifier on this
Windows machine fails to bind 502xx because the OS reserves the
50167-50266 dynamic-port range; the shift moves them outside that
range. Internal (container-side) ports are unchanged so service-to-
service routing and the existing log lines are untouched.
Adds docs/grafana/ with the CP4 dashboard JSON (checkpoint-4.json) and a Grafana file-provider manifest. Compose mounts the manifest as a named file inside /otel-lgtm/grafana/conf/provisioning/dashboards/ so the built-in OTEL-LGTM dashboards keep working, and the dashboards directory itself is mounted at a separate path that the manifest points at. Dashboard panels cover: /checkout request rate and p50/p95/p99 latency, in-flight checkouts gauge, executor primary indicator, 2PC outcomes and latency, DB writes by replica and path, pending 2PC reservations, kv_store key count, payment outcomes/latency, plus a Tempo trace search scoped to service.name=orchestrator.
Adds tests/e2e/ with one pytest file per Guide13-required scenario
plus a shared _common.py for HTTP/concurrency helpers:
test_01_single_clean_order - single non-fraudulent order
test_02_multiple_non_conflicting - 4 concurrent orders, distinct titles
test_03_mixed_fraud - 3 clean + 3 fraud, interleaved
test_04_conflicting_orders - 8 concurrent orders, single book
seeded at stock 3; asserts exactly
3 commit by parsing the executors'
2pc_decision log lines
scripts/checkpoint4-checks.ps1 is the single-command runner. It wipes
the books_database state bind-mount before bring-up so the conflict
test sees the fresh seed every time (docker compose down -v does not
clean bind mounts).
load_test/run_load.py is a stdlib-only open-loop scheduler with constant / step / spike modes. It posts the standard clean-order payload to /checkout against the running stack, records per-request latencies, writes a CSV under load_test/results/, and prints a per-stage summary. scripts/_cp4_prom_snapshot.py queries the OTEL-LGTM container's Prometheus over the Grafana proxy and dumps the metrics that checkpoint-4-evaluation.md cites (2pc_total by outcome, db_writes_total by path, p50/p95 checkout and 2PC histograms, the in-flight UpDownCounters, kv_store and pending_orders async gauges). The CSV snapshots are the raw data behind the TA-question answers.
docs/checkpoint-4-architecture.md - Mermaid diagram of all 14 services with host/container ports and protocols, plus textual port table. docs/checkpoint-4-evaluation.md - the four TA pre-flagged questions answered with measured numbers and Prometheus snapshots. HTTP ceiling ~80 RPS, 2PC ceiling ~2.5 sessions/s, bottleneck = single-leader serialized consume_loop in order_executor. docs/checkpoint-4-summary.md - one-page operator/team-lead overview of what changed since CP3, one-command runners for each new feature, and the known limitations (incl. the pre-existing CP3 verifier 18/19 that the team lead should be aware of at demo time). README.md gets a single one-paragraph CP4 callout at the top pointing at the new docs; the CP3 content is untouched.
Why: the original plan was committed before any CP4 work and described forward-looking intent. Per the final-review workflow, the plan now also needs to record the current-state map for each requirement and the single outstanding fix before its implementation commit lands. The audit confirms every BASE rubric item is MET except the Guide14 "Set environment variable: OTEL_METRIC_EXPORT_INTERVAL=1000" rule, which the current compose silently ignores: utils/telemetry.py only reads a non-standard OTEL_METRIC_EXPORT_INTERVAL_MS with default 10000 ms. The next commit closes that gap by honouring the spec-named variable. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Guide14 instructs students to "Set environment variable: OTEL_METRIC_EXPORT_INTERVAL=1000" so meters export every second. The previous telemetry helper only read a non-standard OTEL_METRIC_EXPORT_INTERVAL_MS and ignored the spec-named variable, so operators following Guide14 verbatim got 10 s metric cadence with no warning. Compose did not set the variable on any instrumented service. Why this matters: the demo runs are short (30-60 s) and the dashboards auto-refresh on a 10 s tick. Dropping the export cadence to 1 s makes panel deltas visible during the 'submit checkout' beat of the demo rather than two ticks later. telemetry.py now reads OTEL_METRIC_EXPORT_INTERVAL first (spec name, milliseconds), falls back to OTEL_METRIC_EXPORT_INTERVAL_MS for any in-flight scripts already using the legacy name, then defaults to 10000 ms. docker-compose.yaml sets the spec-named variable to 1000 on the four instrumented services (orchestrator, three executors, three DB replicas, payment_service). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The CP3 verifier's bonus:concurrent-writes test (Phase B) issues raw Write RPCs that set every seeded title to arbitrary values, including 'Designing Data-Intensive Applications' = 205. When pytest tests/e2e ran after the CP3 verifier in the same docker compose lifecycle, test 04 fired 8 concurrent orders against DDIA, all 8 committed (because real stock was 205, not the expected seed of 3), and the assertion `commits <= INITIAL_STOCK` mis-fired with the misleading message "the per-title lock failed". The per-title lock is not regressed. The test's INITIAL_STOCK assumption was just brittle to upstream state contamination. The fix reads live stock via scripts/_cp3_db_probe.py at the start of the test (helper read_stock_quorum in _common.py) and expects exactly min(pre_stock, N_ORDERS) commits and the remainder as aborts. The safety invariant "commits never exceed available stock" is still asserted explicitly. This is the property the per-title lock actually guarantees, regardless of starting inventory. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Temporary file consumed during the PR review; the team lead removes it before merging to master. Captures the readiness verdict, what changed since CP3 (f33f8da), a per-requirement coverage table with reproducible verify steps, the recommended reading order, end-to-end verify commands, weak spots worth scrutinising, known limitations, and open questions to resolve before merge. .gitignore is widened to allow this single review .md and the operator demo script's local-only/ folder. The team-lead .md whitelist is intentional and removable when the file is deleted; the local-only/ exclusion stays. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Audit of tracked files at 2f0f335 to identify stale artefacts that the CP4 evaluation does not need. Baseline verified before discovery: CP4 verifier 4/4 PASS, CP3 verifier 18/19 (the single FAIL pre-dates CP3 submission and is documented). Findings: zero Category 1 (obviously abandoned) files; 18 Category 2 / Other files flagged for review. None deleted by this commit -- the plan's "Files to delete" list is intentionally empty so Phase 4 cannot run unattended. The reviewer marks which rows to delete; the auditor then batches with git rm, re-runs the verifiers, rolls back if any previously-passing check now fails, and pushes a single chore: remove stale files commit. .gitignore widened to allow this single planning .md, consistent with how the other CP4 deliverable docs are exempted. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Audit of 2f0f335 identified 18 tracked files that neither the CP3 nor CP4 verifier reaches. Both verifiers re-run after deletion match the pre-deletion baseline exactly (CP4 4/4 PASS; CP3 18/19 with the same pre-existing bonus:participant-failure-recovery FAIL that dates back to commit 2b12c97 before the CP3 submission). Deleted files (one-sentence rationale each, from docs/checkpoint-4-cleanup-plan.md): orchestrator/tests/test_cp3_execution_only.py -- Phase 8 standalone test that recreates orchestrator with docker-compose.cp3-only.yaml, deleted at 2b12c97 before CP3 submission; not invoked by any verifier. order_executor/tests/test_2pc_end_to_end.py -- Phase 5 standalone 2PC happy-path smoke; superseded by the CP3 verifier's 2pc:valid-commit + convergence checks. payment_service/tests/smoke_test.py -- Phase 4 standalone smoke for individual participant RPCs; superseded by the CP3 verifier's full 2PC checks. test_checkout_empty_items.json -- "Prepared payload" mentioned only in the README poke list; not invoked by any verifier. README sentence updated in this commit to drop the two removed filenames. test_checkout_terms_false.json -- Same as above, terms-not-accepted variant. utils/api/bookstore.yaml -- Course-provided OpenAPI example template; services use the gRPC .proto files in utils/pb instead, not these YAMLs. utils/api/fintech.yaml -- Same as above. utils/api/ridehailing.yaml -- Same as above. docs/diagrams/architecture-diagram.jpg -- CP2-era artefact; the CP4 architecture diagram is docs/checkpoint-4-architecture.md (Mermaid + port table). docs/diagrams/system-flow-diagram.jpg -- CP2-era artefact; not referenced by any current doc. docs/diagrams/leader-election.svg -- CP2-era figure; not referenced by any current doc. docs/diagrams/vector-clocks.svg -- CP2-era figure; not referenced by any current doc. .idea/.gitignore .idea/ds-practice-2026.iml .idea/inspectionProfiles/profiles_settings.xml .idea/misc.xml .idea/modules.xml .idea/vcs.xml -- IntelliJ IDEA project files committed by accident; the repo's .gitignore already lists .idea, so they will not reappear. The CP3 verifier still passes 18/19. The remaining FAIL on bonus:participant-failure-recovery is unchanged from baseline and pre-dates the CP3 submission tip f33f8da; it is out of scope per the audit brief's "don't touch CP3 work" rule. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The rubric (Guide15) grades documentation under "Project organization,
documentation, collaboration" (0.25 pt), and the audit identified
which .md files actually feed that grade vs which were working memos
for the dev that nobody downstream reads.
Kept tracked:
README.md -- everyone's entry point
docs/checkpoint-4-architecture.md -- "Architecture diagram" 0.5 pt
docs/checkpoint-4-evaluation.md -- TA pre-flagged Q answers
docs/checkpoint-4-review.md -- team-lead PR review
docs/checkpoint-4-cleanup-plan.md -- current cleanup PR review
Moved to local-only/ (gitignored):
docs/checkpoint-4-plan.md -- pre-implementation working memo;
the git log + the diff vs CP3
carry the same information for
the team lead.
docs/checkpoint-4-summary.md -- once labelled "Summary for the
team lead", but superseded by
checkpoint-4-review.md which is
the actual team-lead facing
document the audit produced.
Reference updates so no committed .md links to a missing file:
README.md top-of-file note -- drop plan + summary links;
keep architecture + evaluation.
docs/checkpoint-4-architecture.md -- drop the trailing summary link.
docs/checkpoint-4-evaluation.md -- rewrite the §4 closing sentence
that pointed to summary's
"next steps" section.
docs/checkpoint-4-review.md -- drop plan + summary from the
BASE-5 row and from the
Where-to-start reading order.
The cleanup-plan.md's historical references to plan.md / summary.md
are left as-is: they are past-tense audit observations describing
state at 2f0f335 and are part of the audit narrative that the team
lead removes before merging this PR.
CP4 verifier re-run after the moves: 4/4 PASS,
checkpoint4-checks PASSED.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Until now the team lead had two .md files to read: the original CP4
implementation review and the post-implementation cleanup plan. Both
were authored at different points in the branch's life and the
cleanup plan's planning sections (baseline check, required-file
lists, candidate categorisation) are workflow detail the team lead
doesn't need to verify.
This commit folds the actionable content of the cleanup plan into
review.md as a new section 9 "Stale-file cleanup (post-implementation
audit)":
9.1 -- the 18 deletions with one-sentence rationale each, plus the
regression-result statement (CP4 4/4 PASS, CP3 18/19, same
pre-existing FAIL).
9.2 -- the two audit oddities flagged for the team lead's radar
(test_2pc_crash_recovery.py's same-root-cause dependency on
the deleted compose override; the untracked smoke.csv that
shows up in git status from a previous session).
The reviewer's checklist is renumbered to section 10 and gains one
new item -- "section 9 cleanup deletions reviewed". TL;DR is
extended with a third bullet pointing at section 9 so the team lead
sees the cleanup landed without scrolling. Git-log block in
section 2 is brought up to current HEAD so the four housekeeping
commits at the top are visible.
docs/checkpoint-4-cleanup-plan.md is git-removed; the .gitignore
whitelist entry that exempted it from the global *.md exclusion is
also dropped. After this commit the team lead reads exactly one
review file (docs/checkpoint-4-review.md) for the entire branch.
CP4 verifier re-run after the merge: 4/4 PASS,
checkpoint4-checks PASSED.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The README still carried the full CP3 deliverable narrative -- the
"How CP3 requirements are met" rubric table (R1-R6, B1-B3), the
embedded consistency- and commitment-protocol diagrams, the two
bonus write-ups, and the entire §A.1 / §A.2 design-rationale
appendix. That content was for CP3's TAs and is no longer the doc
the page should be advertising.
Rewrite README as a CP4-focused page:
- title is now "Checkpoint 4" only,
- one paragraph describes the system at a glance,
- Quick demo points at scripts/checkpoint4-checks.ps1 (not the
CP3 verifier; that file still exists for the team-lead
regression check but is not the public entry point any more),
- a repository-layout table replaces the rubric mapping as the
"where do I look for X" reference,
- two pointers to the CP4 docs that the rubric grades
(architecture diagram + TA-question evaluation).
CP3 functionality is unchanged: the thirteen services, the verifier
script, the CP3-era tests that the verifier invokes, and the
per-title-lock + B2 recovery code all remain exactly as they were.
Only the README narrative changed.
Two inbound references from CP4 docs that pointed into the deleted
README sections are fixed in the same commit:
- docs/checkpoint-4-architecture.md no longer trails with a
"see README.md for the CP3 design rationale" sentence; that
rationale lives in git history now.
- docs/checkpoint-4-evaluation.md §3 used to defer to
"CP3 §A.1.5 in README.md" when explaining why the audit didn't
run a 1-DB-replica experiment; the relevant one-sentence
rationale (synchronous primary-backup blocks on every live
backup) is inlined so the section stands on its own.
Net diff: 3 files changed, 56 insertions, 209 deletions.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Updated readme
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.