Refine claims and clarify benchmark methodology for accuracy and transparency

JacekDabrowski1 · JacekDabrowski1 · commit 2f7035f6c837 · 2026-03-25T23:02:57.000Z
Update API and documentation versions, correct algorithm counts, and clarify benchmark dataset origins and methodology. Replit-Commit-Author: Agent Replit-Commit-Session-Id: ec794acd-c4a5-47f6-b906-d70ac3c316ee Replit-Commit-Checkpoint-Type: full_checkpoint Replit-Commit-Event-Id: 5d1281c2-bed0-40e2-91e1-1f16ff952df7 Replit-Commit-Screenshot-Url: https://storage.googleapis.com/screenshot-production-us-central1/28ec11df-9ccf-40bc-9ff4-d0523e5b6a98/ec794acd-c4a5-47f6-b906-d70ac3c316ee/XemkZBb Replit-Helium-Checkpoint-Created: true
diff --git a/website/static/benchmarks.js b/website/static/benchmarks.js
@@ -14,12 +14,13 @@ const COLORS = {
 };
 
 const ALGO_COLORS = {
-    'Cleora':     '#6c63ff',
-    'ProNE':      '#f59e0b',
-    'RandNE':     '#ef4444',
-    'NetMF':      '#3b82f6',
-    'DeepWalk':   '#f472b6',
-    'Node2Vec':   '#fb923c',
+    'Cleora':           '#6c63ff',
+    'Cleora (whiten)':  '#6c63ff',
+    'ProNE':            '#f59e0b',
+    'RandNE':           '#ef4444',
+    'NetMF':            '#3b82f6',
+    'DeepWalk':         '#f472b6',
+    'Node2Vec':         '#fb923c',
 };
 
 const DATASETS = ['ego-Facebook', 'PPI-large', 'Flickr', 'ogbn-arxiv', 'Yelp'];
@@ -56,7 +57,7 @@ const MEMORY_DATA = {
 
 const SCATTER_DATA = {
     'ego-Facebook': {
-        'Cleora':     { acc: 0.964, time: 0.740 },
+        'Cleora (whiten)': { acc: 0.964, time: 0.740 },
         'NetMF':      { acc: 0.944, time: 17.920 },
         'Node2Vec':   { acc: 0.918, time: 111.426 },
         'DeepWalk':   { acc: 0.912, time: 32.352 },
diff --git a/website/templates/api.html b/website/templates/api.html
@@ -53,7 +53,7 @@ <h4>Tuning</h4>
 
     <div class="docs-content">
         <h1>API Reference</h1>
-        <p>Complete API documentation for pycleora 3.0. All functions, parameters, and return values.</p>
+        <p>Complete API documentation for pycleora 3.2. All functions, parameters, and return values.</p>
 
         <h2 id="sparse-matrix">pycleora.SparseMatrix</h2>
         <div class="api-method">
diff --git a/website/templates/base.html b/website/templates/base.html
@@ -4,7 +4,7 @@
     <meta charset="UTF-8">
     <meta name="viewport" content="width=device-width, initial-scale=1.0">
     <title>{% block title %}pycleora{% endblock %} — Fast Graph Embeddings</title>
-    <meta name="description" content="{% block meta_desc %}pycleora: The fastest CPU-only graph embedding library. Rust core, Python API. 5 graph generators, 7 alternative algorithms. Built-in MLP and Label Propagation classifiers — no PyTorch, no GPU.{% endblock %}">
+    <meta name="description" content="{% block meta_desc %}pycleora: The fastest CPU-only graph embedding library. Rust core, Python API. 8 algorithms, 5 graph generators. Built-in MLP and Label Propagation classifiers — no PyTorch, no GPU.{% endblock %}">
     <meta property="og:title" content="pycleora — Fast Graph Embeddings">
     <meta property="og:description" content="The fastest CPU-only graph embedding library. 7 alternative algorithms, 5 graph generators, zero GPU required.">
     <meta property="og:type" content="website">
diff --git a/website/templates/benchmarks.html b/website/templates/benchmarks.html
@@ -31,11 +31,15 @@ <h4>Analysis</h4>
 
     <div class="docs-content">
         <h1>Benchmark Results</h1>
-        <p>7 algorithms compared across 7 datasets (3 SNAP + 4 standard benchmarks). Node classification accuracy using Nearest Centroid classifier on 80/20 train-test split.</p>
+        <p>7 algorithms compared across 7 datasets (3 SNAP downloads + 4 scale-matched synthetic). Node classification accuracy using Nearest Centroid classifier (zero hyperparameters) on 80/20 train-test split. Absolute accuracy is deliberately low — we use the simplest possible classifier to isolate embedding quality, not model performance.</p>
+
+        <div class="callout callout-info">
+            <strong>Dataset note:</strong> ego-Facebook, roadNet-CA, and soc-LiveJournal1 are downloaded from SNAP. PPI-large, Flickr, ogbn-arxiv, and Yelp are <strong>scale-matched synthetic graphs</strong> (generated via SBM/Erdős–Rényi to reproduce node count, edge count, and community structure). They are not the original datasets. See <a href="#methodology">Methodology</a> for details.
+        </div>
 
         <div class="bench-section" id="viz-accuracy">
             <h2 id="summary">Summary Table</h2>
-            <p>Best accuracy per dataset. <span class="best-score">Green</span> = best on that dataset. "&mdash;" = excluded (too large or not applicable). Cleora results use <code>whiten=True</code> with auto-tuned iterations for optimal quality.</p>
+            <p>Best accuracy per dataset. <span class="best-score">Green</span> = best on that dataset. "&mdash;" = excluded (too large or not applicable). * = uses <code>whiten=True</code> post-processing.</p>
             <div class="bench-toggle">
                 <button class="bench-toggle-btn active" data-view="chart">Chart</button>
                 <button class="bench-toggle-btn" data-view="table">Table</button>
@@ -65,7 +69,7 @@ <h2 id="summary">Summary Table</h2>
                             <td>ego-Facebook</td>
                             <td>4,039</td>
                             <td>88,234</td>
-                            <td class="best-score">0.964</td>
+                            <td class="best-score">0.964*</td>
                             <td>0.021</td>
                             <td>0.318</td>
                             <td>0.944</td>
@@ -355,7 +359,7 @@ <h3>ego-Facebook</h3>
                 <table class="bench-table">
                     <thead><tr><th>Algorithm</th><th>Accuracy</th><th>Time</th></tr></thead>
                     <tbody>
-                        <tr><td>Cleora (whiten)</td><td class="best-score">0.964</td><td>0.740s</td></tr>
+                        <tr><td>Cleora (whiten)*</td><td class="best-score">0.964</td><td>0.740s</td></tr>
                         <tr><td>NetMF</td><td>0.944</td><td>17.920s</td></tr>
                         <tr><td>Cleora-multiscale</td><td>0.942</td><td>0.593s</td></tr>
                         <tr><td>Node2Vec</td><td>0.918</td><td>111.426s</td></tr>
@@ -414,7 +418,7 @@ <h2 id="when-to-use">When to Use What</h2>
                 <tr>
                     <td>Real-time / streaming</td>
                     <td><strong>Cleora</strong></td>
-                    <td>Fastest for re-embedding. Constant memory. Incremental updates.</td>
+                    <td>Fastest for re-embedding. Lowest memory overhead. Incremental updates.</td>
                 </tr>
                 <tr>
                     <td>Million-node graphs</td>
@@ -439,7 +443,7 @@ <h2 id="when-to-use">When to Use What</h2>
                 <tr>
                     <td>Memory constrained</td>
                     <td><strong>Cleora</strong></td>
-                    <td>4 MB for 4k-node graph. 1.9 GB for 2M nodes. Best memory efficiency at every scale.</td>
+                    <td>16 MB for 4k nodes, 1.9 GB for 2M nodes. Lowest memory footprint at every scale tested.</td>
                 </tr>
                 <tr>
                     <td>Maximum embedding speed</td>
@@ -457,7 +461,7 @@ <h2 id="methodology">Methodology</h2>
             <li><strong>Embedding dimension:</strong> 1024 for all algorithms</li>
             <li><strong>Metrics:</strong> Accuracy, Macro F1, wall-clock time, peak memory delta</li>
             <li><strong>Hardware:</strong> Single CPU core, no GPU</li>
-            <li><strong>Datasets:</strong> Benchmarks use synthetically generated graphs (Erdős–Rényi, Barabási–Albert, SBM) with node and edge counts matching named real-world datasets. PPI-large, Flickr, ogbn-arxiv, and Yelp are simulated at matching scale; ego-Facebook, roadNet-CA, and soc-LiveJournal1 are downloaded from SNAP. Synthetic graphs are not the original datasets — they reproduce scale and community structure, not content.</li>
+            <li><strong>Datasets:</strong> ego-Facebook, roadNet-CA, and soc-LiveJournal1 are downloaded from SNAP. PPI-large, Flickr, ogbn-arxiv, and Yelp are <strong>scale-matched synthetic graphs</strong> (generated via Erdős–Rényi, Barabási–Albert, SBM to reproduce node count, edge count, and community structure). Synthetic graphs are not the original datasets — they reproduce scale and structure, not content.</li>
             <li><strong>Walk-based params:</strong> num_walks=10, walk_length=20 (Facebook); excluded for larger graphs</li>
             <li><strong>Excluded algorithms:</strong> GraRep and HOPE (require dense n&times;n matrices, infeasible for 4k+ nodes)</li>
         </ul>
diff --git a/website/templates/docs.html b/website/templates/docs.html
@@ -38,7 +38,7 @@ <h4>Tools</h4>
 
     <div class="docs-content">
         <h1>Documentation</h1>
-        <p>Complete guide to pycleora 3.0 — the fastest CPU-only graph embedding library.</p>
+        <p>Complete guide to pycleora 3.2 — the fastest CPU-only graph embedding library.</p>
 
         <h2 id="installation">Installation</h2>
         <h3>From PyPI (recommended)</h3>
@@ -127,7 +127,7 @@ <h2 id="basic-embedding">Cleora Embedding</h2>
 
         <h2 id="algorithms">Alternative Algorithms</h2>
         <div class="callout callout-info">
-            <strong>Note on negative sampling:</strong> DeepWalk, Node2Vec, NetMF, and GraRep all require negative sampling to approximate random walks. This introduces noise, stochastic variation, and reproducibility issues. Cleora eliminates negative sampling entirely — it computes all walks exactly via a single sparse matrix multiplication.
+            <strong>Cleora vs walk-based methods:</strong> DeepWalk and Node2Vec sample random walks and train a skip-gram model (which uses negative sampling to approximate the softmax). NetMF factorizes the same co-occurrence matrix directly but still requires a negative sampling parameter. Cleora eliminates both walk sampling and skip-gram training entirely — it computes the full walk distribution via matrix powers.
         </div>
         <pre><code><span class="code-keyword">from</span> pycleora.algorithms <span class="code-keyword">import</span> (
     embed_deepwalk, embed_node2vec,
diff --git a/website/templates/functions.html b/website/templates/functions.html
@@ -17,7 +17,7 @@ <h2>Everything Cleora Can Do</h2>
             </div>
             <div>
                 <h3>Embedding Engine</h3>
-                <p>9 algorithms unified under one API — spectral, walk-based, and matrix factorization methods</p>
+                <p>8 algorithms unified under one API — spectral, walk-based, and matrix factorization methods</p>
             </div>
         </div>
         <div class="features-tag-grid">
@@ -48,7 +48,7 @@ <h3>Rust Performance Core</h3>
         <div class="features-highlights-row">
             <div class="features-highlight-chip"><strong>240x</strong> faster than GraphSAGE</div>
             <div class="features-highlight-chip"><strong>5 MB</strong> total footprint</div>
-            <div class="features-highlight-chip"><strong>0</strong> external dependencies</div>
+            <div class="features-highlight-chip"><strong>0</strong> heavy dependencies</div>
         </div>
     </div>
 
@@ -396,12 +396,12 @@ <h3>Deterministic &amp; Reproducible</h3>
             </div>
             <div>
                 <h3>Tiny Footprint</h3>
-                <p>Just 5 MB installed with zero dependencies. Compare to 500 MB+ for PyTorch Geometric or DGL. Installs in seconds, not minutes.</p>
+                <p>Just 5 MB installed — only numpy and scipy required. Compare to 500 MB+ for PyTorch Geometric or DGL. Installs in seconds, not minutes.</p>
             </div>
         </div>
         <div class="features-highlights-row">
             <div class="features-highlight-chip"><strong>5 MB</strong> vs 500 MB+</div>
-            <div class="features-highlight-chip"><strong>0</strong> dependencies</div>
+            <div class="features-highlight-chip"><strong>0</strong> heavy dependencies</div>
             <div class="features-highlight-chip">Installs in <strong>seconds</strong></div>
         </div>
     </div>
diff --git a/website/templates/index.html b/website/templates/index.html
@@ -8,8 +8,8 @@
     <div class="hero-badge animate-in">v3.1 Released</div>
     <h1 class="animate-in delay-1">Graph Embeddings.<br>Blazing Fast.</h1>
     <p class="subtitle animate-in delay-2">
-        The only graph embedding library that performs <strong>all possible random walks in a single matrix multiplication</strong>.
-        No negative sampling. No GPU. No noise. Just fast, deterministic, production-grade embeddings.
+        The only graph embedding library that captures <strong>the equivalent of all random walks at each depth in one matrix power</strong>.
+        No random walk sampling. No skip-gram training. No GPU. Just fast, deterministic, production-grade embeddings.
     </p>
     <div class="hero-buttons animate-in delay-3">
         <a href="/docs" class="btn btn-primary">Get Started</a>
@@ -62,8 +62,8 @@ <h3>Sparse Markov Matrix</h3>
         <div class="how-step scroll-reveal">
             <div class="step-number">02</div>
             <div class="step-content">
-                <h3>Single Matrix Multiplication = All Random Walks</h3>
-                <p>One sparse matrix multiplication captures <em>every possible random walk</em> of a given length. No sampling, no noise, no stochastic approximation. This is what makes Cleora deterministic and orders of magnitude faster.</p>
+                <h3>Matrix Powers = All Walk Distributions</h3>
+                <p>Each iteration multiplies the embedding matrix by the sparse transition matrix — M<sup>k</sup> captures <em>the full distribution of all walks of length k</em>. No sampling, no noise, no stochastic approximation. This is what makes Cleora deterministic and orders of magnitude faster.</p>
             </div>
         </div>
         <div class="how-connector scroll-reveal"></div>
@@ -87,8 +87,8 @@ <h2>What Makes Cleora Different</h2>
             <div class="advantage-icon">
                 <svg width="28" height="28" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><circle cx="12" cy="12" r="10"/><path d="M8 12l3 3 5-5"/></svg>
             </div>
-            <h3>No Negative Sampling</h3>
-            <p>Unlike DeepWalk, Node2Vec, and LINE, Cleora doesn't approximate random walks with negative sampling. It computes <strong>all walks exactly</strong> via matrix multiplication. Less noise, higher accuracy, perfect reproducibility.</p>
+            <h3>No Sampling, No Training</h3>
+            <p>Unlike DeepWalk, Node2Vec, and LINE, Cleora eliminates both random walk sampling AND skip-gram training entirely. It captures <strong>all walk distributions exactly</strong> via matrix powers. No noise, perfect reproducibility.</p>
         </div>
         <div class="advantage-card scroll-reveal" data-delay="100">
             <div class="advantage-icon">
@@ -102,7 +102,7 @@ <h3>240x Faster Than GraphSAGE</h3>
                 <svg width="28" height="28" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><rect x="3" y="3" width="18" height="18" rx="2"/><path d="M3 9h18M9 3v18"/></svg>
             </div>
             <h3>Deterministic Embeddings</h3>
-            <p>Same input always produces the same output. No random seeds, no stochastic variation, no "run it 5 times and average" workflows. Critical for reproducible research and production ML pipelines.</p>
+            <p>Same input always produces the same output. Deterministic by default — no stochastic variation, no "run it 5 times and average" workflows. Critical for reproducible research and production ML pipelines.</p>
         </div>
         <div class="advantage-card scroll-reveal" data-delay="300">
             <div class="advantage-icon">
@@ -115,8 +115,8 @@ <h3>Heterogeneous Hypergraphs</h3>
             <div class="advantage-icon">
                 <svg width="28" height="28" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><path d="M21 16V8a2 2 0 00-1-1.73l-7-4a2 2 0 00-2 0l-7 4A2 2 0 003 8v8a2 2 0 001 1.73l7 4a2 2 0 002 0l7-4A2 2 0 0021 16z"/></svg>
             </div>
-            <h3>5 MB, Zero Dependencies</h3>
-            <p>The entire library is ~5 MB. Compare: PyTorch Geometric is 500 MB+, DGL is 400 MB+. Cleora ships as a single compiled Rust extension. No CUDA, no cuDNN, no GPU driver headaches.</p>
+            <h3>5 MB, No Heavy Dependencies</h3>
+            <p>The entire library is ~5 MB with only numpy and scipy. Compare: PyTorch Geometric is 500 MB+, DGL is 400 MB+. Cleora ships as a single compiled Rust extension. No CUDA, no cuDNN, no GPU driver headaches.</p>
         </div>
         <div class="advantage-card scroll-reveal" data-delay="500">
             <div class="advantage-icon">
@@ -163,7 +163,7 @@ <h4>Customer-Restaurant Graph</h4>
                 </div>
                 <div>
                     <h4>Cleora Embeddings <span class="flow-time">&lt; 5 minutes</span></h4>
-                    <p>197x faster than DeepWalk. No sampling of positive/negative examples. Purely structure-based — iterative weighted averaging of neighbor embeddings + L2 normalization.</p>
+                    <p>240x faster than GraphSAGE, 197x faster than DeepWalk (as measured by Zomato). No walk sampling, no skip-gram training. Purely structure-based — iterative weighted averaging of neighbor embeddings + L2 normalization.</p>
                 </div>
             </div>
             <div class="flow-arrow scroll-reveal">&darr;</div>
@@ -235,10 +235,10 @@ <h2>Trusted in Production Worldwide</h2>
         </div>
         <div class="testimonial-card scroll-reveal" data-delay="200">
             <div class="testimonial-quote">
-                "Cleora-powered solutions achieved top placements in KDD Cup 2021, WSDM WebTour 2021, and SIGIR eCom 2020 — beating deep learning approaches on travel, e-commerce, and web recommendation benchmarks."
+                Cleora-powered solutions achieved top placements in KDD Cup 2021, WSDM WebTour 2021, and SIGIR eCom 2020 — beating deep learning approaches on travel, e-commerce, and web recommendation benchmarks.
             </div>
             <div class="testimonial-source">
-                <div class="testimonial-company">ML Competitions</div>
+                <div class="testimonial-company">Competition Results</div>
                 <div class="testimonial-role">KDD Cup, WSDM, SIGIR</div>
             </div>
         </div>
@@ -374,11 +374,11 @@ <h3>Sparse Markov Matrix</h3>
                     </div>
                     <div class="pipeline-text">
                         <div class="pipeline-step-num">04</div>
-                        <h3>Single Matrix Multiplication = All Walks</h3>
-                        <p>One sparse matrix multiplication captures <em>every possible random walk</em> of a given length. No sampling, no noise — this is the mathematical breakthrough that makes Cleora deterministic and fast.</p>
+                        <h3>Matrix Power = All Walk Distributions</h3>
+                        <p>Each iteration applies one sparse matrix power — M<sup>k</sup> captures <em>the full distribution of all walks of length k</em>. No sampling, no noise — this is what makes Cleora deterministic and fast.</p>
                         <div class="pipeline-highlight-badge">
                             <svg width="14" height="14" viewBox="0 0 24 24" fill="none" stroke="var(--green)" stroke-width="2"><path d="M13 2L3 14h9l-1 8 10-12h-9l1-8z"/></svg>
-                            All walks captured in 1 multiplication
+                            Complete walk distributions, zero sampling
                         </div>
                     </div>
                 </div>
@@ -1300,7 +1300,7 @@ <h2>From Edges to Embeddings in 5 Lines</h2>
 <section class="section cta-section">
     <div class="cta-card scroll-reveal">
         <h2>Ready to Embed Your Graph?</h2>
-        <p>Join Zomato, Dailymotion, and hundreds of ML teams using Cleora in production. Install in seconds, embed in minutes.</p>
+        <p>Join Zomato, Dailymotion, Synerise, and ML teams worldwide using Cleora in production. Install in seconds, embed in minutes.</p>
         <div class="cta-buttons">
             <a href="/docs" class="btn btn-primary btn-lg">Read the Docs</a>
             <a href="https://github.com/BaseModelAI/cleora" class="btn btn-secondary btn-lg" target="_blank">Star on GitHub</a>