You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: website/templates/benchmarks.html
+11-7Lines changed: 11 additions & 7 deletions
Original file line number
Diff line number
Diff line change
@@ -31,11 +31,15 @@ <h4>Analysis</h4>
31
31
32
32
<divclass="docs-content">
33
33
<h1>Benchmark Results</h1>
34
-
<p>7 algorithms compared across 7 datasets (3 SNAP + 4 standard benchmarks). Node classification accuracy using Nearest Centroid classifier on 80/20 train-test split.</p>
34
+
<p>7 algorithms compared across 7 datasets (3 SNAP downloads + 4 scale-matched synthetic). Node classification accuracy using Nearest Centroid classifier (zero hyperparameters) on 80/20 train-test split. Absolute accuracy is deliberately low — we use the simplest possible classifier to isolate embedding quality, not model performance.</p>
35
+
36
+
<divclass="callout callout-info">
37
+
<strong>Dataset note:</strong> ego-Facebook, roadNet-CA, and soc-LiveJournal1 are downloaded from SNAP. PPI-large, Flickr, ogbn-arxiv, and Yelp are <strong>scale-matched synthetic graphs</strong> (generated via SBM/Erdős–Rényi to reproduce node count, edge count, and community structure). They are not the original datasets. See <ahref="#methodology">Methodology</a> for details.
38
+
</div>
35
39
36
40
<divclass="bench-section" id="viz-accuracy">
37
41
<h2id="summary">Summary Table</h2>
38
-
<p>Best accuracy per dataset. <spanclass="best-score">Green</span> = best on that dataset. "—" = excluded (too large or not applicable). Cleora results use<code>whiten=True</code>with auto-tuned iterations for optimal quality.</p>
42
+
<p>Best accuracy per dataset. <spanclass="best-score">Green</span> = best on that dataset. "—" = excluded (too large or not applicable). * = uses<code>whiten=True</code>post-processing.</p>
<li><strong>Embedding dimension:</strong> 1024 for all algorithms</li>
458
462
<li><strong>Metrics:</strong> Accuracy, Macro F1, wall-clock time, peak memory delta</li>
459
463
<li><strong>Hardware:</strong> Single CPU core, no GPU</li>
460
-
<li><strong>Datasets:</strong>Benchmarks use synthetically generated graphs (Erdős–Rényi, Barabási–Albert, SBM) with node and edge counts matching named real-world datasets. PPI-large, Flickr, ogbn-arxiv, and Yelp are simulated at matching scale; ego-Facebook, roadNet-CA, and soc-LiveJournal1 are downloaded from SNAP. Synthetic graphs are not the original datasets — they reproduce scale and community structure, not content.</li>
464
+
<li><strong>Datasets:</strong>ego-Facebook, roadNet-CA, and soc-LiveJournal1 are downloaded from SNAP. PPI-large, Flickr, ogbn-arxiv, and Yelp are <strong>scale-matched synthetic graphs</strong> (generated via Erdős–Rényi, Barabási–Albert, SBM to reproduce node count, edge count, and community structure). Synthetic graphs are not the original datasets — they reproduce scale and structure, not content.</li>
461
465
<li><strong>Walk-based params:</strong> num_walks=10, walk_length=20 (Facebook); excluded for larger graphs</li>
462
466
<li><strong>Excluded algorithms:</strong> GraRep and HOPE (require dense n×n matrices, infeasible for 4k+ nodes)</li>
<strong>Note on negative sampling:</strong> DeepWalk, Node2Vec, NetMF, and GraRep all require negative sampling to approximate random walks. This introduces noise, stochastic variation, and reproducibility issues. Cleora eliminates negative sampling entirely — it computes all walks exactly via a single sparse matrix multiplication.
130
+
<strong>Cleora vs walk-based methods:</strong> DeepWalk and Node2Vec sample random walks and train a skip-gram model (which uses negative sampling to approximate the softmax). NetMF factorizes the same co-occurrence matrix directly but still requires a negative sampling parameter. Cleora eliminates both walk sampling and skip-gram training entirely — it computes the full walk distribution via matrix powers.
<h3>Single Matrix Multiplication = All Random Walks</h3>
66
-
<p>One sparse matrix multiplication captures <em>every possible random walk</em>of a given length. No sampling, no noise, no stochastic approximation. This is what makes Cleora deterministic and orders of magnitude faster.</p>
65
+
<h3>Matrix Powers = All Walk Distributions</h3>
66
+
<p>Each iteration multiplies the embedding matrix by the sparse transition matrix — M<sup>k</sup> captures <em>the full distribution of all walks of length k</em>. No sampling, no noise, no stochastic approximation. This is what makes Cleora deterministic and orders of magnitude faster.</p>
67
67
</div>
68
68
</div>
69
69
<divclass="how-connector scroll-reveal"></div>
@@ -87,8 +87,8 @@ <h2>What Makes Cleora Different</h2>
<p>Unlike DeepWalk, Node2Vec, and LINE, Cleora doesn't approximate random walks with negative sampling. It computes<strong>all walks exactly</strong> via matrix multiplication. Less noise, higher accuracy, perfect reproducibility.</p>
90
+
<h3>No Sampling, No Training</h3>
91
+
<p>Unlike DeepWalk, Node2Vec, and LINE, Cleora eliminates both random walk sampling AND skip-gram training entirely. It captures<strong>all walk distributions exactly</strong> via matrix powers. No noise, perfect reproducibility.</p>
<p>Same input always produces the same output. No random seeds, no stochastic variation, no "run it 5 times and average" workflows. Critical for reproducible research and production ML pipelines.</p>
105
+
<p>Same input always produces the same output. Deterministic by default — no stochastic variation, no "run it 5 times and average" workflows. Critical for reproducible research and production ML pipelines.</p>
<p>The entire library is ~5 MB. Compare: PyTorch Geometric is 500 MB+, DGL is 400 MB+. Cleora ships as a single compiled Rust extension. No CUDA, no cuDNN, no GPU driver headaches.</p>
118
+
<h3>5 MB, No Heavy Dependencies</h3>
119
+
<p>The entire library is ~5 MB with only numpy and scipy. Compare: PyTorch Geometric is 500 MB+, DGL is 400 MB+. Cleora ships as a single compiled Rust extension. No CUDA, no cuDNN, no GPU driver headaches.</p>
<p>197x faster than DeepWalk. No sampling of positive/negative examples. Purely structure-based — iterative weighted averaging of neighbor embeddings + L2 normalization.</p>
166
+
<p>240x faster than GraphSAGE, 197x faster than DeepWalk (as measured by Zomato). No walk sampling, no skip-gram training. Purely structure-based — iterative weighted averaging of neighbor embeddings + L2 normalization.</p>
167
167
</div>
168
168
</div>
169
169
<divclass="flow-arrow scroll-reveal">↓</div>
@@ -235,10 +235,10 @@ <h2>Trusted in Production Worldwide</h2>
"Cleora-powered solutions achieved top placements in KDD Cup 2021, WSDM WebTour 2021, and SIGIR eCom 2020 — beating deep learning approaches on travel, e-commerce, and web recommendation benchmarks."
238
+
Cleora-powered solutions achieved top placements in KDD Cup 2021, WSDM WebTour 2021, and SIGIR eCom 2020 — beating deep learning approaches on travel, e-commerce, and web recommendation benchmarks.
<p>One sparse matrix multiplication captures <em>every possible random walk</em>of a given length. No sampling, no noise — this is the mathematical breakthrough that makes Cleora deterministic and fast.</p>
377
+
<h3>Matrix Power = All Walk Distributions</h3>
378
+
<p>Each iteration applies one sparse matrix power — M<sup>k</sup> captures <em>the full distribution of all walks of length k</em>. No sampling, no noise — this is what makes Cleora deterministic and fast.</p>
0 commit comments