Skip to content

Commit f1cb240

Browse files
authored
Merge pull request #79 from BaseModelAI/interwhiten-python-defaults-256-40
Interwhiten python defaults 256 40
2 parents 9bfb69f + 47fc3d3 commit f1cb240

20 files changed

Lines changed: 200 additions & 109 deletions

Cargo.lock

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "pycleora"
3-
version = "3.2.0"
3+
version = "3.2.1"
44
edition = "2018"
55
license-file = "LICENSE"
66
readme = "README.md"

README.md

Lines changed: 12 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -78,12 +78,14 @@ for r in similar:
7878
print(f"{r['entity_id']}: {r['similarity']:.4f}")
7979
```
8080

81+
`embed()` defaults to `feature_dim=256`, `num_iterations=40`, and whitening after every propagation step.
82+
8183
### Step-by-Step Example
8284

8385
The high-level `embed()` function wraps the Markov propagation loop for convenience. Here's the full manual version, which gives you complete control over the process:
8486

8587
```python
86-
from pycleora import SparseMatrix
88+
from pycleora import SparseMatrix, whiten_embeddings
8789
import numpy as np
8890
import pandas as pd
8991
import random
@@ -111,6 +113,7 @@ NUM_ITERATIONS = 40
111113
for i in range(NUM_ITERATIONS):
112114
embeddings = mat.left_markov_propagate(embeddings)
113115
embeddings /= np.linalg.norm(embeddings, ord=2, axis=-1, keepdims=True)
116+
embeddings = whiten_embeddings(embeddings)
114117

115118
for entity, embedding in zip(mat.entity_ids, embeddings):
116119
print(entity, embedding)
@@ -155,7 +158,7 @@ Embeddings are stable across runs and support inductive learning: new nodes can
155158

156159
| Algorithm | Type | Description |
157160
|-----------|------|-------------|
158-
| **Cleora** | Spectral / Random Walk | Iterative Markov propagation with L2 normalization — all random walks in one matrix multiplication |
161+
| **Cleora** | Spectral / Random Walk | Iterative Markov propagation with per-iteration whitening — all random walks in one matrix multiplication |
159162
| **ProNE** | Spectral | Fast spectral propagation with Chebyshev polynomial approximation |
160163
| **RandNE** | Random Projection | Gaussian random projection for very fast, approximate embeddings |
161164
| **NetMF** | Matrix Factorization | Network Matrix Factorization — factorizes the DeepWalk matrix explicitly |
@@ -177,13 +180,13 @@ pycleora embed --input graph.tsv --output out.npz --algorithm node2vec
177180

178181
Beyond the standard algorithms, Cleora supports several advanced embedding strategies:
179182

180-
- **Multiscale embeddings** — concatenates embeddings from different iteration depths (e.g. scales `[1, 2, 4, 8]`) to capture both local and global graph structure simultaneously
183+
- **Multiscale embeddings** — concatenates embeddings from different iteration depths (e.g. scales `[10, 20, 30, 40]`) to capture both local and global graph structure simultaneously
181184
- **Attention-weighted propagation** — uses softmax-normalized dot-product attention during propagation, dynamically weighting neighbor contributions
182185
- **Supervised refinement** — fine-tunes unsupervised embeddings using positive/negative entity pairs with a triplet margin loss
183186
- **Directed graph embeddings** — handles asymmetric relationships where edge direction matters
184187
- **Weighted graph embeddings** — incorporates edge weights into the propagation step
185188
- **Node feature integration** — initializes embeddings with external features (text, image, numeric) before propagation
186-
- **PCA whitening** — built-in ZCA whitening to decorrelate embedding dimensions and improve downstream task performance
189+
- **PCA whitening** — built-in whitening after every iteration by default to decorrelate embedding dimensions and improve downstream task performance
187190

188191
---
189192

@@ -312,7 +315,7 @@ See [cleora.ai/use-cases](https://cleora.ai/use-cases) for detailed walkthroughs
312315
2. **Hypergraph Construction** — Builds a heterogeneous hypergraph where a single edge can connect multiple entities of different types.
313316
3. **Sparse Markov Matrix** — Constructs a sparse transition matrix (99%+ sparse). Rows normalized so each row sums to 1.
314317
4. **Single Matrix Multiplication = All Walks** — One sparse matrix multiplication captures *every possible random walk* of a given length. No sampling, no noise.
315-
5. **L2-Normalized Propagation** — Each iteration replaces every node's embedding with the L2-normalized average of its neighbors. 3-4 iterations for co-occurrence similarity, 7+ for contextual similarity.
318+
5. **L2-Normalized + Whitened Propagation** — Each iteration replaces every node's embedding with the L2-normalized average of its neighbors and then whitens the embedding space. The default configuration runs 40 iterations at 256 dimensions.
316319
6. **Embeddings Ready** — Dense, deterministic embedding vectors for every entity. Same input always yields same output.
317320

318321
---
@@ -343,15 +346,15 @@ A: No, this is a methodologically wrong approach, stemming from outdated matrix
343346

344347
**Q: What embedding dimensionality to use?**
345348

346-
A: The more, the better, but we typically work from _1024_ to _4096_. Memory is cheap and machines are powerful, so don't skimp on embedding size.
349+
A: The default is **256**. For larger production systems we often work from _1024_ to _4096_, but `256` is the baseline shipped by the library.
347350

348351
**Q: How many iterations of Markov propagation should I use?**
349352

350-
A: Depends on what you want to achieve. Low iterations (3) tend to approximate the co-occurrence matrix, while high iterations (7+) tend to give contextual similarity (think skip-gram but much more accurate and faster).
353+
A: The default is **40** whitening-enhanced propagation steps. If you want more local, co-occurrence-style behavior you can dial that down manually; higher values bias more toward contextual similarity.
351354

352355
**Q: How do I incorporate external information, e.g. entity metadata, images, texts into the embeddings?**
353356

354-
A: Just initialize the embedding matrix with your own vectors coming from a VIT, sentence-transformers, or a random projection of your numeric features. In that scenario low numbers of Markov iterations (1 to 3) tend to work best.
357+
A: Just initialize the embedding matrix with your own vectors coming from a VIT, sentence-transformers, or a random projection of your numeric features. In that scenario fewer Markov iterations than the default `40` often work best.
355358

356359
**Q: My embeddings don't fit in memory, what do I do?**
357360

@@ -367,7 +370,7 @@ A: Cleora works best for relatively sparse hypergraphs. If all your hyperedges c
367370

368371
**Q: How can Cleora be so fast and accurate at the same time?**
369372

370-
A: Not using negative sampling is a great boon. By constructing the (sparse) Markov transition matrix, Cleora explicitly performs all possible random walks in a hypergraph in one big step (a single matrix multiplication). That's what we call a single _iteration_. We perform 3+ such iterations. Thanks to a highly efficient implementation in Rust, with special care for concurrency, memory layout and cache coherence, it is blazingly fast. Negative sampling or randomly selecting random walks tend to introduce a lot of noise - Cleora is free of those burdens.
373+
A: Not using negative sampling is a great boon. By constructing the (sparse) Markov transition matrix, Cleora explicitly performs all possible random walks in a hypergraph in one big step (a single matrix multiplication). That's what we call a single _iteration_. The default configuration performs 40 such iterations with whitening after every step. Negative sampling or randomly selecting random walks tend to introduce a lot of noise - Cleora is free of those burdens.
371374

372375
---
373376

examples/cleora_loop.py

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,23 @@
11
import time
22

33
import numpy as np
4-
from pycleora import SparseMatrix
4+
from pycleora import SparseMatrix, whiten_embeddings
55

66
start_time = time.time()
77

88
# graph = SparseMatrix.from_files(["zaba30_large_5m.tsv"], "basket complex::product", hyperedge_trim_n=16)
99
graph = SparseMatrix.from_files(["perf_inputs/0.tsv", "perf_inputs/1.tsv", "perf_inputs/2.tsv", "perf_inputs/3.tsv", "perf_inputs/4.tsv", "perf_inputs/5.tsv", "perf_inputs/6.tsv", "perf_inputs/7.tsv"], "complex::reflexive::name")
1010

1111
print("Entities n", len(graph.entity_ids))
12-
# embeddings = np.random.randn(len(graph.entity_ids), 128).astype(np.float32)
13-
embeddings = graph.initialize_deterministically(feature_dim=128, seed=0)
12+
# embeddings = np.random.randn(len(graph.entity_ids), 256).astype(np.float32)
13+
embeddings = graph.initialize_deterministically(feature_dim=256, seed=0)
1414

15-
for i in range(3):
15+
for i in range(40):
1616
embeddings = graph.left_markov_propagate(embeddings)
1717
# embeddings = graph.symmetric_markov_propagate(embeddings)
1818

1919
embeddings /= np.linalg.norm(embeddings, ord=2, axis=-1, keepdims=True)
20+
embeddings = whiten_embeddings(embeddings)
2021
print(f"Iter {i} finished")
2122

2223
print(graph.entity_ids[:10])

examples/from_iterator.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
import time
22

33
import numpy as np
4-
from pycleora import SparseMatrix
4+
from pycleora import SparseMatrix, whiten_embeddings
55

66
start_time = time.time()
77

@@ -25,9 +25,10 @@ def edges_iterator():
2525

2626
embeddings = np.random.randn(len(graph.entity_ids), 256).astype(np.float32)
2727

28-
for i in range(3):
28+
for i in range(40):
2929
embeddings = graph.left_markov_propagate(embeddings)
3030
embeddings /= np.linalg.norm(embeddings, ord=2, axis=-1, keepdims=True)
31+
embeddings = whiten_embeddings(embeddings)
3132
print(f"Iter {i} finished")
3233

33-
print(f"Took {time.time() - start_time} seconds ")
34+
print(f"Took {time.time() - start_time} seconds ")

examples/graph_pickle.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
import time
22

33
import numpy as np
4-
from pycleora import SparseMatrix
4+
from pycleora import SparseMatrix, whiten_embeddings
55

66
import pickle
77

@@ -21,7 +21,9 @@
2121
print(graph.entity_ids[:10])
2222
print(graph_reread.entity_ids[:10])
2323

24-
embeddings = graph_reread.initialize_deterministically(feature_dim=128, seed=0)
24+
embeddings = graph_reread.initialize_deterministically(feature_dim=256, seed=0)
2525
embeddings = graph_reread.left_markov_propagate(embeddings)
26+
embeddings /= np.linalg.norm(embeddings, ord=2, axis=-1, keepdims=True)
27+
embeddings = whiten_embeddings(embeddings)
2628

27-
print(embeddings)
29+
print(embeddings)

0 commit comments

Comments
 (0)