You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+12-9Lines changed: 12 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -78,12 +78,14 @@ for r in similar:
78
78
print(f"{r['entity_id']}: {r['similarity']:.4f}")
79
79
```
80
80
81
+
`embed()` defaults to `feature_dim=256`, `num_iterations=40`, and whitening after every propagation step.
82
+
81
83
### Step-by-Step Example
82
84
83
85
The high-level `embed()` function wraps the Markov propagation loop for convenience. Here's the full manual version, which gives you complete control over the process:
84
86
85
87
```python
86
-
from pycleora import SparseMatrix
88
+
from pycleora import SparseMatrix, whiten_embeddings
Beyond the standard algorithms, Cleora supports several advanced embedding strategies:
179
182
180
-
-**Multiscale embeddings** — concatenates embeddings from different iteration depths (e.g. scales `[1, 2, 4, 8]`) to capture both local and global graph structure simultaneously
183
+
-**Multiscale embeddings** — concatenates embeddings from different iteration depths (e.g. scales `[10, 20, 30, 40]`) to capture both local and global graph structure simultaneously
-**Supervised refinement** — fine-tunes unsupervised embeddings using positive/negative entity pairs with a triplet margin loss
183
186
-**Directed graph embeddings** — handles asymmetric relationships where edge direction matters
184
187
-**Weighted graph embeddings** — incorporates edge weights into the propagation step
185
188
-**Node feature integration** — initializes embeddings with external features (text, image, numeric) before propagation
186
-
-**PCA whitening** — built-in ZCA whitening to decorrelate embedding dimensions and improve downstream task performance
189
+
-**PCA whitening** — built-in whitening after every iteration by default to decorrelate embedding dimensions and improve downstream task performance
187
190
188
191
---
189
192
@@ -312,7 +315,7 @@ See [cleora.ai/use-cases](https://cleora.ai/use-cases) for detailed walkthroughs
312
315
2.**Hypergraph Construction** — Builds a heterogeneous hypergraph where a single edge can connect multiple entities of different types.
313
316
3.**Sparse Markov Matrix** — Constructs a sparse transition matrix (99%+ sparse). Rows normalized so each row sums to 1.
314
317
4.**Single Matrix Multiplication = All Walks** — One sparse matrix multiplication captures *every possible random walk* of a given length. No sampling, no noise.
315
-
5.**L2-Normalized Propagation** — Each iteration replaces every node's embedding with the L2-normalized average of its neighbors. 3-4 iterations for co-occurrence similarity, 7+ for contextual similarity.
318
+
5.**L2-Normalized + Whitened Propagation** — Each iteration replaces every node's embedding with the L2-normalized average of its neighbors and then whitens the embedding space. The default configuration runs 40 iterations at 256 dimensions.
316
319
6.**Embeddings Ready** — Dense, deterministic embedding vectors for every entity. Same input always yields same output.
317
320
318
321
---
@@ -343,15 +346,15 @@ A: No, this is a methodologically wrong approach, stemming from outdated matrix
343
346
344
347
**Q: What embedding dimensionality to use?**
345
348
346
-
A: The more, the better, but we typically work from _1024_ to _4096_. Memory is cheap and machines are powerful, so don't skimp on embedding size.
349
+
A: The default is **256**. For larger production systems we often work from _1024_ to _4096_, but `256`is the baseline shipped by the library.
347
350
348
351
**Q: How many iterations of Markov propagation should I use?**
349
352
350
-
A: Depends on what you want to achieve. Low iterations (3) tend to approximate the co-occurrence matrix, while high iterations (7+) tend to give contextual similarity (think skip-gram but much more accurate and faster).
353
+
A: The default is **40** whitening-enhanced propagation steps. If you want more local, co-occurrence-style behavior you can dial that down manually; higher values bias more toward contextual similarity.
351
354
352
355
**Q: How do I incorporate external information, e.g. entity metadata, images, texts into the embeddings?**
353
356
354
-
A: Just initialize the embedding matrix with your own vectors coming from a VIT, sentence-transformers, or a random projection of your numeric features. In that scenario low numbers of Markov iterations (1 to 3) tend to work best.
357
+
A: Just initialize the embedding matrix with your own vectors coming from a VIT, sentence-transformers, or a random projection of your numeric features. In that scenario fewer Markov iterations than the default `40` often work best.
355
358
356
359
**Q: My embeddings don't fit in memory, what do I do?**
357
360
@@ -367,7 +370,7 @@ A: Cleora works best for relatively sparse hypergraphs. If all your hyperedges c
367
370
368
371
**Q: How can Cleora be so fast and accurate at the same time?**
369
372
370
-
A: Not using negative sampling is a great boon. By constructing the (sparse) Markov transition matrix, Cleora explicitly performs all possible random walks in a hypergraph in one big step (a single matrix multiplication). That's what we call a single _iteration_. We perform 3+ such iterations. Thanks to a highly efficient implementation in Rust, with special care for concurrency, memory layout and cache coherence, it is blazingly fast. Negative sampling or randomly selecting random walks tend to introduce a lot of noise - Cleora is free of those burdens.
373
+
A: Not using negative sampling is a great boon. By constructing the (sparse) Markov transition matrix, Cleora explicitly performs all possible random walks in a hypergraph in one big step (a single matrix multiplication). That's what we call a single _iteration_. The default configuration performs 40 such iterationswith whitening after every step. Negative sampling or randomly selecting random walks tend to introduce a lot of noise - Cleora is free of those burdens.
0 commit comments