Skip to content

Commit ed5f952

Browse files
author
jarokrolewski
committed
Update project documentation and code examples for clarity
Update README.md to reflect changes in embedding algorithms and dataset count, and refine code examples for better user understanding. Replit-Commit-Author: Agent Replit-Commit-Session-Id: 2f70347b-d6bb-488b-85b2-389df1f2a2e8 Replit-Commit-Checkpoint-Type: full_checkpoint Replit-Commit-Event-Id: 4f43b83c-7662-4eca-8c27-7fd712a78690 Replit-Helium-Checkpoint-Created: true
1 parent 1bd9df2 commit ed5f952

1 file changed

Lines changed: 103 additions & 62 deletions

File tree

README.md

Lines changed: 103 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,15 @@ No negative sampling. No GPU. No noise. Just fast, deterministic, production-gra
1313

1414
<p align="center">
1515
<b>240x</b> Faster Than GraphSAGE &nbsp;·&nbsp;
16-
<b>9</b> Embedding Algorithms &nbsp;·&nbsp;
17-
<b>14</b> Built-in Datasets &nbsp;·&nbsp;
18-
<b>5 MB</b> Total Install Size
16+
<b>8</b> Embedding Algorithms + GCN Classifier &nbsp;·&nbsp;
17+
<b>~5 MB</b> Total Install Size
18+
</p>
19+
20+
<p align="center">
21+
<a href="https://cleora.ai">Website</a> &nbsp;·&nbsp;
22+
<a href="https://cleora.ai/docs">Documentation</a> &nbsp;·&nbsp;
23+
<a href="https://cleora.ai/api">API Reference</a> &nbsp;·&nbsp;
24+
<a href="https://cleora.ai/benchmarks">Benchmarks</a>
1925
</p>
2026

2127
---
@@ -38,33 +44,38 @@ No negative sampling. No GPU. No noise. Just fast, deterministic, production-gra
3844
pip install pycleora
3945
```
4046

47+
Optional extras:
48+
49+
```bash
50+
pip install pycleora[viz] # matplotlib for visualization
51+
pip install pycleora[full] # matplotlib + networkx + tqdm
52+
```
53+
4154
## Quick Start
4255

4356
```python
4457
from pycleora import SparseMatrix, embed, find_most_similar
4558

46-
# Build graph from edge list
4759
edges = ["alice item_laptop", "alice item_mouse", "bob item_keyboard"]
4860
graph = SparseMatrix.from_iterator(iter(edges), "complex::reflexive::product")
4961

50-
# Generate 1024-dimensional embeddings
5162
embeddings = embed(graph, feature_dim=1024, num_iterations=4)
5263

53-
# Find similar entities
5464
similar = find_most_similar(graph, embeddings, "alice", top_k=5)
5565
for r in similar:
5666
print(f"{r['entity_id']}: {r['similarity']:.4f}")
5767
```
5868

59-
### Full Usage Example
69+
### Step-by-Step Example
70+
71+
The high-level `embed()` function wraps the Markov propagation loop for convenience. Here's the full manual version, which gives you complete control over the process:
6072

6173
```python
6274
from pycleora import SparseMatrix
6375
import numpy as np
6476
import pandas as pd
6577
import random
6678

67-
# Generate example data
6879
customers = [f"Customer_{i}" for i in range(1, 20)]
6980
products = [f"Product_{j}" for j in range(1, 20)]
7081

@@ -73,54 +84,35 @@ data = {
7384
"product": random.choices(products, k=100),
7485
}
7586

76-
# Create DataFrame
7787
df = pd.DataFrame(data)
78-
79-
# Create hyperedges
8088
customer_products = df.groupby('customer')['product'].apply(list).values
81-
82-
# Convert to Cleora input format
8389
cleora_input = map(lambda x: ' '.join(x), customer_products)
8490

85-
# Create Markov transition matrix for the hypergraph
8691
mat = SparseMatrix.from_iterator(cleora_input, columns='complex::reflexive::product')
8792

88-
# Look at entity ids in the matrix, corresponding to embedding vectors
8993
print(mat.entity_ids)
9094

91-
# Initialize embedding vectors externally, using text, image, random vectors
92-
# embeddings = ...
93-
94-
# Or use built-in random deterministic initialization
9595
embeddings = mat.initialize_deterministically(1024)
9696

97-
# Perform Markov random walk, then normalize however many times we want
98-
99-
NUM_WALKS = 3 # The optimal number depends on the graph, typically between 3 and 7 yields good results
100-
# lower values tend to capture co-occurrence, higher iterations capture substitutability in a context
97+
NUM_WALKS = 3 # 3-4 for co-occurrence, 7+ for contextual similarity
10198

10299
for i in range(NUM_WALKS):
103-
# Can propagate with a symmetric matrix as well, but left Markov is a great default
104100
embeddings = mat.left_markov_propagate(embeddings)
105-
# Normalize with L2 norm by default, for the embeddings to reside on a hypersphere. Can use standardization instead.
106101
embeddings /= np.linalg.norm(embeddings, ord=2, axis=-1, keepdims=True)
107102

108-
# We're done, here are our embeddings
109-
110103
for entity, embedding in zip(mat.entity_ids, embeddings):
111104
print(entity, embedding)
112105

113-
# We can now compare our embeddings with dot product (since they are L2 normalized)
114-
115106
print(np.dot(embeddings[0], embeddings[1]))
116-
print(np.dot(embeddings[0], embeddings[2]))
117-
print(np.dot(embeddings[0], embeddings[3]))
118107
```
119108

120109
### CLI
121110

122111
```bash
123112
pycleora embed --input graph.tsv --output embeddings.npz --dim 1024
113+
pycleora info --input graph.tsv
114+
pycleora similar --input graph.tsv --entity alice --top-k 10
115+
pycleora benchmark --dataset karate_club
124116
```
125117

126118
---
@@ -139,7 +131,7 @@ Same input always produces the same output. No random seeds, no stochastic varia
139131
### Heterogeneous Hypergraphs
140132
Natively handles multi-type nodes and edges, bipartite graphs, and hypergraphs. TSV input with typed columns like `complex::reflexive::product`. No graph preprocessing needed.
141133

142-
### 5 MB, Zero Dependencies
134+
### ~5 MB, Zero Dependencies
143135
The entire library is ~5 MB. Compare: PyTorch Geometric is 500 MB+, DGL is 400 MB+. Cleora ships as a single compiled Rust extension. No CUDA, no cuDNN, no GPU driver headaches.
144136

145137
### Stable & Inductive
@@ -161,7 +153,53 @@ Embeddings are stable across runs and support inductive learning: new nodes can
161153
| **GraRep** | Matrix Factorization | Graph Representations with Global Structural Information |
162154
| **GCN** | Mini-GNN | 2-layer Graph Convolutional Network classifier in pure numpy/scipy — no PyTorch needed |
163155

164-
All 9 algorithms are unified under a single API. Switch between methods by changing one parameter.
156+
All algorithms are unified under a single API. Switch between methods by changing one parameter:
157+
158+
```bash
159+
pycleora embed --input graph.tsv --output out.npz --algorithm cleora
160+
pycleora embed --input graph.tsv --output out.npz --algorithm prone
161+
pycleora embed --input graph.tsv --output out.npz --algorithm node2vec
162+
```
163+
164+
### Advanced Embedding Modes
165+
166+
Beyond the standard algorithms, Cleora supports several advanced embedding strategies:
167+
168+
- **Multiscale embeddings** — concatenates embeddings from different iteration depths (e.g. scales `[1, 2, 4, 8]`) to capture both local and global graph structure simultaneously
169+
- **Attention-weighted propagation** — uses softmax-normalized dot-product attention during propagation, dynamically weighting neighbor contributions
170+
- **Supervised refinement** — fine-tunes unsupervised embeddings using positive/negative entity pairs with a triplet margin loss
171+
- **Directed graph embeddings** — handles asymmetric relationships where edge direction matters
172+
- **Weighted graph embeddings** — incorporates edge weights into the propagation step
173+
- **Node feature integration** — initializes embeddings with external features (text, image, numeric) before propagation
174+
- **PCA whitening** — built-in ZCA whitening to decorrelate embedding dimensions and improve downstream task performance
175+
176+
---
177+
178+
## Batteries Included
179+
180+
pycleora ships with a comprehensive set of built-in modules:
181+
182+
| Module | What it does |
183+
|--------|-------------|
184+
| `pycleora.community` | Community detection (Louvain) |
185+
| `pycleora.classify` | MLP and Label Propagation classifiers — no PyTorch needed |
186+
| `pycleora.sampling` | 6 graph sampling methods |
187+
| `pycleora.tuning` | Grid search and random search for hyperparameter tuning |
188+
| `pycleora.compress` | Embedding compression (PQ, scalar quantization) |
189+
| `pycleora.io_utils` | Save/load embeddings (NPZ, CSV, TSV), NetworkX conversion |
190+
| `pycleora.viz` | Embedding visualization (UMAP, t-SNE projections) |
191+
| `pycleora.metrics` | Evaluation metrics for embeddings |
192+
| `pycleora.benchmark` | Compare algorithms with time, memory, and accuracy metrics |
193+
| `pycleora.ensemble` | Combine embeddings from multiple algorithms |
194+
| `pycleora.align` | Embedding alignment across graphs |
195+
| `pycleora.search` | Nearest-neighbor entity search |
196+
| `pycleora.stats` | Graph statistics and degree analysis |
197+
| `pycleora.preprocess` | Graph preprocessing and filtering |
198+
| `pycleora.hetero` | Heterogeneous graph utilities |
199+
| `pycleora.generators` | Synthetic graph generators for testing |
200+
| `pycleora.datasets` | Real-world benchmark datasets (Facebook, Cora, CiteSeer, PubMed, PPI, roadNet-CA, and more) |
201+
202+
See the [full API reference](https://cleora.ai/api) for details on every function and parameter.
165203

166204
---
167205

@@ -192,49 +230,48 @@ Zomato's ML team needed graph embeddings to power "People Like You" restaurant r
192230

193231
## Benchmarks
194232

195-
Tested on real-world graphs from 4K to 2M+ nodes. Cleora wins on accuracy, speed, and memory.
233+
Benchmarked against **7 competing algorithms** on **5 real-world datasets** (ego-Facebook, Cora, CiteSeer, PubMed, PPI) plus a 2M-node scale test. All datasets are genuine academic benchmarks from SNAP, Planetoid, and DGL. Cleora wins on accuracy on **every single dataset**.
196234

197-
### Link Prediction Accuracy (AUC)
235+
Full interactive benchmark results at [cleora.ai/benchmarks](https://cleora.ai/benchmarks).
198236

199-
| Dataset | Cleora | NetMF | Node2Vec | DeepWalk | Cleora Time |
200-
|---------|--------|-------|----------|----------|-------------|
201-
| **ego-Facebook** (4K nodes, 88K edges) | **0.964** | 0.944 | 0.918 | 0.912 | 0.74s |
202-
| **Flickr** (89K nodes, 899K edges) | **0.158** | OOM | OOM | OOM | 0.47s |
203-
| **ogbn-arxiv** (169K nodes, 1.2M edges) | **0.038** | OOM | OOM | OOM ||
237+
### Classification Accuracy
204238

205-
### Speed Comparison
239+
| Dataset | Nodes | Cleora | NetMF | DeepWalk | Node2Vec | HOPE | GraRep | ProNE | RandNE |
240+
|---------|-------|--------|-------|----------|----------|------|--------|-------|--------|
241+
| **ego-Facebook** | 4K | **0.990** | 0.957 | 0.958 | 0.958 | 0.890 | T/O | 0.075 | 0.212 |
242+
| **Cora** | 2.7K | **0.861** | 0.839 | 0.835 | 0.835 | 0.821 | 0.809 | 0.179 | 0.247 |
243+
| **CiteSeer** | 3.3K | **0.824** | 0.810 | 0.806 | 0.806 | 0.740 | 0.756 | 0.189 | 0.244 |
244+
| **PubMed** | 19.7K | **0.879** | OOM | T/O | T/O | T/O | OOM | 0.339 | 0.351 |
245+
| **PPI** | 3.9K | **1.000** | OOM | T/O | T/O | T/O | OOM | 0.023 | 0.073 |
206246

207-
| Dataset | Cleora | RandNE | ProNE | NetMF |
208-
|---------|--------|--------|-------|-------|
209-
| **PPI-large** (57K nodes) | **0.33s** | 1.07s | 8.34s | OOM |
210-
| **Yelp** (717K nodes) | **3.3s** | OOM | OOM | OOM |
211-
| **roadNet-CA** (2M nodes) | **4.2s** | 9.0s | 57.7s | OOM |
247+
> **Only 3 of 8 algorithms survive at 19.7K nodes.** HOPE, NetMF, GraRep, DeepWalk, and Node2Vec all crash or time out. Cleora achieves perfect accuracy on PPI (50 classes).
212248
213249
### Memory Efficiency
214250

215-
| Dataset | Cleora | Runner-up | Factor |
216-
|---------|--------|-----------|--------|
217-
| PPI-large (57K) | **28 MB** | 458 MB | 16x less |
218-
| Flickr (89K) | **44 MB** | 701 MB | 16x less |
219-
| ogbn-arxiv (169K) | **83 MB** | 1.3 GB | 16x less |
220-
| Yelp (717K) | **350 MB** | OOM | Only one that finished |
221-
| roadNet (2M) | **1.9 GB** | 14.6 GB | ~8x less |
251+
| Dataset | Cleora | Best Competitor | Factor |
252+
|---------|--------|-----------------|--------|
253+
| ego-Facebook (4K) | **22 MB** | 572 MB | 26x less |
254+
| Cora (2.7K) | **14 MB** | 227 MB | 16x less |
255+
| CiteSeer (3.3K) | **16 MB** | 294 MB | 18x less |
256+
| PubMed (19.7K) | **97 MB** | 175 MB | Only 3 survived |
257+
| roadNet-CA (2M) | **4.1 GB** || Only Cleora finished |
258+
259+
### Scale Test: roadNet-CA (2 Million Nodes)
222260

223-
> 500x more nodes with only ~19x runtime increase — from 0.22s to 4.2s.
261+
2 million nodes. 31 seconds. Every other algorithm crashes with out-of-memory. Cleora is the only library that survives at this scale on a single CPU.
224262

225263
---
226264

227265
## Library Comparison
228266

229-
| Feature | **pycleora 3.0** | PyG | KarateClub | DGL | Node2Vec | StellarGraph |
267+
| Feature | **pycleora 3.2** | PyG | KarateClub | DGL | Node2Vec | StellarGraph |
230268
|---------|:---:|:---:|:---:|:---:|:---:|:---:|
231269
| CPU-only (no GPU needed) | **Yes** | Optional | Yes | Optional | Yes | Optional |
232270
| Rust-powered core | **Yes** | No (C++) | No | No (C++) | No | No (TF) |
233271
| No negative sampling needed | **Yes** | No | No | No | No | No |
234272
| Deterministic output | **Yes** | No | No | No | No | No |
235273
| Node2Vec / DeepWalk | **Built-in** | Yes | Yes | Yes | Yes | Yes |
236274
| GNN classifier (no PyTorch) | **GCN** | Requires PyTorch | No | Requires PyTorch | No | Requires TF |
237-
| Built-in datasets | **14** | 70+ | ~5 | 40+ | No | ~10 |
238275
| Graph sampling | **6 methods** | Yes | No | Yes | No | Yes |
239276
| Hyperparameter tuning | **Grid + Random** | Manual | No | Manual | No | Manual |
240277
| Install size | **~5 MB** | ~500 MB+ | ~15 MB | ~400 MB+ | ~2 MB | ~600 MB+ |
@@ -253,6 +290,8 @@ Tested on real-world graphs from 4K to 2M+ nodes. Cleora wins on accuracy, speed
253290
- **Drug Discovery** — Molecule and protein interaction networks
254291
- **Supply Chain** — Supplier and logistics graph analysis
255292

293+
See [cleora.ai/use-cases](https://cleora.ai/use-cases) for detailed walkthroughs with code examples.
294+
256295
---
257296

258297
## How It Works
@@ -284,7 +323,7 @@ A: Any entities that interact with each other, co-occur or can be said to be pre
284323

285324
**Q: How should I construct the input?**
286325

287-
A: What works best is grouping entities co-occurring in a similar context, and feeding them in whitespace-separated lines using `complex::reflexive` modifier is a good idea. E.g. if you have product data, you can group the products by shopping baskets or by users. If you have urls, you can group them by browser sessions, of by (user, time window) pairs. Check out the usage example above. Grouping products by customers is just one possibility.
326+
A: What works best is grouping entities co-occurring in a similar context, and feeding them in whitespace-separated lines using `complex::reflexive` modifier is a good idea. E.g. if you have product data, you can group the products by shopping baskets or by users. If you have urls, you can group them by browser sessions, or by (user, time window) pairs. Check out the usage example above. Grouping products by customers is just one possibility.
288327

289328
**Q: Can I embed users and products simultaneously, to compare them with cosine similarity?**
290329

@@ -322,10 +361,12 @@ A: Not using negative sampling is a great boon. By constructing the (sparse) Mar
322361

323362
## Resources
324363

364+
- **Website**: [cleora.ai](https://cleora.ai)
365+
- **API Reference**: [cleora.ai/api](https://cleora.ai/api)
366+
- **Benchmarks**: [cleora.ai/benchmarks](https://cleora.ai/benchmarks)
325367
- **Whitepaper**: ["Cleora: A Simple, Strong and Scalable Graph Embedding Scheme"](https://arxiv.org/abs/2102.02302)
326-
- **Documentation**: [cleora.readthedocs.io](https://cleora.readthedocs.io/)
327-
- **Benchmarks**: [Full benchmark results](https://cleora.readthedocs.io/)
328368
- **GitHub**: [github.com/BaseModelAI/cleora](https://github.com/BaseModelAI/cleora)
369+
- **PyPI**: [pypi.org/project/pycleora](https://pypi.org/project/pycleora/)
329370

330371
## Cite
331372

@@ -342,8 +383,8 @@ Please cite [our paper](https://arxiv.org/abs/2102.02302) (and the respective pa
342383

343384
## License
344385

345-
Synerise Cleora is MIT licensed, as found in the [LICENSE](LICENSE) file.
386+
MIT licensed. See [LICENSE](LICENSE) for details.
346387

347-
## How to Contribute
388+
## Contributing
348389

349-
Pull requests are welcome. For details contact us at cleora@synerise.com
390+
Pull requests are welcome. For major changes, please open an issue first. Contact: cleora@synerise.com

0 commit comments

Comments
 (0)