Skip to content

ralu2004/Local-File-Search-System

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

247 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Local File Search System

A local file search engine that indexes files on your machine and enables fast full-text and metadata search, with a CLI, an HTTP API, and a React-based web UI.


Build

cd application
mvn package

The built jar will be at application/target/application-1.0-SNAPSHOT.jar.


Usage

Index a directory

java -jar application/target/application-1.0-SNAPSHOT.jar index <directory>

Options:

Option Default Description
--db <path> .searchengine/index.db Custom database path
-i, --ignore <pattern> Glob pattern to ignore (repeatable)
--max-file-size <MB> 10 Skip files larger than this
--preview-lines <n> 3 Number of preview lines to store
--batch-size <n> 250 Number of files per DB batch write

Examples:

# Basic index
java -jar ... index C:\Users\user\Documents

# With ignore rules
java -jar ... index C:\Users\user\Documents -i "*.log" -i "backup"

# Tune DB batch writes
java -jar ... index C:\Users\user\Documents --batch-size 500

# Custom database path
java -jar ... --db C:\myindex\index.db index C:\Users\user\Documents

Search

java -jar application/target/application-1.0-SNAPSHOT.jar search "<query>"

Options:

Option Default Description
--db <path> .searchengine/index.db Custom database path
--limit <n> 50 Maximum number of results

Query syntax:

Query Meaning
getting started Full-text search
README.md Search by filename
content:hello Restrict full-text match to file contents
path:src/main Filter by path substring (cross-platform)
ext:java Filter by extension
modified:2025-01-01 Files modified after date
size:1048576 Files larger than size in bytes
size:10kb, size:5mb, size:1gb Size filter with units (case-insensitive)
color:red Filter images by dominant color (red, blue, green, yellow, ...)
sort:date, sort:alpha, sort:balanced, sort:behavior Choose ranking strategy
config ext:json Combined full-text and metadata

Qualifiers can appear in any order and combine with AND semantics. Duplicate qualifiers (e.g., two content: filters) compose with AND. For CLI usage, sorting is query-based (sort:<mode>), not a separate --sort flag.

Examples:

java -jar ... search "getting started"
java -jar ... search "ext:java"
java -jar ... search "README.md"
java -jar ... search "size:10mb"
java -jar ... search "color:red"
java -jar ... search "config ext:json" --limit 10
java -jar ... search "auth path:src/main sort:date"

Web UI

The project includes a React frontend that talks to an HTTP API server.

Start the API server:

java -jar application/target/application-1.0-SNAPSHOT.jar server

The server listens on http://localhost:7070 by default. To use a different port, pass it as the next argument: ... server 8080.

Run the frontend:

cd frontend
npm install
npm run dev

The dev server prints the URL it's listening on. The frontend sends requests to http://localhost:7070/api/*.

The UI exposes both indexing and search workflows: configure a root directory and ignore rules, run indexing with live progress, then search with sort-mode selection (default, balanced, date, alphabetical, personalized). Results show file metadata, content previews with query-term highlighting, and a "Mark as opened" action that feeds personalized ranking. Image results display an inline thumbnail instead of a text preview. When the personalized sort is active, results display ranking insights describing why each result scored where it did.

Context-aware widgets appear below the search bar when results match certain patterns: a View as Gallery action activates when the majority of results are images (displaying them in a grid via a secure localfile:// Electron protocol), Analyze Logs when results are predominantly log files, Copy Folder Path when all results share a directory, and Export File List is always available to save result paths to a text file.


Default ignore rules

The crawler always ignores common system/build directories (for example node_modules, target, build, dist, .git, .idea, AppData, Program Files, Windows) and also ignores hidden files/directories and non-text files. You can add more rules with -i/--ignore.


Incremental indexing verification

To validate that only changed files are re-indexed, do the following steps:

  1. Run an initial index on a test directory.
  2. Modify one indexed text file, add one new file, and delete one existing indexed file.
  3. Run indexing again on the same directory.
  4. Check the report:
    • Skipped should include unchanged files.
    • Indexed should reflect only changed/new files.
    • Deleted should include files removed from disk.

Testing

Run all automated tests:

cd application
mvn test

Current suite covers:

  • Query parsing: full-text, filename, metadata, mixed input, and size unit parsing (bytes, kb, mb, gb)
  • Search behavior: recursive traversal, single-word and multi-word full-text search
  • Metadata filters: ext, modified, size (including unit forms), path, content, color
  • Query decorator pipeline: sanitization (stripping, whitespace normalisation), synonym expansion (known shorthands, filter transparency, case insensitivity), FTS5 logic (wildcard placement, operator passthrough, phrase preservation), and full chain composition
  • Runtime indexing options: ignoreRules, maxFileSizeMb, previewLines, batchSize
  • Indexing lifecycle: background progress snapshot and final report
  • Parallel indexing: Producer-Consumer correctness — all files indexed, no duplicates, writer thread isolation
  • Image indexing: dominant color extraction, color: filter integration
  • Resilience: database failure propagation, unreadable files, and symlink-loop environments (platform-dependent skip)
  • Incremental indexing: unchanged-file skip, modified-file update, and deleted-file cleanup
  • Ranking strategies: resolver mapping, swappable strategy selection, behavior score formula (frequency, recency, position lift), ranking insight formatting (relative time, lift threshold)
  • Search activity: history recording, suggestion prefix matching, recent-query ordering
  • Widget activation: each rule tested in isolation (threshold boundaries, extension grouping, null handling); factory-level tests for ordering, custom rule injection, and empty/null guards

Typical output should report all tests passing, with one optional skipped test on platforms that cannot create symlinks.


Personalized ranking

The default ranking favors content relevance and path features. The personalized ranking strategy (sort:behavior, or "Personalized" in the UI) reorders results based on the user's interaction history with similar queries. It uses three signals:

  • Frequency — how often the file has been opened for the same normalized query
  • Recency — how recently it was opened (exponential decay with a 7-day half-life)
  • Position lift/boost — whether the user typically had to "dig" past higher-ranked results to reach this file (an opened file consistently found at position 8 ranks higher than one always found at position 1, holding other factors equal)

Files split into two buckets: those with any open history sort first by behavior score, those without sort after by full-text relevance. When the personalized sort is active, the UI shows insights under each result explaining the ranking ("you've opened this 5 times for similar searches", "last opened 2 hours ago", "you often find this past higher-ranked results").

The UI also surfaces query suggestions based on prefix matches against search history, and recent unique queries — both fed by the same activity tracking that drives personalized ranking.


Relevant features

Multimodal search

The Extractor dispatches to a registered list of FileProcessingStrategy implementations. TextFileStrategy handles text files as before; ImageFileStrategy extracts the dominant color from images using HSB hue bucketing. A custom Extractor(List<FileProcessingStrategy>) constructor allows future file types (PDF, audio) to be added without modifying Extractor itself.

Query pre-processor pipeline

After QueryParser produces a Query object, a QueryDecorator chain transforms it before execution: SanitizationDecorator strips FTS5-breaking characters, SynonymDecorator expands shorthand terms using a configurable synonyms.properties file, and LogicDecorator appends FTS5 prefix wildcards to the last eligible token. Decorators are interchangeable wrappers — each is independently testable and the chain is composable without modifying the core search logic.

Widget factory

WidgetActivator holds an ordered registry of WidgetActivationRule strategies. Each rule encapsulates one activation condition and returns an Optional<Widget>. The factory iterates the registry, collects non-empty results, and returns the widget strip in registration order. Adding a new widget type requires implementing one interface and registering one instance — no existing code changes.

Producer-Consumer indexing

Indexer runs a bounded BlockingQueue between a reader thread pool (one task per file) and a single IndexWriter consumer thread. The queue capacity is capped at min(poolSize * 4, 256) to bound heap pressure. A 30-minute timeout on reader pool shutdown prevents indefinite hangs on stuck extractors.

Known limitations

  • The indexer targets text-like and image files; other binary formats (PDF, audio, video) are skipped during content extraction.
  • CLI sorting is expressed inside the query (sort:...), not via a separate --sort option.
  • Symlink-related behavior may vary by OS permissions (the test suite already marks this as optional on unsupported environments).
  • Synonym expansion is applied at query time. Matched synonyms are not currently highlighted in result previews — only the original typed terms are highlighted.

Design notes

The following sections document the architectural choices of the ranking system, including the trade-offs that were considered and deliberately accepted.

Request flow

The sequence diagram below traces a personalized-search request from the UI to SQLite and back, showing where each layer narrows its dependencies and where the behavior-score UDF runs. The component diagram shows the post-decomposition/current shape of the persistence layer.

Personalized search - request sequence

sequenceDiagram
    autonumber
    participant UI as Frontend (SearchPanel)
    participant API as ApiServer
    participant SS as SearchService
    participant SE as SearchEngine
    participant Acc as DatabaseAccessor
    participant Sess as DatabaseSession<br/>(SqliteDatabaseSession)
    participant FC as FileContext
    participant Repo as SqliteFileRepository
    participant DB as SQLite + FTS5<br/>+ behavior_score UDF
    participant Obs as SearchActivityObserver

    UI->>API: GET /api/search?q=auth+sort:behavior
    API->>SS: search(dbPath, query, limit)

    SS->>Acc: openFileSearch(dbPath)
    Acc->>Sess: open(dbPath)
    Sess-->>Acc: DatabaseSession
    Note right of Acc: returns CloseableFileSearch<br/>(narrow view)
    Acc-->>SS: CloseableFileSearch

    SS->>SE: new SearchEngine(searchRepo, parser, limit)
    SS->>SE: search(input)
    SE->>SE: parse query, resolve strategy
    SE->>FC: search(query, limit, BehaviorRankingStrategy, normalizedQuery)

    FC->>Repo: search(...)
    Repo->>DB: SELECT ... ORDER BY CASE ... behavior_score(...) DESC
    Note right of DB: UDF computes behavior score<br/>per row using Java formula
    DB-->>Repo: rows
    Repo-->>FC: List<RankedSearchResult> (with insights)
    FC-->>SE: results
    SE-->>SS: results

    SS->>Obs: onSearchExecuted(...)
    Obs->>Acc: openSearchActivity(dbPath)
    Acc-->>Obs: CloseableSearchActivity
    Obs->>Sess: recordSearch(...)
    Note right of Obs: Records query + duration<br/>for future personalization

    SS-->>API: List<RankedSearchResult>
    API-->>UI: JSON (results + insights)
    UI->>UI: render results, show insights<br/>under top files
Loading

The narrowing happens at openFileSearch: services receive a CloseableFileSearch (a narrow view), not a session or a Database. The compiler enforces that only file-search methods can be called from SearchService at this point in the flow. The same pattern repeats for activity recording via CloseableSearchActivity.

Persistence layer - component structure

graph TB
    subgraph svc["Service layer (consumers)"]
        SS[SearchService]
        IS[IndexService]
        HS[HistoryService]
        Obs[SearchActivityObserver]
    end

    Acc[DatabaseAccessor]

    subgraph ifaces["Closeable view interfaces<br/>(app.repository)"]
        CFS[CloseableFileSearch]
        CSA[CloseableSearchActivity]
        CIR[CloseableIndexRuns]
        CIS[CloseableIndexSession]
    end

    DS[DatabaseSession<br/>umbrella interface]

    subgraph impl["SQLite implementation (app.db.sqlite)"]
        SDS[SqliteDatabaseSession]
        FC[FileContext]
        IRC[IndexRunContext]
        AC[ActivityContext]
        SCP[SqliteConnectionProvider]
    end

    SS -->|via| CFS
    SS -->|via| CSA
    IS -->|via| CIS
    HS -->|via| CIR
    Obs -->|via| CSA

    CFS -.implemented by.-> SDS
    CSA -.implemented by.-> SDS
    CIR -.implemented by.-> SDS
    CIS -.implemented by.-> SDS

    DS -.aggregates.-> CFS
    DS -.aggregates.-> CSA
    DS -.aggregates.-> CIR
    DS -.aggregates.-> CIS

    SDS -->|delegates to| FC
    SDS -->|delegates to| IRC
    SDS -->|delegates to| AC

    FC -->|uses| SCP
    IRC -->|uses| SCP
    AC -->|uses| SCP

    Acc -->|opens| DS
    SS -.uses.-> Acc
    IS -.uses.-> Acc
    HS -.uses.-> Acc
    Obs -.uses.-> Acc
Loading

Behavior score: SQLite UDF instead of inline SQL

The personalized ranking formula combines three signals into a single weighted score. Two implementation paths were considered:

Option A — formula inline in the strategy's ORDER BY clause. Fits the existing RankingStrategy contract (each strategy returns a SQL fragment). Simple to wire in, but the formula lives as a string concatenation in Java code and cannot be unit-tested without a live SQLite connection.

Option B — formula in a Java class, exposed to SQL through a SQLite user-defined function. Requires registering the UDF on every JDBC connection (introducing a connection-level coupling), but keeps the formula in a unit-testable Java class while preserving the RankingStrategy contract.

The current implementation features option B. The formula lives in BehaviorScoreFormula with BehaviorScoreFormulaTest covering each component (frequency cap, recency half-life, position lift threshold) as pure-function tests. The thin SQLite adapter (SqliteBehaviorScoreFunction) extends org.sqlite.Function and delegates to the formula. The strategy class (BehaviorRankingStrategy) returns an ORDER BY clause referencing the UDF by its registered name.

Explainability: per-strategy opt-in, not per-result tagging

When the user picks personalized sort, results show insights describing why they ranked. The insight text is generated from the same raw signals that feed the formula (open count, last-open timestamp, average position) and lives in BehaviorRankingInsights. Backend production of insights is gated by RankingStrategy.producesInsights(), a default method that returns false and is overridden to true only on BehaviorRankingStrategy. Other strategies' result rows carry an empty insight list.

Persistence layer: dependency inversion before decomposition

The persistence layer underwent two refactors during Iteration 2:

1. Narrow consumer dependencies. Services that previously held references to a god Database class were updated to depend on narrow repository interfaces (FileSearchRepository, SearchActivityRepository, IndexRunRepository, FileWriteRepository, FileMetadataRepository). Each interface has a Closeable* companion (e.g., CloseableFileSearch) extending AutoCloseable, so services can use try-with-resources while still depending on a narrow type. The DatabaseAccessor is the only place that constructs persistence handles; services never see the umbrella type.

2. Decompose the implementation. With consumers decoupled, the monolithic Database class was split into three per-domain context classes:

  • FileContext — file records (search, write, metadata)
  • IndexRunContext — indexing run lifecycle and history
  • ActivityContext — search execution and result-open activity

SqliteDatabaseSession composes the three contexts and implements the umbrella DatabaseSession interface (which aggregates the closeable views). SqliteDatabaseProvider returns DatabaseSession instances and now owns schema initialization (extracted from the previous Database constructor). The Database class no longer exists.

Frontend structure

App.tsx is the top-level component owning section state and layout. Indexing concerns (config form, status badge, history table) live in IndexPanel; search concerns (search bar, sort selector, results, insights) live in SearchPanel. Leaf components (SuggestionBox, ResultCard, StatusBadge) are presentational. All HTTP calls are centralized in api/client.ts. Shared types live in types.ts. Pure utilities (formatFileSize, getFolderPath, highlightText) are extracted to the utils/* modules.


Architecture

See ARCHITECTURE.md for the C4 model of the system design (delivered as part of an earlier iteration). The "Design notes" section above documents iteration-2 additions and refactors not covered by the original architecture document.

Development

Pre-commit hook

To enable the pre-commit hook (runs checkstyle before each commit):

cp hooks/pre-commit .git/hooks/pre-commit && chmod +x .git/hooks/pre-commit

About

A local search engine that indexes documents, media, and binaries across your device. By leveraging filenames, content inspection, and metadata, it provides a "search-as-you-type" experience for retrieving local data.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors