How Nomic Uses Lance and LanceDB for Atlas and AEC Document Intelligence

How Nomic Uses Lance and LanceDB for Atlas and AEC Document Intelligence

Nomic builds two related products: Atlas , a large-scale embedding visualization system that consumes tabular datasets and can either take precomputed embeddings or generate embeddings for any text column, and the newer AEC product , which ingests documents such as PDFs and runs agentic workflows over them. Across both products, the team relies on the Lance file format , with LanceDB powering the AEC product.

Figure 1: Nomic Atlas for data exploration Nomic Atlas screenshot

Atlas: making very large embedding datasets explorable

Background and migration

Atlas started before Lance existed, with early versions leaning on the Apache Arrow ecosystem to store derived artifacts in Arrow IPC on disk and formats such as Parquet and Feather . Some data also lived in PostgreSQL . As Atlas grew to handle tens of millions of embeddings, this approach became hard to scale, prompting the team to shift the storage of derived data and embeddings into Lance tables .

Query performance and index strategy

This change simplified query performance and day-to-day operations. For many Atlas use cases, queries run fast enough by scanning Lance tables directly, eliminating the need to prebuild and keep an HNSW index in memory to retrieve relevant neighbors. The team now builds an index only when a specific latency target requires it, avoiding the memory pressure and operational complexity they experienced when maintaining large in-memory HNSW graphs.

In Atlas we scale to tens of millions of embeddings… for a lot of our use cases we can just do the plain Lance query without even building an index… generally we don’t—we just do the scan, and Lance is fast enough for most of our use cases. —Aaron Miller

Versioning and compaction

Version history is crucial to Atlas, as users need to open datasets as they existed at previous points in time. The team leverages Lance’s ability to keep old versions so Atlas can always show earlier states of a map. Since these versions must remain available, the team writes data in large batches to reduce fragmentation and treats compaction carefully—often skipping it entirely because it would remove older versions that users still rely on.

Storage layout and agents

The team also saw practical gains in storage layout, with many derived artifacts moving from a shared file system to Amazon S3 . This lowered costs and made scaling storage easier while maintaining the same workflows. Atlas’s chat and agent features now read tool-call results directly from Lance tables, eliminating the need for custom caching layers.

Ecosystem fit

Atlas continues to benefit from the broader Arrow-first stack, particularly since the team already uses DuckDB , which allows analytical filtering and column selection to integrate cleanly when preparing maps and performing quality checks.

AEC product: running multi-stage document workflows with LanceDB

The AEC product accepts documents such as PDFs and processes them through several stages, calling out to different models and providers to perform the work before surfacing answers through retrieval and agentic workflows.

Why Node.js + LanceDB

This product uses LanceDB from Node.js , chosen because it’s easier to use from Node while Atlas continues to access Lance from Python . The retrieval patterns in AEC are usually scoped to a subset of the data, with users wanting results restricted to a particular project, site, or other slice. LanceDB makes this straightforward because vector search can be combined with SQL-style filters in the same query. While the primary retrieval method is currently text embeddings, the team plans to add full-text search for content without embeddings and situations where keyword matching performs better, plus explore hybrid approaches.

A big part of why we’re using LanceDB [in the AEC product] is because we use Node.js there… we find LanceDB easier to use from Node. And since we’re querying subsets of the dataset, the ability to do an embedding query with SQL filters is really helpful. - Aaron Miller

Concurrency and checkpoints

Concurrency was a challenge in earlier designs. The team solved it through a multi-step approach:

Step 1: Isolate writes by stage The team writes each processing stage to its own Lance table. This lets many writers run in parallel without contending on a single table. Reads can later combine the stage tables as needed.

Step 2: Store checkpoints with data For long-running syncs from enterprise sources, the application must know where it left off. The checkpoint for that process is stored inside Lance so it cannot drift away from the data.

Step 3: Two-phase commit for consistency To keep things consistent the team performs two commits during ingestion. One commit writes the metadata that includes the checkpoint. A second commit writes the data. When the system restarts it verifies that both commits happened together before continuing.

For building datasets and tables quickly, see Datasets quickstart .

Fragmentation and compaction

The team has also dealt with table fragmentation from many small updates, compacting tables on a schedule or manually in some contexts. They would like programmatic triggers that start compaction when read performance degrades, plus a mode of compaction that keeps specific older versions for Atlas while cleaning up the rest.

Why Lance and LanceDB fit these jobs

Atlas needed storage that scales to tens of millions of embeddings and supports efficient scans, with the option to add an index only when a workload demands lower latency. It also needed reliable access to earlier versions of datasets so users could reproduce what they saw before. Lance met those needs and integrated seamlessly with the Arrow and DuckDB tools the team already used.

The AEC product needed a developer-friendly way to run retrieval in a Node environment, scope searches to a subset of the data, and support multi-stage pipelines without writer contention. LanceDB provided vector search with filters , while Lance tables gave the team a simple way to isolate writes by stage and keep ingestion checkpoints consistent with data.

From a developer-experience standpoint, Lance (the low-level API) and LanceDB (the database wrapper) feel like the right way to work with large-scale embeddings. —Andriy Mulyar, Nomic CEO

Results and next steps

Atlas now serves many queries by scanning Lance tables directly, building an index only when latency requires it. The team removed the burden of keeping large HNSW graphs in memory for most use cases, while users can view older versions of datasets as needed (see Lance overview ). Derived artifacts moved from a shared file system to S3, simplifying scaling and reducing costs.

In the AEC product, LanceDB enables vector search with SQL-style filtering so users can search within a project or other slice. Writing each pipeline stage to its own table improved throughput and stability, while checkpoints stored inside Lance and committed in step with data ensure a crash cannot desynchronize them. The team plans to add full-text search and evaluate hybrid retrieval as needs grow.

Summary

This is the story in the team’s own words: use Lance for derived data and embeddings across both products, use LanceDB from Node where developer ergonomics matter, favor scans until an index is truly needed, keep versions available for Atlas, split writes by stage to avoid contention, and store checkpoints with the data so the system can always pick up exactly where it left off.

David Myriel

David Myriel

Writer.