The Mind
Back to all posts

Local-First Semantic Search for my Blog using Transformers.js

How I implemented a privacy-focused, local-first, cost-free semantic search for this blog

August 31, 2025 12 mins read

Introduction

The idea for this actually came to me after interacting with Maxime Heckel’s Blog. His blog is pretty cool and it covers topics around web development, shaders and real-time 3D on the web. He also wrote about how he implemented AI-powered Semantic Search from scratch on his own website. His write up was quite comprehensive and was an amazing read, but the plan I had for my implementation was quite different. Main reason for this is that I wanted to spend $0 on API fees for embedding models provided by leading AI Labs, and I also wanted privacy for the readers of this blog - a way in which all their search queries could potentially happen offline without it being sent to third parties for processing.

So this write up covers everything, from how I built the search interface to match the theme plus look and feel of my website to how our Semantic Search actually works and some considerations and challenges I had while developing it.

Traditional search relies on keyword matching - if you search for “machine learning,” it only finds content containing those exact words. Semantic search, however, understands the meaning behind words. It knows that “AI,” “machine learning,” “neural networks,” and “deep learning” are related concepts, even if they’re different terms.

This is achieved through embeddings - numerical representations of text that capture semantic meaning. Similar concepts have similar embeddings, allowing us to find related content mathematically using cosine similarity.

The Technology Stack

For my implementation, I used:

  • Transformers.js: A JavaScript library that brings Hugging Face transformers to the browser
  • GTE-Small Model: A transformer model optimized for semantic similarity
  • Astro: This is what this site runs on (this approach should work with any framework)
  • React: For the search UI components (We used Shadcn’s Dialog component too)
  • Tailwind CSS: For styling with a glass-morphism design

Why Local Over Cloud?

Before diving into implementation, let’s understand why running semantic search locally is advantageous and we can use the same embeddings model that Maxime used as a benchmark:

Cost Comparison

OpenAI Embeddings API:

  • ~$0.13 per million tokens for embeddings
  • ~$0.002 per search query
  • For a blog with 1,000 daily searches: ~$60/month
  • Scales linearly with usage

Local Transformers.js:

  • One-time 33MB download per user
  • Zero ongoing costs
  • Scales infinitely without additional expense
  • No rate limits or quotas

Privacy Benefits

  • No user data leaves the browser
  • Search queries remain completely private - even to me who runs this website
  • No tracking or analytics by third parties and no data to sell
  • Complies with strict privacy regulations (GDPR, CCPA) so I don’t have to worry about this, especially for my very basic portfolio website

Performance After Initial Load

  • Sub-500ms search response time
  • Works offline once cached
  • No network latency
  • Consistent performance regardless of server location or rather where our users/readers are using it from

Step 1: Setting Up the Project Structure

First, create a dedicated folder structure to keep the semantic search components isolated and maintainable:

Terminal window
src/
components/
search/
├── SearchModal.tsx # Main search UI
├── SearchTrigger.tsx # CMD+K handler
├── SearchEngine.ts # Semantic search logic
└── SearchPreloader.astro # Background model loading
scripts/
└── generate-embeddings.js # Build-time embedding generator
public/
└── search-embeddings.json # Generated embeddings (gitignored)

This structure makes the feature easy to remove if needed - just delete the search folder and remove a few import lines.

Step 2: Installing Dependencies

Install the necessary packages:

Terminal window
# For semantic search
npm install @xenova/transformers
# For UI components (if using shadcn/ui)
npx shadcn@latest add dialog

The @xenova/transformers package is the key dependency that enables running transformer models in the browser using WebAssembly.

Step 3: Generating Embeddings at Build Time

The most critical optimization for performance is pre-computing embeddings for our content at build time. This avoids having to process our blog posts in the user’s browser. Even though when I was implementing this, I only had 5 blog posts, this was the better approach for scalability reasons because I intend to keep writing. It also ensures the user uses as little compute as possible (or as needed) to access our Semantic Search feature.

Create scripts/generate-embeddings.js:

import { pipeline } from '@xenova/transformers';
import fs from 'fs';
import path from 'path';
import { fileURLToPath } from 'url';
const __filename = fileURLToPath(import.meta.url);
const __dirname = path.dirname(__filename);
// Initialize the embedding model
console.log('Loading embedding model...');
const extractor = await pipeline(
'feature-extraction',
'Xenova/gte-small',
{ quantized: true }
);
function createOptimalChunks(content, title) {
const chunks = [];
// Remove frontmatter if present
content = content.replace(/^---[\s\S]*?---\n/, '');
// Split by section headers for semantic boundaries
const sections = content.split(/^## /gm);
sections.forEach((section, index) => {
if (!section.trim() || section.length < 100) return;
const sectionText = index === 0 ? section : `## ${section}`;
// Keep code-heavy sections together
const hasCode = sectionText.includes('```');
if (sectionText.length <= 1500 || hasCode) {
chunks.push({
text: sectionText.trim(),
context: title
});
} else {
// Split large sections by paragraphs
const paragraphs = sectionText.split(/\n\n+/);
let currentChunk = '';
for (const para of paragraphs) {
if ((currentChunk + para).length > 1200) {
if (currentChunk.length > 200) {
chunks.push({
text: currentChunk.trim(),
context: title
});
}
currentChunk = para;
} else {
currentChunk += (currentChunk ? '\n\n' : '') + para;
}
}
if (currentChunk.length > 200) {
chunks.push({
text: currentChunk.trim(),
context: title
});
}
}
});
return chunks;
}
async function getEmbedding(text) {
const output = await extractor(text, {
pooling: 'mean',
normalize: true
});
return Array.from(output.data);
}
async function generateEmbeddings() {
const blogDir = path.join(__dirname, '..', 'src', 'content', 'blog');
const outputPath = path.join(__dirname, '..', 'public', 'search-embeddings.json');
const files = fs.readdirSync(blogDir).filter(f => f.endsWith('.md'));
const embeddings = [];
for (const file of files) {
console.log(`Processing ${file}...`);
const filePath = path.join(blogDir, file);
const content = fs.readFileSync(filePath, 'utf-8');
// Parse frontmatter for title
const frontmatterMatch = content.match(/^---\n([\s\S]*?)\n---/);
let title = file.replace('.md', '').replace(/-/g, ' ');
if (frontmatterMatch) {
const titleMatch = frontmatterMatch[1].match(/title:\s*["']?(.+?)["']?\s*$/m);
if (titleMatch) {
title = titleMatch[1];
}
}
const mainContent = content.replace(/^---[\s\S]*?---\n/, '');
const chunks = createOptimalChunks(mainContent, title);
const embeddedChunks = [];
for (const chunk of chunks) {
// Include title for context
const textForEmbedding = `${title}\n\n${chunk.text}`;
const embedding = await getEmbedding(textForEmbedding);
// Clean preview text
const cleanPreview = chunk.text
.replace(/^##\s+/gm, '')
.replace(/^###\s+/gm, '')
.replace(/\*\*/g, '')
.replace(/\*/g, '')
.replace(/\[([^\]]+)\]\([^)]+\)/g, '$1')
.substring(0, 300);
embeddedChunks.push({
text: cleanPreview,
embedding
});
}
const slug = file.replace('.md', '');
embeddings.push({
id: slug,
title,
url: `/blog/${slug}`,
chunks: embeddedChunks
});
}
fs.writeFileSync(outputPath, JSON.stringify(embeddings));
const stats = fs.statSync(outputPath);
const fileSizeInKB = stats.size / 1024;
console.log(`Generated embeddings for ${embeddings.length} posts`);
console.log(`File size: ${fileSizeInKB.toFixed(2)} KB`);
}
generateEmbeddings().catch(console.error);

Key Optimization: Semantic Chunking

The chunking strategy is crucial for search quality. We split content at semantic boundaries (section headers) rather than arbitrary character counts. This ensures each chunk contains coherent, complete thoughts that can be properly understood by the embedding model.

Step 4: Building the Search Engine

Create src/components/search/SearchEngine.ts:

interface BlogEmbedding {
id: string;
title: string;
url: string;
chunks: {
text: string;
embedding: number[];
}[];
}
interface SearchResult {
id: string;
title: string;
excerpt: string;
url: string;
similarity: number;
}
export class SemanticSearchEngine {
private model: any = null;
private embeddings: BlogEmbedding[] = [];
private isInitialized = false;
async initialize(): Promise<void> {
if (this.isInitialized) return;
try {
// Check if embeddings are preloaded
if ((window as any).searchEmbeddings) {
this.embeddings = (window as any).searchEmbeddings;
} else {
const response = await fetch('/search-embeddings.json');
if (!response.ok) {
throw new Error('Failed to load embeddings');
}
this.embeddings = await response.json();
}
// Check if model is preloaded
if ((window as any).searchModel) {
this.model = (window as any).searchModel;
console.log('Using preloaded search model');
} else {
const { pipeline, env } = await import('@xenova/transformers');
// Configure to use CDN
env.allowLocalModels = false;
env.remoteURL = 'https://cdn.jsdelivr.net/npm/@xenova/transformers@2.17.2/';
this.model = await pipeline(
'feature-extraction',
'Xenova/gte-small',
{
quantized: true,
progress_callback: (data: any) => {
if (data.status === 'downloading') {
console.log(`Downloading model: ${Math.round(data.progress)}%`);
}
}
}
);
}
this.isInitialized = true;
} catch (error) {
console.error('Failed to initialize search engine:', error);
throw error;
}
}
async search(query: string): Promise<SearchResult[]> {
if (!this.isInitialized) {
throw new Error('Search engine not initialized');
}
// Generate embedding for the query
const queryEmbedding = await this.getEmbedding(query);
// Calculate similarity scores for all chunks
const results: Array<{
blogId: string;
title: string;
url: string;
chunk: string;
similarity: number;
}> = [];
for (const blog of this.embeddings) {
for (const chunk of blog.chunks) {
const similarity = this.cosineSimilarity(queryEmbedding, chunk.embedding);
results.push({
blogId: blog.id,
title: blog.title,
url: blog.url,
chunk: chunk.text,
similarity
});
}
}
// Sort by similarity
results.sort((a, b) => b.similarity - a.similarity);
// Filter out low similarity results (threshold: 0.80)
const relevantResults = results.filter(r => r.similarity > 0.80);
// If no results meet threshold, take top 3
const finalResults = relevantResults.length > 0
? relevantResults
: results.slice(0, 3);
// Group by blog post and take best chunk per post
const blogResults = new Map<string, SearchResult>();
for (const result of finalResults) {
if (!blogResults.has(result.blogId)) {
blogResults.set(result.blogId, {
id: result.blogId,
title: result.title,
excerpt: this.truncateExcerpt(result.chunk),
url: result.url,
similarity: result.similarity
});
}
if (blogResults.size >= 5) break;
}
return Array.from(blogResults.values());
}
private async getEmbedding(text: string): Promise<number[]> {
const output = await this.model(text, {
pooling: 'mean',
normalize: true
});
return Array.from(output.data);
}
private cosineSimilarity(a: number[], b: number[]): number {
let dotProduct = 0;
let normA = 0;
let normB = 0;
for (let i = 0; i < a.length; i++) {
dotProduct += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
normA = Math.sqrt(normA);
normB = Math.sqrt(normB);
if (normA === 0 || normB === 0) {
return 0;
}
return dotProduct / (normA * normB);
}
private truncateExcerpt(text: string, maxLength: number = 150): string {
if (text.length <= maxLength) return text;
const truncated = text.substring(0, maxLength);
const lastSpace = truncated.lastIndexOf(' ');
return truncated.substring(0, lastSpace) + '...';
}
}

Step 5: The Challenge of Model Selection

Initially, we used the all-MiniLM-L6-v2 model, which is popular for general semantic similarity. However, we encountered significant false positives - for example, a blog post about building an ARM operating system would appear in searches for “machine learning” with 79% similarity, despite having nothing to do with ML (To be fair, I had one mention about AI Agents on my ARM Os write up but still, this was just way off).

The issue was that MiniLM-L6-v2 was trained on general text and couldn’t properly distinguish between technical contexts.  The model was intended to be used as a sentence and short paragraph encoder. It would match on superficial similarities like the word “approaches” appearing in both contexts.

The Solution: GTE-Small

We switched to the Xenova/gte-small model because:

  1. Trained on technical content: Including GitHub, StackOverflow, and technical documentation
  2. Better context understanding: Distinguishes between “approaches” in OS kernels vs. machine learning
  3. Minimal size increase: Only 33MB vs. 25MB for MiniLM - the difference felt minimal enough for me to feel confident about it
  4. 40% better accuracy for technical queries - which most of my blog would be.

This change dramatically improved search quality, eliminating most false positives while maintaining fast performance.

Step 6: Building the UI with Glass-Morphism

Create a modern search modal with a frosted glass effect:

import React, { useState, useEffect, useCallback, useRef } from 'react';
import {
Dialog,
DialogContent,
DialogHeader,
} from '@/components/ui/dialog';
import { Input } from '@/components/ui/input';
import { Search, Loader2, FileText, ArrowRight } from 'lucide-react';
const SAMPLE_QUESTIONS = [
"How do I build an ARM operating system?",
"Tell me about RAG pipelines",
"What's LM Studio for local AI?",
"How to use Digital Ocean Spaces?",
"Building a local AI agent"
];
export function SearchModal({ isOpen, onClose }: SearchModalProps) {
const [query, setQuery] = useState('');
const [results, setResults] = useState<SearchResult[]>([]);
const [isSearching, setIsSearching] = useState(false);
const [isModelLoading, setIsModelLoading] = useState(false);
const [modelReady, setModelReady] = useState(false);
const searchEngineRef = useRef<any>(null);
const inputRef = useRef<HTMLInputElement>(null);
useEffect(() => {
if (isOpen && !searchEngineRef.current && !isModelLoading) {
initializeSearchEngine();
}
}, [isOpen]);
const initializeSearchEngine = async () => {
try {
setIsModelLoading(true);
const { SemanticSearchEngine } = await import('./SearchEngine');
searchEngineRef.current = new SemanticSearchEngine();
await searchEngineRef.current.initialize();
setModelReady(true);
} catch (error) {
console.error('Failed to initialize search:', error);
} finally {
setIsModelLoading(false);
}
};
const performSearch = useCallback(async (searchQuery: string) => {
if (!searchQuery.trim() || !searchEngineRef.current || !modelReady) return;
setIsSearching(true);
try {
const searchResults = await searchEngineRef.current.search(searchQuery);
setResults(searchResults);
} catch (error) {
console.error('Search failed:', error);
setResults([]);
} finally {
setIsSearching(false);
}
}, [modelReady]);
// Debounced search
useEffect(() => {
const timer = setTimeout(() => {
if (query && modelReady) {
performSearch(query);
} else {
setResults([]);
}
}, 300);
return () => clearTimeout(timer);
}, [query, performSearch, modelReady]);
return (
<Dialog open={isOpen} onOpenChange={onClose}>
<DialogContent className="max-w-4xl max-h-[85vh] p-0 flex flex-col bg-white/80 dark:bg-neutral-900/90 backdrop-blur-xl border-black/10 dark:border-white/10" showCloseButton={false}>
<div className="p-4 border-b border-black/5 dark:border-white/10 bg-black/5 dark:bg-white/5">
<div className="relative">
<Search className="absolute left-3 top-1/2 transform -translate-y-1/2 w-4 h-4 text-muted-foreground" />
<Input
ref={inputRef}
type="text"
placeholder={isModelLoading ? "Preparing semantic search..." : "Search anything in my blog..."}
value={query}
onChange={(e) => setQuery(e.target.value)}
className="pl-10 pr-4 h-12 text-base bg-black/5 dark:bg-white/5 border border-black/10 dark:border-white/10 backdrop-blur-sm transition-all"
disabled={isModelLoading}
/>
</div>
</div>
<div className="max-h-[60vh] overflow-y-auto">
{!query && !isModelLoading && (
<div className="p-4">
<p className="text-sm text-muted-foreground mb-3">Try asking:</p>
<div className="flex flex-wrap gap-2">
{SAMPLE_QUESTIONS.map((sample, index) => (
<button
key={index}
className="px-3 py-1.5 text-sm bg-black/5 dark:bg-white/5 backdrop-blur-sm border border-black/10 dark:border-white/10 rounded-full hover:bg-black/10 dark:hover:bg-white/10 transition-all duration-200 hover:scale-105"
onClick={() => setQuery(sample)}
>
{sample}
</button>
))}
</div>
</div>
)}
{results.length > 0 && (
<div className="px-4 py-2">
{results.map((result) => (
<button
key={result.id}
onClick={() => window.location.href = result.url}
className="w-full mb-2 text-left p-4 rounded-lg bg-black/5 dark:bg-white/5 backdrop-blur-sm border border-black/10 dark:border-white/10 hover:bg-black/10 dark:hover:bg-white/10 transition-all duration-200 group"
>
<div className="flex items-start gap-3">
<FileText className="w-4 h-4 mt-1 text-muted-foreground shrink-0" />
<div className="flex-1 min-w-0">
<h3 className="font-medium text-sm mb-1 group-hover:text-primary transition-colors">
{result.title}
</h3>
<p className="text-sm text-muted-foreground line-clamp-2">
{result.excerpt}
</p>
<div className="flex items-center gap-2 mt-2">
<span className="text-xs text-muted-foreground">
{Math.round(result.similarity * 100)}% match
</span>
</div>
</div>
</div>
</button>
))}
</div>
)}
</div>
<div className="p-3 border-t border-black/5 dark:border-white/10 bg-black/5 dark:bg-white/5 backdrop-blur-sm">
<div className="flex items-center justify-between text-xs text-muted-foreground">
<span className="text-foreground/60">AI search running locally in your browser</span>
<kbd className="px-2 py-1 rounded bg-black/10 dark:bg-white/10 border border-black/10 dark:border-white/10 text-[10px] font-mono backdrop-blur-sm">ESC</kbd>
</div>
</div>
</DialogContent>
</Dialog>
);
}

Step 7: Implementing Background Preloading

To ensure search is instant when users need it, preload the model in the background after the page loads:

SearchPreloader.astro
<script>
if (typeof window !== 'undefined') {
// Preload embeddings after 1 second
setTimeout(() => {
fetch('/search-embeddings.json')
.then(response => response.json())
.then(data => {
(window as any).searchEmbeddings = data;
console.log('Search embeddings preloaded');
})
.catch(err => console.log('Failed to preload embeddings:', err));
}, 1000);
// Preload model after 3 seconds
setTimeout(async () => {
if ('requestIdleCallback' in window) {
requestIdleCallback(async () => {
try {
const { pipeline, env } = await import('@xenova/transformers');
env.allowLocalModels = false;
env.remoteURL = 'https://cdn.jsdelivr.net/npm/@xenova/transformers@2.17.2/';
console.log('Preloading search model...');
const model = await pipeline(
'feature-extraction',
'Xenova/gte-small',
{
quantized: true,
progress_callback: (data: any) => {
if (data.status === 'downloading') {
console.log(`Background model download: ${Math.round(data.progress)}%`);
}
}
}
);
(window as any).searchModel = model;
console.log('Search model preloaded and ready');
} catch (error) {
console.log('Failed to preload model:', error);
}
});
}
}, 3000);
}
</script>

This approach ensures:

  • The page loads normally without any delay
  • Embeddings load after 1 second
  • Model downloads after 3 seconds when the browser is idle
  • By the time users press CMD+K, everything is ready

Step 8: Handling Edge Cases and Optimizations

The Similarity Threshold Challenge

One of the biggest challenges was determining the right similarity threshold. Too low, and you get false positives. Too high, and relevant results are filtered out. After testing, we settled on 80% as the threshold, with a fallback to show the top 3 results if nothing meets the threshold.

Dealing with Horizontal Scroll

We encountered an interesting CSS challenge where the search result cards would cause horizontal scrolling. The issue was that cards with w-full and m-2 (margin) would extend beyond their container. The solution was to adjust the container padding and remove horizontal margins from the cards.

Dark Mode Visibility

The glass-morphism effect initially had poor visibility in dark mode. We solved this by:

  • Using bg-neutral-900/90 instead of bg-black/80 for better contrast
  • Increasing the opacity slightly while maintaining the frosted effect
  • Adding subtle borders with border-white/10

Step 9: Integration with Your Site

Add the search trigger to our site header:

export function SearchTrigger() {
const [isOpen, setIsOpen] = useState(false);
useEffect(() => {
const handleKeyDown = (event: KeyboardEvent) => {
if ((event.metaKey || event.ctrlKey) && event.key === 'k') {
event.preventDefault();
setIsOpen(true);
}
};
document.addEventListener('keydown', handleKeyDown);
return () => document.removeEventListener('keydown', handleKeyDown);
}, []);
return (
<>
<Button
variant="ghost"
size="sm"
onClick={() => setIsOpen(true)}
>
<Search className="w-4 h-4" />
<span>Search</span>
<kbd className="ml-2 text-xs">⌘K</kbd>
</Button>
<SearchModal isOpen={isOpen} onClose={() => setIsOpen(false)} />
</>
);
}

Step 10: Build Script Integration

Updating our package.json to generate embeddings during the build process:

{
"scripts": {
"build": "node scripts/generate-embeddings.js && astro build",
"update-embeddings": "node scripts/generate-embeddings.js"
}
}

Performance Metrics

After implementation, here are the real-world performance metrics:

  • Embedding generation: ~329KB for 5 blog posts (66KB per post)
  • Model download: 33MB (one-time, cached forever)
  • Search response time: <500ms after model loads
  • Initial page load impact: +100KB (embeddings only)
  • Model loading: Background after 3 seconds, non-blocking

Troubleshooting Common Issues

False Positives in Search Results

If you’re seeing irrelevant results after implementing your Semantic Search, check:

  1. Your chunking strategy - ensure semantic boundaries are respected. This guide is elite at this
  2. The similarity threshold - we found 80% works well for technical content
  3. Consider switching models if your content is specialized

Model Download Failures

If the model fails to download:

  1. Check CDN configuration in the code
  2. Ensure CORS headers are properly set
  3. Consider hosting the model files yourself for reliability (Esp if you’re working in Production use cases and serving a lot of users every day)

Memory Usage Concerns

The model uses approximately 100MB of RAM when loaded. For mobile devices:

  1. Consider using a smaller model like paraphrase-MiniLM-L3-v2 (14MB)
  2. Implement device detection to load different models
  3. Add a toggle for users to enable/disable semantic search

Future Enhancements

While our implementation provides a pretty decent semantic search experience, there are several enhancements you could add:

  1. Hybrid Search: Combine semantic search with traditional keyword matching for best of both worlds
  2. Search Analytics: Track what users search for (locally) to improve sample questions
  3. Multi-language Support: Use multilingual models for international audiences
  4. Citation Extraction: Show specific paragraphs that match the query
  5. Query Expansion: Automatically expand queries with synonyms for better coverage

Conclusion

Building local semantic search with Transformers.js provides a powerful, privacy-focused alternative to cloud-based solutions. While there were challenges - from model selection to UI refinements - the end result is a search experience that rivals commercial offerings while keeping user data private and eliminating ongoing costs.

The key insights from our implementation is that:

  • Pre-computing embeddings at build time is crucial for performance
  • Model selection matters significantly for search quality
  • Background preloading ensures instant search when needed
  • Glass-morphism UI provides a modern, premium feel
  • Local execution means infinite scalability at zero marginal cost

This implementation delivers intelligent, context-aware search that operates offline, preserves user privacy, and maintains site performance. The solution eliminates recurring costs and API key management while providing enterprise-grade search capabilities.

I truly love local-first AI and I believe that as transformer models become more efficient and browser capabilities expand, we can expect to see increased adoption of edge-based AI features that balance functionality with privacy considerations.