Introduction
The idea for this actually came to me after interacting with Maxime Heckel’s Blog. His blog is pretty cool and it covers topics around web development, shaders and real-time 3D on the web. He also wrote about how he implemented AI-powered Semantic Search from scratch on his own website. His write up was quite comprehensive and was an amazing read, but the plan I had for my implementation was quite different. Main reason for this is that I wanted to spend $0 on API fees for embedding models provided by leading AI Labs, and I also wanted privacy for the readers of this blog - a way in which all their search queries could potentially happen offline without it being sent to third parties for processing.
So this write up covers everything, from how I built the search interface to match the theme plus look and feel of my website to how our Semantic Search actually works and some considerations and challenges I had while developing it.
Understanding Semantic Search
Traditional search relies on keyword matching - if you search for “machine learning,” it only finds content containing those exact words. Semantic search, however, understands the meaning behind words. It knows that “AI,” “machine learning,” “neural networks,” and “deep learning” are related concepts, even if they’re different terms.
This is achieved through embeddings - numerical representations of text that capture semantic meaning. Similar concepts have similar embeddings, allowing us to find related content mathematically using cosine similarity.
The Technology Stack
For my implementation, I used:
- Transformers.js: A JavaScript library that brings Hugging Face transformers to the browser
- GTE-Small Model: A transformer model optimized for semantic similarity
- Astro: This is what this site runs on (this approach should work with any framework)
- React: For the search UI components (We used Shadcn’s Dialog component too)
- Tailwind CSS: For styling with a glass-morphism design
Why Local Over Cloud?
Before diving into implementation, let’s understand why running semantic search locally is advantageous and we can use the same embeddings model that Maxime used as a benchmark:
Cost Comparison
OpenAI Embeddings API:
- ~$0.13 per million tokens for embeddings
- ~$0.002 per search query
- For a blog with 1,000 daily searches: ~$60/month
- Scales linearly with usage
Local Transformers.js:
- One-time 33MB download per user
- Zero ongoing costs
- Scales infinitely without additional expense
- No rate limits or quotas
Privacy Benefits
- No user data leaves the browser
- Search queries remain completely private - even to me who runs this website
- No tracking or analytics by third parties and no data to sell
- Complies with strict privacy regulations (GDPR, CCPA) so I don’t have to worry about this, especially for my very basic portfolio website
Performance After Initial Load
- Sub-500ms search response time
- Works offline once cached
- No network latency
- Consistent performance regardless of server location or rather where our users/readers are using it from
Step 1: Setting Up the Project Structure
First, create a dedicated folder structure to keep the semantic search components isolated and maintainable:
src/ components/ search/ ├── SearchModal.tsx # Main search UI ├── SearchTrigger.tsx # CMD+K handler ├── SearchEngine.ts # Semantic search logic └── SearchPreloader.astro # Background model loading
scripts/ └── generate-embeddings.js # Build-time embedding generator
public/ └── search-embeddings.json # Generated embeddings (gitignored)
This structure makes the feature easy to remove if needed - just delete the search
folder and remove a few import lines.
Step 2: Installing Dependencies
Install the necessary packages:
# For semantic searchnpm install @xenova/transformers
# For UI components (if using shadcn/ui)npx shadcn@latest add dialog
The @xenova/transformers package is the key dependency that enables running transformer models in the browser using WebAssembly.
Step 3: Generating Embeddings at Build Time
The most critical optimization for performance is pre-computing embeddings for our content at build time. This avoids having to process our blog posts in the user’s browser. Even though when I was implementing this, I only had 5 blog posts, this was the better approach for scalability reasons because I intend to keep writing. It also ensures the user uses as little compute as possible (or as needed) to access our Semantic Search feature.
Create scripts/generate-embeddings.js:
import { pipeline } from '@xenova/transformers';import fs from 'fs';import path from 'path';import { fileURLToPath } from 'url';
const __filename = fileURLToPath(import.meta.url);const __dirname = path.dirname(__filename);
// Initialize the embedding modelconsole.log('Loading embedding model...');const extractor = await pipeline( 'feature-extraction', 'Xenova/gte-small', { quantized: true });
function createOptimalChunks(content, title) { const chunks = [];
// Remove frontmatter if present content = content.replace(/^---[\s\S]*?---\n/, '');
// Split by section headers for semantic boundaries const sections = content.split(/^## /gm);
sections.forEach((section, index) => { if (!section.trim() || section.length < 100) return;
const sectionText = index === 0 ? section : `## ${section}`;
// Keep code-heavy sections together const hasCode = sectionText.includes('```');
if (sectionText.length <= 1500 || hasCode) { chunks.push({ text: sectionText.trim(), context: title }); } else { // Split large sections by paragraphs const paragraphs = sectionText.split(/\n\n+/); let currentChunk = '';
for (const para of paragraphs) { if ((currentChunk + para).length > 1200) { if (currentChunk.length > 200) { chunks.push({ text: currentChunk.trim(), context: title }); } currentChunk = para; } else { currentChunk += (currentChunk ? '\n\n' : '') + para; } }
if (currentChunk.length > 200) { chunks.push({ text: currentChunk.trim(), context: title }); } } });
return chunks;}
async function getEmbedding(text) { const output = await extractor(text, { pooling: 'mean', normalize: true }); return Array.from(output.data);}
async function generateEmbeddings() { const blogDir = path.join(__dirname, '..', 'src', 'content', 'blog'); const outputPath = path.join(__dirname, '..', 'public', 'search-embeddings.json');
const files = fs.readdirSync(blogDir).filter(f => f.endsWith('.md')); const embeddings = [];
for (const file of files) { console.log(`Processing ${file}...`); const filePath = path.join(blogDir, file); const content = fs.readFileSync(filePath, 'utf-8');
// Parse frontmatter for title const frontmatterMatch = content.match(/^---\n([\s\S]*?)\n---/); let title = file.replace('.md', '').replace(/-/g, ' ');
if (frontmatterMatch) { const titleMatch = frontmatterMatch[1].match(/title:\s*["']?(.+?)["']?\s*$/m); if (titleMatch) { title = titleMatch[1]; } }
const mainContent = content.replace(/^---[\s\S]*?---\n/, ''); const chunks = createOptimalChunks(mainContent, title);
const embeddedChunks = []; for (const chunk of chunks) { // Include title for context const textForEmbedding = `${title}\n\n${chunk.text}`; const embedding = await getEmbedding(textForEmbedding);
// Clean preview text const cleanPreview = chunk.text .replace(/^##\s+/gm, '') .replace(/^###\s+/gm, '') .replace(/\*\*/g, '') .replace(/\*/g, '') .replace(/\[([^\]]+)\]\([^)]+\)/g, '$1') .substring(0, 300);
embeddedChunks.push({ text: cleanPreview, embedding }); }
const slug = file.replace('.md', ''); embeddings.push({ id: slug, title, url: `/blog/${slug}`, chunks: embeddedChunks }); }
fs.writeFileSync(outputPath, JSON.stringify(embeddings));
const stats = fs.statSync(outputPath); const fileSizeInKB = stats.size / 1024; console.log(`Generated embeddings for ${embeddings.length} posts`); console.log(`File size: ${fileSizeInKB.toFixed(2)} KB`);}
generateEmbeddings().catch(console.error);
Key Optimization: Semantic Chunking
The chunking strategy is crucial for search quality. We split content at semantic boundaries (section headers) rather than arbitrary character counts. This ensures each chunk contains coherent, complete thoughts that can be properly understood by the embedding model.
Step 4: Building the Search Engine
Create src/components/search/SearchEngine.ts
:
interface BlogEmbedding { id: string; title: string; url: string; chunks: { text: string; embedding: number[]; }[];}
interface SearchResult { id: string; title: string; excerpt: string; url: string; similarity: number;}
export class SemanticSearchEngine { private model: any = null; private embeddings: BlogEmbedding[] = []; private isInitialized = false;
async initialize(): Promise<void> { if (this.isInitialized) return;
try { // Check if embeddings are preloaded if ((window as any).searchEmbeddings) { this.embeddings = (window as any).searchEmbeddings; } else { const response = await fetch('/search-embeddings.json'); if (!response.ok) { throw new Error('Failed to load embeddings'); } this.embeddings = await response.json(); }
// Check if model is preloaded if ((window as any).searchModel) { this.model = (window as any).searchModel; console.log('Using preloaded search model'); } else { const { pipeline, env } = await import('@xenova/transformers');
// Configure to use CDN env.allowLocalModels = false; env.remoteURL = 'https://cdn.jsdelivr.net/npm/@xenova/transformers@2.17.2/';
this.model = await pipeline( 'feature-extraction', 'Xenova/gte-small', { quantized: true, progress_callback: (data: any) => { if (data.status === 'downloading') { console.log(`Downloading model: ${Math.round(data.progress)}%`); } } } ); }
this.isInitialized = true; } catch (error) { console.error('Failed to initialize search engine:', error); throw error; } }
async search(query: string): Promise<SearchResult[]> { if (!this.isInitialized) { throw new Error('Search engine not initialized'); }
// Generate embedding for the query const queryEmbedding = await this.getEmbedding(query);
// Calculate similarity scores for all chunks const results: Array<{ blogId: string; title: string; url: string; chunk: string; similarity: number; }> = [];
for (const blog of this.embeddings) { for (const chunk of blog.chunks) { const similarity = this.cosineSimilarity(queryEmbedding, chunk.embedding); results.push({ blogId: blog.id, title: blog.title, url: blog.url, chunk: chunk.text, similarity }); } }
// Sort by similarity results.sort((a, b) => b.similarity - a.similarity);
// Filter out low similarity results (threshold: 0.80) const relevantResults = results.filter(r => r.similarity > 0.80);
// If no results meet threshold, take top 3 const finalResults = relevantResults.length > 0 ? relevantResults : results.slice(0, 3);
// Group by blog post and take best chunk per post const blogResults = new Map<string, SearchResult>();
for (const result of finalResults) { if (!blogResults.has(result.blogId)) { blogResults.set(result.blogId, { id: result.blogId, title: result.title, excerpt: this.truncateExcerpt(result.chunk), url: result.url, similarity: result.similarity }); }
if (blogResults.size >= 5) break; }
return Array.from(blogResults.values()); }
private async getEmbedding(text: string): Promise<number[]> { const output = await this.model(text, { pooling: 'mean', normalize: true }); return Array.from(output.data); }
private cosineSimilarity(a: number[], b: number[]): number { let dotProduct = 0; let normA = 0; let normB = 0;
for (let i = 0; i < a.length; i++) { dotProduct += a[i] * b[i]; normA += a[i] * a[i]; normB += b[i] * b[i]; }
normA = Math.sqrt(normA); normB = Math.sqrt(normB);
if (normA === 0 || normB === 0) { return 0; }
return dotProduct / (normA * normB); }
private truncateExcerpt(text: string, maxLength: number = 150): string { if (text.length <= maxLength) return text;
const truncated = text.substring(0, maxLength); const lastSpace = truncated.lastIndexOf(' ');
return truncated.substring(0, lastSpace) + '...'; }}
Step 5: The Challenge of Model Selection
Initially, we used the all-MiniLM-L6-v2 model, which is popular for general semantic similarity. However, we encountered significant false positives - for example, a blog post about building an ARM operating system would appear in searches for “machine learning” with 79% similarity, despite having nothing to do with ML (To be fair, I had one mention about AI Agents on my ARM Os write up but still, this was just way off).
The issue was that MiniLM-L6-v2 was trained on general text and couldn’t properly distinguish between technical contexts. The model was intended to be used as a sentence and short paragraph encoder. It would match on superficial similarities like the word “approaches” appearing in both contexts.
The Solution: GTE-Small
We switched to the Xenova/gte-small model because:
- Trained on technical content: Including GitHub, StackOverflow, and technical documentation
- Better context understanding: Distinguishes between “approaches” in OS kernels vs. machine learning
- Minimal size increase: Only 33MB vs. 25MB for MiniLM - the difference felt minimal enough for me to feel confident about it
- 40% better accuracy for technical queries - which most of my blog would be.
This change dramatically improved search quality, eliminating most false positives while maintaining fast performance.
Step 6: Building the UI with Glass-Morphism
Create a modern search modal with a frosted glass effect:
import React, { useState, useEffect, useCallback, useRef } from 'react';import { Dialog, DialogContent, DialogHeader,} from '@/components/ui/dialog';import { Input } from '@/components/ui/input';import { Search, Loader2, FileText, ArrowRight } from 'lucide-react';
const SAMPLE_QUESTIONS = [ "How do I build an ARM operating system?", "Tell me about RAG pipelines", "What's LM Studio for local AI?", "How to use Digital Ocean Spaces?", "Building a local AI agent"];
export function SearchModal({ isOpen, onClose }: SearchModalProps) { const [query, setQuery] = useState(''); const [results, setResults] = useState<SearchResult[]>([]); const [isSearching, setIsSearching] = useState(false); const [isModelLoading, setIsModelLoading] = useState(false); const [modelReady, setModelReady] = useState(false); const searchEngineRef = useRef<any>(null); const inputRef = useRef<HTMLInputElement>(null);
useEffect(() => { if (isOpen && !searchEngineRef.current && !isModelLoading) { initializeSearchEngine(); } }, [isOpen]);
const initializeSearchEngine = async () => { try { setIsModelLoading(true); const { SemanticSearchEngine } = await import('./SearchEngine'); searchEngineRef.current = new SemanticSearchEngine(); await searchEngineRef.current.initialize(); setModelReady(true); } catch (error) { console.error('Failed to initialize search:', error); } finally { setIsModelLoading(false); } };
const performSearch = useCallback(async (searchQuery: string) => { if (!searchQuery.trim() || !searchEngineRef.current || !modelReady) return;
setIsSearching(true); try { const searchResults = await searchEngineRef.current.search(searchQuery); setResults(searchResults); } catch (error) { console.error('Search failed:', error); setResults([]); } finally { setIsSearching(false); } }, [modelReady]);
// Debounced search useEffect(() => { const timer = setTimeout(() => { if (query && modelReady) { performSearch(query); } else { setResults([]); } }, 300);
return () => clearTimeout(timer); }, [query, performSearch, modelReady]);
return ( <Dialog open={isOpen} onOpenChange={onClose}> <DialogContent className="max-w-4xl max-h-[85vh] p-0 flex flex-col bg-white/80 dark:bg-neutral-900/90 backdrop-blur-xl border-black/10 dark:border-white/10" showCloseButton={false}> <div className="p-4 border-b border-black/5 dark:border-white/10 bg-black/5 dark:bg-white/5"> <div className="relative"> <Search className="absolute left-3 top-1/2 transform -translate-y-1/2 w-4 h-4 text-muted-foreground" /> <Input ref={inputRef} type="text" placeholder={isModelLoading ? "Preparing semantic search..." : "Search anything in my blog..."} value={query} onChange={(e) => setQuery(e.target.value)} className="pl-10 pr-4 h-12 text-base bg-black/5 dark:bg-white/5 border border-black/10 dark:border-white/10 backdrop-blur-sm transition-all" disabled={isModelLoading} /> </div> </div>
<div className="max-h-[60vh] overflow-y-auto"> {!query && !isModelLoading && ( <div className="p-4"> <p className="text-sm text-muted-foreground mb-3">Try asking:</p> <div className="flex flex-wrap gap-2"> {SAMPLE_QUESTIONS.map((sample, index) => ( <button key={index} className="px-3 py-1.5 text-sm bg-black/5 dark:bg-white/5 backdrop-blur-sm border border-black/10 dark:border-white/10 rounded-full hover:bg-black/10 dark:hover:bg-white/10 transition-all duration-200 hover:scale-105" onClick={() => setQuery(sample)} > {sample} </button> ))} </div> </div> )}
{results.length > 0 && ( <div className="px-4 py-2"> {results.map((result) => ( <button key={result.id} onClick={() => window.location.href = result.url} className="w-full mb-2 text-left p-4 rounded-lg bg-black/5 dark:bg-white/5 backdrop-blur-sm border border-black/10 dark:border-white/10 hover:bg-black/10 dark:hover:bg-white/10 transition-all duration-200 group" > <div className="flex items-start gap-3"> <FileText className="w-4 h-4 mt-1 text-muted-foreground shrink-0" /> <div className="flex-1 min-w-0"> <h3 className="font-medium text-sm mb-1 group-hover:text-primary transition-colors"> {result.title} </h3> <p className="text-sm text-muted-foreground line-clamp-2"> {result.excerpt} </p> <div className="flex items-center gap-2 mt-2"> <span className="text-xs text-muted-foreground"> {Math.round(result.similarity * 100)}% match </span> </div> </div> </div> </button> ))} </div> )} </div>
<div className="p-3 border-t border-black/5 dark:border-white/10 bg-black/5 dark:bg-white/5 backdrop-blur-sm"> <div className="flex items-center justify-between text-xs text-muted-foreground"> <span className="text-foreground/60">AI search running locally in your browser</span> <kbd className="px-2 py-1 rounded bg-black/10 dark:bg-white/10 border border-black/10 dark:border-white/10 text-[10px] font-mono backdrop-blur-sm">ESC</kbd> </div> </div> </DialogContent> </Dialog> );}
Step 7: Implementing Background Preloading
To ensure search is instant when users need it, preload the model in the background after the page loads:
<script> if (typeof window !== 'undefined') { // Preload embeddings after 1 second setTimeout(() => { fetch('/search-embeddings.json') .then(response => response.json()) .then(data => { (window as any).searchEmbeddings = data; console.log('Search embeddings preloaded'); }) .catch(err => console.log('Failed to preload embeddings:', err)); }, 1000);
// Preload model after 3 seconds setTimeout(async () => { if ('requestIdleCallback' in window) { requestIdleCallback(async () => { try { const { pipeline, env } = await import('@xenova/transformers');
env.allowLocalModels = false; env.remoteURL = 'https://cdn.jsdelivr.net/npm/@xenova/transformers@2.17.2/';
console.log('Preloading search model...'); const model = await pipeline( 'feature-extraction', 'Xenova/gte-small', { quantized: true, progress_callback: (data: any) => { if (data.status === 'downloading') { console.log(`Background model download: ${Math.round(data.progress)}%`); } } } );
(window as any).searchModel = model; console.log('Search model preloaded and ready'); } catch (error) { console.log('Failed to preload model:', error); } }); } }, 3000); }</script>
This approach ensures:
- The page loads normally without any delay
- Embeddings load after 1 second
- Model downloads after 3 seconds when the browser is idle
- By the time users press CMD+K, everything is ready
Step 8: Handling Edge Cases and Optimizations
The Similarity Threshold Challenge
One of the biggest challenges was determining the right similarity threshold. Too low, and you get false positives. Too high, and relevant results are filtered out. After testing, we settled on 80% as the threshold, with a fallback to show the top 3 results if nothing meets the threshold.
Dealing with Horizontal Scroll
We encountered an interesting CSS challenge where the search result cards would cause horizontal scrolling. The issue was that cards with w-full
and m-2
(margin) would extend beyond their container. The solution was to adjust the container padding and remove horizontal margins from the cards.
Dark Mode Visibility
The glass-morphism effect initially had poor visibility in dark mode. We solved this by:
- Using bg-neutral-900/90 instead of bg-black/80 for better contrast
- Increasing the opacity slightly while maintaining the frosted effect
- Adding subtle borders with border-white/10
Step 9: Integration with Your Site
Add the search trigger to our site header:
export function SearchTrigger() { const [isOpen, setIsOpen] = useState(false);
useEffect(() => { const handleKeyDown = (event: KeyboardEvent) => { if ((event.metaKey || event.ctrlKey) && event.key === 'k') { event.preventDefault(); setIsOpen(true); } };
document.addEventListener('keydown', handleKeyDown); return () => document.removeEventListener('keydown', handleKeyDown); }, []);
return ( <> <Button variant="ghost" size="sm" onClick={() => setIsOpen(true)} > <Search className="w-4 h-4" /> <span>Search</span> <kbd className="ml-2 text-xs">⌘K</kbd> </Button>
<SearchModal isOpen={isOpen} onClose={() => setIsOpen(false)} /> </> );}
Step 10: Build Script Integration
Updating our package.json
to generate embeddings during the build process:
{ "scripts": { "build": "node scripts/generate-embeddings.js && astro build", "update-embeddings": "node scripts/generate-embeddings.js" }}
Performance Metrics
After implementation, here are the real-world performance metrics:
- Embedding generation: ~329KB for 5 blog posts (66KB per post)
- Model download: 33MB (one-time, cached forever)
- Search response time: <500ms after model loads
- Initial page load impact: +100KB (embeddings only)
- Model loading: Background after 3 seconds, non-blocking
Troubleshooting Common Issues
False Positives in Search Results
If you’re seeing irrelevant results after implementing your Semantic Search, check:
- Your chunking strategy - ensure semantic boundaries are respected. This guide is elite at this
- The similarity threshold - we found 80% works well for technical content
- Consider switching models if your content is specialized
Model Download Failures
If the model fails to download:
- Check CDN configuration in the code
- Ensure CORS headers are properly set
- Consider hosting the model files yourself for reliability (Esp if you’re working in Production use cases and serving a lot of users every day)
Memory Usage Concerns
The model uses approximately 100MB of RAM when loaded. For mobile devices:
- Consider using a smaller model like paraphrase-MiniLM-L3-v2 (14MB)
- Implement device detection to load different models
- Add a toggle for users to enable/disable semantic search
Future Enhancements
While our implementation provides a pretty decent semantic search experience, there are several enhancements you could add:
- Hybrid Search: Combine semantic search with traditional keyword matching for best of both worlds
- Search Analytics: Track what users search for (locally) to improve sample questions
- Multi-language Support: Use multilingual models for international audiences
- Citation Extraction: Show specific paragraphs that match the query
- Query Expansion: Automatically expand queries with synonyms for better coverage
Conclusion
Building local semantic search with Transformers.js provides a powerful, privacy-focused alternative to cloud-based solutions. While there were challenges - from model selection to UI refinements - the end result is a search experience that rivals commercial offerings while keeping user data private and eliminating ongoing costs.
The key insights from our implementation is that:
- Pre-computing embeddings at build time is crucial for performance
- Model selection matters significantly for search quality
- Background preloading ensures instant search when needed
- Glass-morphism UI provides a modern, premium feel
- Local execution means infinite scalability at zero marginal cost
This implementation delivers intelligent, context-aware search that operates offline, preserves user privacy, and maintains site performance. The solution eliminates recurring costs and API key management while providing enterprise-grade search capabilities.
I truly love local-first AI and I believe that as transformer models become more efficient and browser capabilities expand, we can expect to see increased adoption of edge-based AI features that balance functionality with privacy considerations.