Morph Compact

Blazing Fast Compaction

33,000 tok/s on a custom inference engine. Shrink context 50-70% while keeping every surviving sentence verbatim. Your agents run for hours, not minutes.

Get Started Read the docs

Context compaction visualization — large input compressed to compact output

Try it out

Paste context. See it shrink.

Context

## Conversation Transcript — Coding Agent Session

**User:** I need to add rate limiting to our Express API. We're getting hammered by scrapers on the /api/search endpoint and it's driving up our OpenAI costs. The rest of the API is fine for now.

**Assistant:** I'll look at the current setup first.

**[Tool Call: read_file]** path: src/routes/api.ts
```typescript
import { Router } from 'express';
import { searchHandler } from '../handlers/search';
import { chatHandler } from '../handlers/chat';
import { embedHandler } from '../handlers/embed';
import { authMiddleware } from '../middleware/auth';
import { validateRequest } from '../middleware/validation';
import { SearchRequestSchema, ChatRequestSchema, EmbedRequestSchema } from '../schemas';

const router = Router();

// Public endpoints
router.get('/health', (req, res) => res.json({ status: 'ok' }));
router.get('/version', (req, res) => res.json({ version: process.env.APP_VERSION || '1.0.0' }));

// Authenticated endpoints
router.use(authMiddleware);

router.post('/search', validateRequest(SearchRequestSchema), searchHandler);
router.post('/chat', validateRequest(ChatRequestSchema), chatHandler);
router.post('/embed', validateRequest(EmbedRequestSchema), embedHandler);

// Batch endpoints
router.post('/batch/embed', validateRequest(EmbedRequestSchema), async (req, res) => {
  const results = [];
  for (const item of req.body.items) {
    const result = await embedHandler.process(item);
    results.push(result);
  }
  res.json({ results });
});

export default router;
```

**[Tool Call: read_file]** path: src/handlers/search.ts
```typescript
import { Request, Response } from 'express';
import OpenAI from 'openai';
import { prisma } from '../lib/db';
import { SearchRequest } from '../schemas';
import { buildSearchQuery } from '../lib/search-utils';
import { trackUsage } from '../lib/usage';

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

export async function searchHandler(req: Request, res: Response) {
  const { query, filters, limit = 20, offset = 0 } = req.body as SearchRequest;
  const userId = req.user!.id;

// Generate embedding for semantic search
  const embeddingResponse = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: query,
    dimensions: 512,
  });

const embedding = embeddingResponse.data[0].embedding;

// Hybrid search: combine vector similarity with keyword matching
  const searchQuery = buildSearchQuery({
    embedding,
    textQuery: query,
    filters,
    limit,
    offset,
    userId,
  });

const [results, total] = await Promise.all([
    prisma.$queryRawUnsafe(searchQuery.sql, ...searchQuery.params),
    prisma.$queryRawUnsafe(searchQuery.countSql, ...searchQuery.countParams),
  ]);

// Re-rank results using cross-encoder if we have enough results
  let rankedResults = results as any[];
  if (rankedResults.length > 5) {
    const rerankResponse = await openai.chat.completions.create({
      model: 'gpt-4o-mini',
      messages: [
        {
          role: 'system',
          content: 'You are a search result ranker. Given a query and search results, return the indices of the results sorted by relevance. Return only a JSON array of indices.',
        },
        {
          role: 'user',
          content: JSON.stringify({
            query,
            results: rankedResults.map((r, i) => ({ index: i, title: r.title, snippet: r.snippet })),
          }),
        },
      ],
      response_format: { type: 'json_object' },
      temperature: 0,
    });

try {
      const parsed = JSON.parse(rerankResponse.choices[0].message.content || '{}');
      if (Array.isArray(parsed.indices)) {
        rankedResults = parsed.indices.map((i: number) => rankedResults[i]).filter(Boolean);
      }
    } catch {
      // Fall back to original ordering if re-ranking fails
    }
  }

// Track usage for billing
  await trackUsage({
    userId,
    endpoint: 'search',
    tokensUsed: embeddingResponse.usage.total_tokens + (rankedResults.length > 5 ? 500 : 0),
    metadata: { query, resultCount: rankedResults.length },
  });

res.json({
    results: rankedResults.map(r => ({
      id: r.id,
      title: r.title,
      snippet: r.snippet,
      score: r.similarity_score,
      metadata: r.metadata,
    })),
    total: (total as any)[0]?.count || 0,
    limit,
    offset,
  });
}
```

**[Tool Call: read_file]** path: src/middleware/auth.ts
```typescript
import { Request, Response, NextFunction } from 'express';
import jwt from 'jsonwebtoken';
import { prisma } from '../lib/db';

declare global {
  namespace Express {
    interface Request {
      user?: { id: string; email: string; plan: string; apiKeyId: string };
    }
  }
}

export async function authMiddleware(req: Request, res: Response, next: NextFunction) {
  const authHeader = req.headers.authorization;
  if (!authHeader?.startsWith('Bearer ')) {
    return res.status(401).json({ error: 'Missing authorization header' });
  }

const token = authHeader.slice(7);

// Check if it's an API key (starts with 'sk-')
  if (token.startsWith('sk-')) {
    const apiKey = await prisma.apiKey.findUnique({
      where: { key: token, revokedAt: null },
      include: { user: { select: { id: true, email: true, plan: true } } },
    });

if (!apiKey) {
      return res.status(401).json({ error: 'Invalid API key' });
    }

// Update last used timestamp
    await prisma.apiKey.update({
      where: { id: apiKey.id },
      data: { lastUsedAt: new Date() },
    });

req.user = { ...apiKey.user, apiKeyId: apiKey.id };
    return next();
  }

// Otherwise treat as JWT
  try {
    const payload = jwt.verify(token, process.env.JWT_SECRET!) as { userId: string };
    const user = await prisma.user.findUnique({
      where: { id: payload.userId },
      select: { id: true, email: true, plan: true },
    });

if (!user) {
      return res.status(401).json({ error: 'User not found' });
    }

req.user = { ...user, apiKeyId: '' };
    next();
  } catch {
    return res.status(401).json({ error: 'Invalid token' });
  }
}
```

**[Tool Call: search_files]** pattern: "rate" OR "throttle" OR "limiter", path: src/
No results found.

**Assistant:** The search endpoint makes two OpenAI calls per request (embedding + optional rerank), so scrapers are expensive. There's no rate limiting anywhere in the codebase right now.

I'll add a Redis-based sliding window rate limiter scoped to the search endpoint. The approach:
1. Create a rate limit middleware using `ioredis`
2. Key on the API key ID (not IP, since these are authenticated requests)
3. Set a default of 60 requests/minute for the search endpoint
4. Return standard 429 responses with Retry-After headers

**[Tool Call: write_file]** path: src/middleware/rate-limit.ts
```typescript
import Redis from 'ioredis';
import { Request, Response, NextFunction } from 'express';

const redis = new Redis(process.env.REDIS_URL || 'redis://localhost:6379');

interface RateLimitOptions {
  windowMs: number;
  max: number;
  keyPrefix?: string;
  keyFn?: (req: Request) => string;
}

export function rateLimit(options: RateLimitOptions) {
  const { windowMs, max, keyPrefix = 'rl', keyFn } = options;
  const windowSeconds = Math.ceil(windowMs / 1000);

return async (req: Request, res: Response, next: NextFunction) => {
    const identifier = keyFn ? keyFn(req) : req.user?.apiKeyId || req.ip || 'anonymous';
    const key = `${keyPrefix}:${identifier}`;
    const now = Date.now();
    const windowStart = now - windowMs;

try {
      const pipeline = redis.pipeline();
      pipeline.zremrangebyscore(key, 0, windowStart);
      pipeline.zadd(key, String(now), `${now}:${Math.random()}`);
      pipeline.zcard(key);
      pipeline.expire(key, windowSeconds);

const results = await pipeline.exec();
      const requestCount = (results?.[2]?.[1] as number) || 0;

res.setHeader('X-RateLimit-Limit', max);
      res.setHeader('X-RateLimit-Remaining', Math.max(0, max - requestCount));
      res.setHeader('X-RateLimit-Reset', new Date(now + windowMs).toISOString());

if (requestCount > max) {
        const retryAfter = Math.ceil(windowMs / 1000);
        res.setHeader('Retry-After', retryAfter);
        return res.status(429).json({
          error: 'Too many requests',
          retryAfter,
        });
      }

next();
    } catch (err) {
      console.error('[RateLimit] Redis error, allowing request:', err);
      next();
    }
  };
}
```

**[Tool Call: edit_file]** path: src/routes/api.ts — added rate limiter to the search route:
```typescript
import { rateLimit } from '../middleware/rate-limit';

const searchRateLimit = rateLimit({
  windowMs: 60_000,
  max: 60,
  keyPrefix: 'rl:search',
  keyFn: (req) => req.user?.apiKeyId || req.ip || 'anonymous',
});

router.post('/search', searchRateLimit, validateRequest(SearchRequestSchema), searchHandler);
```

**[Tool Call: run_command]** npm test -- --grep "rate-limit"
All 12 tests passed.

**Assistant:** Done. The rate limiter is in place:
- Sliding window, 60 requests/minute per API key
- Falls back to allowing requests if Redis is unavailable
- Standard 429 + Retry-After headers for clients that hit the limit
- Only applied to /search — the other endpoints aren't getting abused yet

The Redis sorted set approach means the window slides continuously rather than resetting on a fixed boundary, so a burst of 60 requests won't suddenly allow another 60 one second later.

Diff

Diff appears here after compaction

3 free compactions remaining

COMPACTION, NOT SUMMARIZATION

UNDER 3 SECONDS AT 33,000 TOK/S

WORKS WITH WEB SEARCH

Morph Compact

Compaction that actually
improves performance

Compaction, not summarization

Summarization rewrites your context. Factory's eval scored it 3.4-3.7/5 on accuracy. Compact deletes filler and keeps every surviving sentence word-for-word.

Under 3 seconds at 33,000 tok/s

Custom inference engine. Fast enough to run inline before every LLM call, not just at the 95% capacity cliff.

Token cost reduction — descending cost bars

Works with web search

Agents running web searches pull back 10k+ tokens per page. Compact shrinks search results to the signal in under 300ms, so downstream models stay fast and don't lose the thread.

Web search context compressed for agent consumption

Enable 24+ hour agent sessions

Agents that don't forget

Agents hit a quality cliff when compaction triggers at 95% capacity. They contradict decisions and loop on solved problems. Compact runs proactively, so there's no cliff.

50-70% cost reduction

Pay for signal, not filler. Optimize tokens per task, not per request.

Multi-session memory

Store compacted transcripts across sessions. The model picks up where it left off with full access to prior decisions and file paths.

1.5 min vs 2.5s compaction

Compaction is where users churn

It's the #1 drop-off point in agent products. A 90-second stall while context compresses kills the flow. Compact runs in 2.5 seconds. Go from your highest-churn feature to a virtually invisible product experience.

Prompt-aware filtering

Pass the next objective. Compact keeps what's relevant and drops the rest. No prompt needed for general compaction.

Stop restarting sessions. Start shipping.

Get Started

Free tier available. Pay only for what you use.

Morph Fast Apply

Morph WarpGrep

Morph Compact

Morph Glance

Morph MCP

Morph Monitor

Blog

Startup Credits

Students

Contact Us

About

Careers

Blazing Fast Compaction

Paste context. See it shrink.