award/optimize.md

# Quickawards Performance Optimization Plan

## Overview

This document outlines the comprehensive optimization plan for Quickawards, focusing primarily on resolving critical performance issues in QSO statistics queries.

## Critical Performance Issue

### Current Problem
The `getQSOStats()` function loads ALL user QSOs into memory before calculating statistics:
- **Location**: `src/backend/services/lotw.service.js:496-517`
- **Impact**: Users with 200k QSOs experience 5-10 second page loads
- **Memory Usage**: 100MB+ per request
- **Concurrent Users**: Limited to 2-3 due to memory pressure

### Root Cause
```javascript
// Current implementation (PROBLEMATIC)
export async function getQSOStats(userId) {
  const allQSOs = await db.select().from(qsos).where(eq(qsos.userId, userId));
  // Loads 200k+ records into memory
  // ... processes with .filter() and .forEach()
}
```

### Target Performance
- **Query Time**: <100ms for 200k QSO users (currently 5-10 seconds)
- **Memory Usage**: <1MB per request (currently 100MB+)
- **Concurrent Users**: Support 50+ concurrent users

## Optimization Plan

### Phase 1: Emergency Performance Fix (Week 1)

#### 1.1 SQL Query Optimization
**File**: `src/backend/services/lotw.service.js`

Replace the memory-intensive `getQSOStats()` function with SQL-based aggregates:

```javascript
// Optimized implementation
export async function getQSOStats(userId) {
  const [basicStats, uniqueStats] = await Promise.all([
    // Basic statistics
    db.select({
      total: sql<number>`COUNT(*)`,
      confirmed: sql<number>`SUM(CASE WHEN lotw_qsl_rstatus = 'Y' OR dcl_qsl_rstatus = 'Y' THEN 1 ELSE 0 END)`
    }).from(qsos).where(eq(qsos.userId, userId)),

    // Unique counts
    db.select({
      uniqueEntities: sql<number>`COUNT(DISTINCT entity)`,
      uniqueBands: sql<number>`COUNT(DISTINCT band)`,
      uniqueModes: sql<number>`COUNT(DISTINCT mode)`
    }).from(qsos).where(eq(qsos.userId, userId))
  ]);

  return {
    total: basicStats[0].total,
    confirmed: basicStats[0].confirmed,
    uniqueEntities: uniqueStats[0].uniqueEntities,
    uniqueBands: uniqueStats[0].uniqueBands,
    uniqueModes: uniqueStats[0].uniqueModes,
  };
}
```

**Benefits**:
- Query executes entirely in SQLite
- Only returns 5 integers instead of 200k+ objects
- Reduces memory from 100MB+ to <1MB
- Expected query time: 50-100ms for 200k QSOs

#### 1.2 Critical Database Indexes
**File**: `src/backend/migrations/add-performance-indexes.js` (extend existing file)

Add essential indexes for QSO statistics queries:

```javascript
// Index for primary user queries
await db.run(sql`CREATE INDEX IF NOT EXISTS idx_qsos_user_primary ON qsos(user_id)`);

// Index for confirmation status queries
await db.run(sql`CREATE INDEX IF NOT EXISTS idx_qsos_user_confirmed ON qsos(user_id, lotw_qsl_rstatus, dcl_qsl_rstatus)`);

// Index for unique counts (entity, band, mode)
await db.run(sql`CREATE INDEX IF NOT EXISTS idx_qsos_user_unique_counts ON qsos(user_id, entity, band, mode)`);
```

**Benefits**:
- Speeds up WHERE clause filtering by 10-100x
- Optimizes COUNT(DISTINCT) operations
- Critical for sub-100ms query times

#### 1.3 Testing & Validation

**Test Cases**:
1. Small dataset (1k QSOs): Query time <10ms
2. Medium dataset (50k QSOs): Query time <50ms
3. Large dataset (200k QSOs): Query time <100ms

**Validation Steps**:
1. Run test queries with logging enabled
2. Compare memory usage before/after
3. Verify frontend receives identical API response format
4. Load test with 50 concurrent users

**Success Criteria**:
- ✅ Query time <100ms for 200k QSOs
- ✅ Memory usage <1MB per request
- ✅ API response format unchanged
- ✅ No errors in production for 1 week

### Phase 2: Stability & Monitoring (Week 2)

#### 2.1 Basic Caching Layer
**File**: `src/backend/services/lotw.service.js`

Add 5-minute TTL cache for QSO statistics:

```javascript
const statsCache = new Map();

export async function getQSOStats(userId) {
  const cacheKey = `stats_${userId}`;
  const cached = statsCache.get(cacheKey);

  if (cached && Date.now() - cached.timestamp < 300000) { // 5 minutes
    return cached.data;
  }

  // Run optimized SQL query (from Phase 1.1)
  const stats = await calculateStatsWithSQL(userId);

  statsCache.set(cacheKey, {
    data: stats,
    timestamp: Date.now()
  });

  return stats;
}

// Invalidate cache after QSO syncs
export async function invalidateStatsCache(userId) {
  statsCache.delete(`stats_${userId}`);
}
```

**Benefits**:
- Cache hit: <1ms response time
- Reduces database load by 80-90%
- Automatic cache invalidation after syncs

#### 2.2 Performance Monitoring
**File**: `src/backend/utils/logger.js` (extend existing)

Add query performance tracking:

```javascript
export async function trackQueryPerformance(queryName, fn) {
  const start = performance.now();
  const result = await fn();
  const duration = performance.now() - start;

  logger.debug('Query Performance', {
    query: queryName,
    duration: `${duration.toFixed(2)}ms`,
    threshold: duration > 100 ? 'SLOW' : 'OK'
  });

  if (duration > 500) {
    logger.warn('Slow query detected', { query: queryName, duration: `${duration.toFixed(2)}ms` });
  }

  return result;
}

// Usage in getQSOStats:
const stats = await trackQueryPerformance('getQSOStats', () =>
  calculateStatsWithSQL(userId)
);
```

**Benefits**:
- Detect performance regressions early
- Identify slow queries in production
- Data-driven optimization decisions

#### 2.3 Cache Invalidation Hooks
**Files**: `src/backend/services/lotw.service.js`, `src/backend/services/dcl.service.js`

Invalidate cache after QSO imports:

```javascript
// lotw.service.js - after syncQSOs()
export async function syncQSOs(userId, lotwUsername, lotwPassword, sinceDate, jobId) {
  // ... existing sync logic ...
  await invalidateStatsCache(userId);
}

// dcl.service.js - after syncQSOs()
export async function syncQSOs(userId, dclApiKey, sinceDate, jobId) {
  // ... existing sync logic ...
  await invalidateStatsCache(userId);
}
```

#### 2.4 Monitoring Dashboard
**File**: Create `src/backend/routes/health.js` (or extend existing health endpoint)

Add performance metrics to health check:

```javascript
app.get('/api/health', async (req) => {
  return {
    status: 'healthy',
    uptime: process.uptime(),
    database: await checkDatabaseHealth(),
    performance: {
      avgQueryTime: getAverageQueryTime(),
      cacheHitRate: getCacheHitRate(),
      slowQueriesCount: getSlowQueriesCount()
    }
  };
});
```

### Phase 3: Scalability Enhancements (Month 1)

#### 3.1 SQLite Configuration Optimization
**File**: `src/backend/db/index.js`

Optimize SQLite for read-heavy workloads:

```javascript
const db = new Database('data/award.db');

// Enable WAL mode for better concurrency
db.pragma('journal_mode = WAL');

// Increase cache size (default -2000KB, set to 100MB)
db.pragma('cache_size = -100000');

// Optimize for SELECT queries
db.pragma('synchronous = NORMAL'); // Balance between safety and speed
db.pragma('temp_store = MEMORY'); // Keep temporary tables in RAM
db.pragma('mmap_size = 30000000000'); // Memory-map database (30GB limit)
```

**Benefits**:
- WAL mode allows concurrent reads
- Larger cache reduces disk I/O
- Memory-mapped I/O for faster access

#### 3.2 Materialized Views for Large Datasets
**File**: Create `src/backend/migrations/create-materialized-views.js`

For users with >50k QSOs, create pre-computed statistics:

```javascript
// Create table for pre-computed stats
await db.run(sql`
  CREATE TABLE IF NOT EXISTS qso_stats_cache (
    user_id INTEGER PRIMARY KEY,
    total INTEGER,
    confirmed INTEGER,
    unique_entities INTEGER,
    unique_bands INTEGER,
    unique_modes INTEGER,
    updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
  )
`);

// Create trigger to auto-update stats after QSO changes
await db.run(sql`
  CREATE TRIGGER IF NOT EXISTS update_qso_stats
  AFTER INSERT OR UPDATE OR DELETE ON qsos
  BEGIN
    INSERT OR REPLACE INTO qso_stats_cache (user_id, total, confirmed, unique_entities, unique_bands, unique_modes, updated_at)
    SELECT
      user_id,
      COUNT(*) as total,
      SUM(CASE WHEN lotw_qsl_rstatus = 'Y' OR dcl_qsl_rstatus = 'Y' THEN 1 ELSE 0 END) as confirmed,
      COUNT(DISTINCT entity) as unique_entities,
      COUNT(DISTINCT band) as unique_bands,
      COUNT(DISTINCT mode) as unique_modes,
      CURRENT_TIMESTAMP as updated_at
    FROM qsos
    WHERE user_id = NEW.user_id
    GROUP BY user_id;
  END;
`);
```

**Benefits**:
- Stats updated automatically in real-time
- Query time: <5ms for any dataset size
- No cache invalidation needed

**Usage in getQSOStats()**:
```javascript
export async function getQSOStats(userId) {
  // First check if user has pre-computed stats
  const cachedStats = await db.select().from(qsoStatsCache).where(eq(qsoStatsCache.userId, userId));

  if (cachedStats.length > 0) {
    return {
      total: cachedStats[0].total,
      confirmed: cachedStats[0].confirmed,
      uniqueEntities: cachedStats[0].uniqueEntities,
      uniqueBands: cachedStats[0].uniqueBands,
      uniqueModes: cachedStats[0].uniqueModes,
    };
  }

  // Fall back to regular query for small users
  return calculateStatsWithSQL(userId);
}
```

#### 3.3 Connection Pooling
**File**: `src/backend/db/index.js`

Implement connection pooling for better concurrency:

```javascript
import { Pool } from 'bun-sqlite3';

const pool = new Pool({
  filename: 'data/award.db',
  max: 10, // Max connections
  timeout: 30000, // 30 second timeout
});

export async function getDb() {
  return pool.getConnection();
}
```

**Note**: SQLite has limited write concurrency, but read connections can be pooled.

#### 3.4 Advanced Caching Strategy
**File**: `src/backend/services/cache.service.js`

Implement Redis-style caching with Bun's built-in capabilities:

```javascript
class CacheService {
  constructor() {
    this.cache = new Map();
    this.stats = { hits: 0, misses: 0 };
  }

  async get(key) {
    const value = this.cache.get(key);
    if (value) {
      this.stats.hits++;
      return value.data;
    }
    this.stats.misses++;
    return null;
  }

  async set(key, data, ttl = 300000) {
    this.cache.set(key, {
      data,
      timestamp: Date.now(),
      ttl
    });

    // Auto-expire after TTL
    setTimeout(() => this.delete(key), ttl);
  }

  async delete(key) {
    this.cache.delete(key);
  }

  getStats() {
    const total = this.stats.hits + this.stats.misses;
    return {
      hitRate: total > 0 ? (this.stats.hits / total * 100).toFixed(2) + '%' : '0%',
      hits: this.stats.hits,
      misses: this.stats.misses,
      size: this.cache.size
    };
  }
}

export const cacheService = new CacheService();
```

## Implementation Checklist

### Phase 1: Emergency Performance Fix
- [ ] Replace `getQSOStats()` with SQL aggregates
- [ ] Add database indexes
- [ ] Run migration
- [ ] Test with 1k, 50k, 200k QSO datasets
- [ ] Verify API response format unchanged
- [ ] Deploy to production
- [ ] Monitor for 1 week

### Phase 2: Stability & Monitoring
- [ ] Implement 5-minute TTL cache
- [ ] Add performance monitoring
- [ ] Create cache invalidation hooks
- [ ] Add performance metrics to health endpoint
- [ ] Deploy to production
- [ ] Monitor cache hit rate (target >80%)

### Phase 3: Scalability Enhancements
- [ ] Optimize SQLite configuration (WAL mode, cache size)
- [ ] Create materialized views for large datasets
- [ ] Implement connection pooling
- [ ] Deploy advanced caching strategy
- [ ] Load test with 100+ concurrent users

## Additional Issues Identified (Future Work)

### High Priority

1. **Unencrypted LoTW Password Storage**
   - **Location**: `src/backend/services/auth.service.js:124`
   - **Issue**: LoTW password stored in plaintext in database
   - **Fix**: Encrypt with AES-256 before storing
   - **Effort**: 4 hours

2. **Weak JWT Secret Security**
   - **Location**: `src/backend/config.js:27`
   - **Issue**: Default JWT secret in production
   - **Fix**: Use environment variable with strong secret
   - **Effort**: 1 hour

3. **ADIF Parser Logic Error**
   - **Location**: `src/backend/utils/adif-parser.js:17-18`
   - **Issue**: Potential data corruption from incorrect parsing
   - **Fix**: Use case-insensitive regex for `<EOR>` tags
   - **Effort**: 2 hours

### Medium Priority

4. **Missing Database Transactions**
   - **Location**: Sync operations in `lotw.service.js`, `dcl.service.js`
   - **Issue**: No transaction support for multi-record operations
   - **Fix**: Wrap syncs in transactions
   - **Effort**: 6 hours

5. **Memory Leak Potential in Job Queue**
   - **Location**: `src/backend/services/job-queue.service.js`
   - **Issue**: Jobs never removed from memory
   - **Fix**: Implement cleanup mechanism
   - **Effort**: 4 hours

### Low Priority

6. **Database Path Exposure**
   - **Location**: Error messages reveal database path
   - **Issue**: Predictable database location
   - **Fix**: Sanitize error messages
   - **Effort**: 2 hours

## Monitoring & Metrics

### Key Performance Indicators (KPIs)

1. **QSO Statistics Query Time**
   - Target: <100ms for 200k QSOs
   - Current: 5-10 seconds
   - Tool: Application performance monitoring

2. **Memory Usage per Request**
   - Target: <1MB per request
   - Current: 100MB+
   - Tool: Node.js memory profiler

3. **Concurrent Users**
   - Target: 50+ concurrent users
   - Current: 2-3 users
   - Tool: Load testing with Apache Bench

4. **Cache Hit Rate**
   - Target: >80% after Phase 2
   - Current: 0% (no cache)
   - Tool: Custom metrics in cache service

5. **Database Response Time**
   - Target: <50ms for all queries
   - Current: Variable (some queries slow)
   - Tool: SQLite query logging

### Alerting Thresholds

- **Critical**: Query time >500ms
- **Warning**: Query time >200ms
- **Info**: Cache hit rate <70%

## Rollback Plan

If issues arise after deployment:

1. **Phase 1 Rollback** (if SQL query fails):
   - Revert `getQSOStats()` to original implementation
   - Keep database indexes (they help performance)
   - Estimated rollback time: 5 minutes

2. **Phase 2 Rollback** (if cache causes issues):
   - Disable cache by bypassing cache checks
   - Keep monitoring (helps diagnose issues)
   - Estimated rollback time: 2 minutes

3. **Phase 3 Rollback** (if SQLite config causes issues):
   - Revert SQLite configuration changes
   - Drop materialized views if needed
   - Estimated rollback time: 10 minutes

## Success Criteria

### Phase 1 Success
- ✅ Query time <100ms for 200k QSOs
- ✅ Memory usage <1MB per request
- ✅ Zero bugs in production for 1 week
- ✅ User feedback: "Page loads instantly now"

### Phase 2 Success
- ✅ Cache hit rate >80%
- ✅ Database load reduced by 80%
- ✅ Zero cache-related bugs for 1 week

### Phase 3 Success
- ✅ Support 50+ concurrent users
- ✅ Query time <5ms for materialized views
- ✅ Zero performance complaints for 1 month

## Timeline

- **Week 1**: Phase 1 - Emergency Performance Fix
- **Week 2**: Phase 2 - Stability & Monitoring
- **Month 1**: Phase 3 - Scalability Enhancements
- **Month 2-3**: Address additional high-priority security issues
- **Ongoing**: Monitor, iterate, optimize

## Resources

### Documentation
- SQLite Performance: https://www.sqlite.org/optoverview.html
- Drizzle ORM: https://orm.drizzle.team/
- Bun Runtime: https://bun.sh/docs

### Tools
- Query Performance: SQLite EXPLAIN QUERY PLAN
- Load Testing: Apache Bench (`ab -n 1000 -c 50 http://localhost:3001/api/qsos/stats`)
- Memory Profiling: Node.js `--inspect` flag with Chrome DevTools
- Database Analysis: `sqlite3 data/award.db "PRAGMA index_info(idx_qsos_user_primary);"`

---

**Last Updated**: 2025-01-21
**Author**: Quickawards Optimization Team
**Status**: Planning Phase - Ready to Start Phase 1 Implementation