Orlume Architecture
Comprehensive technical reference for researchers, developers, and academics interested in browser-based neural image processing.
Abstract
Orlume is a browser-native image processing system that combines deep learning-based scene understanding with real-time physically-based rendering. The system performs monocular depth estimation using Vision Transformer architectures, generates surface normals through gradient-based reconstruction, and applies deferred shading with GGX specular reflectance and horizon-based ambient occlusion (HBAO).
Key Innovation: First fully browser-based implementation of neural 3D relighting that combines monocular depth estimation with physically-based rendering for interactive photo manipulation.
System Architecture
The Orlume processing pipeline implements a multi-stage architecture optimized for GPU parallelism:
┌─────────────────────────────────────────────────────────────────────┐
│ INPUT PROCESSING │
│ ┌─────────┐ ┌──────────────┐ ┌─────────────┐ │
│ │ Image │───▶│ sRGB→Linear │───▶│ Normalize │ │
│ │ Decode │ │ Conversion │ │ [0,1] │ │
│ └─────────┘ └──────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ NEURAL INFERENCE (Parallel) │
│ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │
│ │ Depth Anything │ │ SegFormer │ │ MediaPipe FM │ │
│ │ V2 ViT │ │ B0-512 │ │ 468 Points │ │
│ │ (Depth Map) │ │ (Materials) │ │ (Face Mesh) │ │
│ └───────┬────────┘ └───────┬────────┘ └───────┬────────┘ │
└──────────┼───────────────────┼───────────────────┼──────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────────┐
│ GEOMETRY RECONSTRUCTION │
│ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │
│ │ Scharr Kernel │ │ Material Map │ │ Face Normals │ │
│ │ ∇D → Normal │ │ RGBA Encode │ │ Triangulation │ │
│ └───────┬────────┘ └───────┬────────┘ └───────┬────────┘ │
└──────────┼───────────────────┼───────────────────┼──────────────────┘
│ │ │
└───────────────────┴───────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ DEFERRED RENDERING (WebGL2/WebGPU) │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ G-Buffer: Albedo | Normals | Depth | Materials | Position │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────┼───────────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ HBAO │ │ GGX BRDF │ │ Shadow │ │
│ │ 8-dir │ │ Specular │ │ Raymarch│ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ └───────────────┼───────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Final Composite: Diffuse + Specular + AO + Shadows │ │
│ │ Tone Mapping: ACES Filmic | Exposure Compensation │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Technical Capabilities
| Subsystem | Technology | Performance |
|---|---|---|
| Depth Estimation | Depth Anything V2 (ViT-Small) | ~150ms @ 1080p |
| Semantic Segmentation | SegFormer B0 (ADE20K 150-class) | ~200ms @ 512×512 |
| Face Mesh | MediaPipe (468 landmarks) | ~16ms per frame |
| PBR Shading | GGX + HBAO + Soft Shadows | 60fps @ 4K |
| Neural Upscaling | Real-ESRGAN / ESRGAN-thick | ~2s per 2× upscale |
ML Monocular Depth Estimation
Orlume employs Depth Anything V2, a state-of-the-art monocular depth estimation model based on the Vision Transformer (ViT) architecture. The model processes single RGB images to produce dense, relative depth maps that serve as the foundation for 3D scene reconstruction.
Model Specification
| Property | Value |
|---|---|
| Model ID | Xenova/depth-anything-small-hf |
| Architecture | Vision Transformer (ViT) Encoder + CNN Decoder |
| Input Resolution | Any (internally resized to 518×518) |
| Output | Single-channel depth map, normalized [0, 1] |
| Inference Backend | ONNX Runtime (WebGPU → WASM fallback) |
Depth Processing Pipeline
// Depth estimation with bilateral smoothing
const depthTensor = await pipeline('depth-estimation', image);
const depthMap = normalizeMinMax(depthTensor);
const smoothedDepth = bilateralFilter(depthMap, {
spatialSigma: 9,
rangeSigma: 0.1 // Edge-preserving parameter
});
ML Semantic Segmentation
Material-aware rendering is achieved through SegFormer B0, a hierarchical Transformer encoder with lightweight MLP decoder. The model classifies each pixel into one of 150 semantic categories from the ADE20K dataset, which are then mapped to physically-based material properties.
Material Property Mapping
| Semantic Class | Roughness | Metallic | Subsurface | Emissive |
|---|---|---|---|---|
| Person/Skin | 0.60 | 0.00 | 0.35 | 0.00 |
| Metal/Car/Building | 0.30 | 0.95 | 0.00 | 0.00 |
| Glass/Window | 0.02 | 0.00 | 0.00 | 0.00 |
| Vegetation | 0.85 | 0.00 | 0.10 | 0.00 |
| Sky | 1.00 | 0.00 | 0.00 | 1.00 |
| Lamp/Light | 0.50 | 0.00 | 0.00 | 0.80 |
RGBA Material Encoding
G = Metallic × 255
B = Subsurface Scattering × 255
A = Emissive × 255
ML Face Mesh Detection
For portrait images, MediaPipe Face Mesh provides 468 3D facial landmarks that are triangulated into a dense mesh. This enables accurate facial geometry reconstruction for realistic skin rendering with subsurface scattering.
Mesh Generation
- 468 vertices — Sparse 3D landmark positions
- ~900 triangles — Dense tessellation via Delaunay triangulation
- Smooth normals — Area-weighted vertex normal averaging
- Depth interpolation — Barycentric coordinates for dense depth map
// Per-vertex normal via area-weighted averaging
vec3 computeSmoothNormal(int vertexIdx) {
vec3 normal = vec3(0.0);
for (int t = 0; t < adjacentTriangles; t++) {
vec3 faceNormal = cross(v1 - v0, v2 - v0);
float area = length(faceNormal) * 0.5;
normal += normalize(faceNormal) * area;
}
return normalize(normal);
}
ML Neural Image Upscaling
Super-resolution is powered by Real-ESRGAN with optional face enhancement via GFPGAN. The RRDB (Residual-in-Residual Dense Block) architecture reconstructs high-frequency details that are lost in traditional bicubic upscaling.
| Scale Factor | Architecture | Use Case |
|---|---|---|
| 2× | Real-ESRGAN x2 | General photo enhancement |
| 4× | Real-ESRGAN x4 + GFPGAN | Portrait restoration |
SHADER Surface Normal Estimation
Surface normals are computed from the depth map using the Scharr operator, which provides better rotational symmetry than the traditional Sobel kernel. A 9-tap Gaussian filter is applied for artifact-free surfaces.
Scharr Gradient Kernels
Gx (Horizontal): Gy (Vertical):
┌────┬────┬────┐ ┌────┬─────┬────┐
│ -3 │ 0 │ +3 │ │ -3 │ -10 │ -3 │
├────┼────┼────┤ ├────┼─────┼────┤
│-10 │ 0 │+10 │ │ 0 │ 0 │ 0 │
├────┼────┼────┤ ├────┼─────┼────┤
│ -3 │ 0 │ +3 │ │ +3 │ +10 │ +3 │
└────┴────┴────┘ └────┴─────┴────┘
Normal Reconstruction
// Fragment shader: Depth → Normal
vec3 computeNormal(vec2 uv, sampler2D depthTex) {
float d = texture(depthTex, uv).r;
float dx = texture(depthTex, uv + vec2(1.0, 0.0) / resolution).r - d;
float dy = texture(depthTex, uv + vec2(0.0, 1.0) / resolution).r - d;
vec3 normal = normalize(vec3(-dx * normalStrength,
-dy * normalStrength,
1.0));
return normal * 0.5 + 0.5; // Encode to [0,1] for storage
}
Gaussian Smoothing (9-tap)
Weights: [1,2,1; 2,4,2; 1,2,1]
SHADER Physically-Based Rendering
Orlume implements a full Cook-Torrance BRDF with GGX microfacet distribution, Fresnel-Schlick approximation, and Smith geometry term. This provides physically-accurate light interaction that responds correctly to material properties.
BRDF Components
GGX Normal Distribution Function (D)
where α = roughness², m = half-vector, n = surface normal
Fresnel-Schlick Approximation (F)
where F₀ = 0.04 for dielectrics, ~0.7-1.0 for metals
Smith Geometry Function (G)
G₁(x) = 2(n·x) / (n·x + √(α² + (1-α²)(n·x)²))
Final BRDF Integration
vec3 cookTorranceBRDF(vec3 N, vec3 V, vec3 L, vec3 albedo,
float roughness, float metallic) {
vec3 H = normalize(V + L);
float NdotL = max(dot(N, L), 0.0);
float NdotV = max(dot(N, V), 0.0);
float NdotH = max(dot(N, H), 0.0);
float VdotH = max(dot(V, H), 0.0);
// GGX Distribution
float alpha = roughness * roughness;
float alpha2 = alpha * alpha;
float denom = NdotH * NdotH * (alpha2 - 1.0) + 1.0;
float D = alpha2 / (PI * denom * denom);
// Fresnel
vec3 F0 = mix(vec3(0.04), albedo, metallic);
vec3 F = F0 + (1.0 - F0) * pow(1.0 - VdotH, 5.0);
// Geometry (Smith-GGX)
float k = alpha / 2.0;
float G1L = NdotL / (NdotL * (1.0 - k) + k);
float G1V = NdotV / (NdotV * (1.0 - k) + k);
float G = G1L * G1V;
// Specular term
vec3 specular = (D * F * G) / (4.0 * NdotL * NdotV + 0.001);
// Diffuse (Lambert)
vec3 kD = (1.0 - F) * (1.0 - metallic);
vec3 diffuse = kD * albedo / PI;
return (diffuse + specular) * NdotL;
}
SHADER Horizon-Based Ambient Occlusion
HBAO (Horizon-Based Ambient Occlusion) provides realistic contact shadows and ambient darkening in crevices. The algorithm ray-marches in multiple directions to find the horizon angle at each pixel.
Algorithm Parameters
| Parameter | Value | Description |
|---|---|---|
| Directions | 8 | Cardinal + diagonal directions |
| Steps per Direction | 8 | Ray marching samples |
| Radius | 8px | Sample distance |
| Bias | 0.025 | Self-occlusion prevention |
float computeHBAO(vec2 uv, float centerDepth) {
float occlusion = 0.0;
for (int d = 0; d < 8; d++) {
vec2 dir = directions[d];
float maxHorizon = -1.0;
for (int s = 1; s <= 8; s++) {
vec2 sampleUV = uv + dir * float(s) * radius;
float sampleDepth = texture(depthTex, sampleUV).r;
float heightDiff = sampleDepth - centerDepth;
float horizonAngle = heightDiff / (float(s) * radius);
maxHorizon = max(maxHorizon, horizonAngle);
}
occlusion += clamp(maxHorizon, 0.0, 1.0);
}
return 1.0 - (occlusion / 8.0) * intensity;
}
SHADER Soft Shadow Computation
Shadows are computed via screen-space ray marching from each fragment toward the light source. A novel anti-banding technique combines per-pixel dithering with Gaussian depth sampling.
Anti-Banding Techniques
- Pseudo-random dithering — Per-pixel offset using hash function
- 9-tap Gaussian blur — Smooth depth sampling at 3px radius
- Gradient accumulation — Soft blocking instead of hard thresholds
- 48 ray steps — Quadratic distribution (denser near fragment)
float hash(vec2 p) {
return fract(sin(dot(p, vec2(127.1, 311.7))) * 43758.5453);
}
float calculateSoftShadow(vec2 uv, vec2 lightPos) {
float dither = hash(uv * resolution) * 0.5;
vec2 rayDir = normalize(lightPos - uv);
float shadow = 0.0;
for (int i = 0; i < 48; i++) {
float t = (float(i) + dither) / 48.0;
t = t * t; // Quadratic distribution
vec2 samplePos = mix(uv, lightPos, t);
// 9-tap Gaussian depth sample
float depth = sampleDepthSmooth(samplePos);
float heightDiff = depth - texture(depthTex, uv).r;
shadow += smoothstep(0.0, 0.05, heightDiff);
}
return 1.0 - (shadow / 48.0);
}
SHADER Volumetric God Rays
Atmospheric light scattering is simulated through radial blur from the light source position. The effect includes chromatic aberration, bloom, and depth-aware masking for realistic results.
Effect Parameters
| Parameter | Range | Description |
|---|---|---|
| Intensity | 0.0 - 2.0 | Ray brightness |
| Decay | 0.90 - 1.0 | Falloff per sample |
| Samples | 32 - 128 | Ray marching iterations |
| Chromatic | 0.0 - 0.1 | RGB channel separation |
| Scatter | 0.0 - 1.0 | Atmospheric scattering |
GPU WebGPU Backend
The primary rendering backend leverages WebGPU for modern GPU acceleration with WGSL shaders. This provides significant performance improvements over WebGL2, especially for compute-heavy operations.
| Feature | WebGPU | WebGL2 |
|---|---|---|
| Shader Language | WGSL | GLSL ES 3.00 |
| Compute Shaders | ✓ | ✗ |
| Bind Groups | ✓ | ✗ |
| Multi-threaded | ✓ | Limited |
GPU WebGL2 Fallback
For browsers without WebGPU support, a full WebGL2 backend provides identical visual results with GLSL ES 3.00 shaders. Automatic detection ensures the best available backend is selected.
Required Extensions
EXT_color_buffer_float— Floating-point render targetsOES_texture_float_linear— Linear filtering for float texturesEXT_float_blend— Blending with float framebuffers
GPU GLSL Shader Architecture
The shader pipeline processes images through multiple passes:
| Pass | Shader | Output |
|---|---|---|
| 1 | Develop (Exposure, WB) | Linear RGB |
| 2 | HSL Color Mixer | Color-adjusted RGB |
| 3 | Normal Generation | Normal map (RGB) |
| 4 | HBAO | AO mask (R) |
| 5 | PBR Lighting | Lit RGB |
| 6 | Shadow Pass | Shadow mask (R) |
| 7 | Composite + Tone Map | Final sRGB |
Color Grading Pipeline
Color processing follows a strict order to maintain predictable results:
- Exposure — EV stops (-5 to +5)
- White Balance — Temperature (2000K-12000K) + Tint
- Contrast — S-curve with midpoint preservation
- Highlights/Shadows — Luminance-selective adjustment
- Whites/Blacks — Endpoint clipping control
- HSL — Per-channel hue/saturation/luminance
- Vibrance — Saturation-aware saturation boost
HSL Color Mixer
The 8-channel HSL mixer provides independent control over specific color ranges, implemented in a single-pass GPU shader:
| Channel | Hue Range | Center Hue |
|---|---|---|
| Red | 330° - 30° | 0° |
| Orange | 15° - 45° | 30° |
| Yellow | 45° - 75° | 60° |
| Green | 75° - 165° | 120° |
| Aqua | 165° - 195° | 180° |
| Blue | 195° - 255° | 225° |
| Purple | 255° - 285° | 270° |
| Magenta | 285° - 330° | 310° |
Tone Mapping
Final output uses ACES Filmic tone mapping for cinematic highlight roll-off:
vec3 ACESFilm(vec3 x) {
float a = 2.51;
float b = 0.03;
float c = 2.43;
float d = 0.59;
float e = 0.14;
return clamp((x * (a * x + b)) / (x * (c * x + d) + e), 0.0, 1.0);
}
Academic References
- Yang, L. et al. "Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data." CVPR 2024.
- Xie, E. et al. "SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers." NeurIPS 2021.
- Wang, X. et al. "Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data." ICCV 2021 Workshop.
- Bavoil, L., Sainz, M. "Image-space horizon-based ambient occlusion." SIGGRAPH 2008.
- Walter, B. et al. "Microfacet Models for Refraction through Rough Surfaces." EGSR 2007.
- Karis, B. "Real Shading in Unreal Engine 4." SIGGRAPH 2013 Course.
Technical Dependencies
| Library | Version | Purpose |
|---|---|---|
| Transformers.js | 2.17+ | ML model inference (ONNX Runtime) |
| Three.js | 0.160+ | 3D mesh rendering |
| MediaPipe | 0.10+ | Face mesh detection |
| ONNX Runtime Web | 1.17+ | Neural network execution |
Browser Compatibility
| Browser | WebGPU | WebGL2 |
|---|---|---|
| Chrome | 113+ | 90+ |
| Firefox | 120+ | 90+ |
| Safari | 17+ | 15+ |
| Edge | 113+ | 90+ |