FPNet, Deep Attention Network for Automated Floor Plan Analysis

Introduction

We develop a multi-task deep attention network to recognize room-boundary and room-type elements in CAD floor plans.

FPNet Summary

Related Work

Method

FPNet Architecture Summary

Problem Statement

FPNet addresses automated floor plan analysis - taking architectural floor plan images and automatically identifying room types (bedroom, kitchen, bathroom, etc.) and structural elements (doors, windows, walls, fixtures).

Overall Architecture: U-Net Structure

The network follows a U-Net architecture, which has a characteristic U-shape consisting of:

Left side (Encoder): Progressively reduces image size while increasing feature depth
Right side (Decoder): Progressively increases image size back to original resolution
Skip connections: Direct pathways connecting encoder and decoder at corresponding levels

FPNet Architecture

Encoder: Feature Extraction and Downsampling

The encoder transforms a single floor plan image into multiple feature representations at different levels of abstraction and spatial resolution. Think of it like creating a pyramid of increasingly “smart” but smaller representations of the original image.

How It Works

Progressive Downsampling with Max-Pooling

Max-pooling is a downsampling operation that reduces image size by taking the maximum value in small regions.

Original 4×4 region:        After 2×2 max-pooling:
[1  3  2  8]                    [3  8]
[2  3  1  4]          →         [7  9]
[5  1  7  9]
[0  2  6  1]

The 2×2 max-pooling operation divides the input into non-overlapping $2 \times 2$ blocks, then it takes the maximum value from each block and results in an output that’s half the width and half the height. For example:

Level 1: Input image $512 \times 512$ → after max-pool → $256 \times 256$
Level 2: $256 \times 256$ → after max-pool → $128 \times 128$
Level 3: $128 \times 128$ → after max-pool → $64 \times 64$
Level 4: $64 \times 64$ → after max-pool → $32 \times 32$
Level 5: $32 \times 32$ → after max-pool → $16 \times 16$

Each level captures information at a different scale of detail.

Feature Channel Doubling

On this paper, spatial size refers to the width and height dimensions. Whilst the channels refer to the depth dimension, like having multiple layers of information for each pixel position. For example:

Input: $512 \times 512 \times 3$ ($3$ channels - RGB)
Level 1: $256 \times 256 \times 64$ ($64$ channels of basic features like edges)
Level 2: $128 \times 128 \times 128$ ($128$ channels of more complex patterns)
Level 3: $64 \times 64 \times 256$ ($256$ channels of high-level concepts)
Level 4: $32 \times 32 \times 512$ ($512$ channels of very abstract features)

Think of channels as different “filters” or “feature detectors.” If you have 256 channels at a particular layer, you have 256 different types of feature maps, each highlighting different aspects:

Channel $1$: Might highlight vertical edges
Channel $2$: Might highlight horizontal edges
Channel $3$: Might highlight corners
Channel $100$: Might highlight door-like patterns
Channel $256$: Might highlight bathroom fixtures

Why Double Channels?

As spatial size decreases, we lose spatial detail
We compensate by increasing the number of different types of features we track
More channels = more capacity to represent complex patterns

Multi-Level Processing Implementation

At each encoder level, the following sequence occurs:

Input → ResNeXt Block → Spatial Attention → Max-Pooling → Next Level
   ↓
Skip Connection to Decoder

$5$ levels total in the encoder
Each level processes features at a different scale
Skip connections are created at each level to preserve information for the decoder
Spatial attention is applied between ResNeXt blocks at each level

Example of Multi-Scale Representations:

Level 1 ($256 \times 256$): Detects basic lines, corners, small details
Level 2 ($128 \times 128$): Recognizes door frames, window outlines
Level 3 ($64 \times 64$): Identifies room boundaries, wall segments
Level 4 ($32 \times 32$): Understands room layouts, spatial relationships
Level 5 ($16 \times 16$): Captures overall floor plan structure, building layout

ResNet and ResNeXt Blocks

ResNet Background

In deep neural networks, when you train the network, information flows backward to update weights. In very deep networks (many layers), this information gets weaker and weaker as it travels backward, like a whisper in a long telephone chain. Eventually, the early layers barely get updated, so they don’t learn properly.

ResNet solution - Residual Connections:

Traditional Block:          ResNet Block:
Input → Layer1 → Layer2     Input → Layer1 → Layer2 → Add
                                ↓              ↑
                                └──────────────┘
                                (Skip Connection)

How Skip Connections Work:

The input is added directly to the output of the processing layers
If layers learn nothing useful, they can output zeros, and the skip connection passes the input unchanged
This creates “highways” for gradients to flow backward during training
Even in very deep networks, early layers can still receive strong training signals

Why This Prevents Vanishing Gradients:

Gradients can flow directly through skip connections
They don’t have to pass through every layer sequentially
Early layers get direct feedback about their contribution to the final result

ResNeXt

ResNeXt blocks are advanced building blocks for deep networks that improve upon traditional ResNet blocks. What the traditional RedNet does:

Input → Conv → BN → ReLU → Conv → BN → Add with Input

What ResNeXt does - Multiple Parallel Pathways:

Input Feature Map (64×64×256)
          ↓
    Split into 32 groups of 8 channels each
          ↓
Path 1: Process group 1 (64×64×8) → Conv layers → Output 1 (64×64×8)
Path 2: Process group 2 (64×64×8) → Conv layers → Output 2 (64×64×8)
Path 3: Process group 3 (64×64×8) → Conv layers → Output 3 (64×64×8)
...
Path 32: Process group 32 (64×64×8) → Conv layers → Output 32 (64×64×8)
          ↓
    Concatenate all outputs
          ↓
Combined Output (64×64×256) ← This is 8×32 = 256 channels
          ↓
    Add original input (residual connection)
          ↓
Final Output (64×64×256) → Goes to Spatial Attention Module

Where the cardinality is the number of parallel paths (e.g. $32$ parallel paths). Each path applies the same type of operations (convolutions) but with different learned parameters:

Path $1$ might learn to detect horizontal lines
Path $2$ might learn to detect vertical lines
Path $3$ might learn to detect corners
Path $32$ might learn to detect curved elements

The aggregated output becomes input to the spatial attention module. Then to max-pooling for the next encoder level. Also sent via skip connection to the corresponding decoder level.

Spatial Attention in Encoder

Applied between ResNeXt blocks, spatial attention helps the network focus on the most informative spatial regions, stated simply it creates a “spotlight map” that tells the network where to focus in the image.

FPNet Summary

How It Works

1. Create Two Summary Maps

Given a $64 \times 64 \times 256$ feature maps (which comes from the ResNeXt block) we generate:

Averate summary ($64 \times 64 \times 1$): For each pixel location we average all the $256$ channels. This would encode overall activity at each location.
Max summary ($64 \times 64 \times 1$): For each pixel location we take the highest of all $256$ channels. This would encode peak activity at each location.

You can think of it like this:

You have $256$ different detectors looking at each pixel
Average summary: “How active are detectors on average at this pixel?”
Max summary: “What’s the strongest detection at this pixel?”

2. Combine the summary

Average Summary (64×64×1) ─┐
                           ├─ Stack together → Combined (64×64×2)
Max Summary (64×64×1) ─────┘

Now you have a 2-layer summary showing both average and peak activity at each location.

3. Learn What’s Important

Combined Summary (64×64×2) → 4×4 Convolution → Attention Map (64×64×1)
                                    ↓
                              Apply Sigmoid (makes values 0-1)

Where the convolution learns:

“When I see this pattern of average + max activity, this location is important”
“When I see that pattern of average + max activity, this location is not important”

The result is an attention map where:

Bright spots (close to 1) = “Pay attention here!”
Dark spots (close to 0) = “Ignore this area”

4. Apply the “Spotlight”

Original Features (64×64×256) × Attention Map (64×64×1) = Focused Features (64×64×256)

For example:

At pixel location (10, 15):
- Original features: [value1, value2, value3, ..., value256]
- Attention weight: 0.9 (high attention)
- Result: [0.9×value1, 0.9×value2, 0.9×value3, ..., 0.9×value256]

At pixel location (30, 45):
- Original features: [value1, value2, value3, ..., value256]
- Attention weight: 0.1 (low attention)
- Result: [0.1×value1, 0.1×value2, 0.1×value3, ..., 0.1×value256]

Decoder: Upsampling and Reconstruction

The decoder reconstructs full-resolution segmentation maps from the compressed encoder features.

How Upsampling Works

Bilinear interpolation: Gradually increases spatial resolution at each decoder level
Progressive reconstruction: Each level adds more spatial detail
Feature combination: Combines upsampled features with skip connection features from encoder

Final Output Layer

1×1 convolution: Maps feature channels to class predictions
Two output heads: Separate segmentation maps for rooms and icons

Feature Types and Their Importance

Low-Level Features (Early Encoder Layers)

What they are: Basic visual elements like edges, corners, textures
Characteristics: High spatial resolution, low semantic meaning
Examples: Wall edges, door frames, line intersections

High-Level Features (Deep Encoder/Decoder Layers)

What they are: Complex semantic concepts learned through deep processing
Characteristics: Low spatial resolution, high semantic meaning
Examples: “This region looks like a kitchen layout,” “This pattern suggests a bathroom”

Why Both Are Needed

High-level features: Provide semantic understanding and context
Low-level features: Provide precise spatial localization and boundary details
The challenge: High-level features know WHAT things are, low-level features know WHERE they are precisely

Channel Attention: Smart Feature Combination

The Problem with Simple Concatenation

When combining encoder features with decoder features through skip connections, not all information is equally useful. Low-level features contain much background noise and irrelevant details.

Channel Attention Solution

How It Works:

Global context extraction: Uses Global Average Pooling to capture semantic information from high-level features
Attention vector creation: Applies 1×1 convolution + batch normalization + softmax to create channel weights
Feature weighting: Multiplies low-level features by attention weights to emphasize relevant channels
Feature fusion: Adds the weighted low-level features to high-level features

Effect:

Suppresses irrelevant background information in low-level features
Emphasizes channels that contain architecturally relevant information
Creates better fusion between semantic understanding and spatial precision

Training Objective: Multi-Task Learning

Two Parallel Tasks

Room Segmentation: Classify pixels as background, different room types, or walls
Icon Segmentation: Classify pixels as different architectural elements (doors, windows, fixtures)

Loss Function

Base loss: Cross-entropy loss for each segmentation task
Multi-task weighting: Uses homoscedastic uncertainty to automatically balance the two losses
Uncertainty learning: The network learns optimal weights between tasks rather than manual tuning

Mathematical Formulation

$$ L_S = - \sum_{i = 1}^C y_i \log(p_i) $$

Where:

$y_i$ is the true label for element i
$p_i$ is the predicted probability
$C$ is the number of classes
Applied separately to both room and icon segmentation tasks

Model Output

Two Segmentation Maps

Room segmentation map: Each pixel classified as:
- Background, walls, or specific room types (kitchen, bedroom, bathroom, etc.)
Icon segmentation map: Each pixel classified as:
- Empty space or specific architectural elements (doors, windows, toilet, sink, etc.)

Output Resolution

Same resolution as input image
Pixel-level predictions for both tasks simultaneously