- Introduction
- Class Diagram
- Camera Capturing
- Camera Preview Rendering
- Applying Virtual Background
- Demo
This project leverages Core ML body segmentation to replace the background in real-time on iOS devices. Using deep learning model, it accurately detects and segments the human figure, allowing users to apply custom virtual backgrounds. Optimized for performance, it ensures smooth processing on mobile devices.
classDiagram
class CameraController {
+SessionSetupResult
-devicePosition: AVCaptureDevice.Position
-captureSession: AVCaptureSession
-sessionQueue: DispatchQueue
-dataOutputQueue: DispatchQueue
-cameraVideoDataOutput: AVCaptureVideoDataOutput
-videoTrackSourceFormatDescription: CMFormatDescription?
-cameraDeviceInput: AVCaptureDeviceInput?
-cameraProcessor: CameraProcessor?
-setupResult: SessionSetupResult
-isSessionRunning: Bool
-isRenderingEnabled: Bool
-lastFpsTimestamp: TimeInterval
-frameCount: Int
+fpsDelegate: FpsDelegate?
+cameraPreview: CameraPreview?
+init(cameraProcessor: CameraProcessor?)
+configure()
+startRunning() SessionSetupResult
+stopRunning()
+restartSession()
+enableRendering()
+disableRendering()
+applyBackgroundImage(image: CGImage)
}
class PHPickerViewControllerDelegate {
<<protocol>>
+picker(_ picker: PHPickerViewController, didFinishPicking results: [PHPickerResult])
}
class FpsDelegate {
<<protocol>>
+didUpdateFps(fps: Double)
}
class CameraPreview {
<<class>>
+pixelBuffer: CVPixelBuffer?
}
class CameraProcessor {
<<protocol>>
+process(framePixelBuffer: CVPixelBuffer) CVPixelBuffer?
+applyBackgroundImage(image: CGImage)
}
class CameraVirtualBackgroundProcessor {
-MixParams
-pixelData: [UInt8]
-backgroundTexture: MTLTexture?
-outputTexture: MTLTexture?
-segmentationMaskBuffer: MTLBuffer?
-segmentationWidth: Int
-segmentationHeight: Int
-inputCameraTexture: MTLTexture?
-textureCache: CVMetalTextureCache?
-commandQueue: MTLCommandQueue
-computePipelineState: MTLComputePipelineState
-device: MTLDevice
-model: DeepLabV3?
-bytesPerPixel: Int
-videoSize: CGSize
-textureLoader: MTKTextureLoader
-samplerState: MTLSamplerState?
+init?()
+process(framePixelBuffer: CVPixelBuffer) CVPixelBuffer?
+applyBackgroundImage(image: CGImage)
+loadTexture(image: CGImage) MTLTexture?
-getDeepLabV3Model() DeepLabV3?
-render(pixelBuffer: CVPixelBuffer?) MTLTexture?
-makeTextureFromCVPixelBuffer(pixelBuffer: CVPixelBuffer) MTLTexture?
}
class CameraViewController {
-cameraController: CameraController!
-selection: [String: PHPickerResult]
-selectedAssetIdentifiers: [String]
-selectedAssetIdentifierIterator: IndexingIterator<[String]>?
-currentAssetIdentifier: String?
-cameraPreview: CameraPreview
-imagePickerButton: UIButton
-fpsLabel: UILabel
-cameraUnavailableLabel: UILabel
+viewDidLoad()
+viewWillAppear(animated: Bool)
+viewWillDisappear(animated: Bool)
+initialize()
+configureView()
+requestCameraPermission(completion: @escaping (Bool) -> Void)
+showCameraDisabledAlert()
+presentPicker(filter: PHPickerFilter?)
+didEnterBackground(notification: NSNotification)
+willEnterForeground(notification: NSNotification)
+sessionWasInterrupted(notification: NSNotification)
+sessionInterruptionEnded(notification: NSNotification)
+sessionRuntimeError(notification: NSNotification)
+presentPickerForImages(_ sender: Any)
+addObservers()
+removeObservers()
+didUpdateFps(fps: Double)
+displayNext()
+handleCompletion(assetIdentifier: String, object: Any?, error: Error?)
}
CameraController ..|> FpsDelegate
CameraController *-- CameraPreview
CameraController *-- CameraProcessor
CameraVirtualBackgroundProcessor ..|> CameraProcessor
CameraViewController ..|> PHPickerViewControllerDelegate
CameraViewController ..|> FpsDelegate
CameraViewController *-- CameraController
CameraViewController *-- CameraPreview
Camera capturing for iOS utilizes the AVFoundation framework to manage and configure the camera hardware, capture video and audio data, and process the captured data. The CameraController class is responsible for setting up capture session (AVCaptureSession) it's the core component that manages the flow of data from the input devices (camera and microphone) to the outputs (video and audio data).
In this implementation, the controller focuses solely on video capture. It starts with seting up the capture session on a background queue (sessionQueue), configures the session preset, adds the camera input, and sets up the video data output. The startRunning and stopRunning methods start and stop the capture session on the sessionQueue.
Each time a new video frame is captured, the captureOutput method is triggered, processing the frame and updating the frames per second (FPS) in real time.
The CameraPreview class is a custom MetalKit view that renders video frames captured by the camera. It manages the Metal device, command queue, render pipeline state, and texture cache. The class handles the creation of Metal textures from pixel buffers, sets up the necessary transformations for rendering, and encodes the rendering commands. The draw method is responsible for rendering the video frames on the screen, and the setupTransform method updates the transformation settings based on the texture dimensions, view bounds, mirroring, and rotation.
The PassThrough.metal file contains the vertex and fragment shaders used by the CameraPreview class to render video frames. The vertex shader processes the vertex positions and texture coordinates, while the fragment shader samples the texture to determine the color of each pixel. The CameraPreview class sets up the render pipeline state with these shaders, prepares the vertex and texture coordinate buffers, and issues draw commands to render the video frames using Metal. This setup allows for efficient and high-performance rendering of video frames captured by the camera.
The CameraVirtualBackgroundProcessor class is responsible for processing video frames to apply a virtual background using CoreML and Metal. It implements the CameraProcessor protocol, which defines methods for processing video frames and applying a background image. The implementation uses the DeepLabV3 model for semantic image segmentation to separate the foreground (a person) from the background in video frames. The model is loaded using CoreML, and the segmentation mask generated by the model is used to blend the input frame with a virtual background image. This process involves resizing the input frame, running the segmentation model, creating a Metal buffer for the segmentation mask, and using a compute shader to render the final output. This class leverages the GPU for efficient real-time video processing, making it suitable for applications such as virtual backgrounds in video conferencing or live streaming.
Here’s how it works:
-
Initialization: This initializer sets up a Metal-based image processing pipeline. It creates a
Metaldevice and command queue, loads themixercompute shader fromMixer.metalfile, and sets up a compute pipeline. A texture cache is initialized to convert pixel buffers intoMetaltextures, and an output texture is created for GPU processing. Also thepixelDataarray is created as a buffer to store raw pixel values which is used later to copyoutputTexturedata intopixelBuffer. -
Resizing the Frame: The
processmethod checks ifbackgroundTextureis not available then it returnsframePixelBuffer, which represents the frame captured by the camera but if it's available then it resizes the input pixel buffer frame to match the dimensions (513x513) expected by the model, it usesresizePixelBufferutility function to resizeCVPixelBufferto specified dimensions without preserving the aspect ratio. It extracts a region from the source buffer and scales it using theAccelerateframework'svImageAPI for efficient resizing. The function handles pixel format consistency, and creates a newCVPixelBufferwith the resized image data, then it loads and runs the model to generate the segmentation mask. -
Loading the Model: The
DeepLabV3model is loaded using thegetDeepLabV3Modelmethod, which lazily initializes the model with a configuration when first time used. The model can label image regions into 21 semantic classes, including categories such as bicycle, bus, car, cat, person, and more, with the background considered as one of the classes. The model input is a color image of size513x513pixels. When you input an image of size513x513, the output is a segmentation mask. Specifically, the model produces a mask where each pixel is assigned a class label corresponding to the object or background category it belongs to. The output mask is typically in the form of a tensor with the shape (height,width,num_classes), whereheightandwidthcorrespond to the dimensions of the input image (513x513), andnum_classesis the number of categories the model was trained on (21forDeepLabV3). The semantic class index for thepersoncategory is15. -
Generating the Segmentation Mask: The model's
predictionmethod is called with the resized input frame to generate the segmentation mask. This mask is then copied into aMetalbuffer for use in the compute shader. -
Rendering the Frame: The
rendermethod uses a compute shader to blend the input video texture with a virtual background texture based on a segmentation mask. The method sets up the necessary textures, buffers, and sampler states, dispatches the compute kernel, and commits the command buffer for execution. The compute shader inMixer.metalperforms the blending operation by checking the segmentation mask and combining the input and background textures accordingly. The result is available asoutputTexture. This process enables real-time virtual background processing for video frames. -
Displaying the Frame: The final step of the
processmethod is to copy the output texture data into the pixel buffer and return it for displaying on the screen using theCameraPreviewclass, the process of which was explained in the Camera Preview Rendering section.
