hft-latency-lab - Explore high-frequency trading latency with FPGA and ARM comparisons.

hft-latency-lab

Explore high-frequency trading latency with FPGA and ARM comparisons.

Pitch

This repository provides a comprehensive lab environment for analyzing high-frequency trading performance through a minimal, fully instrumented datapath on a Pynq-Z2. By juxtaposing a tiny quantized MLP on FPGA against CPU-driven reflex rules, it enables precise cycle counting and valuable insights into latency dynamics.

Description

Overview

The hft-latency-lab repository presents a minimal, fully instrumented high-frequency trading (HFT) datapath utilizing the Pynq-Z2 FPGA. This project conducts a performance comparison between a tiny quantized Multi-Layer Perceptron (MLP) running on FPGA and hard-coded reflex rules executed on an ARM CPU, focusing notably on cycle-accurate latency breakdowns.

Project Goals

This initiative aims to establish a measurable HFT datapath that interacts with real hardware rather than relying solely on Python backtesting. The setup includes:

Two Concurrent Processing Lanes:
- Reflex Lane (CPU): Implements traditional, hard-coded rules on a standard CPU, which can also be deployed on the Pynq ARM.
- Inference Lane (FPGA): Hosts a compact quantized MLP on the Pynq-Z2 fabric.
Performance Benchmarking: The current focus is on a System-on-Chip (SoC) benchmark that measures the latency incurred throughout the processing steps, comparing the ARM reflex to the FPGA MLP lane on the same chip.

System Architecture

1. Original Setup (Host ↔ FPGA)

Data flows from the host through an X710 NIC, where it is converted into a compact LOB format and sent to the Pynq-Z2 over UDP.
The main components on the Pynq include:
- Parser
- Feature Pipeline
- MLP Inference Stream
Scores are returned via UDP, merging insights from the reflex lane and FPGA for order simulations.

2. Current SoC Benchmark (Inside Pynq)

In this setup, the Ethernet connection is bypassed, and all components are integrated within the Pynq:

ARM Reflex Lane: Python and C logic managing simple queuing and cancel-storm rules.
Neuro Lane (FPGA): Includes modules for traffic generation, feature extraction, MLP inference, and latency timing.

A detailed analysis of where time is spent in terms of various computational operations (such as math, AXI, DMA, and other control logic) is conducted using different design overlays.

Key Results

Performance Metrics

Performance is evaluated at 125 MHz on the Pynq-Z2, with cycle measurements as follows:

MLP Computational Latency: Approximately 64 cycles (~0.5 µs).
Overall Fabric Latency: Ranges from ~140k cycles (~1.0-1.3 ms) without DMA to ~3.4-3.6 ms with DMA included.
ARM Reflex Lane Performance: Approximately ~16-20 µs, illustrating that the CPU execution is significantly faster—by a factor of roughly 100 times compared to the FPGA lane.

CDF Analysis

Cycles are further analyzed through cumulative distribution functions (CDFs) captured during the benchmarking, detailing the p50 and p99 latency metrics for both ARM and FPGA processing paths.

Conclusion

This repository serves not only as a benchmarking tool but also as a comprehensive study on the latencies involved in HFT systems running on Pynq-Z2. Insights gathered indicate that while a 64-cycle MLP can effectively process data, it becomes constrained by a larger 100k+ cycle shell, which emphasizes the importance of well-designed architectures in achieving optimal performance.

For those interested in experimenting or analyzing the findings, detailed scripts and guidelines are provided within the repository.

0 comments

No comments yet.

New comment