Release v1.3.0 · kubernetes-sigs/gateway-api-inference-extension

Noteworthy

LoRA Syncer

This release, and future releases will not have the lora syncer image associated with them, as we are deprecating that feature, a similar functionality will still exist in the form of the file system resolver. For model servers that do not yet support this form of LoRA management, but support the discrete LoRA management endpoints that the lora-syncer uses, the old images will be kept indefinitely, and can still be used.

In the next release, the lora syncer code will be removed from the codebase.

Flow Control

Flow Control continues to evolve with the addition of Scale from/to Zero support. Allowing requests to be sent to an EPP with no model serving endpoints behind it, and emitting metrics to be used by the autoscaler to then scale up the pool.

In following releases we will continue to develop towards this feature being default enabled.

Standalone EPP

This functionality allows the EPP to be deployed as a proxy, all contained within a single pod. This is achieved by the Envoy proxy having EPP as a sidecar container. This feature was developed for batch inference scenarios, and is currently considered experimental.

Fix(es)

We improved the functionality of the approximate prefix cache scorer when working with the llm-d P/D setup

What's Changed

Added crd validation ci workflow. by @bexxmodd in #1879
chore: bump sim version by @nirrozenbaum in #1890
feat(conformance): add conformance test for verifying x-gateway-destination-endpoint-served by @zetxqx in #1862
Add deprecation notice on metrics port in runner and datastore by @elevran in #1886
refactor: Flatten Flow Control inter-flow policy plugin directory structure by @LukeAVanDrie in #1841
Execute prepare data plugins in topological order of data dependencies by @rahulgurnani in #1878
chore(deps): bump go.uber.org/zap from 1.27.0 to 1.27.1 by @dependabot[bot] in #1896
chore(deps): bump google.golang.org/grpc from 1.76.0 to 1.77.0 by @dependabot[bot] in #1897
chore(deps): bump github.com/prometheus/common from 0.67.2 to 0.67.4 by @dependabot[bot] in #1895
enhance bbr helm chart to generalize cmd-line args by @nirrozenbaum in #1900
feat: Add totalRunningRequests metric for latency predictor by @BenjaminBraunDev in #1899
chore(deps): bump sigs.k8s.io/structured-merge-diff/v6 from 6.3.0 to 6.3.1 by @dependabot[bot] in #1898
SLO Aware Routing Sidecar + Plugin EPP Integration and Helm Deployment by @BenjaminBraunDev in #1839
Use the correct vllm metric gpu_cache_usage_perc --> kv_cache_usage_perc by @ezrasilvera in #1905
fix: fixed helm chart by @capri-xiyue in #1907
docs: add Kgateway BBR documentation by @howardjohn in #1908
Implement EPP Plugins by datalayer objects by @elevran in #1901
feat: Implement Model Rewrite and Traffic Splitting Logic by @zetxqx in #1820
docs: Updated quickstart to use stable Istio release 1.28.0 by @atharva-310 in #1902
fix(release): correctly update lora-syncer and epp image tags across RC and final releases by @googs1025 in #1916
fix: sort InferenceModelRewrite lists by (Namespace, Name) in tests by @googs1025 in #1917
Define and register plugin factories for datalayer by @elevran in #1911
fix: Properly install the InferenceModelRewrite CRD using kustomize by @shmuelk in #1934
Move AllPodsPredicate to datastore package by @elevran in #1939
Add automatic TLS certificate reloading for EPP by @pierDipi in #1765
feat(modelRewrite): Add metrics for InferenceModelRewrite decisions by @zetxqx in #1938
fix: CI golangci-lint errors by @shmuelk in #1948
Update inference perf chart to match upstream chart + Add Prefix Cache Github Actions by @rlakhtakia in #1949
Standardize plugins.TypedName field name from 'tn' to 'typedName' by @rohithnarasimha in #1918
Update inference perf chart to use new hf token structure. by @rlakhtakia in #1955
fix infinite loop in profile picker and switch predictor based routing to on by default with a header to disable by @BenjaminBraunDev in #1929
fix config load error when picker is set before the scoerer w/o weight. by @zetxqx in #1958
add kaushikmitr as appoved of slo aware routing plugin by @kaushikmitr in #1956
refactor: [Scale from Zero] Introduce PodLocator by @LukeAVanDrie in #1950
feat: add config validation in predicted-latency-scorer plugin by @googs1025 in #1904
Run tests with two data layer implementations by @irar2 in #1930
Rename PodInfo struct to EndpointMetadata to better reflect its purpose by @shmuelk in #1866
feat(metrics): add scheduler attempt counter by @googs1025 in #1931
chore: update released quickstart to v1.2.1 by @nirrozenbaum in #1941
generalize latest release quickstart by @nirrozenbaum in #1966
chore(deps): bump github.com/onsi/ginkgo/v2 from 2.27.2 to 2.27.3 by @dependabot[bot] in #1971
chore(deps): bump golang.org/x/sync from 0.18.0 to 0.19.0 by @dependabot[bot] in #1972
chore(deps): bump go.opentelemetry.io/otel/sdk from 1.38.0 to 1.39.0 by @dependabot[bot] in #1975
refactor: Standardize config loading and system default injection by @LukeAVanDrie in #1953
chore(deps): bump github.com/onsi/gomega from 1.38.2 to 1.38.3 by @dependabot[bot] in #1974
chore(deps): bump go.opentelemetry.io/otel/exporters/stdout/stdouttrace from 1.38.0 to 1.39.0 by @dependabot[bot] in #1973
feat: Enable Scale-from-Zero with Flow Control enabled by @LukeAVanDrie in #1952
feature: (helm) support custom volumes and volumeMounts for epp by @delavet in #1945
Use spf13/pflag instead of Go's standard flag package by @elevran in #1979
Extend textual configuration support with the Datalayer's configuration by @shmuelk in #1914
test/integration: introduce robust harness and migrate BBR suite by @LukeAVanDrie in #1959
test/bbr: fix startup race condition and IPv6 address formatting by @LukeAVanDrie in #1987
[chore]Bump vLLM Image Tags by @Frapschen in #1733
Add Prefill Heavy E2E Test to Github Actions by @rlakhtakia in #1894
Add decode heavy benchmark e2e test to github actions. by @rlakhtakia in #1893
BBR multi lora guide by @davidbreitgand in #1940
[feat] Add running requests scorer and tests by @BenjaminBraunDev in #1957
Implement PrepareDataPlugin for prefix cache match plugin by @rahulgurnani in #1942
Define and implement command line parsing with Options struct by @elevran in #1984
fix(inferenceModelRewrites): conditionally skip watching InferenceModelRewrite and InferenceObjective by @zetxqx in #1967
Add e2e test for multiport InferencePool enhancement by @RyanRosario in #1885
chore(deps): bump go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc from 1.38.0 to 1.39.0 by @dependabot[bot] in #1997
flowcontrol: refactor registry config to support dynamic priority provisioning by @LukeAVanDrie in #2001
test(e2e): use kustomize to install all the crds by @zetxqx in #1990
chore(deps): bump github.com/prometheus/prometheus from 0.307.3 to 0.308.0 by @dependabot[bot] in #1999
chore(deps): bump the kubernetes group with 6 updates by @dependabot[bot] in #1996
chore(deps): bump github.com/spf13/pflag from 1.0.7 to 1.0.10 by @dependabot[bot] in #2000
remove duplicate lora adapter scorer entry in docs by @strangiato in #2009
doc: add doc for infernecemodelrewrites. by @zetxqx in #1978
Update benchmarking to use correct secret by @rlakhtakia in #2004
flowcontrol: Support dynamic priority provisioning by @LukeAVanDrie in #2006
pkg/epp: use labels.Equals for label comparison by @ErikJiang in #2015
fix: fix header parsing to prevent trace ID loss by @LukeAVanDrie in #2024
fix: decouple streaming usage parsing from [DONE] signal to handle network fragmentation by @LukeAVanDrie in #2026
refactor: Flatten Flow Control intra-flow policy plugin directory structure by @LukeAVanDrie in #1840
fix: harden header sanitization and handling logic by @LukeAVanDrie in #2025
Refactor: Prepare EPP SaturationDetection as an Extension Point by @LukeAVanDrie in #1976
test: fix flaky garbage collection test by @LukeAVanDrie in #2014
feat(flowcontrol): add pool and model labels to metrics by @LukeAVanDrie in #2010
add preparedata plugin to latency based scorer to consume prefix states by @kaushikmitr in #2005
chore(deps): bump github.com/prometheus/prometheus from 0.308.0 to 0.308.1 by @dependabot[bot] in #2036
cleanup: Migrate raw map sets to k8s.io/apimachinery/pkg/util/sets by @LukeAVanDrie in #2030
test: expose fake data store for downstream tests by @MregXN in #2027
chore(deps): bump google.golang.org/protobuf from 1.36.10 to 1.36.11 by @dependabot[bot] in #2037
chore(deps): bump google.golang.org/grpc from 1.77.0 to 1.78.0 by @dependabot[bot] in #2043
bbr configmap reconciler and bbr datastore by @nirrozenbaum in #2045
refactor: refactor monitoring session by @capri-xiyue in #1906
fix: correctly handle zero fresh pods in pool metrics by @googs1025 in #2049
Set up data layer based on configuration by @elevran in #2046
track base models in bbr by @nirrozenbaum in #2050
bbr helm chart rbac enhancements for multi pool management by @nirrozenbaum in #2047
setup configmap reconciler with controller manager by @nirrozenbaum in #2051
create httproute via helm chart by @nirrozenbaum in #2054
Double check the flow before marking it idle by @shmuelk in #2041
cleanup: remove min helper function by @ErikJiang in #2052
feat: add DeadlinePriority plugin in intra-flow dispatch policy by @googs1025 in #1960
test/integration: introduce robust harness and migrate EPP suite by @LukeAVanDrie in #2022
chore(deps): bump github.com/prometheus/common from 0.67.4 to 0.67.5 by @dependabot[bot] in #2059
fix prometheus auth by @sallyom in #2061
Limit response body size by @adelsam in #2058
fix: ensure ResponseComplete hook always executes by @LukeAVanDrie in #2064
chore(comment): correct OpenAI chat completions endpoint path by @googs1025 in #2065
fix bbr image build. by @zetxqx in #2066
remove setup log when it is not needed to pass it as arg by @nirrozenbaum in #2069
enable configmap controller in bbr by @nirrozenbaum in #2067
enable bbr integration tests in makefile by @nirrozenbaum in #2071
added optional base model flag to inferencepool helm chart by @nirrozenbaum in #2073
typo in bbr helm chart rbac by @nirrozenbaum in #2074
typo fix by @nirrozenbaum in #2075
Conformance report for NGINX Gateway Fabric by @sjberman in #2023
"non streaming mode" configuration to the SLO-aware router, by @kaushikmitr in #2048
add server side filtering based on namespace if the ns env var is set by @nirrozenbaum in #2077
[release-1.3] prefill aware prefix plugin by @k8s-infra-cherrypick-robot in #2106
[release-1.3] Fixed targetPorts copy error by @k8s-infra-cherrypick-robot in #2107
[release-1.3] changed httproute creation to be behind a flag. by @k8s-infra-cherrypick-robot in #2129
[release-1.3] rename of experimental http route creation section in helm by @k8s-infra-cherrypick-robot in #2130
[release-1.3] fix: [Flow Control]: Optionally disable endpoint subset filtering while dispatching requests by @k8s-infra-cherrypick-robot in #2155
[release-1.3] Increase default FlowGCTimeout to 1h to prevent premature GC by @k8s-infra-cherrypick-robot in #2154

New Contributors

@ezrasilvera made their first contribution in #1905
@atharva-310 made their first contribution in #1902
@rohithnarasimha made their first contribution in #1918
@RyanRosario made their first contribution in #1885
@strangiato made their first contribution in #2009
@MregXN made their first contribution in #2027
@adelsam made their first contribution in #2058
@sjberman made their first contribution in #2023
@k8s-infra-cherrypick-robot made their first contribution in #2106

Full Changelog: v1.2.1...v1.3.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.3.0

Choose a tag to compare

Sorry, something went wrong.