Noteworthy
LoRA Syncer
This release, and future releases will not have the lora syncer image associated with them, as we are deprecating that feature, a similar functionality will still exist in the form of the file system resolver. For model servers that do not yet support this form of LoRA management, but support the discrete LoRA management endpoints that the lora-syncer uses, the old images will be kept indefinitely, and can still be used.
In the next release, the lora syncer code will be removed from the codebase.
Flow Control
Flow Control continues to evolve with the addition of Scale from/to Zero support. Allowing requests to be sent to an EPP with no model serving endpoints behind it, and emitting metrics to be used by the autoscaler to then scale up the pool.
In following releases we will continue to develop towards this feature being default enabled.
Standalone EPP
This functionality allows the EPP to be deployed as a proxy, all contained within a single pod. This is achieved by the Envoy proxy having EPP as a sidecar container. This feature was developed for batch inference scenarios, and is currently considered experimental.
Fix(es)
- We improved the functionality of the approximate prefix cache scorer when working with the llm-d P/D setup
What's Changed
- Added crd validation ci workflow. by @bexxmodd in #1879
- chore: bump sim version by @nirrozenbaum in #1890
- feat(conformance): add conformance test for verifying
x-gateway-destination-endpoint-servedby @zetxqx in #1862 - Add deprecation notice on metrics port in runner and datastore by @elevran in #1886
- refactor: Flatten Flow Control inter-flow policy plugin directory structure by @LukeAVanDrie in #1841
- Execute prepare data plugins in topological order of data dependencies by @rahulgurnani in #1878
- chore(deps): bump go.uber.org/zap from 1.27.0 to 1.27.1 by @dependabot[bot] in #1896
- chore(deps): bump google.golang.org/grpc from 1.76.0 to 1.77.0 by @dependabot[bot] in #1897
- chore(deps): bump github.com/prometheus/common from 0.67.2 to 0.67.4 by @dependabot[bot] in #1895
- enhance bbr helm chart to generalize cmd-line args by @nirrozenbaum in #1900
- feat: Add totalRunningRequests metric for latency predictor by @BenjaminBraunDev in #1899
- chore(deps): bump sigs.k8s.io/structured-merge-diff/v6 from 6.3.0 to 6.3.1 by @dependabot[bot] in #1898
- SLO Aware Routing Sidecar + Plugin EPP Integration and Helm Deployment by @BenjaminBraunDev in #1839
- Use the correct vllm metric gpu_cache_usage_perc --> kv_cache_usage_perc by @ezrasilvera in #1905
- fix: fixed helm chart by @capri-xiyue in #1907
- docs: add Kgateway BBR documentation by @howardjohn in #1908
- Implement EPP Plugins by datalayer objects by @elevran in #1901
- feat: Implement Model Rewrite and Traffic Splitting Logic by @zetxqx in #1820
- docs: Updated quickstart to use stable Istio release 1.28.0 by @atharva-310 in #1902
- fix(release): correctly update lora-syncer and epp image tags across RC and final releases by @googs1025 in #1916
- fix: sort InferenceModelRewrite lists by (Namespace, Name) in tests by @googs1025 in #1917
- Define and register plugin factories for datalayer by @elevran in #1911
- fix: Properly install the InferenceModelRewrite CRD using kustomize by @shmuelk in #1934
- Move AllPodsPredicate to datastore package by @elevran in #1939
- Add automatic TLS certificate reloading for EPP by @pierDipi in #1765
- feat(modelRewrite): Add metrics for InferenceModelRewrite decisions by @zetxqx in #1938
- fix: CI golangci-lint errors by @shmuelk in #1948
- Update inference perf chart to match upstream chart + Add Prefix Cache Github Actions by @rlakhtakia in #1949
- Standardize plugins.TypedName field name from 'tn' to 'typedName' by @rohithnarasimha in #1918
- Update inference perf chart to use new hf token structure. by @rlakhtakia in #1955
- fix infinite loop in profile picker and switch predictor based routing to on by default with a header to disable by @BenjaminBraunDev in #1929
- fix config load error when picker is set before the scoerer w/o weight. by @zetxqx in #1958
- add kaushikmitr as appoved of slo aware routing plugin by @kaushikmitr in #1956
- refactor: [Scale from Zero] Introduce PodLocator by @LukeAVanDrie in #1950
- feat: add config validation in predicted-latency-scorer plugin by @googs1025 in #1904
- Run tests with two data layer implementations by @irar2 in #1930
- Rename PodInfo struct to EndpointMetadata to better reflect its purpose by @shmuelk in #1866
- feat(metrics): add scheduler attempt counter by @googs1025 in #1931
- chore: update released quickstart to v1.2.1 by @nirrozenbaum in #1941
- generalize latest release quickstart by @nirrozenbaum in #1966
- chore(deps): bump github.com/onsi/ginkgo/v2 from 2.27.2 to 2.27.3 by @dependabot[bot] in #1971
- chore(deps): bump golang.org/x/sync from 0.18.0 to 0.19.0 by @dependabot[bot] in #1972
- chore(deps): bump go.opentelemetry.io/otel/sdk from 1.38.0 to 1.39.0 by @dependabot[bot] in #1975
- refactor: Standardize config loading and system default injection by @LukeAVanDrie in #1953
- chore(deps): bump github.com/onsi/gomega from 1.38.2 to 1.38.3 by @dependabot[bot] in #1974
- chore(deps): bump go.opentelemetry.io/otel/exporters/stdout/stdouttrace from 1.38.0 to 1.39.0 by @dependabot[bot] in #1973
- feat: Enable Scale-from-Zero with Flow Control enabled by @LukeAVanDrie in #1952
- feature: (helm) support custom volumes and volumeMounts for epp by @delavet in #1945
- Use spf13/pflag instead of Go's standard flag package by @elevran in #1979
- Extend textual configuration support with the Datalayer's configuration by @shmuelk in #1914
- test/integration: introduce robust harness and migrate BBR suite by @LukeAVanDrie in #1959
- test/bbr: fix startup race condition and IPv6 address formatting by @LukeAVanDrie in #1987
- [chore]Bump vLLM Image Tags by @Frapschen in #1733
- Add Prefill Heavy E2E Test to Github Actions by @rlakhtakia in #1894
- Add decode heavy benchmark e2e test to github actions. by @rlakhtakia in #1893
- BBR multi lora guide by @davidbreitgand in #1940
- [feat] Add running requests scorer and tests by @BenjaminBraunDev in #1957
- Implement PrepareDataPlugin for prefix cache match plugin by @rahulgurnani in #1942
- Define and implement command line parsing with Options struct by @elevran in #1984
- fix(inferenceModelRewrites): conditionally skip watching InferenceModelRewrite and InferenceObjective by @zetxqx in #1967
- Add e2e test for multiport InferencePool enhancement by @RyanRosario in #1885
- chore(deps): bump go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc from 1.38.0 to 1.39.0 by @dependabot[bot] in #1997
- flowcontrol: refactor registry config to support dynamic priority provisioning by @LukeAVanDrie in #2001
- test(e2e): use kustomize to install all the crds by @zetxqx in #1990
- chore(deps): bump github.com/prometheus/prometheus from 0.307.3 to 0.308.0 by @dependabot[bot] in #1999
- chore(deps): bump the kubernetes group with 6 updates by @dependabot[bot] in #1996
- chore(deps): bump github.com/spf13/pflag from 1.0.7 to 1.0.10 by @dependabot[bot] in #2000
- remove duplicate lora adapter scorer entry in docs by @strangiato in #2009
- doc: add doc for infernecemodelrewrites. by @zetxqx in #1978
- Update benchmarking to use correct secret by @rlakhtakia in #2004
- flowcontrol: Support dynamic priority provisioning by @LukeAVanDrie in #2006
- pkg/epp: use labels.Equals for label comparison by @ErikJiang in #2015
- fix: fix header parsing to prevent trace ID loss by @LukeAVanDrie in #2024
- fix: decouple streaming usage parsing from [DONE] signal to handle network fragmentation by @LukeAVanDrie in #2026
- refactor: Flatten Flow Control intra-flow policy plugin directory structure by @LukeAVanDrie in #1840
- fix: harden header sanitization and handling logic by @LukeAVanDrie in #2025
- Refactor: Prepare EPP SaturationDetection as an Extension Point by @LukeAVanDrie in #1976
- test: fix flaky garbage collection test by @LukeAVanDrie in #2014
- feat(flowcontrol): add pool and model labels to metrics by @LukeAVanDrie in #2010
- add preparedata plugin to latency based scorer to consume prefix states by @kaushikmitr in #2005
- chore(deps): bump github.com/prometheus/prometheus from 0.308.0 to 0.308.1 by @dependabot[bot] in #2036
- cleanup: Migrate raw map sets to k8s.io/apimachinery/pkg/util/sets by @LukeAVanDrie in #2030
- test: expose fake data store for downstream tests by @MregXN in #2027
- chore(deps): bump google.golang.org/protobuf from 1.36.10 to 1.36.11 by @dependabot[bot] in #2037
- chore(deps): bump google.golang.org/grpc from 1.77.0 to 1.78.0 by @dependabot[bot] in #2043
- bbr configmap reconciler and bbr datastore by @nirrozenbaum in #2045
- refactor: refactor monitoring session by @capri-xiyue in #1906
- fix: correctly handle zero fresh pods in pool metrics by @googs1025 in #2049
- Set up data layer based on configuration by @elevran in #2046
- track base models in bbr by @nirrozenbaum in #2050
- bbr helm chart rbac enhancements for multi pool management by @nirrozenbaum in #2047
- setup configmap reconciler with controller manager by @nirrozenbaum in #2051
- create httproute via helm chart by @nirrozenbaum in #2054
- Double check the flow before marking it idle by @shmuelk in #2041
- cleanup: remove min helper function by @ErikJiang in #2052
- feat: add DeadlinePriority plugin in intra-flow dispatch policy by @googs1025 in #1960
- test/integration: introduce robust harness and migrate EPP suite by @LukeAVanDrie in #2022
- chore(deps): bump github.com/prometheus/common from 0.67.4 to 0.67.5 by @dependabot[bot] in #2059
- fix prometheus auth by @sallyom in #2061
- Limit response body size by @adelsam in #2058
- fix: ensure ResponseComplete hook always executes by @LukeAVanDrie in #2064
- chore(comment): correct OpenAI chat completions endpoint path by @googs1025 in #2065
- fix bbr image build. by @zetxqx in #2066
- remove setup log when it is not needed to pass it as arg by @nirrozenbaum in #2069
- enable configmap controller in bbr by @nirrozenbaum in #2067
- enable bbr integration tests in makefile by @nirrozenbaum in #2071
- added optional base model flag to inferencepool helm chart by @nirrozenbaum in #2073
- typo in bbr helm chart rbac by @nirrozenbaum in #2074
- typo fix by @nirrozenbaum in #2075
- Conformance report for NGINX Gateway Fabric by @sjberman in #2023
- "non streaming mode" configuration to the SLO-aware router, by @kaushikmitr in #2048
- add server side filtering based on namespace if the ns env var is set by @nirrozenbaum in #2077
- [release-1.3] prefill aware prefix plugin by @k8s-infra-cherrypick-robot in #2106
- [release-1.3] Fixed targetPorts copy error by @k8s-infra-cherrypick-robot in #2107
- [release-1.3] changed httproute creation to be behind a flag. by @k8s-infra-cherrypick-robot in #2129
- [release-1.3] rename of experimental http route creation section in helm by @k8s-infra-cherrypick-robot in #2130
- [release-1.3] fix: [Flow Control]: Optionally disable endpoint subset filtering while dispatching requests by @k8s-infra-cherrypick-robot in #2155
- [release-1.3] Increase default FlowGCTimeout to 1h to prevent premature GC by @k8s-infra-cherrypick-robot in #2154
New Contributors
- @ezrasilvera made their first contribution in #1905
- @atharva-310 made their first contribution in #1902
- @rohithnarasimha made their first contribution in #1918
- @RyanRosario made their first contribution in #1885
- @strangiato made their first contribution in #2009
- @MregXN made their first contribution in #2027
- @adelsam made their first contribution in #2058
- @sjberman made their first contribution in #2023
- @k8s-infra-cherrypick-robot made their first contribution in #2106
Full Changelog: v1.2.1...v1.3.0