Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions charts/spiderpool/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,7 @@ helm install spiderpool spiderpool/spiderpool --wait --namespace kube-system \
| `ipam.enableKubevirtStaticIP` | the feature to keep kubevirt vm pod static IP | `true` |
| `ipam.enableIPConflictDetection` | enable IP conflict detection | `false` |
| `ipam.enableGatewayDetection` | enable gateway detection | `false` |
| `ipam.enableCleanOutdatedEndpoint` | enable clean outdated endpoint | `false` |
| `ipam.spiderSubnet.enable` | SpiderSubnet feature. | `true` |
| `ipam.spiderSubnet.autoPool.enable` | SpiderSubnet Auto IPPool feature. | `true` |
| `ipam.spiderSubnet.autoPool.defaultRedundantIPNumber` | the default redundant IP number of SpiderSubnet feature auto-created IPPools | `1` |
Expand Down
1 change: 1 addition & 0 deletions charts/spiderpool/templates/configmap.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ data:
enableIPv6: {{ .Values.ipam.enableIPv6 }}
enableStatefulSet: {{ .Values.ipam.enableStatefulSet }}
enableKubevirtStaticIP: {{ .Values.ipam.enableKubevirtStaticIP }}
enableCleanOutdatedEndpoint: {{ .Values.ipam.enableCleanOutdatedEndpoint }}
enableSpiderSubnet: {{ .Values.ipam.spiderSubnet.enable }}
enableAutoPoolForApplication: {{ .Values.ipam.spiderSubnet.autoPool.enable }}
enableIPConflictDetection: {{ .Values.ipam.enableIPConflictDetection }}
Expand Down
3 changes: 3 additions & 0 deletions charts/spiderpool/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,9 @@ ipam:
## @param ipam.enableGatewayDetection enable gateway detection
enableGatewayDetection: false

## @param ipam.enableCleanOutdatedEndpoint enable clean outdated endpoint
enableCleanOutdatedEndpoint: false

spiderSubnet:
## @param ipam.spiderSubnet.enable SpiderSubnet feature.
enable: true
Expand Down
2 changes: 2 additions & 0 deletions cmd/spiderpool-controller/cmd/daemon.go
Original file line number Diff line number Diff line change
Expand Up @@ -397,6 +397,8 @@ func initGCManager(ctx context.Context) {
gcIPConfig.EnableStatefulSet = controllerContext.Cfg.EnableStatefulSet
// EnableKubevirtStaticIP was determined by Configmap.
gcIPConfig.EnableKubevirtStaticIP = controllerContext.Cfg.EnableKubevirtStaticIP
// EnableCleanOutdatedEndpoint was determined by Configmap.
gcIPConfig.EnableCleanOutdatedEndpoint = controllerContext.Cfg.EnableCleanOutdatedEndpoint
gcIPConfig.LeaderRetryElectGap = time.Duration(controllerContext.Cfg.LeaseRetryGap) * time.Second
gcManager, err := gcmanager.NewGCManager(
controllerContext.ClientSet,
Expand Down
15 changes: 15 additions & 0 deletions docs/concepts/ipam-des-zh_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,21 @@ NOTE:

- 对于节点重启等意外情况导致 Pod 的 Sandbox 容器重启,Pod 状态为 Running 但 status 中的 podIPs 字段被清空,Spiderpool 之前会将其 IP 地址回收,但可能会导致 IP 地址的重复分配。目前 Spiderpool 默认不会回收该状态的 Pod。该功能可通过环境变量 [spiderpool-controller ENV](./../reference/spiderpool-controller.md#env)`SPIDERPOOL_GC_ENABLE_STATELESS_RUNNING_POD_ON_EMPTY_POD_STATUS_IPS` 控制(默认为 false)。

### 回收僵尸 SpiderEndpoint

Spiderpool 会周期性扫描 SpiderEndpoint,如果发现 IP 池中的某个 Pod 对应的 IP 分配记录已经不存在,但是其 SpiderEndpoint 对象仍然存在,Spiderpool 将会回收该 SpiderEndpoint。该功能可通过 `spiderpool-conf` configMap 开启或关闭, 默认为 false:

apiVersion: v1
kind: ConfigMap
metadata:
name: spiderpool-conf
namespace: spiderpool
data:
conf.yml: |
...
enableCleanOutdatedEndpoint: true
...

### IP 冲突检测和网关可达性检测

对于 Underlay 网络,IP 冲突是无法接受的,这可能会造成严重的问题。Spiderpool 支持 IP 冲突检测和网关可达性检测,该功能以前由 coordinator 插件实现,由于可能会导致一些潜在的通信问题。现在由 IPAM 完成。
Expand Down
17 changes: 17 additions & 0 deletions docs/concepts/ipam-des.md
Original file line number Diff line number Diff line change
Expand Up @@ -212,6 +212,23 @@ The above complete IP recovery algorithm can ensure the correct recovery of IP a

- For the **stateless** Pod in the `Running` phase, Spiderpool will not release its IP address when the Pod's `status.podIPs` is empty. This feature can be controlled by the environment variable `SPIDERPOOL_GC_ENABLE_STATELESS_RUNNING_POD_ON_EMPTY_POD_STATUS_IPS`.

### Clean Outdated SpiderEndpoint

Spiderpool periodically scans SpiderEndpoints. If it finds that the IP allocation record for a Pod in the IP pool no longer exists, but the corresponding SpiderEndpoint object still exists, Spiderpool will reclaim that SpiderEndpoint. This feature can be enabled or disabled via the spiderpool-conf ConfigMap, and is disabled (false) by default:

```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: spiderpool-conf
namespace: spiderpool
data:
conf.yml: |
...
enableCleanOutdatedEndpoint: true
...
```

### IP Conflict Detection and Gateway Reachability Detection

For Underlay networks, IP conflicts are unacceptable as they can cause serious issues. Spiderpool supports IP conflict detection and gateway reachability detection, which were previously implemented by the coordinator plugin but could cause some potential communication problems. Now, this is handled by IPAM.
Expand Down
19 changes: 11 additions & 8 deletions pkg/gcmanager/gc_manager.go
Original file line number Diff line number Diff line change
Expand Up @@ -8,19 +8,20 @@ import (
"fmt"
"time"

"go.uber.org/zap"
"k8s.io/client-go/informers"
"k8s.io/client-go/kubernetes"

"github.com/spidernet-io/spiderpool/pkg/election"
"github.com/spidernet-io/spiderpool/pkg/ippoolmanager"
"github.com/spidernet-io/spiderpool/pkg/kubevirtmanager"
"github.com/spidernet-io/spiderpool/pkg/limiter"
"github.com/spidernet-io/spiderpool/pkg/lock"
"github.com/spidernet-io/spiderpool/pkg/logutils"
"github.com/spidernet-io/spiderpool/pkg/nodemanager"
"github.com/spidernet-io/spiderpool/pkg/podmanager"
"github.com/spidernet-io/spiderpool/pkg/statefulsetmanager"
"github.com/spidernet-io/spiderpool/pkg/workloadendpointmanager"

"go.uber.org/zap"
"k8s.io/client-go/informers"
"k8s.io/client-go/kubernetes"
)

type GarbageCollectionConfig struct {
Expand All @@ -30,6 +31,7 @@ type GarbageCollectionConfig struct {
EnableGCStatelessRunningPodOnEmptyPodStatusIPs bool
EnableStatefulSet bool
EnableKubevirtStaticIP bool
EnableCleanOutdatedEndpoint bool

ReleaseIPWorkerNum int
GCIPChannelBuffer int
Expand Down Expand Up @@ -77,6 +79,7 @@ type SpiderGC struct {

informerFactory informers.SharedInformerFactory
gcLimiter limiter.Limiter
Locker lock.Mutex
}

func NewGCManager(clientSet *kubernetes.Clientset, config *GarbageCollectionConfig,
Expand Down Expand Up @@ -114,10 +117,9 @@ func NewGCManager(clientSet *kubernetes.Clientset, config *GarbageCollectionConf
logger = logutils.Logger.Named("IP-GarbageCollection")

spiderGC := &SpiderGC{
k8ClientSet: clientSet,
PodDB: NewPodDBer(config.MaxPodEntryDatabaseCap),
gcConfig: config,

k8ClientSet: clientSet,
PodDB: NewPodDBer(config.MaxPodEntryDatabaseCap),
gcConfig: config,
gcSignal: make(chan struct{}, 1),
gcIPPoolIPSignal: make(chan *PodEntry, config.GCIPChannelBuffer),

Expand All @@ -130,6 +132,7 @@ func NewGCManager(clientSet *kubernetes.Clientset, config *GarbageCollectionConf

leader: spiderControllerLeader,
gcLimiter: limiter.NewLimiter(limiter.LimiterConfig{}),
Locker: lock.Mutex{},
}

return spiderGC, nil
Expand Down
98 changes: 96 additions & 2 deletions pkg/gcmanager/scanAll_IPPool.go
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,15 @@ import (
apierrors "k8s.io/apimachinery/pkg/api/errors"
"k8s.io/client-go/tools/cache"

corev1 "k8s.io/api/core/v1"

"github.com/spidernet-io/spiderpool/pkg/constant"
spiderpoolv2beta1 "github.com/spidernet-io/spiderpool/pkg/k8s/apis/spiderpool.spidernet.io/v2beta1"
"github.com/spidernet-io/spiderpool/pkg/logutils"
"github.com/spidernet-io/spiderpool/pkg/nodemanager"
"github.com/spidernet-io/spiderpool/pkg/podmanager"
"github.com/spidernet-io/spiderpool/pkg/types"
"github.com/spidernet-io/spiderpool/pkg/utils/convert"
corev1 "k8s.io/api/core/v1"
)

// monitorGCSignal will monitor signal from CLI, DefaultGCInterval
Expand Down Expand Up @@ -76,6 +77,18 @@ func (s *SpiderGC) monitorGCSignal(ctx context.Context) {

// executeScanAll scans the whole pod and whole IPPoolList
func (s *SpiderGC) executeScanAll(ctx context.Context) {
epList, err := s.wepMgr.ListEndpoints(ctx, constant.UseCache)
if err != nil {
logger.Sugar().Errorf("failed to list all endpoints: %v, skip clean outdated endpoint", err)
return
}

suspiciousEndpointMap := make(map[string]*spiderpoolv2beta1.WorkloadEndpointStatus)
for i := range epList.Items {
key := fmt.Sprintf("%s/%s", epList.Items[i].Namespace, epList.Items[i].Name)
suspiciousEndpointMap[key] = &epList.Items[i].Status
}

poolList, err := s.ippoolMgr.ListIPPools(ctx, constant.UseCache)
if err != nil {
if apierrors.IsNotFound(err) {
Expand Down Expand Up @@ -127,6 +140,16 @@ func (s *SpiderGC) executeScanAll(ctx context.Context) {
flagStaticIPPod := false
shouldGcstatelessTerminatingPod := false
endpoint, endpointErr := s.wepMgr.GetEndpointByName(ctx, podNS, podName, constant.UseCache)
if endpointErr == nil && endpoint != nil {
// If we find the endpoint through the allocation information in the IP pool,
// it means that this endpoint is normal. Otherwise, it is very likely
// to be a leaked endpoint. We remove it from the suspiciousEndpointMap, so
// that it can be further cleaned up in the cleanOutdateEndpoint function.
s.Locker.Lock()
delete(suspiciousEndpointMap, poolIPAllocation.NamespacedName)
s.Locker.Unlock()
}

podYaml, podErr := s.podMgr.GetPodByName(ctx, podNS, podName, constant.UseCache)

// handle the pod not existed with the same name
Expand Down Expand Up @@ -399,11 +422,82 @@ func (s *SpiderGC) executeScanAll(ctx context.Context) {
fnScanAll(v6poolList)
}()
}

wg.Wait()

// Clean up outdated endpoints where both pod and owner reference no longer exist
if s.gcConfig.EnableCleanOutdatedEndpoint {
if len(suspiciousEndpointMap) > 0 {
logger.Sugar().Infof("Endpoint cleanup: processing %d outdated endpoints", len(suspiciousEndpointMap))
s.cleanOutdateEndpoint(ctx, suspiciousEndpointMap)
} else {
logger.Sugar().Infof("Endpoint cleanup: no outdated endpoints found, nothing to clean")
}
} else {
logger.Sugar().Infof("Endpoint cleanup: disabled by configuration (EnableCleanOutdatedEndpoint=false)")
}

logger.Sugar().Debugf("IP GC scan all finished")
}

func (s *SpiderGC) cleanOutdateEndpoint(ctx context.Context, suspiciousEndpointMap map[string]*spiderpoolv2beta1.WorkloadEndpointStatus) {
logCtx := logutils.IntoContext(ctx, logger)

for nsNameKey, status := range suspiciousEndpointMap {
namespace, name, err := cache.SplitMetaNamespaceKey(nsNameKey)
if err != nil {
logger.Sugar().Errorf("cleanOutdateEndpoint: failed to clean outdated endpoint %s/%s: %v", namespace, name, err)
continue
}
for _, ipAllocationDetail := range status.Current.IPs {
ipv4Pool := ipAllocationDetail.IPv4Pool
ipv6Pool := ipAllocationDetail.IPv6Pool
if ipv4Pool != nil {
err := s.checkEndpointExistInIPPool(logCtx, namespace, name, *ipv4Pool, logger, status)
if err != nil {
logger.Sugar().Errorf("cleanOutdateEndpoint: failed to clean outdated endpoint %s/%s: %v", namespace, name, err)
}
}

if ipv6Pool != nil {
err := s.checkEndpointExistInIPPool(logCtx, namespace, name, *ipv6Pool, logger, status)
if err != nil {
logger.Sugar().Errorf("cleanOutdateEndpoint: failed to clean outdated endpoint %s/%s: %v", namespace, name, err)
}
}
}
}
logger.Sugar().Debugf("Finished cleaning outdated endpoints")
}

func (s *SpiderGC) checkEndpointExistInIPPool(ctx context.Context, epNamespace, epName, poolName string, logger *zap.Logger, status *spiderpoolv2beta1.WorkloadEndpointStatus) error {
pool, err := s.ippoolMgr.GetIPPoolByName(ctx, poolName, constant.IgnoreCache)
if err != nil {
return err
}

poolAllocatedIPs, err := convert.UnmarshalIPPoolAllocatedIPs(pool.Status.AllocatedIPs)
if err != nil {
return err
}

endpointKey := epNamespace + "/" + epName
for _, poolIPAllocation := range poolAllocatedIPs {
if poolIPAllocation.NamespacedName == endpointKey {
logger.Sugar().Debugf("Endpoint validation: endpoint %s has active IP allocation in pool %s, skipping cleanup", endpointKey, poolName)
return nil
}
}

logger.Sugar().Debugf("Endpoint cleanup: endpoint %s has no IP allocation in pool %s, proceeding with cleanup", endpointKey, poolName)
err = s.wepMgr.ReleaseEndpointAndFinalizer(ctx, epNamespace, epName, constant.IgnoreCache)
if err != nil {
logger.Sugar().Errorf("Endpoint cleanup: failed to remove endpoint %s: %v", endpointKey, err)
} else {
logger.Sugar().Infof("Endpoint cleanup: successfully removed outdated endpoint %s", endpointKey)
}
return nil
}

// Helps check if it is a valid static Pod (StatefulSet or Kubevirt), if it is a valid static Pod. Return true
func (s *SpiderGC) isValidStatefulsetOrKubevirt(ctx context.Context, logger *zap.Logger, podNS, podName, poolIP, ownerControllerType string) (bool, error) {
if s.gcConfig.EnableStatefulSet && ownerControllerType == constant.KindStatefulSet {
Expand Down
55 changes: 27 additions & 28 deletions pkg/gcmanager/tracePod_worker.go
Original file line number Diff line number Diff line change
Expand Up @@ -49,37 +49,36 @@ func (s *SpiderGC) handlePodEntryForTracingTimeOut(podEntry *PodEntry) {
if podEntry.TracingStopTime.IsZero() {
logger.Sugar().Warnf("unknown podEntry: %+v", podEntry)
return
} else {
if time.Now().UTC().After(podEntry.TracingStopTime) {
// If the statefulset application quickly experiences scaling down and up,
// check whether `Status.PodIPs` is empty to determine whether the Pod in the current K8S has completed the normal IP release to avoid releasing the wrong IP.
ctx := context.TODO()
currentPodYaml, err := s.podMgr.GetPodByName(ctx, podEntry.Namespace, podEntry.PodName, constant.UseCache)
if err != nil {
tracingReason := fmt.Sprintf("the graceful deletion period of pod '%s/%s' is over, get the current pod status in Kubernetes", podEntry.Namespace, podEntry.PodName)
if apierrors.IsNotFound(err) {
logger.With(zap.Any("podEntry tracing-reason", tracingReason)).
Sugar().Debugf("pod '%s/%s' not found", podEntry.Namespace, podEntry.PodName)
} else {
logger.With(zap.Any("podEntry tracing-reason", tracingReason)).
Sugar().Errorf("failed to get pod '%s/%s', error: %v", podEntry.Namespace, podEntry.PodName, err)
// the pod will be handled next time.
return
}
}

if time.Now().UTC().After(podEntry.TracingStopTime) {
// If the statefulset application quickly experiences scaling down and up,
// check whether `Status.PodIPs` is empty to determine whether the Pod in the current K8S has completed the normal IP release to avoid releasing the wrong IP.
ctx := context.TODO()
currentPodYaml, err := s.podMgr.GetPodByName(ctx, podEntry.Namespace, podEntry.PodName, constant.UseCache)
if err != nil {
tracingReason := fmt.Sprintf("the graceful deletion period of pod '%s/%s' is over, get the current pod status in Kubernetes", podEntry.Namespace, podEntry.PodName)
if apierrors.IsNotFound(err) {
logger.With(zap.Any("podEntry tracing-reason", tracingReason)).
Sugar().Debugf("pod '%s/%s' not found", podEntry.Namespace, podEntry.PodName)
} else {
if len(currentPodYaml.Status.PodIPs) == 0 {
logger.Sugar().Infof("The IP address of the Pod %v that has exceeded the grace period has been released through cmdDel, ignore it.", podEntry.PodName)
s.PodDB.DeletePodEntry(podEntry.Namespace, podEntry.PodName)
return
}
logger.With(zap.Any("podEntry tracing-reason", tracingReason)).
Sugar().Errorf("failed to get pod '%s/%s', error: %v", podEntry.Namespace, podEntry.PodName, err)
// the pod will be handled next time.
return
}

logger.With(zap.Any("podEntry tracing-reason", podEntry.PodTracingReason)).
Sugar().Infof("the graceful deletion period of pod '%s/%s' is over, try to release the IP address.", podEntry.Namespace, podEntry.PodName)
} else {
// not time out
return
if len(currentPodYaml.Status.PodIPs) == 0 {
logger.Sugar().Infof("The IP address of the Pod %v that has exceeded the grace period has been released through cmdDel, to ensure the IP is released, trigger this pod entry to gc again.", podEntry.PodName)
s.PodDB.DeletePodEntry(podEntry.Namespace, podEntry.PodName)
}
}

logger.With(zap.Any("podEntry tracing-reason", podEntry.PodTracingReason)).
Sugar().Infof("the graceful deletion period of pod '%s/%s' is over, try to release the IP address.", podEntry.Namespace, podEntry.PodName)
} else {
// not time out
return
}

select {
Expand All @@ -100,7 +99,7 @@ func (s *SpiderGC) releaseIPPoolIPExecutor(ctx context.Context, workerIndex int)
case podCache := <-s.gcIPPoolIPSignal:
err := func() error {
endpoint, err := s.wepMgr.GetEndpointByName(ctx, podCache.Namespace, podCache.PodName, constant.UseCache)
if nil != err {
if err != nil {
if apierrors.IsNotFound(err) {
log.Sugar().Infof("SpiderEndpoint '%s/%s' not found, maybe already cleaned by cmdDel or ScanAll",
podCache.Namespace, podCache.PodName)
Expand Down
Loading
Loading