Skip to content

Commit

Permalink
Use px/agent_status_diagnostics script within px cli to detect miss…
Browse files Browse the repository at this point in the history
…ing kernel headers (#2065)

Summary: Use `px/agent_status_diagnostics` script within px cli to
detect missing kernel headers

This PR leverages the script added in #2064 to detect missing kernel
headers during cli deploys and `px collect-logs` commands. This solves
2/3 of the use cases I was hoping to identify for #2051 (the last being
helm installs).

A recent example of this problem is
#1986, where a Go TLS tracing
bug went undiagnosed for months (August to December). Amazon Linux
2023's headers are different enough that it breaks Go TLS tracing when
pixie's pre-packaged headers are used. The tooling in this PR would have
provided a few opportunities for this to be caught.

Relevant Issues: #2051

Type of change: /kind feature

Test Plan: Verified the following scenarios
<details><summary>Test cases</summary>

- [x] `px collect-logs` works against a cloud that doesn't have a
`px/agent_status_diagnostics` script
```
$ bazel run -c opt  --stamp src/pixie_cli:px -- collect-logs

WARN[0006] healthcheck script detected the following warnings:  error="Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes."
Logs written to pixie_logs_20241223165214.zip

# zip file contains px/agent_status output
$ cat px_agent_diagnostics.txt
{"_tableName_":"output","agent_id":"07fb4d26-3b53-4ba7-9bb7-f2cb10a1e63d","asid":79,"hostname":"gke-dev-ddelnano1-default-pool-b099382d-30mu","ip_address":"","agent_state":"AGENT_STATE_HEALTHY","create_time":"2024-12-18T12:43:44.41952403Z","last_heartbeat_ns":4303060450,"kernel_headers_installed":true}
```
- [x] `px collect-logs` works against a cloud that does have a
`px/agent_status_diagnostics` script
```
$ bazel run  src/pixie_cli:px -- collect-logs
INFO: Analyzed target //src/pixie_cli:px (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
Target //src/pixie_cli:px up-to-date:
  bazel-bin/src/pixie_cli/px_/px
INFO: Elapsed time: 4.240s, Critical Path: 3.89s
INFO: 3 processes: 1 internal, 2 linux-sandbox.
INFO: Build completed successfully, 3 total actions
INFO: Running command line: bazel-bin/src/pixie_cli/px_/px collect-logs
Pixie CLI
*******************************
* ENV VARS
*        PX_CLOUD_ADDR=testing.getcosmic.ai:443
*******************************
Logs written to pixie_logs_20241218164734.zip

$ cat px_agent_diagnostics.txt
{"_tableName_":"output","headers_installed_percent":1}
```
- [x] `px collect-logs` identifies when kernel headers are missing when
`px/agent_status_diagnostics` present
```
$ Logs written to pixie_logs_20241223165214.zip
$ bazel run -c opt  --stamp src/pixie_cli:px -- --bundle https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json collect-logs
[ ... ]
WARN[0012] healthcheck script detected the following warnings:  error="Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes."

$ cat px_agent_diagnostics.txt
{"_tableName_":"output","headers_installed_percent":0.5}
```

- [x] Artificially forcing context deadline (timeout) results in an
error
```
$ git diff
diff --git a/src/pixie_cli/pkg/vizier/script.go b/src/pixie_cli/pkg/vizier/script.go
index 7d3b7e008..c957b8943 100644
--- a/src/pixie_cli/pkg/vizier/script.go
+++ b/src/pixie_cli/pkg/vizier/script.go
@@ -317,7 +317,7 @@ func RunSimpleHealthCheckScript(br *script.BundleManager, cloudAddr string, clus
                execScript = br.MustGetScript(script.AgentStatusScript)
        }

-       ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+       ctx, cancel := context.WithTimeout(context.Background(), 1*time.Second)

$ bazel run  src/pixie_cli:px -- collect-logs

WARN[0012]src/pixie_cli/pkg/vizier/logs.go:135 px.dev/pixie/src/pixie_cli/pkg/vizier.(*LogCollector).CollectPixieLogs() failed to run health check script             error="context deadline exceeded"
Logs written to pixie_logs_20241218165033.zip
```
- [x] `px collect-logs` prompts auth flow when credentials don't match
current cloud
```
$ PX_CLOUD_ADDR=new-cloud bazel run  src/pixie_cli:px -- collect-logs
*******************************
* ENV VARS
*        PX_CLOUD_ADDR=new-cloud
*******************************
Failed to authenticate. Please retry `px auth login`.
```

- [x] `px deploy` on pre v0.14.14 (older) vizier with existing bundle
warns that kernel headers should be installed
```
# Additional flags provided to speed up vizier bootstrapping
$ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key='<deploy key>' --deploy_olm=false --olm_namespace=olm --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json
```

- [x] `px deploy` on pre v0.14.14 (older) vizier with latest bundle
warns that kernel headers should be installed
```
# Additional flags provided to speed up vizier bootstrapping
$ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key='<deploy key>' --deploy_olm=false --olm_namespace=olm --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json

[ ... ]
Waiting for Pixie to pass healthcheck
 ✔    Wait for PEMs/Kelvin
 ✔    Wait for PEMs/Kelvin
 ✕    Wait for healthcheck  ERR: Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes.
Pixie healthcheck detected the following warnings: error=Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes.

[ ...]
```

- [x] `px deploy` on v0.14.14 vizier with latest bundle warns
appropriate when kernel headers are missing
```
$ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key=<deploy key> --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json -v 0.14.14-pre-r1.0

[ ... ]
Waiting for Pixie to pass healthcheck
 ✔    Wait for PEMs/Kelvin
 ✕    Wait for healthcheck  ERR: Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes.
Pixie healthcheck detected the following warnings: error=Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes.
```

</details>

Changelog Message: Enhanced the `px` cli's `deploy` and `collect-logs`
commands to surface when kernel headers aren't installed. This is a
common source of bugs that can only be addressed by installing your
distro's kernel headers.

Signed-off-by: Dom Del Nano <[email protected]>
  • Loading branch information
ddelnano authored Jan 6, 2025
1 parent a1a1d0e commit 3c9c4bd
Show file tree
Hide file tree
Showing 8 changed files with 329 additions and 146 deletions.
4 changes: 2 additions & 2 deletions src/pixie_cli/pkg/cmd/collect_logs.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ import (
"github.com/spf13/viper"

"px.dev/pixie/src/pixie_cli/pkg/utils"
"px.dev/pixie/src/utils/shared/k8s"
"px.dev/pixie/src/pixie_cli/pkg/vizier"
)

func init() {
Expand All @@ -42,7 +42,7 @@ var CollectLogsCmd = &cobra.Command{
viper.BindPFlag("namespace", cmd.Flags().Lookup("namespace"))
},
Run: func(cmd *cobra.Command, args []string) {
c := k8s.NewLogCollector()
c := vizier.NewLogCollector(mustCreateBundleReader(), viper.GetString("cloud_addr"))
fName := fmt.Sprintf("pixie_logs_%s.zip", time.Now().Format("20060102150405"))
err := c.CollectPixieLogs(fName)
if err != nil {
Expand Down
83 changes: 19 additions & 64 deletions src/pixie_cli/pkg/cmd/deploy.go
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@ import (
"context"
"errors"
"fmt"
"io"
"os"
"strings"
"time"
Expand Down Expand Up @@ -72,6 +71,7 @@ var BlockListedLabels = []string{
}

func init() {
DeployCmd.Flags().StringP("bundle", "b", "", "Path/URL to bundle file")
DeployCmd.Flags().StringP("extract_yaml", "e", "", "Directory to extract the Pixie yamls to")
DeployCmd.Flags().StringP("vizier_version", "v", "", "Pixie version to deploy")
DeployCmd.Flags().BoolP("check", "c", true, "Check whether the cluster can run Pixie")
Expand Down Expand Up @@ -106,6 +106,7 @@ var DeployCmd = &cobra.Command{
Use: "deploy",
Short: "Deploys Pixie on the current K8s cluster",
PreRun: func(cmd *cobra.Command, args []string) {
viper.BindPFlag("bundle", cmd.Flags().Lookup("bundle"))
viper.BindPFlag("extract_yaml", cmd.Flags().Lookup("extract_yaml"))
viper.BindPFlag("vizier_version", cmd.Flags().Lookup("vizier_version"))
viper.BindPFlag("check", cmd.Flags().Lookup("check"))
Expand Down Expand Up @@ -604,61 +605,6 @@ func deploy(cloudConn *grpc.ClientConn, clientset *kubernetes.Clientset, vzClien
return clusterID
}

func runSimpleHealthCheckScript(cloudAddr string, clusterID uuid.UUID) error {
v, err := vizier.ConnectionToVizierByID(cloudAddr, clusterID)
br := mustCreateBundleReader()
if err != nil {
return err
}
execScript := br.MustGetScript(script.AgentStatusScript)

ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
defer cancel()

resp, err := v.ExecuteScriptStream(ctx, execScript, nil)
if err != nil {
return err
}

// TODO(zasgar): Make this use the Null output. We can't right now
// because of fatal message on vizier failure.
errCh := make(chan error)
// Eat all responses.
go func() {
for {
select {
case <-ctx.Done():
if ctx.Err() != nil {
errCh <- ctx.Err()
return
}
errCh <- nil
return
case msg := <-resp:
if msg == nil {
errCh <- nil
return
}
if msg.Err != nil {
if msg.Err == io.EOF {
errCh <- nil
return
}
errCh <- msg.Err
return
}
if msg.Resp.Status != nil && msg.Resp.Status.Code != 0 {
errCh <- errors.New(msg.Resp.Status.Message)
}
// Eat messages.
}
}
}()

err = <-errCh
return err
}

func waitForHealthCheckTaskGenerator(cloudAddr string, clusterID uuid.UUID) func() error {
return func() error {
timeout := time.NewTimer(5 * time.Minute)
Expand All @@ -668,10 +614,15 @@ func waitForHealthCheckTaskGenerator(cloudAddr string, clusterID uuid.UUID) func
case <-timeout.C:
return errors.New("timeout waiting for healthcheck (it is possible that Pixie stabilized after the healthcheck timeout. To check if Pixie successfully deployed, run `px debug pods`)")
default:
err := runSimpleHealthCheckScript(cloudAddr, clusterID)
_, err := vizier.RunSimpleHealthCheckScript(mustCreateBundleReader(), cloudAddr, clusterID)
if err == nil {
return nil
}
// The health check warning error indicates the cluster successfully deployed, but there are some warnings.
// Return the error to end the polling and show the warnings.
if _, ok := err.(*vizier.HealthCheckWarning); ok {
return err
}
time.Sleep(5 * time.Second)
}
}
Expand All @@ -691,13 +642,17 @@ func waitForHealthCheck(cloudAddr string, clusterID uuid.UUID, clientset *kubern
hc := utils.NewSerialTaskRunner(healthCheckJobs)
err := hc.RunAndMonitor()
if err != nil {
_ = pxanalytics.Client().Enqueue(&analytics.Track{
UserId: pxconfig.Cfg().UniqueClientID,
Event: "Deploy Healthcheck Failed",
Properties: analytics.NewProperties().
Set("err", err.Error()),
})
utils.WithError(err).Fatal("Failed Pixie healthcheck")
if _, ok := err.(*vizier.HealthCheckWarning); ok {
utils.WithError(err).Error("Pixie healthcheck detected the following warnings:")
} else {
_ = pxanalytics.Client().Enqueue(&analytics.Track{
UserId: pxconfig.Cfg().UniqueClientID,
Event: "Deploy Healthcheck Failed",
Properties: analytics.NewProperties().
Set("err", err.Error()),
})
utils.WithError(err).Fatal("Failed Pixie healthcheck")
}
}
_ = pxanalytics.Client().Enqueue(&analytics.Track{
UserId: pxconfig.Cfg().UniqueClientID,
Expand Down
5 changes: 2 additions & 3 deletions src/pixie_cli/pkg/cmd/root.go
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,6 @@ var RootCmd = &cobra.Command{

// Name a variable to store a slice of commands that don't require cloudAddr
var cmdsCloudAddrNotReqd = []*cobra.Command{
CollectLogsCmd,
VersionCmd,
}

Expand Down Expand Up @@ -245,7 +244,7 @@ func checkAuthForCmd(c *cobra.Command) {
os.Exit(1)
}
switch c {
case DeployCmd, UpdateCmd, GetCmd, DeployKeyCmd, APIKeyCmd:
case CollectLogsCmd, DeployCmd, UpdateCmd, GetCmd, DeployKeyCmd, APIKeyCmd:
utils.Errorf("These commands are unsupported in Direct Vizier mode.")
os.Exit(1)
default:
Expand All @@ -254,7 +253,7 @@ func checkAuthForCmd(c *cobra.Command) {
}

switch c {
case DeployCmd, UpdateCmd, RunCmd, LiveCmd, GetCmd, ScriptCmd, DeployKeyCmd, APIKeyCmd:
case CollectLogsCmd, DeployCmd, UpdateCmd, RunCmd, LiveCmd, GetCmd, ScriptCmd, DeployKeyCmd, APIKeyCmd:
authenticated := auth.IsAuthenticated(viper.GetString("cloud_addr"))
if !authenticated {
utils.Errorf("Failed to authenticate. Please retry `px auth login`.")
Expand Down
1 change: 1 addition & 0 deletions src/pixie_cli/pkg/vizier/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ go_library(
"data_formatter.go",
"errors.go",
"lister.go",
"logs.go",
"script.go",
"stream_adapter.go",
"utils.go",
Expand Down
144 changes: 144 additions & 0 deletions src/pixie_cli/pkg/vizier/logs.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
/*
* Copyright 2018- The Pixie Authors.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
* SPDX-License-Identifier: Apache-2.0
*/

package vizier

import (
"archive/zip"
"context"
"errors"
"os"
"strings"

log "github.com/sirupsen/logrus"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/rest"

"px.dev/pixie/src/utils/script"
"px.dev/pixie/src/utils/shared/k8s"
)

// LogCollector collect logs for Pixie and cluster setup information.
type LogCollector struct {
k8sConfig *rest.Config
k8sClientSet *kubernetes.Clientset
cloudAddr string
br *script.BundleManager
k8s.LogCollector
}

// NewLogCollector creates a new log collector.
func NewLogCollector(br *script.BundleManager, cloudAddr string) *LogCollector {
cfg := k8s.GetConfig()
cs := k8s.GetClientset(cfg)
return &LogCollector{
cfg,
cs,
cloudAddr,
br,
*k8s.NewLogCollector(),
}
}

// CollectPixieLogs collects logs for all Pixie pods and write them to the zip file fName.
func (c *LogCollector) CollectPixieLogs(fName string) error {
if !strings.HasSuffix(fName, ".zip") {
return errors.New("fname must have .zip suffix")
}
f, err := os.Create(fName)
if err != nil {
return err
}
defer f.Close()

zf := zip.NewWriter(f)
defer zf.Close()

vls := k8s.VizierLabelSelector()
vizierLabelSelector := metav1.FormatLabelSelector(&vls)

// We check across all namespaces for the matching pixie pods.
vizierPodList, err := c.k8sClientSet.CoreV1().Pods("").List(context.Background(), metav1.ListOptions{LabelSelector: vizierLabelSelector})
if err != nil {
return err
}

// We also need to get the logs the operator logs.
// As the LabelSelectors are ANDed, we need to make a new query and merge
// the results.
ols := k8s.OperatorLabelSelector()
operatorLabelSelector := metav1.FormatLabelSelector(&ols)

operatorPodList, err := c.k8sClientSet.CoreV1().Pods("").List(context.Background(), metav1.ListOptions{LabelSelector: operatorLabelSelector})
if err != nil {
return err
}

// Merge the two pod lists
pods := append(vizierPodList.Items, operatorPodList.Items...)

for _, pod := range pods {
for _, containerStatus := range pod.Status.ContainerStatuses {
// Ignore prev logs, they might not exist.
_ = c.LogPodInfoToZipFile(zf, pod, containerStatus.Name, true)

err := c.LogPodInfoToZipFile(zf, pod, containerStatus.Name, false)
if err != nil {
log.WithError(err).Warnf("Failed to log pod: %s", pod.Name)
}
}
err = c.WritePodDescription(zf, pod)
if err != nil {
log.WithError(err).Warnf("failed to write pod description")
}
}

err = c.LogKubeCmd(zf, "nodes.log", "describe", "node")
if err != nil {
log.WithError(err).Warn("failed to log node info")
}

err = c.LogKubeCmd(zf, "services.log", "describe", "services", "--all-namespaces", "-l", vizierLabelSelector)
if err != nil {
log.WithError(err).Warnf("failed to log services")
}

// Describe vizier and write it to vizier.log
err = c.LogKubeCmd(zf, "vizier.log", "describe", "vizier", "--all-namespaces")
if err != nil {
log.WithError(err).Warnf("failed to log vizier crd")
}

clusterID, err := GetCurrentVizier(c.cloudAddr)
if err != nil {
log.WithError(err).Warnf("failed to get cluster ID")
}
outputCh, err := RunSimpleHealthCheckScript(c.br, c.cloudAddr, clusterID)

if err != nil {
entry := log.WithError(err)
if _, ok := err.(*HealthCheckWarning); ok {
entry.Warn("healthcheck script detected the following warnings:")
} else {
entry.Warn("failed to run healthcheck script")
}
}

return c.LogOutputToZipFile(zf, "px_agent_diagnostics.txt", <-outputCh)
}
Loading

0 comments on commit 3c9c4bd

Please sign in to comment.