Skip to content

[SPARK-56074][INFRA] Improve AGENTS.md with inline build/test commands, PR workflow, and dev notes#54899

Open
cloud-fan wants to merge 7 commits intoapache:masterfrom
cloud-fan:ai
Open

[SPARK-56074][INFRA] Improve AGENTS.md with inline build/test commands, PR workflow, and dev notes#54899
cloud-fan wants to merge 7 commits intoapache:masterfrom
cloud-fan:ai

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

Rewrite AGENTS.md to follow industry best practices for AI coding agent instructions:

  • Add concise project overview
  • Git pre-flight checks (uncommitted changes, branch state) before making edits
  • Inline actionable SBT build/test commands instead of links to docs
  • PySpark test setup with venv and Python version check
  • Development notes for SQLQueryTestSuite golden file tests and Spark Connect proto definitions
  • PR workflow guidelines (title format, description template, fork/upstream workflow)
  • Add CLAUDE.md as a symlink to AGENTS.md so Claude Code also picks up the instructions

Why are the changes needed?

The existing AGENTS.md only contained links to build docs, which AI coding agents cannot follow. The rewrite provides inline commands and actionable guidance that agents can directly use. All commands have been verified to work.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

All build and test commands were manually verified:

  • build/sbt sql/compile, build/sbt sql/Test/compile
  • build/sbt "sql/testOnly *MySuite", build/sbt "sql/testOnly *MySuite -- -z \"test name\""
  • PySpark venv setup and python/run-tests with both test suite and single test case
  • Confirmed OBJC_DISABLE_INITIALIZE_FORK_SAFETY is no longer needed on macOS

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.6

…w, and dev notes

Rewrite AGENTS.md to follow industry best practices: inline actionable commands
instead of links, add project overview, git pre-flight checks, PySpark venv setup,
development notes for SQLQueryTestSuite and Spark Connect protos, and PR workflow
guidelines. Add CLAUDE.md as a symlink so Claude Code also picks up the instructions.

Generated-by: Claude Opus 4.6

Co-authored-by: Isaac
@cloud-fan
Copy link
Contributor Author

> project sql
> testOnly *MySuite

### PySpark Tests
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also cc @gaogaotiantian @Yicong-Huang for the pyspark test part.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use JIRA IDs for trace-ability, @cloud-fan , because this is very hot area in these days.

AGENTS.md Outdated

build/sbt -Phive package

Set up and activate a virtual environment (Python 3.10+):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For many python developers, we have more than 1 venv in the working directory, with names that contains the python version in it. (We probably have uv developers but let's deal with it when they speak). Checking .venv here is too ideal - I don't have a .venv directory in any of my dev machine, I have something like venv3.12.

Do we assume the agent work from a new terminal or in the current user terminal session? They are probably smart enough to find the venv directory in root directory and activate it. If not, we should help them.

Copy link
Contributor Author

@cloud-fan cloud-fan Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The agent is supposed to work under the spark directory, this .venv is inside the spark directory. Do we need to have multiple virtual envs for the spark project?

BTW, I pick .venv as it's kind of a industry standard to put venv profile under the project dir, and seems people start using it already: #53836

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's common to use .venv as the virtualenv dir but it's far from being industry standard. Even CPython docs mentioned two:

Contained in a directory, conventionally named .venv or venv in the project directory, or under a container directory for lots of virtual environments, such as ~/.virtualenvs.

Unfortunately we do need multiple virtual envs to develop spark, at least pyspark. We support all official Python versions and it's not uncommon to have issues that only reproduce on a specific Python version. Also python3 often does not point to the python version we really want to use.

Currently I have two virtual env directories under spark and I probably will have more in the future.

Actually, if I remember correctly, our fragile type annotation can only pass on a specific python version (3.12 for now). Oh and I think it reports an error if you just pip install -r dev/requirements.txt because it has a slightly different version for numpy or its stub.. Then I realize maybe it's helpful to add the lint/format/type system to this file too because we do not do industry standard either.

That being said, I don't have a perfect solution for python environment management for agents - I don't have enough knowledge about how it would work better. Python environment management is not the best part of Python, but we still need to deal with it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my experience, I just let my agent create a venv for me if not pointed by me. However, I often do not create it under spark root as it is not picked up by .gitignore (yet?).
Given that there is no standard, maybe just let agent choose one location to prepare a venv if not already given?

AGENTS.md Outdated

build/sbt -Phive package

Set up and activate a virtual environment (Python 3.10+):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this means we need to update this AGENTS.md file every time we deprecate a Python version. Can we make this as robust as possible which we don't need to edit this file?

Copy link
Contributor Author

@cloud-fan cloud-fan Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it, python/run-tests already verifies python versions

AGENTS.md Outdated

### PR Title

Format: `[SPARK-xxxx][COMPONENT] Title` or `[MINOR] Title` for trivial changes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, let's avoid to mention MINOR here.

AGENTS.md Outdated
Contributors push their feature branch to their personal fork and open PRs against `master` on the upstream Apache Spark repo. Run `git remote -v` to identify which remote is the fork and which is upstream (`apache/spark`). If the remotes are unclear, ask the user to set them up following the standard convention (`origin` for the fork, `upstream` for `apache/spark`).

- **PySpark Testing**: [python/docs/source/development/testing.rst](python/docs/source/development/testing.rst)
Use `gh pr create` to open PRs. If `gh` is not installed, generate the GitHub PR URL for the user and recommend installing the GitHub CLI.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be pretty useless if the agent does not have access to JIRA.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does not just save the clicking create PR button, but also generates PR title and description.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave my agent access to JIRA. all you need is a Personal Access Token. maybe we can add it?

AGENTS.md Outdated

### PR Title

Format: `[SPARK-xxxx][COMPONENT] Title` or `[MINOR] Title` for trivial changes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add an instruction to ask the user to get the SPARK-XXX ID?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

second this. my agents sometimes hallucinate a SPARK ID for me. It will be good to tell them a source to get this ID.

… drop Python version check

Co-authored-by: Isaac
@cloud-fan
Copy link
Contributor Author

Addressed review comment: removed [MINOR] from the PR title format. Also dropped the hardcoded Python version check from the venv setup — python/run-tests already enforces the minimum version.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Mar 19, 2026

BTW, I sent an email to Spark mailing list in order to request for feedback. Thank you, Wenchen. This becomes more and more important .

@cloud-fan cloud-fan changed the title [MINOR] Improve AGENTS.md with inline build/test commands, PR workflow, and dev notes [SPARK-56074][INFRA] Improve AGENTS.md with inline build/test commands, PR workflow, and dev notes Mar 19, 2026
AGENTS.md Outdated
Comment on lines +69 to +78
## Pull Request Workflow

### PR Title

Format: `[SPARK-xxxx][COMPONENT] Title` where `SPARK-xxxx` is the JIRA ticket ID. Ask the user to create a new ticket or provide an existing one if not given.
For follow-up fixes: `[SPARK-xxxx][COMPONENT][FOLLOWUP] Title`.

### PR Description

Follow the template in `.github/PULL_REQUEST_TEMPLATE`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about telling it not to raise PRs without a human review beforehand?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or more accurately: I'd rather leave this part out, I still want a human to make the PRs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/apache/spark/pull/54899/changes#diff-a54ff182c7e8acf56acfd6e4b9c3ff41e2c41a31c9b211b2deb9df75d9a478f9R84

I asked llm to get human approval before pushing commits and creating PRs, does it address your concern?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably fine although I'd still rather have the humans make the PR (this isn't a big time saver to have the AI agent do it and in making the PR hopefully they'd do the review).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear: I think we can go with this and revisit if it becomes a problem

Copy link
Contributor

@HeartSaVioR HeartSaVioR Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I actually want to push people to review the code as initial reviewer and write the PR title and description by themselves. Someone shouldn't claim the outcome of LLM as their own if they do not understand what the code is, and I'm kind of worry that making the process easier might eventually end up with proceeding to what LLM suggests blindly.

But we have a belief among community, and more likely people who want to delegate the effort for the PR can just do the same with one more prompt (no way to prevent it), so it's probably not a big deal.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you for addressing my comments, @cloud-fan .

AGENTS.md Outdated
# Apache Spark

This file provides context and guidelines for AI coding assistants working with the Apache Spark codebase.
Apache Spark is a multi-language engine for large-scale data processing and analytics, primarily written in Scala and Java. It supports both batch and streaming workloads, with Spark Connect as an optional server-client protocol.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need this for agentic tools

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm not actually very sure how this would be useful for Agent. Do we see its usefulness?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the entire overview section can be removed, as llm should be very familiar with Spark. But searching from web, it's very common to put an overview section with some general info.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AGENTS.md Outdated

## Before Making Changes

Before the first edit in a session, check the git state:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just using github worktree?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think we should use worktree.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the it is up to developers to add new worktree?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

git worktree is a bit more heavy (a whole cope of all files) and I don't use it very often in my development. Maybe we can improve the If there are uncommitted changes part, and ask users whether they want to create a new worktree or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When developer use agentic tools heavily, they will get the point how worktree helps, you know, you don't need to work on a single PR at a time with AI

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worktree can help with multi tasking and sub agents, which is kind of hard with branches. I think at one point the little heavy overhead can be justified. I adapted to worktree pretty quickly.

AGENTS.md Outdated

## Development Notes

SQL golden file tests are managed by `SQLQueryTestSuite` and its variants. Read the class documentation before running or updating these tests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For rules and notes of a specific areas, can we only use Agents.md as a menu not details

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's already menu like, this note asks llm to read details in the classdoc.

@yaooqinn
Copy link
Member

  • For best practices, AGENTS.md file shall be capped most 100 lines. The current content is too specific.
  • Consider making the free-formed text more structural

AGENTS.md Outdated
Comment on lines +71 to +78
### PR Title

Format: `[SPARK-xxxx][COMPONENT] Title` where `SPARK-xxxx` is the JIRA ticket ID. Ask the user to create a new ticket or provide an existing one if not given.
For follow-up fixes: `[SPARK-xxxx][COMPONENT][FOLLOWUP] Title`.

### PR Description

Follow the template in `.github/PULL_REQUEST_TEMPLATE`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the PR title and desc is very much easy thing for AI-tools to learn, it consumes too much room from the AGENTS.md actually

AGENTS.md Outdated

Before the first edit in a session, check the git state:
1. If there are uncommitted changes, ask the user whether to continue editing or stash/commit first.
2. If the branch is `master`, or has new commits compared to `master` (check with `git log master..HEAD`), create a new branch from `master` before editing. Inform the user of the new branch name.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we should add the pull from upstream master, just so that the new branch is branched out from a newer master. I think my local agent always forgets to pull from upstream, but instead always pull from origin/master which is my fork and states that it is the latest already.

AGENTS.md Outdated

build/sbt -Phive package

Set up and activate a virtual environment (Python 3.10+):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my experience, I just let my agent create a venv for me if not pointed by me. However, I often do not create it under spark root as it is not picked up by .gitignore (yet?).
Given that there is no standard, maybe just let agent choose one location to prepare a venv if not already given?

AGENTS.md Outdated
Set up and activate a virtual environment (Python 3.10+):

if [ ! -d .venv ]; then
python3 -c "import sys; assert sys.version_info >= (3, 10)" || { echo "Python 3.10+ required. Ask the user to install a compatible version."; exit 1; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we support 3.10+, but is it a good idea to use a more recent Python for development? especially our CI environments are 3.12. So using a version similar to CI might help prevent some issues.

AGENTS.md Outdated

### PR Title

Format: `[SPARK-xxxx][COMPONENT] Title` or `[MINOR] Title` for trivial changes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

second this. my agents sometimes hallucinate a SPARK ID for me. It will be good to tell them a source to get this ID.

AGENTS.md Outdated

### PR Title

Format: `[SPARK-xxxx][COMPONENT] Title` or `[MINOR] Title` for trivial changes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we add those supported components explicitly? e.g., [PYTHON], [SS], [TEST], [DOCS], otherwise where can it find this information?

AGENTS.md Outdated
Contributors push their feature branch to their personal fork and open PRs against `master` on the upstream Apache Spark repo. Run `git remote -v` to identify which remote is the fork and which is upstream (`apache/spark`). If the remotes are unclear, ask the user to set them up following the standard convention (`origin` for the fork, `upstream` for `apache/spark`).

- **PySpark Testing**: [python/docs/source/development/testing.rst](python/docs/source/development/testing.rst)
Use `gh pr create` to open PRs. If `gh` is not installed, generate the GitHub PR URL for the user and recommend installing the GitHub CLI.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave my agent access to JIRA. all you need is a Personal Access Token. maybe we can add it?

AGENTS.md Outdated
## Build and Test

Prefer building in sbt over maven:
Prefer SBT over Maven for day-to-day development (faster incremental compilation). Replace `sql` below with the target module (e.g., `core`, `catalyst`, `connect`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My cursor somehow tries a few random names for a module like 'sql' before it finds it. Wonder did you see it too, and this also helped it?

last time @HyukjinKwon mentioned to not duplicate the doc and just point, it's an alternative too to enhance the doc, but not sure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for majority of other docs, I think it's better to leave a pointer here rather than copy paste data. but build and test is pretty stable (less likely to be changed very often), yet it would be very frequently used by the agent. So I actually would prefer just have the instructions copy pasted in this doc.

AGENTS.md Outdated
- SBT build instructions: See the ["Building with SBT"](docs/building-spark.md#building-with-sbt) section
- SBT testing: See the ["Testing with SBT"](docs/building-spark.md#testing-with-sbt) section
- Running individual tests: See the ["Running Individual Tests"](docs/building-spark.md#running-individual-tests) section
Contributors push their feature branch to their personal fork and open PRs against `master` on the upstream Apache Spark repo. Run `git remote -v` to identify which remote is the fork and which is upstream (`apache/spark`). If the remotes are unclear, ask the user to set them up following the standard convention (`origin` for the fork, `upstream` for `apache/spark`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, is it, so one can say 'make a pr that does this'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@jzhuge
Copy link
Member

jzhuge commented Mar 19, 2026

Thanks for adding these. The context is much more useful to agents than links to docs.

However, I feel it is a bit over-engineered. Shorter is better for agent instructions, reducing noise and cost. The most valuable content is the non-obvious or unique stuff:

  • PySpark requires a prior build step
  • Golden file tests → read SQLQueryTestSuite docs
  • Proto changes → read the README first
  • PR title format
  • ...

Copy link

@vaquarkhan vaquarkhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two suggestions that have become standard practice for AGENTS.md in large monorepos:

  1. Add an "Architecture Boundaries" section with explicit negative constraints.

AI agents frequently hallucinate legacy patterns or bypass isolation layers. The most effective guardrail is a short "Never Do" list. For Spark, the critical ones are:

Architecture Boundaries

  • Spark Connect: Uses a decoupled client-server architecture communicating via gRPC. Never share JVM memory or state between the client (sql/connect/client/) and the server (sql/connect/server/).
  • Dependencies: Never introduce cyclic dependencies. common/utils is foundational; higher-level modules like sql/core must not pull in dependencies that violate the module hierarchy.
  • Generated Files: Never hand-edit SQL golden files (sql/core/src/test/resources/sql-tests/results/) or generated Protobuf files. Regenerate golden files with SPARK_GENERATE_GOLDEN_FILES=1 and protos with dev/connect-gen-protos.sh.
  1. Establish the federated AGENTS.md pattern.

The AGENTS.md spec now explicitly supports subdirectory placement with nearest-wins cascading. To keep this root file lightweight (~100 lines) and establish that module-specific rules belong in nested files, add a note at the top:

Note: For component-specific instructions, look for nested AGENTS.md files in subdirectories (e.g., sql/AGENTS.md, python/AGENTS.md). Agents read the nearest file in the directory tree, so the closest one takes precedence.
This aligns with @yaooqinn's comment about keeping this file as a menu rather than details ;sub-projects like PySpark or Catalyst can define their own hyper-specific guardrails locally without bloating the root file.

@cloud-fan
Copy link
Contributor Author

@vaquarkhan I think eventually AGENTS.md should contains architecture overview, but it will need more time to get consensus on it (Spark is large project with many modules, many architectures at different granularity).

I want to only include dev ops related stuff in the first version, which already need a lot of time to reach agreement (see the discussions we have now).

I'll keep this PR open for a few days to collect more feedback.

AGENTS.md Outdated

### PR Title

Format: `[SPARK-xxxx][COMPONENT] Title` where `SPARK-xxxx` is the JIRA ticket ID. Ask the user to create a new ticket or provide an existing one if not given.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we specify which are the allowed values for COMPONENT?

Copy link
Contributor Author

@cloud-fan cloud-fan Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have a clear rule and llm should be able to figure it out. For example, we use [SQL] a lot but the actual module names are catalyst, sql/core, hive, etc.

…ceholders, simplify PR section

Co-authored-by: Isaac

build/sbt -Phive package

Activate the virtual environment specified by the user, or default to `~/.virtualenvs/pyspark`:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gaogaotiantian how about this? we give more control to the users.

Copy link
Contributor Author

@cloud-fan cloud-fan Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

users can say "run test ... with venv at <path>".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the agent should try to locate a virtual env at root dir of the repo, instead of a global user venv. If no venv is found, agent can create a venv or .venv at the root dir of the repo.

AGENTS.md Outdated
This file provides context and guidelines for AI coding assistants working with the Apache Spark codebase.
## Before Making Changes

Before the first edit in a session, ensure a clean working environment:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the common issue I hit when working with llm. Sometimes I forgot to clean up the working env and asking llm to make changes, it just did so and mess up my working env.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep it simple and mandatory, without any conditions, otherwise, agent is tricky with conditional guidelines

  • DON'T include any unrelated changes

AGENTS.md Outdated
- SBT build instructions: See the ["Building with SBT"](docs/building-spark.md#building-with-sbt) section
- SBT testing: See the ["Testing with SBT"](docs/building-spark.md#testing-with-sbt) section
- Running individual tests: See the ["Running Individual Tests"](docs/building-spark.md#running-individual-tests) section
Contributors push their feature branch to their personal fork and open PRs against `master` on the upstream Apache Spark repo.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another common mistake. llm pushed my branch to upstream apache many times...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For agent, we'd keep it simple, for example

  • MUST push to my fork

@@ -0,0 +1 @@
AGENTS.md No newline at end of file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? AGENTS.md is now like an open standard for project harness

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems not well supported: anthropics/claude-code#34235

@vaquarkhan
Copy link

@vaquarkhan I think eventually AGENTS.md should contains architecture overview, but it will need more time to get consensus on it (Spark is large project with many modules, many architectures at different granularity).

I want to only include dev ops related stuff in the first version, which already need a lot of time to reach agreement (see the discussions we have now).

I'll keep this PR open for a few days to collect more feedback.

@cloud-fan That makes complete sense. Getting community consensus on global architecture boundaries for a project as massive as Spark would definitely stall this PR. Starting strictly with the DevOps, build, and test instructions is a great v1.

As a lightweight compromise for this PR, perhaps we just include the Federated Pattern note at the top?

Note: For component-specific instructions, look for nested AGENTS.md files in subdirectories (e.g., sql/AGENTS.md, python/AGENTS.md). Agents read the nearest file in the directory tree, so the closest one takes precedence.

If we establish that routing rule now, it unlocks the ability for individual component maintainers to add their own hyper-specific architecture boundaries to their respective subdirectories later, without ever needing to bloat this root file or block on global consensus.

Thanks for driving this forward!

@cloud-fan
Copy link
Contributor Author

For component-specific instructions, look for nested AGENTS.md files in subdirectories

let's do it when there is at least one nested AGENTS.md exists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.