Skip to content

Conversation

IanMatthewHuff
Copy link
Member

@IanMatthewHuff IanMatthewHuff commented Oct 17, 2025

There was an issue that we ran into when looking at the git repo data that I added. In particular there were a lot of files that were listed as having diffs, but inside the diffJSON package the actual diff string was "".

I investigated and the issue was pretty clear. The files changed were calculated looking at the upstreamCommit ref that we calculate, but the actual individual diff string calculations in getChangeDiff were looking at a diff against HEAD, not against that computed ref. When testing this would work with staged, un-staged, or new files. But it fails if you have local commits that are not pushed upstream. In this case it correctly registers the files as changed, but doesn't pick up the diff from getDiffChanges.

Also per request from science team I added a mapping to convert the file operation (MODIFIED, DELETED, ect..) from the enum value to the readable string.

Scenarios that I tested manually with this (first scenario here is the easiest way to see the repo):

  • Edit a file test.ts, stage the change, commit change, then run the agent (Previous to this change you would see in the 'begin' git diff an entry for test.ts, but the diff string would be empty, now it's populated with the committed change)
  • Edit a file test.ts, stage and commit change, make a different edit to the file locally and don't stage, then run the agent (Should see both the committed change and the local change in the diff string for test.ts)
  • Test python files. The science team was seeing almost all cs / ts files in the data and not python. It think this is just an artifact of heavier usage of CS / TS for internals and we just have an initial bit of data. The file type shouldn't matter. But I did check this to make sure. py files showed up fine in the testing.
  • Create a new branch on the repo and don't push it upstream. Make a local change and check that it shows in the diff. This was previous issue we had with calculating the upstream so I checked this scenario again.
  • Create a new branch on the repo and don't push it upstream. Make a change, commit it, then check to see if it shows in the diff (basically the first scenario on a new branch that's not been pushed upstream)
  • Check the new string status mapping. Checked that a change with modified and deleted files showed the correct status strings for each.

@IanMatthewHuff IanMatthewHuff marked this pull request as ready for review October 17, 2025 21:58
@Copilot Copilot AI review requested due to automatic review settings October 17, 2025 21:58
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a critical issue with git repository telemetry collection where diff strings were empty despite files being correctly identified as changed. The root cause was a mismatch between the reference used to identify changed files (upstream commit) and the reference used to generate diffs (HEAD).

Key changes:

  • Introduces a new method getWorkingTreeDiffsFromRef() to generate diffs against a specific reference
  • Adds status string mapping to convert git status enums to readable strings for telemetry
  • Updates all telemetry collection to use the new diff method with the correct upstream reference

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/platform/git/common/gitDiffService.ts Adds interface method for diff generation against specific ref
src/platform/git/common/nullGitDiffService.ts Implements null service method for new diff interface
src/extension/prompt/vscode-node/gitDiffService.ts Implements core diff generation logic against specified ref
src/extension/prompt/node/repoInfoTelemetry.ts Updates telemetry to use new diff method and adds status string mapping
src/extension/prompt/node/test/repoInfoTelemetry.spec.ts Updates all test mocks to use new diff method and validates string status output

sbatten
sbatten previously approved these changes Oct 21, 2025
let diff: string;
if (change.status === 7 /* UNTRACKED */) {
// For untracked files, generate a patch showing all content as additions
diff = await this._getUntrackedChangePatch(repository, change.uri);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you able to test with changes added to the index? That seems to have special handling in the other function, so I just want to confirm it works.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sbatten This was a good call to check. I had checked with committed and with unstaged in particular but not with index changes on the new function. There was a tiny bit of a rabbit hole here when I investigated, so I'll summarize.

I checked some staged scenarios with modified files and while I got the correct diff to recreate the file each time, I was a bit surprised that I didn't see INDEX_MODIFIED status in scenarios where I expected it. But in the git extension I saw that if we are using diffWith on a ref we'll generally just be getting a MODIFIED, since diffWith is not using --cached, so it's purposefully ignoring staged vs unstaged in the local repository versus the target ref. It's just going to return the diff from the target ref on modified files for everything between the current state and that ref and won't return an INDEX_MODIFIED.

To test I did the following rotation (just using shorthand here to summarize), which all looked good to me it always had MODIFIED and the full diff of all current changes.



fileA => changeA (unstaged) => diff MODIFIED (changeA) // So the diff showed MODIFIED status with changeA in the file diff

fileA => changeA (staged) => diff MODIFIED (changeA)

fileA => changeA (committed) => diff MODIFIED (changeA)

fileA => changeA (committed) + changeB (unstaged) => diff MODIFIED (changeA + changeB)
fileA => changeA (committed) + changeB (staged) => diff MODIFIED (changeA + changeB)

fileA => changeA (committed) + changeB (committed) => diff MODIFIED (changeA + changeB)

fileA => changeA (committed) + changeB (committed) + changeC (unstaged) => diff MODIFIED (changeA + changeB + changeC)
 

However your comment here did lead me to check and find a bug in one scenario. One of the places that we do currently see an INDEX_* status is when a new file is added, then it shows up as INDEX_ADDED. In this case it returns the new file content via the getWorkingTreeDiffsFromRef, which is correct. But in the case where we have a new file and that new file is in the repository and not ignored, but is untracked we were missing that file. That file won't show up at all in diffWith against the ref. Instead it will be in either untrackedChanges or workingTreeChanges with status UNTRACKED (which it ends up in can change based on config). So after the code where we get our set of file changes from diffWith I add in any changes (by file name) that are not already picked up by diffWith, this pulls in those untracked files. So tested this out with github and ADO repos.
 
When a new file is untracked is shows up with the file contents and status: UNTRACKED. Then when it's added with git add it shows up with the file contents as INDEX_ADDED. In both cases it has the correct file contents. If you modify the file after adding it without pushing the add to upstream, you get INDEX_ADDED still with the modified file contents.

@vs-code-engineering vs-code-engineering bot added this to the October 2025 milestone Oct 21, 2025
const changeMap = new Map<string, Change>();

// Prority to the diffWith changes, then working tree changes, then untracked changes.
for (const change of diffChanges) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This area has the change to pick up the missing UNTRACKED files. According to the git extension it looks like they could show up in either workingTreeChanges or untrackedChanges depending on config. Anything that shows up already via diffWith gets preferred as that's the same diffWith command we'll run to get the actual diff string.

assert.ok(diffs[0].renameUri);
});

test('should include untracked files from both workingTreeChanges and untrackedChanges', async () => {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New test to check that we pick up the untracked files.

@sbatten sbatten assigned sbatten and unassigned joshspicer Oct 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants