Skip to content

Fix FSDirectory fsync race on output close#1292

Draft
vwilson wants to merge 2 commits intoapache:masterfrom
vwilson:894
Draft

Fix FSDirectory fsync race on output close#1292
vwilson wants to merge 2 commits intoapache:masterfrom
vwilson:894

Conversation

@vwilson
Copy link
Copy Markdown
Contributor

@vwilson vwilson commented May 3, 2026

Fix FSDirectory stale-file race and re-enable crash regression test

Fixes #894

Description

  • Flush FSIndexOutput’s managed FileStream buffer before marking the file stale for Sync().
  • Add a regression test that observes the file length at stale-mark time to cover the race.
  • Re-enable TestIndexWriterOnJRECrash.TestNRTThreads_Mem and make CheckIndexes() dispose the wrapper safely without leaking the directory.

Testing

  • Ran Lucene.Net.Store.TestDirectory in src/Lucene.Net.Tests._J-S successfully.
  • Ran TestIndexWriterOnJRECrash.TestNRTThreads_Mem in src/Lucene.Net.Tests._I-J successfully, including the forked crash path.

This code generated by GPT 5.5 Pro

Copy link
Copy Markdown
Contributor

@paulirwin paulirwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR! It's awesome to see this issue fixed. Some minor things noted below.

if (isOpen)
Exception priorE = null; // LUCENENET: No need to cast to IOException

try
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The three levels of try here are likely overkill, we should be able to collapse them to two (and add back your helpful comments):

if (disposing && isOpen)
{
    Exception priorE = null;
    UninterruptableMonitor.Enter(parent.m_syncLock);
    try
    {
        try 
        {
            file.Flush(flushToDisk: false); 
        }
        catch (Exception ioe) when (ioe.IsIOException())
        {
            priorE = ioe;
        }

        parent.OnIndexOutputClosed(this);
    }
    finally
    {
        UninterruptableMonitor.Exit(parent.m_syncLock);
        isOpen = false;
        IOUtils.DisposeWhileHandlingException(priorE, file);
    }
}

Aside: it's arguable whether or not setting isOpen to false should be inside the lock or not. On the one hand, it is volatile and not one of the fields that is considered for the lock elsewhere. But, if we moved setting it inside the lock (up one line in my code above) and did a double-check inside the first try, we could possibly avoid some unnecessary extra work. Not sure if it's worth fixing, but just noting for consideration.

try
{
Exception priorE = null; // LUCENENET: No need to cast to IOException
// LUCENENET: FileStream has a managed buffer. If we add this file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should call out in this comment why Java doesn't have this problem. As far as I can tell, it seems to be because RandomAccessFile.write() does not buffer like FileStream does, so there is no flush race.

protected override void Dispose(bool disposing)
{
if (disposing)
if (disposing && isOpen)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the read of a boolean is atomic, there is still a chance that 2 threads could make in here. Was that intentional?

Note that in the FSDirectory subclasses, we do this in the dispose block to ensure dispose only happens by the first thread and all others are ignored.

if (0 != Interlocked.CompareExchange(ref this.disposed, 1, 0)) return;

Then for the IsOpen checks, we do the check atomically.

        // LUCENENET specific - ensure we can't be abused after dispose
        private bool IsOpen => Interlocked.CompareExchange(ref this.disposed, 0, 0) == 0 ? true : false;

        // LUCENENET specific - ensure we can't be abused after dispose
        private void EnsureOpen()
        {
            if (!IsOpen)
            {
                throw AlreadyClosedException.Create(this.GetType().FullName, "this IndexOutput is disposed.");
            }
        }

Not sure this is contributing to the problem at hand, but it is usually better not to make assumptions that running the Dispose() logic more than once is supposed to be safe to do.

GetTargetPlatform() returned "x64" on any 64-bit process, which on Apple
Silicon (and other ARM64 hosts) caused the forked vstest to abort with
"Could not find 'dotnet' host for the 'X64' architecture". The fork
never connected back to the parent's TCP listener, so the parent blocked
forever in WaitForProcessId until the assembly Timeout fired.

Detect the architecture via RuntimeInformation.ProcessArchitecture so
the fork runs in the same architecture as the parent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@paulirwin
Copy link
Copy Markdown
Contributor

@vwilson I pushed an update to your PR. The now-running TestNRTThreads_Mem test was failing on arm64, which was never noticed before since it was disabled with AwaitsFix. The architecture detection was hard-coded to only support x64 and x86, so this adds ARM architecture support. The tests pass locally for me on macOS arm64 now.

@NightOwl888
Copy link
Copy Markdown
Contributor

NightOwl888 commented May 4, 2026

I am running the complete barrage of tests on Azure DevOps (30 complete runs). That will shake out any issues if there are any. It will take a few hours to complete all of those runs, though.

So far, there is one test failure on this test:

https://dev.azure.com/lucene-net-temp4/Main/_build/results?buildId=353&view=ms.vss-test-web.build-test-results-tab

But, it happened on net10.0, and we are always getting random failures on net10.0. So, I cannot say for sure this broke this particular test in any way that net10.0 isn't already broken.

I also added [Repeat(50)] attribute locally and ran it twice (10 minutes each time), and it was green both times. This is definitely an improvement.

@paulirwin
Copy link
Copy Markdown
Contributor

@NightOwl888 Oof, a check index failed error. That's potentially bad. I haven't seen that on net10 yet.

Are you able to get any more details on that one? The message leaves a bit to be desired.

@NightOwl888
Copy link
Copy Markdown
Contributor

After all of these runs are done, I will open an issue and include some more details. net10.0 is like a shooting gallery. There are failures that randomly occur in multiple tests.

But this specific test that is being fixed in this issue is specifically running CheckIndex to check the result and intentionally trying to stress the scenario to force it to write a broken index if something is not right.

So far, most of the failures are on x86, on net9.0 and net10.0. But there was one on net48 on x64 that I am downloading the logs for - the runner indicates a crash in Lucene.Net.Tests._I-J.

@NightOwl888
Copy link
Copy Markdown
Contributor

Well, the results are in. Unfortunately, this didn't seem to be the entire cause. However, it may indeed have been one of the contributing factors.

Results

You can see the runs with the title Fix FSDirectory fsync race on output close here. I did 6 runs on 5 different projects.

https://dev.azure.com/shad0962/Experiments/_build?definitionId=2&_a=summary
https://dev.azure.com/LuceneNET-Temp/Lucene.NET/_build?definitionId=1&_a=summary
https://dev.azure.com/lucene-net-temp2/Lucene.NET/_build?definitionId=1&_a=summary
https://dev.azure.com/lucene-net-temp3/Lucene.NET/_build?definitionId=1&_a=summary
https://dev.azure.com/lucene-net-temp4/Main/_build?definitionId=1&_a=summary

There are some irrelevant failures of TestRandomStrings (fixed in #122) and anything that failed on net10.0 is suspect because we have been having concurrency problems since the target was added (see #1295). That doesn't leave so many failures, but unfortunately this didn't eliminate them all.


There seems to be some rot with this test that didn't age too well, also. I am seeing freezes when trying to run the tests on Windows x64 and selecting x86 as the target platform. I stepped through and the correct value gets built into the dotnet test command, but the problem is with the SDK itself.

Since .NET 8, we can no longer assume that when calling dotnet on 64-bit Windows that an x86 call will work if only the 64-bit SDK is installed. Now, the 32-bit SDK must be installed separately for the test to function when running the parent test as x86. While I think @paulirwin made the right call by choosing the current runtime platform to pass on to the forked test, this detail means it is not guaranteed to run on that platform.

My thought is to attempt to detect whether the SDK is installed and either make the test inconclusive or do a hard failure to indicate that the SDK must be installed. But maybe there is another way to deal with this. Good test design usually means tests should be ready to run just by cloning the repo. However, now that this SDK behavior has changed, we have inadvertently inherited a dependency on an SDK that wasn't required before.

One other thing we should address is to do a hard failure if the forked process has a non-zero exit code and report the STDERR message in the test. That would prevent a "forever wait" state that we are seeing when the platform is specified wrong as well as give us better info to investigate the failure.

This test has a lot of moving parts, and it is hard to get it to a point where it is consistently stable to run.

@paulirwin
Copy link
Copy Markdown
Contributor

Agreed about needing to hard-fail on non-zero exit code. I didn't put those pieces together but the test was hanging before I added the arm64 fix.

Also, can we leave off the specified architecture if it's not x86? I think that's the only case we have to worry about for the test runs.

(Note that technically .NET supports 32-bit ARM as well, but we don't currently run tests on it. Not sure how practical that is. It basically only exists for Raspberry-Pi-like device support.)

It might be worth seeing if there's some inspiration we could gain from how BenchmarkDotNet does this, too.

@paulirwin
Copy link
Copy Markdown
Contributor

I'm going to convert this to draft until we're sure we're stable here.

@paulirwin paulirwin marked this pull request as draft May 5, 2026 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Investigate Failing Test: Lucene.Net.Index.TestIndexWriterOnJRECrash::TestNRTThreads_Mem()

3 participants