Fix FSDirectory fsync race on output close#1292
Fix FSDirectory fsync race on output close#1292vwilson wants to merge 2 commits intoapache:masterfrom
Conversation
paulirwin
left a comment
There was a problem hiding this comment.
Thanks for this PR! It's awesome to see this issue fixed. Some minor things noted below.
| if (isOpen) | ||
| Exception priorE = null; // LUCENENET: No need to cast to IOException | ||
|
|
||
| try |
There was a problem hiding this comment.
The three levels of try here are likely overkill, we should be able to collapse them to two (and add back your helpful comments):
if (disposing && isOpen)
{
Exception priorE = null;
UninterruptableMonitor.Enter(parent.m_syncLock);
try
{
try
{
file.Flush(flushToDisk: false);
}
catch (Exception ioe) when (ioe.IsIOException())
{
priorE = ioe;
}
parent.OnIndexOutputClosed(this);
}
finally
{
UninterruptableMonitor.Exit(parent.m_syncLock);
isOpen = false;
IOUtils.DisposeWhileHandlingException(priorE, file);
}
}Aside: it's arguable whether or not setting isOpen to false should be inside the lock or not. On the one hand, it is volatile and not one of the fields that is considered for the lock elsewhere. But, if we moved setting it inside the lock (up one line in my code above) and did a double-check inside the first try, we could possibly avoid some unnecessary extra work. Not sure if it's worth fixing, but just noting for consideration.
| try | ||
| { | ||
| Exception priorE = null; // LUCENENET: No need to cast to IOException | ||
| // LUCENENET: FileStream has a managed buffer. If we add this file |
There was a problem hiding this comment.
We should call out in this comment why Java doesn't have this problem. As far as I can tell, it seems to be because RandomAccessFile.write() does not buffer like FileStream does, so there is no flush race.
| protected override void Dispose(bool disposing) | ||
| { | ||
| if (disposing) | ||
| if (disposing && isOpen) |
There was a problem hiding this comment.
While the read of a boolean is atomic, there is still a chance that 2 threads could make in here. Was that intentional?
Note that in the FSDirectory subclasses, we do this in the dispose block to ensure dispose only happens by the first thread and all others are ignored.
if (0 != Interlocked.CompareExchange(ref this.disposed, 1, 0)) return;Then for the IsOpen checks, we do the check atomically.
// LUCENENET specific - ensure we can't be abused after dispose
private bool IsOpen => Interlocked.CompareExchange(ref this.disposed, 0, 0) == 0 ? true : false;
// LUCENENET specific - ensure we can't be abused after dispose
private void EnsureOpen()
{
if (!IsOpen)
{
throw AlreadyClosedException.Create(this.GetType().FullName, "this IndexOutput is disposed.");
}
}Not sure this is contributing to the problem at hand, but it is usually better not to make assumptions that running the Dispose() logic more than once is supposed to be safe to do.
GetTargetPlatform() returned "x64" on any 64-bit process, which on Apple Silicon (and other ARM64 hosts) caused the forked vstest to abort with "Could not find 'dotnet' host for the 'X64' architecture". The fork never connected back to the parent's TCP listener, so the parent blocked forever in WaitForProcessId until the assembly Timeout fired. Detect the architecture via RuntimeInformation.ProcessArchitecture so the fork runs in the same architecture as the parent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@vwilson I pushed an update to your PR. The now-running TestNRTThreads_Mem test was failing on arm64, which was never noticed before since it was disabled with AwaitsFix. The architecture detection was hard-coded to only support x64 and x86, so this adds ARM architecture support. The tests pass locally for me on macOS arm64 now. |
|
I am running the complete barrage of tests on Azure DevOps (30 complete runs). That will shake out any issues if there are any. It will take a few hours to complete all of those runs, though. So far, there is one test failure on this test: But, it happened on I also added |
|
@NightOwl888 Oof, a check index failed error. That's potentially bad. I haven't seen that on net10 yet. Are you able to get any more details on that one? The message leaves a bit to be desired. |
|
After all of these runs are done, I will open an issue and include some more details. But this specific test that is being fixed in this issue is specifically running CheckIndex to check the result and intentionally trying to stress the scenario to force it to write a broken index if something is not right. So far, most of the failures are on x86, on |
|
Well, the results are in. Unfortunately, this didn't seem to be the entire cause. However, it may indeed have been one of the contributing factors. ResultsYou can see the runs with the title https://dev.azure.com/shad0962/Experiments/_build?definitionId=2&_a=summary There are some irrelevant failures of There seems to be some rot with this test that didn't age too well, also. I am seeing freezes when trying to run the tests on Windows x64 and selecting x86 as the target platform. I stepped through and the correct value gets built into the Since .NET 8, we can no longer assume that when calling My thought is to attempt to detect whether the SDK is installed and either make the test inconclusive or do a hard failure to indicate that the SDK must be installed. But maybe there is another way to deal with this. Good test design usually means tests should be ready to run just by cloning the repo. However, now that this SDK behavior has changed, we have inadvertently inherited a dependency on an SDK that wasn't required before. One other thing we should address is to do a hard failure if the forked process has a non-zero exit code and report the STDERR message in the test. That would prevent a "forever wait" state that we are seeing when the platform is specified wrong as well as give us better info to investigate the failure. This test has a lot of moving parts, and it is hard to get it to a point where it is consistently stable to run. |
|
Agreed about needing to hard-fail on non-zero exit code. I didn't put those pieces together but the test was hanging before I added the arm64 fix. Also, can we leave off the specified architecture if it's not x86? I think that's the only case we have to worry about for the test runs. (Note that technically .NET supports 32-bit ARM as well, but we don't currently run tests on it. Not sure how practical that is. It basically only exists for Raspberry-Pi-like device support.) It might be worth seeing if there's some inspiration we could gain from how BenchmarkDotNet does this, too. |
|
I'm going to convert this to draft until we're sure we're stable here. |
Fix FSDirectory stale-file race and re-enable crash regression test
Fixes #894
Description
FSIndexOutput’s managedFileStreambuffer before marking the file stale forSync().TestIndexWriterOnJRECrash.TestNRTThreads_Memand makeCheckIndexes()dispose the wrapper safely without leaking the directory.Testing
Lucene.Net.Store.TestDirectoryinsrc/Lucene.Net.Tests._J-Ssuccessfully.TestIndexWriterOnJRECrash.TestNRTThreads_Meminsrc/Lucene.Net.Tests._I-Jsuccessfully, including the forked crash path.This code generated by GPT 5.5 Pro