-
Notifications
You must be signed in to change notification settings - Fork 126
RSDK-12570: Allow caller to lock Operation Manager #5436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
8c9ace3 to
38bc10b
Compare
|
Line 35 in deee8a2
|
a9286c3 to
016aea5
Compare
016aea5 to
d83fe03
Compare
|
I'll leave it to the team whether this is a better solution but you could convert // All returns a [iter.Seq2] of OperationID/*Operation key/value pairs of all
// running operations known to the Manager. A lock is held on the manager for
// the duration of the iteration, so callers must not retain copies of the
// Operation pointers as there is no guarantee they will remain valid after
// iteration has ended.
func (m *Manager) All() iter.Seq2[string, *Operation] {
return func (yield func(string, *Operation) bool) {
m.lock.Lock()
defer m.lock.Unlock()
for i, o := range m.ops {
if !yield(i, o) {
break
}
}
}
}Callers can then mostly use this normally with |
|
Thanks! I think the general solution is good, but I do like Josh's suggestion more since we guarantee that the lock holds through the iteration. |
|
Thanks! Never used |
That feels dangerous no? These mutexes are not reentrant. Without a strong understood convention, it feels easy to write a deadlock. |
| // If providing a custom context.CancelFunc, do not lock Manager's lock, as it will | ||
| // conflict with other batch-cancel methods: CancelOtherWithLabel, Manager.CancelAll | ||
| cancel context.CancelFunc | ||
| labels []string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benjirewis and I looked at this yesterday and I think what's actually happening is that multiple goroutines are accessing labels concurrently. Specifically:
What I think happened in the race is that a slice is made up of two member fields. An integer size field and a pointer to the underlying growable array of data. An append (specifically when going from the nil empty slice to a non-empty slice of size 1) increments the counter and allocates a new array. But a racing reader can observe the size being 0 and deciding to dereference the (still nil) pointer to the backing array.
Now, I don't understand what this whole label business is. But if we need labels, we should lock individual operations rather than the entire manager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That also sounds right to me; here's toy program that can panic for, I think, the reason we're talking about:
package main
import (
"sync"
)
type Operation struct {
labels []string
}
// HasLabel returns true if this operation has a specific label.
func (o *Operation) HasLabel(label string) bool {
for _, l := range o.labels {
if l == label {
return true
}
}
return false
}
func main() {
for {
op := &Operation{}
var wg sync.WaitGroup
wg.Add(2)
go func() {
defer wg.Done()
op.HasLabel("benny")
}()
go func() {
defer wg.Done()
op.labels = append(op.labels, "benny")
}()
wg.Wait()
println("Finished run")
}
}But if we need labels, we should lock individual operations rather than the entire manager.
I'm a little confused what you mean here, though, @dgottlieb . You mean using a (potentially separate) mutex around writes/reads of labels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, I don't understand what this whole label business is. But if we need labels, we should lock individual operations rather than the entire manager.
From what I can tell a label an arbitrary name to refer to an operation that's lazily set by CancelOtherWithLabel (it cancels any others with the same label before assigning it to the calling Operation. Also see the tests from the original commit [link]). I agree that locking individual operations is ideal, but the current design is a little messy with single Operations being able to read and mutate the Manager that manages all Operations. I'm happy to spend more time looking into this if we feel it will be beneficial.
That feels dangerous no? These mutexes are not reentrant. Without a strong understood convention, it feels easy to write a deadlock.
FWIW I tried migrating all the use cases of the existing All() to the proposed iter.Seq2 version and don't see any deadlocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean using a (potentially separate) mutex around writes/reads of labels?
Discussed offline: yes, that is what is meant.
|
(Cont. from offline discussion): Adding a new lock around The issue described in the first comment still remains though. Shall we keep these in-progress changes, and I'll make a new PR to add the labels lock? type Operation struct {
...
labelsLock sync.Mutex
labels []string
}
func (o *Operation) HasLabel(label string) bool {
o.labelsLock.Lock()
defer o.labelsLock.Unlock()
for _, l := range o.labels {
...
}
func (o *Operation) CancelOtherWithLabel(label string) {
...
o.labelsLock.Lock()
defer o.labelsLock.Unlock()
o.labels = append(o.labels, label)
} |
|
Can this be closed now that #5444 was merged? |
I'll change it to Draft for now. I made a new ticket to track possible correctness issues (possibly from future misuse) with the current OpMgr: https://viam.atlassian.net/browse/RSDK-12570; the merged PR is just a quickfix for the panic. |
The Operation Manager is currently often being used in this manner, both internally and externally:
rdk/operation/opid.go
Lines 55 to 68 in deee8a2
In the above snippet:
o.myManager.All(): acquires the Operation Manager's lock and returns a[]*Operationof all operations known to the Operation Manager.op.Cancel()for each.There is no guarantee that the pointers returned still point to valid memory.
Users have reported seeing
panic: runtime error: invalid memory address ...from the Operation Manager when running highly concurrent code like:With this PR, I've added new public
Lock(),Unlock(), andAllWithoutLock()methods to the Operation Manager.AllWithoutLockthat functions similarly toAll(), but leaves it to the caller to lock/unlock the Operation Manager if they want to process the returned operations.This does appear to resolve the reported panic, but I realize this design of asking the caller to manage the locking may not be ideal, so let me know if you have any better suggestions.