Actor Framework Discussions

cancel
Showing results forΒ 
Search instead forΒ 
Did you mean:Β 

MGI Panel Actor hanging?

Solved!
Go to solution

So I've tracked down a very annoying bug- I have an actor that launches a test actor, and I was getting hangups every once in a while. Very intermittent, and very hard to reproduce consistently- I'd have to launch a test, stop the test, then relaunch it a bunch of times, then catch the hang. Edit note: I'm using the MGI Panel Manager toolkit to handle displaying my actors in subpanels.

 

The basic info here is that the test actor alive but in an inactive state just displaying values, allowing the user to save the data, that kind of thing. It doesn't fully close until the next one starts up. This means I have two test actors open at the same time.

 

Most of the time this works great, but every now and then the new one starts up and nearly instantly hangs, while the old one doesn't close. Through a few hours of detective work with the DETT (thank HEAVENS for the DETT) I have found that the hangup is happening in Panel.lvlib: Subpanel.lvclass: Show core.vi (part of the MGI Panel Manager toolkit; technically not part of the Actor framework part, just the Panel part).

 

The hangup happens when the code tries to grab the semaphore to allow it to start messing with the subpanel. Somehow it never acquires the semaphore. The delay is defaulted to -1, thus it waits forever.

 

This appears to be a race condition since it's so intermittent, but I am having a bear of a time tracking it down. Has anyone seen this before?

 

Edit note: some text was showing up as smiley faces, and I added a comment to indicate I was using the MGI toolkit




Heads up! NI is moving LabVIEW to a mandatory SaaS subscription policy, along with a big price increase. Make your voice heard.
0 Kudos
Message 1 of 17
(4,692 Views)

Whoops, I was mistaken on one thing- while Show is hanging at its semaphore, it's actually not the first thing to hang. Hide Core (on the old test, the one to be closed) actually tries to get the lock first, but it hangs there. Shortly after, the new one tries to get the same lock, where it also hangs.

 

I added some messages to Acquire and Release semaphore, and it appears that Hide Core does an Acquire on a named semaphore, then releases an unnamed semaphore, but I can't figure out for the life of me how that's happening.

 

I should note that though these are different actors, they're both interfacing with the same subpanel, and therefore the same semaphore. In the MGI panel actor the subpanel load/unload protection is done with a named semaphore. Even though the two actors should have nothing to do with each other, they do have access to the same semaphore since the semaphore name is based on the subpanel's refnum (converted to an Int then a String).




Heads up! NI is moving LabVIEW to a mandatory SaaS subscription policy, along with a big price increase. Make your voice heard.
0 Kudos
Message 2 of 17
(4,679 Views)

Well... one more post before I call it a week. I have currently found one place a glitch seems to be occurring, and it's in Panel.lvlib:Subpanel.lvclass:Hide core.vi.

 

This VI is part of the MGI panel framework. I've modified it a bit but unfortunately can't post a snippet as it requires the correct dependencies, but I think a picture should suffice.

 

semaphoreglitch.png

 

Hopefully that shows up. The only difference between the original MGI VI and this one is that I've added "Get Semaphore Status.vi" and a DETT logger between each semaphore operation (the DETT logger basically just adds a little bit of text and flow control to the primitive). I use the Get Status to get the name of the semaphore it's using.

 

In theory, I should see "Hide request lock [name]", then "Hide got lock [name]", then "Hide request release [name]", then finally "Hide got release [name]". This way I inspect the semaphore reference before and after each request to lock or unlock it.

 

I started and stopped the test about 8 times over the course of a minute, and when I encountered the bug I found this in my DETT log:

 

semaphoreglitch2.png

 

When requesting a lock and getting the lock, you can see that it's returning 6E200F7, which is apparently the name of the semaphore. However, when the same VI requests *releasing* the *same reference wire*, it returns an unnamed reference. For some reason, the semaphore name is getting lost, and the VI thinks it's releasing something that it isn't.

 

Yes, I checked and the wires are connected. Again this is an intermittent bug not one that happens every time.

 

Also, though this VI is reentrant, you can see on the right in the DETT log that it's the same clone firing those traces back to back, as would be expected for a lightweight VI like this one.

 

I'm totally stumped here- hopefully someone else can chime in. The whole point of a semaphore is to avoid a deadlock, and here we are with one anyway! I think this may be some complicated interaction of dynamically called VI's, but I can't for the life of me figure it out. How on earth can a single refnum wire have a name at one point, then lose its name mere a mere millisecond later?




Heads up! NI is moving LabVIEW to a mandatory SaaS subscription policy, along with a big price increase. Make your voice heard.
0 Kudos
Message 3 of 17
(4,669 Views)
How on earth can a single refnum wire have a name at one point, then lose its name mere a mere millisecond later?

If there is an error on the "error in" terminal, the "Get Semaphore Status" wont return a name.   Try not running the error wire through that function and make sure your DETT-logging VI logs errors.

0 Kudos
Message 4 of 17
(4,655 Views)

Wow, in retrospect it's always so obvious. I was able to catch error 1111 "Release Semaphore called on a semaphore that was not currently acquired". Now I have to figure out what else released my semaphore, but at least now I have some more info. Thanks for the help.




Heads up! NI is moving LabVIEW to a mandatory SaaS subscription policy, along with a big price increase. Make your voice heard.
0 Kudos
Message 5 of 17
(4,635 Views)

Looks like I've found the bug! It seems to be a race condition in the Panel Manager toolkit. While it can happen with just the standard subpanel VI's, it's more likely to happen when using the Actor Framework along with the Panel Manager. The bug is a race condition involved with closing a subpanel (Close.vi) at the same time as hiding a subpanel (Hide.vi).

 

In my code, I have one actor running in a subpanel. When a user clicks a button, a new actor is launched; if that actor's Pre-Launch Init completes without error, the main program sends a Stop to the previous actor and inserts the new one. Thus, the old actor will call Close.vi as part of its cleanup routine. Meanwhile, the new actor will insert itself into the subpanel, and will call Hide on whatever panel was in there before. Thus, Close.vi and Hide.vi are called nearly simultaneously, and they both interact with the same named Semaphore reference involving the Subpanel. When this happens, there is a race condition and the semaphore *reference* may be released before the semaphore *itself* is returned to the "pool", thus "eating" the semaphore lock. In this way, the new actor will insert itself OK, and the old actor will be removed OK, but the semaphore belonging to the subpanel itself will have no more items available.

 

I have been able to reproduce the effects of this in a standalone VI that doesn't use the Panel Manager framework. Here is a VI Snippet:

 

semaphore deadlock.png

 

Run the VI and notice that the "# available" indicator never changes from 0. If a new process were to access this Semaphore (as is done when a new actor is inserted into the subpanel) then it will not be able to get the lock, and will permanently hang.

 

I am sending this info along to MGI as well. There should be a way to work around this, but it's late and I'm heading out for the day. I'll reply back to this post with a fix if I can figure one out tomorrow.

 

Edit: called it a "framework", changed that to "toolkit"




Heads up! NI is moving LabVIEW to a mandatory SaaS subscription policy, along with a big price increase. Make your voice heard.
Message 6 of 17
(4,625 Views)

Your major bug, to my eyes, is you have one call to Obtain Semaphore and two calls to Release Semaphore. Stop that. Give each process its own call to Obtain Semaphore.

 

But you could go further. Why allow the two actor processes to ever own the semaphore? You could have the caller actor generate the semaphore once and then supply it to the subpanel actors when they launch. The subpanel actors would never call either obtain or release. They would just use the semaphore. The caller actor would destroy the semaphore as part of its own shutdown.

0 Kudos
Message 7 of 17
(4,609 Views)

This is all code internal to the MGI Panel Manager toolkit (http://www.mooregoodideas.com/panel-actors/ ) I didn't create any of it, I just found the bug. I created this example VI to show the race condition, as it's non-apparent in the regular code.

 

I'm currently trying to think of a way to fix the issue without redoing that entire toolkit. I just figured out the race condition last night so I'm still percolating on things.

 

The way the toolkit currently works is that a "Panel" object is generated that's associated with the Actor that's using it, and a second Actor is able to get a reference to that object when it wants to insert itself into the Subpanel frame. This breaks the "Messages only" paradigm that the AF relies on, thus generating this bug. It *should* use messages to insert and remove itself from a panel, but currently it uses an FGV that's unique to the subpanel itself, and relies on a semaphore to ensure individual processes access it atomically. Obviously this ended up with a race condition πŸ™‚

 

(Aside- I'm not quite sure what you mean about two Obtains and one Release- there is one Obtain Semaphore Reference, two Acquire Semaphroes, two Release Semaphores, and one Release Semaphore Reference. (OK technically there's a second Obtain Semaphore but it's in Process 2, which just monitors the reference and makes sure it doesn't go idle; it's not doing anything other than leaking the reference when the example VI closes, which was a simple oversight on my part.)

 

Not defending the code or anything, in fact it's definitely in a sticky spot here that's going to require rethinking some of the decisions made when the framework was first conceived. Part of the issue is that the Panel framework is separate from the Actor framework, so I assume they're trying to keep all of their Panel functions independent of their Panel Actor functions. I don't want to speak for MGI though.

 

Anywho, I've only been trying to consider a fix since last night; beforehand I was trying to sift through the various function calls to see which functions were being called, and it turns out the race condition was being triggered at a time I didn't expect but wasn't apparent until I tried to launch another actor. Hopefully I can get something rolling today.




Heads up! NI is moving LabVIEW to a mandatory SaaS subscription policy, along with a big price increase. Make your voice heard.
0 Kudos
Message 8 of 17
(4,605 Views)
Solution
Accepted by topic author BertMcMahan

New update. I've found a workaround that SHOULD work. I'm not familiar with the whole of the toolkit, so hopefully this doesn't break anything else. I'm past a deadline already so I can't afford to refactor the entire thing, but this should get me going for now. (Note: this can be a little confusing so I'm distinguishing between Semaphore references and Semaphores; I'm calling the latter just "Locks").

 

The issue is that two different processes (in this case, actors) have access to the same Semaphore reference (note: not just the same Semaphore, but the same Semaphore reference). If the Close process begins, then the Hide process tries to grab the lock, then Close will throw away the reference just after Hide acquires it, and Hide therefore won't release the lock since its reference is invalid.

 

Since AFAIK Hide doesn't actually need to execute once Close has been called, I need a way to make sure Hide never gets the lock. Therefore, after Close acquires the Lock, it will generate a *new* Semaphore reference internal to itself. It will then destroy the old Semaphore reference without ever releasing that Lock. Now, Hide will return an error since it never got the Lock it wanted. Since Closes is the last thing the panel will do, we don't need to run Lock, so the error is OK and can be ignored.

 

Next, Close does what it needs to do, then releases a Lock into its NEW Semaphore reference, after which it releases the new Semaphore reference.

 

This means that once Close starts its process, no other VI's can access that Semaphore reference.

semaphore deadlock simple workaround.png

 

I think this is non-ideal, but I also think it'll work to get my project moving along, and I don't want to refactor a dozen VI's in the toolkit πŸ™‚




Heads up! NI is moving LabVIEW to a mandatory SaaS subscription policy, along with a big price increase. Make your voice heard.
0 Kudos
Message 9 of 17
(4,592 Views)

Got it. I understand the problem better now.

 

I think you could just have your root actor do an Obtain Semaphore on that name. The problem you're encountering is the semaphore going stale, right? If you do a root-level obtain, the semaphore will never go stale while the rest of the framework is running. Just have your root actor obtain the semaphore at start of its Actor Core and release it at the end of its actor core. That would be my attempt at a workaround.

0 Kudos
Message 10 of 17
(4,586 Views)