From 04:00 PM CDT – 08:00 PM CDT (09:00 PM UTC – 01:00 AM UTC) Tuesday, April 16, ni.com will undergo system upgrades that may result in temporary service interruption.

We appreciate your patience as we improve our online experience.

NI Linux Real-Time Discussions

cancel
Showing results for 
Search instead for 
Did you mean: 

Target Management with Ansible

We have an ever growing fleet of cRIO 9064s in production,  and we're struggling with devOps staff making one-off config changes to cRIOs and neglecting to make sure those changes propogate to all systems. So the configuration of our systems drift, inconsistent behaviors creep in, and hair pulling ensues.The O'Reilly book "Infrastructure as Code" refers to these rogue hosts as "Snowflake Servers" (because they are unique).

I should note that our cRIO fleet is entirely remote from our office.

One solution is to build an image using the RAD tool and routinely redeploy it to all our systems. The problem with this, is that the RAD tool seems to require systems to be on the same LAN as the RAD tool PC. Another problem with this is that it seems to not be scriptable and would need to be done manually each time. The biggest challenge is that for every slight configuration change we need to:

1) use RAD to deploy the most recent image to a development cRIO

2) make the configuration change on that cRIO

3) test the change

4) create an updated image from the dev cRIO

5) ensure the production system list in RAD tool is up to date

6) delpoy the image to all production systems

This is time consuming and laborious. What would be FANTASTIC is if we could use Ansible or something like it. The process would then become

1) Make the configuration change via the ansible playbook

2) Run the Ansible playbook targetting a single dev cRIO

3) Test

4) Run the Ansible playbook targetting all production cRIOs.

Automatically, the Ansible playbook would, in a fully repeatable/idempotent fashion:

- Pull an up-to-date list of production systems from our server where things like that are centrally managed.

- Factory reset the cRIOs using nisystemformat

- Install the software  as defined in the playbook (normally done through MAX or RAD)

- Manage ni-rt.ini settings as specified in the playbook

- Install + configure any packages defined in the playbook using opkg

- Deploy the RTEXE defined in the playbook.

- Configure crontab

- Deploy ssh keys to /home/admin/.ssh/authorized_keys

- just about anything else that could be done via ssh

Furthermore, the Ansible playbook could be run as a cronjob, meaning that weekly, or daily, ANY deviation of any system from the proper configuration will be nullified.

What I've described here with Ansible is a practice that is recommended in the O'reilly book I mentioned for AWS EC2 instances, and I don't see why the practice shouldn't be used on linux-based NI targets. I think this is not possible to do however, mainly because software install needs to be done through MAX or RAD.

Am I correct? Is there any development effort at NI to make something like what I want possible?



Message 1 of 6
(4,183 Views)

Have you gotten ahold of the folks testing out salt (https://decibel.ni.com/content/thread/47213)?

What I can say is that we've internally looked at both Ansible and Salt for device management for precisely the reasons you've pointed out, we're aware of the deficiencies and pain with the current situation.

0 Kudos
Message 2 of 6
(3,768 Views)

Happy to know it's on the radar and I'll dive into that salt thread.

Also I wrote the below message last night before your reply, but forgot to send it:

----

I’m far from being a seasoned Labview developer, or an Ansible expert (my focus is more on integration)… but if the Labview source is provided, how hard would it be to augment RAD by writing a webservice interface for it? If we had a RAD server with a good API, then I think an Ansible module could be written to control it.

Where I’m going with this is it could work nicely to create a base image of a fresh cRIO OS state that includes the required NI software components. The base image could get installed to field cRIOs by a RAD server controlled by Ansible as part of a playbook. All other target config could be performed via Ansible more directly.

In our case the base image would rarely if ever change, so most often config changes could be done in the playbook alone.

Would anyone else out there have use for such an arrangement, or see value to this approach?

0 Kudos
Message 3 of 6
(3,768 Views)

Davegravy,

Our business situation is much like yours--we have hundreds of sbRIO's deployed around the world--and we've run into many of the same problems you've described.  It costs a lot of money to send someone to the site and manually connect to the device to do repairs.

if the Labview source is provided, how hard would it be to augment RAD by writing a webservice interface for it?

The source code for the RAD tool is available and we considered this solution for the VxWorks devices we have in the field, but ended up rejecting it for several reasons:

1. The RAD tool is designed to be run from a host pc, not a target.  IIRC, it starts by formatting the hard drive and rebooting the target into safe mode.  At that point you have a pristine target... how is it going to know to make web service calls to your RAD server?  You'll need to install some sort of custom bootstrapping code on the freshly formatted device to contact the RAD server, but you're using the RAD server to install your code.  It's a circular problem.  (This might be easier to accomplish on the linux targets... I have no idea.)

2. It creates a large exposure window.  Since it formats the disk first, if something goes wrong (power interruption, corrupted image transfer, manual reset, etc.) during the process we're left with a bricked board and have to roll a truck to the site.

3. We found the RAD tool wasn't very reliable.  While we occasionally use the RAD tool during development, it's not uncommon for image writes to fail for one reason or another.  We need better reliability than that.

(There are other reasons as well, but it doesn't sound like they apply to you.)

We ended up creating our own update mechanism and deploying it as part of the target's code.  With the move to linux rt we're looking to take advantage of the ready-made tools that are available.

Edit:  Doh!  I just realized you were probably talking about Ansible sending requests to you RAD server, not your field devices sending requests!  Silly me.  If that's the case, and since you're already using the RAD tool to deploy production images, it might work just fine for you.

0 Kudos
Message 4 of 6
(3,768 Views)

1. The RAD tool is designed to be run from a host pc, not a target.  IIRC, it starts by formatting the hard drive and rebooting the target into safe mode.  At that point you have a pristine target... how is it going to know to make web service calls to your RAD server? 

Just to clarify I'd imagined the RAD server being a Windows PC with labview environment installed. A separate Linux host with Ansible installed would send a request to the windows PC "I want to install base_image_ID=33 to 74.198.222.44". The RAD server would then do its thing without requiring any GUI input.

I assume the RAD tool could poll the target and detect that the format and reboot into safe mode had occured successfully, and then deliver the payload (image).

Also, formatting and rebooting into safe mode, on linux targets, can be performed via SSH, so that component could in theory be stripped out of the RAD tool's scope entirely.

2. It creates a large exposure window.  Since it formats the disk first, if something goes wrong (power interruption, corrupted image transfer, manual reset, etc.) during the process we're left with a bricked board and have to roll a truck to the site.

Can't disagree here. There's ways to mitigate some of these however... All our systems for example have (large) UPS backups, and you could do a checksum compare on the transfered image.

3. We found the RAD tool wasn't very reliable.  While we occasionally use the RAD tool during development, it's not uncommon for image writes to fail for one reason or another.  We need better reliability than that.

(There are other reasons as well, but it doesn't sound like they apply to you.)

That's concerning. Were any failed image writes catastrophic or would just "trying again" resolve the issue?

0 Kudos
Message 5 of 6
(3,768 Views)

Just to clarify I'd imagined the RAD server being a Windows PC with labview environment installed. A separate Linux host with Ansible installed would send a request to the windows PC "I want to install base_image_ID=33 to 74.198.222.44". The RAD server would then do its thing without requiring any GUI input.

That could work...  (We need connections to be initiated by the target, so it's not an option for us.)

I assume the RAD tool could poll the target and detect that the format and reboot into safe mode had occured successfully, and then deliver the payload (image).

Honestly, I don't remember exactly how the tool does it out of the box.  I usually use ftp to manage the files rather than the RAD tool.  It was too unreliable.

I did poke around in the RAD source code again just now.  My initial thoughts are it's designed from the ground up as a desktop app with a user interface.  It's built as a tool and an example, not a product intended to be extended.  It's not going to be as simple as putting some web service calls around it.  It does use the System Configuration API under the hood, so it's certainly a useful example for putting together your own server.

...you could do a checksum compare on the transfered image.

I think the RAD tool unpackages the images on the pc and transfers the files rather than transferring everything over and unpackaging it on the target.  You'd have to checksum every file on the target.  (See Get System Image.vi and Set System Image.vi in the System Configuration API.)

Were any failed image writes catastrophic or would just "trying again" resolve the issue?

Attempting to push an image again (maybe multiple times) usually resolved the problem.  I seem to recall the images we'd pull from a system sometimes just didn't ever write correctly, so we'd have to redo that on occasion.  I don't remember all the details... I do remember the frustration.

0 Kudos
Message 6 of 6
(3,768 Views)