CGM

Posts containing CGM

Me and My Shadow

By John S.

So we had a new guy start in our group last week.

Normally, this would be cause for great joy and celebration (actually, it still was) but there were a few flies in the ointment related to the fact that he started now. Brendan (his boss) was scheduled to leave for Germany early this week to speak at our European Forum, and of course had preparation work to do. We’re going into a heavy crunch time in the team room. (Did I mention that we’re working in a team room?) We had a holiday weekend coming up. And so on, and so on…. In summary, none of us had a lot of bandwidth to devote to training.

Back when I was running an ACIS team, we would have sent Julian (the new guy) to the training class we give our customers to get some general familiarity with the product. Then we would have spent a day or two showing him how to run our source control and build tools and generally get him set up on his machine. Then we probably would have given him an easy bug with lots of education potential and let him debug through it while figuring out the code, interspersed with a lot of answering of questions and discussions of the theory behind how ACIS works.

But things are a bit different now. We’re working in the team room on productizing CGM, so we don’t have a lot of educational bugs to give him. In the team room we’re doing a lot of pair programming and project ownership is more diffuse. And most of all, as mentioned above, we just didn’t have a lot of time to devote to training due to the various crunches.

So we decided to wing it. One of the big advantages of working in the team room (we’re doing what is essentially a flavor of Scrum with two week sprints) is the training potential. Since there’s discussion about all the stories we’re working on, everyone on the team gets at least a little knowledge about what’s going on. And I’m firmly convinced that one of the major plusses of pair programming is educational – knowledge gets spread around much more efficiently than through lecturing (or not talking at all :). So we said, “we’ll just throw Julian into the team room and let him ‘pair’ on whatever will be most educational at any particular time” (the quotes around pair are meant to indicate that we didn’t expect him to be driving much while pairing).

That was the plan: to do a lot of pairing. But it wasn’t quite what happened. What happened was that Julian spent most of his time following either Brendan or me around watching what we were doing. If we were programming on something, great – he paired with us. But if I had to walk down the hall to coordinate with someone about what we were working on, he’d follow me and watch. And about half way through the week I realized that what was really going on was that we had Julian shadowing us.

The reason I’m familiar with shadowing is that I eat out a LOT. Have you ever sat down at a restaurant and had two servers walk up to your table at which point server #1 says “Hi, I’m Allison and I’ll be your server today. This is Bob – he’s in training”? Bob is a new hire and he’s shadowing Allison. It’s basically an unstructured training regime which emphasizes learning by immersion in the actual work environment. Most restaurants start a new hire out by having him shadow someone else for a week or so, after which they switch roles and the trainer shadows the trainee for another week or so before the trainee finally goes solo.

After the initial light bulb went off in my head, a second one went off right away. The restaurant industry probably trains millions of new hires every year. It is reasonable to assume that they are very good (read: efficient) at training people. So picking up one of their standard training techniques and applying it on our new hire was probably a really good idea, albeit one that we didn’t realize we were having. And it seems to be working out; I feel like Julian’s got a much better feel for what we’re doing and the issues that we’re facing than if we had given him an easy project. And it didn’t suck away huge amounts of our time. And he was even able to help us out on our crunch work this week. I’m planning to recommend that we use it (consciously, this time) as a standard technique when training.

Has anyone else used this approach to training in software development?

 

Tags:

In my previous blog post I compared the different multi-processing technologies used in ACIS and CGM. The primary difference being that ACIS is based on a shared memory model, utilizing multiple threads in a single process while CGM is based on a distributed memory model, utilizing multiple processes with inter-process communications. This article will focus on ACIS, leaving CGM for next time.

ACIS is thread-safe. This is a major milestone in our quest to embrace the multi-core trend, which is on the verge of producing a true 16 core processor. Thirty-two and 64 CPU systems will soon be within reasonable reach. This capacity will allow our customers to pursue significant performance enhancements in their end-user applications with multi-processing.

Thread-safety is a prerequisite to concurrency, which provides true parallelism when paired with multi-processor/multi-core architectures. Here threads work concurrently to complete tasks in less time. The performance increase is usually described as a scaling factor with respect to the number of available cores. The goal of course is to achieve ideal scaling, representing an ideal use of available processing power.

Adding concurrency to your application deserves careful consideration, as it requires compute intensive workflows that meet certain criteria. The ability to break up the inputs into meaningful individual tasks, for them to be well balanced with roughly equal complexity, and without side effects when computed out of order, is important. The workflows should also be stable and well understood before they are considered.

Global operations on multiple models (aka assembly modeling) provides good opportunities for parallelism because the models are typically independent of each other. This represents a single instruction – multiple data (SIMD) workflow, where operations have no affect on each other because of data independence. Loading and faceting of large assemblies are good examples, as are computing cross sections and collision detection.

Watch this video of cross-sectioning an assembly. Here we slice an assembly multiple times, computing the intersections for each independent model concurrently. Computing 100 slices using 7 worker threads takes 8 seconds, and 32.6 seconds using normal serial execution. This represents a scaling factor of 4, whereas ideal scaling would have required the operation to complete in 4.6 seconds. Nonetheless, such performance improvements have a positive impact on the end-user experience.

Compute intensive operations on single models may also be candidates for parallelism. Here the parallel operations are performed on interconnected elements of a model. The challenge is decoupling the interdependencies to avoid unwanted interactions, which can lead to incorrect results, non-deterministic behavior, and even severe errors. The decoupling is accomplished by copying the required portions of the model into the context of the thread, thereby making the data independent. Performance gains are possible when the extra overhead of the copy operation is minimal with respect to the task computation time.

Faceting a single body face-by-face in ACIS for example, can be accomplished more quickly using multiple threads. We must first deep copy each face into the context of the worker thread (i.e. its history stream), an efficient operation in ACIS. The copied face is then completely independent and can be freely accessed without affecting other threads. As a side effect, the independent faces are unaware of their surrounding faces since these were not copied. Generating water-tight facets without this adjacency information can be tricky, possibly making this operation a difficult one to parallelize to the extent that the results are fully equivalent to the serial operation.

The following table shows the benefits of face-by-face faceting with multiple threads. The airplane assembly is roughly 40 MB in size, made up of 149 models. The Skywalker assembly is 240 MB in size, with 633 models.

 

In contrast to the faceting example, computing intersections of face pairs does not rely on adjacency information, making it an excellent candidate for single body parallelism. The faces and their supporting geometry are copied, making them independent. The intersections can then be computed in parallel. Before the results can be combined, they must first be merged into the context of the main thread. Merging, which is much faster than copying, is possible because the results are newly created and without interdependencies. (Computing intersections in parallel was the very first operation to use the multi-processing infrastructure in CGM.)

In our multi-threaded entity-point-distance API we compute (among other things) the entity on which the nearest point lies. This entity, a face, edge or vertex, is on the copy, not the source body. We therefore had to find a way to map it to the original, another tricky aspect of combining results. Fortunately the topological order is preserved in the copy process, which allows the use of the indices in the lists retrieved with the topological query functions (e.g. api_get_edges).

My intent was not to make the use of threads seem overly complex and riddled with obstacles, but instead to share some of our experiences, good and bad. The thread-safe ACIS modeler provides the infrastructure and tools needed to add multi-processing to your applications. It’s up to you to take advantage of it. The task may not always be trivial, and you may encounter both good and bad experiences of your own, but the results are usually well worth the efforts.

What are some of your experiences with adding multi-processing to your applications?

Learn More about ACIS in our Webinars

 

Tags:

Topic: CGM Capabilities

Join Spatial developers for a live chat session in which you can ask technical questions about Convergence Geometric Modeler (CGM), Spatial’s new 3D modeler component. CGM developers Brendan Doerstling, Engineering Manager, and Eric Zenk, Sr. Software Engineer, will be ready to answer your questions.

March 30
8am MDT; 10am EDT, 14:00 UTC
Duration: 45 min.

Sign up for an email reminder below.

Tags:

I have spent many years of my career here at Spatial developing Thread-Safe ACIS and now I’ve been given the opportunity to additionally work on the multi-processing infrastructure in CGM. The two modelers use very different multiprocessing technologies, and it has been interesting comparing and contrasting them. The main goal is the same in both cases, to provide a means for our customers to leverage multiple processors to improve performance in their end-user applications.

ACIS is thread-safe, meaning that multiple threads can be active in the same process concurrently. Additionally, all threads share the same virtual address space and hence have direct access to all data. This is a shared memory model. CGM uses multiple processes for concurrency with inter-process communication as a means to share data. This is a distributed memory model. Both have strengths and weaknesses.

Multithreading and a shared memory model not only benefit from direct data access but also from thread-management routines that control thread interactions. By that I’m referring to synchronization primitives that have little overhead because all threads share the same process. (Synchronizing multiple threads is much faster than synchronizing multiple processes). The main drawback to sharing memory is the possibility of data-races (aka race conditions). This is when threads influence each other, usually negatively, by modifying shared data in unexpected ways.

The main benefit of multiple processes and the distributed memory model is that the product does not need to be thread-safe. After all, these are separate processes working independently in their own address space. Inter-process communication is the main drawback. The tasks must be distributed to available processes and the results must be gathered together. This often requires sending and receiving large amounts of data, which is pure overhead.

It has taken many man years of effort to make the ACIS modeler thread-safe. Additionally, we have the on-going burden of keeping it that way. Our entire development team is affected, in that they now follow a strict set of rules to assure correctness. The distributed processing approach is less invasive, in that it can be managed by a relatively small team. Both techniques are relevant and can provide significant benefits when used correctly.

As component software, both ACIS and CGM must supply good interfaces to their multiprocessing technologies. The ACIS infrastructure includes thread-local storage primitives, mutual exclusion logic, and a thread manager that aids in thread creation and task scheduling. The CGM interface, which is still under construction, contains task management with communication logic that optimizes inter-process communications. The code required to utilize either system is not that dissimilar.

Both systems are most effective when used with high-level compute-intensive operations. This is known as coarse-grain parallelism. In contrast, fine-grain parallelism adds concurrency to highly exercised functions at lower-levels, which typically does not provide as much benefit. As such, both systems are intended to be used in very specific and time consuming workflows such as faceting all bodies in a large assembly. In this instance, the multiprocessing overhead is negligible in comparison to the computation time.

The multiprocessing capabilities of our modelers intend to provide an effective means of improving performance by utilizing multiple processors. We will continue to invest in internal uses, for example, the multi-threaded entity-point-distance API in ACIS and the multi-process face-face intersections in the CGM Boolean operator. Additionally, we anticipate more and more of our customers will leverage this technology as the rewards are well worth the effort.

I plan on getting a bit more technical in my next post. Race conditions, for example, deserve a more thorough explanation (and are a favorite topic of mine). I would also like to discuss a few multiprocessing tools, such as Intel Parallel Studio, that are very useful.

Tags:

By Kevin Tatterson

In my previous post I explored the notion that ++d is faster than d++. 

Ponderings

Now for an educated guess on what would happen if we got rid of the __asm nop and allowed the optimizer to inline. At the very least, the instructions in dark red in the previous example (lea, call myint::operator++, nop, and ret) would go away, leaving us with 8 clock cycles for pre-increment and 11 clock cycles for post-increment: which would make pre-increment is 27% faster!

Back to reality for a moment. In actuality, the myint example gives the best case figures because of two reasons: both the myint copy constructor and the pre/post-increment’s are dead simple – one clock cycle each – and inlining works because the implementations are short. So what happens if these implementations get even just a little more complex?

Cost (clocks)

of copy ctor & operator++

Pre-incr

Total

Clocks

Post-incr

Total

Clocks

% Faster
2 (best case, our example)                   8                                  11                               27%                        
10 16 19 16%                     
20 26 29 10%
40 46 49 6%

 

 

 

 

 

 

 

Now consider that seemingly innocuous instructions will explode the number of clock cycles into the 100’s and 1000’s – calls like sprintf, malloc, new, itoa – will blow this example out of the water and reduce the benefit to nil.

Conclusion

I have mixed feelings on whether to recommend pre-increment over post-increment:

  • Your copy ctor and pre/post-incr implementation have to be dead simple to measure a win.
  • It wouldn’t surprise me if compiler optimizers are able to determine when post-increment can be replaced with pre-increment, when your program’s semantics allow.
  • It doesn’t change the semantics of your program much, but other developers might wonder why you favor pre-increment.
  • In the grand scheme of things, few real world algorithms’ performance will measurably affected by favoring pre-increment.

Here at Spatial, I’d like to think that we take a pragmatic approach to our software’s performance. CGM, ACIS, 3D InterOp, and IOP-CGM, rarely give concern to this level of minutia. I’d like to describe our approach to performance as tactical and pareto-ized. As I said, in rare instances we give concern to minutia, but only when our profilers tell us to.

In the end, I’m okay if you use pre-increment – but for myself, I’ll aspire to loftier programmatic governances. What are these governances? That’s another blog – one that is sure to stir things a bit.

Twitter Facebook LinkedIn YouTube RSS