## Double, Double, Trouble

By Eric

While procrastinating (avoiding writing this blog entry for as long as possible), I debugged an interesting problem.  This gives me something to talk about here.  What follows might be simple or obvious, but I find that considering tiny details very carefully is a good way to improve the quality of the code I write.  Consider the following code snippets

`sphere* make_sphere( double radius, double x, double y, double z);`

void do_something( /* ... */ )
{

// ...

sphere* my_sphere = make_sphere(10,0,0,1);
}

and

class position
{
// ...
public:
position( double x, double y, double z);
//...
};

sphere* make_sphere( position const& p, double radius);

void do_something( /* ... */ )
{

// ...
position center( 0,0,1);
sphere* my_sphere = make_sphere();
}

With the second version of the code, you actually need to have a class structure defining your objects (which requires more code), but strong type checking can help you.  There is also an annoyance with the second version of the code that you may have to write code converting between various types of geometric operators.  This (having well thought out basic types for mathematics) is one area where CGM does particularly well.

The actual bug I looked at was closely related (class names changed to protect the guilty).
```class nifty_curve_calculator

{

// ...

public:

nifty_curve_calculator( double convergence_epsilon, double fitol_desired, ...);

//..

};```
In nifty_curve_calculator, exact points on a curve are calculated to convergence_epsilon.  The nifty_curve_calculator then concatenates a bunch of exact points on the curve into a bspline fit for the curve.  The fitol is the requested distance of the bspline from the exact curve being calculated.  The two tolerances mean completely different things, but the compiler will happily accept code which switches the two tolerances.  In the case I looked at today, the two parameters were swapped which resulted in code that worked most of the time, but caused a hang with more poorly behaved geometry.  We should expect that convergence_epsilon is a lot smaller (10^3 times smaller or more) than the fitol_desired.

There is a whole constellation of bugs like this that can be avoided by making a careful object model.  A simple way to improve type checkability is to avoid argument lists where convertable types are right next to each other.  Avoiding void* arguments like the plague also fits into this line of design improvement.  An additional help is to only require arguments in a constructor which are absolutely mandatory and use get/set methods to control the other parameters.

One area where I run into problems with this is writing code (e.g., for MESH_MANAGERS) where large objects are stored using arrays of indices into other arrays.  If everything has type int (or size_t if that is how you roll), then compiler type checking doesn't help much.  Pointers are slightly better for this, but then you get into ownership issues.  I really wish you could do typedefs that aren't convertable to each other but have the same operations as integers.

Does you have any suggestions or comments for improving type checking in geometric code?
Tags:

## Subtleties of B-rep Translation (Part 3); Why Healing Matters

By Gregg

I’ve written my last two blogs about different pitfalls and insight needed in order to properly translate CAD data. I’ve discussed how “sharing” of geometry inside the data structure is a hidden but much used form of design intent and discussed how geometry forms are inherently linked to high-level algorithms inside the modeler itself. But I haven’t discussed the healing operations that the Spatial translators perform in order to properly translate the different CAD formats. If you use our translators you know they exist, and people commonly ask about their purpose and efficacy.

To understand InterOp healing we have to start by borrowing a concept from any undergraduate Data Structure and Algorithms class. Generally, one views a software system as two distinct but highly inter-related concepts: a data structure and an acting set of algorithms or operators. In our case the data structure is a classic Boundary Representation structure (B-rep) which geometrically and topologically models wire, sheet and solid data. An operator is an action on that data, for example, an algorithm to determine if a point is inside the solid or not.  But the system’s operators are more than just a set of actions. Implicitly, the operators define a set of rules that the structure must obey. Not all the rules are enforced in the structure itself; actually, many can’t be. But they exist and it’s healing in InterOp that properly conditions the B-rep data to adhere to these rules upon translation.

As always a couple of examples best describe the point. I picked three ACIS rules that are, hopefully, easily understandable.

All 3d edge geometry must be projectable to the surface. Anybody can define a spline based EDGE curve and a surface and write it to SAT. Basically, jot down a bunch of control points, knot vectors, what have you, and put it in a file that obeys SAT format. But in order for it to work properly, geometric rules for edge geometries exist. Specifically, the edge geometry must be projectable to the surface. In short, you can’t have this:

There are many reasons in ACIS for this, but primarily if it’s not projectable then point-perp operations are not well-behaved. If they’re not well behaved finding the correct tolerance (distance between the curve and the surface) is problematic. If one cannot define correct tolerances then water-tightness is not achieved and simple operators, like querying if a point is inside the body, fail.

Edge and Face geometry cannot be self-intersecting. A great deal of solid modeling algorithms work by firing rays and analyzing intersections with different edge and face geometries.  In order for any conclusion to be drawn, the results of the intersection must be quantifiable. The problem with self intersecting geometries is just that; how to you quantify the results in Figure 3? The key observation here; imagine you are walking along the curve in Figure 3, starting from the left side. At the start, the material is on the right side, but after the self intersection the material changes to the left side. You cross the self intersection again and the material switches to the right again. This causes endless grief in understanding the results of an intersection.

Tolerances of Vertices cannot entirely consume neighboring edges. For a B-rep model to be considered water-tight, tolerances of faces and edges must be understood. Today many kernels have global tolerances plus optional tolerances applied to edge curves and vertices. These tolerances vary depending on neighboring conditions, usually obeying some upper bound. You can think of these tolerances as the “caulking” that keeps the model water-tight. Depending on the quality of the geometry or the tolerances of the originating modeling system you might need more “caulking” or less; respectively, larger tolerances on edges or vertices, or smaller tolerances.  However in order to realize a robust Boolean engine, again, rules apply. Consider this:

Above we have Edge Curve 2 encapsulated completely inside the gray tolerant vertex. Again, I can easily write this configuration to SAT format, however Booleans cannot process it. It yields horrific ambiguity when building the intersection graphs in the internal stages of Booleans.

So this is a list of just three rules, it’s far from being comprehensive. But the main point: we know that not everything that ends up in an IGES file comes from a mathematically rigorous surfacing or solid modeling engine. Perhaps people are translating their home-grown data into a system like ACIS so they can perform operations that they could not in their originating system.  But in order to perform these operations, the data must conform to the rules of the system. To simply marshal the data and obey a file format, but disregard the rules, is doing just half the job.

That’s why healing matters.

## Application-Driven Development in Action

By Guest

Today, I’m going to discuss how we’ve been using Application Driven Development (AppDD) principles in our work on CGM.

You might have noticed this really cool picture in Stefanie's last post. It’s really cool because it was the result of our work on the CGM componentization team over the past couple of months – putting dimensional metrology support into our demo application C3D Toolkit.  And the other cool part is that we used AppDD to drive the project.

First, some background.  A few years back we started getting requests for the ability to work with large sets of points (point clouds) from our dimensional metrology customers.  So for the past few years we have been incrementally expanding our support for point clouds in our product line.

A primary workflow in dimensional metrology is to measure a cloud of points on the surface of a manufactured part and compare those measurements to a CAD model representing the desired geometry.

To do this, the following high-level operations are required:

Operation #1

Operation #2

Operation #3

Over the course of several projects, we had incorporated API functions into ACIS to provide these services.  During these projects we worked closely with target customers who were incorporating these functions into their applications, so we were intentionally very focused on the individual functions.  The last project of this sequence was to port this point cloud functionality into CGM – I’ll call this version “Phase 1”.

The opportunity to do AppDD came up a couple of months ago when Ray, our PM, told us that he needed a C3D Toolkit demonstration script for the CONTROL show.  The idea was to show how CGM’s point cloud functionality could support dimensional metrology workflows.

C3D Toolkit is our recently-introduced JavaScript-based demo application that we’re using to showcase CGM.  I’ll talk more about C3D Toolkit in another post; the important thing to know is that the one of its goals is to give us a platform to perform AppDD.  The extensions that we write in C3D Toolkit are very high-level; the intent is that the functionality that might appear behind a GUI button in a customer application should correspond to a single C3D Toolkit extension.

Our first step in the AppDD process was to get a 'storyboard' from Ray of the entire workflow(s) that he wanted to show.  This turned out to be a 1 page e-mail describing the steps of the demo.

Our next step was to generate proposed demo scripts.  One of our team went off and wrote a straw man of what he thought the actual javascript code for the demo should be.  Then, in the spirit of teamwork, we all gathered around a pairing station and criticized his straw man :)  In particular, we had to decide exactly what the return value was for each call to a JS extension, and what the screen would be showing at each stage.  And this is where we saw the true value of AppDD take hold – when we started thinking about highlighting a set of garbage points that we’d selected for removal.

When we originally estimated the project, we thought it would probably take a single two-week iteration, because the hard algorithmic work had already been done in Phase 1.  What we hadn’t realized is that we would need a new “part management” layer in the application to manage e.g. the bookkeeping associated with assigning different colors to different parts of the cloud in a point cloud document.  Our focus on storyboarding the entire demo first caused us to notice this mistake up front, while we had time to adjust our estimates.  It also allowed us to rough out the design up front, rather than generating a lot of code and architecture churn as we incrementally discovered unanticipated workflow requirements.

Only after we understood the demo scripts (which also functioned as acceptance tests) did we actually start coding the functionality behind them.  A week before our deadline we were able to demo the bulk of the scripts to Ray, who gave us some course corrections that we incorporated.  In the mean time, Ray had also been busy.  Since we were trying to mimic a real customer process as closely as possible, we wanted to work with real scan data of a real part manufactured from a real CAD model.  So Ray decided on a CAD model to use in the demo, and sent it out (after introducing a distortion for us to detect) to a service provider who manufactured it with a 3D printer and then scanned the resulting part.  Stef’s really cool picture is the result of running our demo on this data, with color coding based on whether the measured points are within tolerance of the model.

We’re told that the demo went off very well at the show; people especially liked the ability to hold the actual part in their hands and compare it to the scan results on the screen.  An important take away from the project is that the core functionality that we had already written (registration and point-body distance) worked “out of the box” – it was the support infrastructure for managing point clouds in an application document that we spent most of our time on.  This validated for us the idea that AppDD helps to ensure a complete interface for customers.  Now that we’ve identified (and have working code for) a new layer of functionality that many of our customers will need, we have an opportunity to move it into our point cloud product.

The full demo on this part can be seen in Gregg’s webinar on C3D Toolkit and CGM – it starts at the 30 minute mark and runs about 10 minutes (but you should feel free to watch the whole thing :).  One thing to look for: the scan of our part actually did have some garbage points that he removes as a first step.  If we hadn’t already thought of the cleanup requirement when designing the demo, we would have discovered it here because we used a real scan (rather than manufacturing test data).

Tags:

## Getting the Point about Points

By Stefanie

Two weeks ago, Spatial hosted a booth at the CONTROL Exhibition in Stuttgart, Germany.  I hate to follow John's recent post with another one about a trade show, but this one is worth discussing - let's just call it "Interesting Shows - part 2."

For anybody not familiar with it, CONTROL is a huge show aimed at the dimensional metrology market.  Whenever I go to trade shows, I am amazed at the scale of the market (4 huge buildings for this one) and the specificity of the vendors.

The range of devices was quite interesting.  There were many varieties of bridge CMMs, but there was also a wide range of hand held measurement machines.  One was a small metal ball with mirrors inside.  You put the ball on the part you wish to measure, and a nearby camera shoots a laser at the ball, which reflects it back.  A similar idea was a wand that looked like the ones used for frisking at airport security.  You poke the point to measure, and again a camera measures specific points on the wand which allow it to infer the location of the point you poked.  After wandering the halls for a few days, a simple understanding of all of it gelled in my mind.

All that these devices do is measure points in space

Of course they do that with tremendous variety, which is how they differentiate themselves from each other. Differentiation can be on the accuracy of measurement, point gathering speed, physical access (e.g. you can't put the wing of an airplane in a bridge machine, so you use a hand held device), and much more.  But the one thing they have in common is that they're still all trying to do one basic thing - give you three very, very accurate coordinates, many, many times over.

As a small indicator of just how hard this actually is, I saw a few vendors selling only the granite slabs that go into the CMMs.  Imagine - there are entire companies whose only business is to make sure that they give you something very flat on which to put your measurement machine.  Now that's accurate.

I realize that to anybody working in this market, this is a simple and obvious concept, but sometimes working on software components, you get so focused on what a specific customer's application is doing that you only see the trees and not the forest -- or maybe the points and not the cloud :-).

Which brings me to the software side of things. The hardware is a major investment and differentiator in the CMM market, but good software is essential to run it.  A good CMM program will do things like help the programmer and/or machine operator easily determine which points to measure, it'll tell the machine how to do that in the most optimal way, and it will analyze the gathered points and report the results back to the user.

Obviously, Spatial is very involved in this part of the measurement market, particularly as more and more systems are moving to measuring and comparing to 3D parts rather than 2D drawings.  One thing in particular struck me throughout the show - almost every discussion I had turned to the subject of PMI (or GD&T) at some point.  There was a time not so long ago when using PMI in CMM applications was a new idea.  When we first added PMI to our 3D InterOp product line, we had many customers excited about it, but mostly in principle. Very few were actually doing anything with it.  Today the discussion is totally different.  We're seeing applications do everything from drive automatic test plan creation to automatic post-process comparison between the gathered points and the tolerances originally specified by the designer.

Getting out to see the physical products in person is a tremendous help to anybody working in software.  For me, I finally internalized both the simplicity and the complexity of dimensional metrology and how we fit into it.

Anybody out there have suggestions for another good educational experience in your market?

Tags:

## What's Up With the GPU?

By jeff

In my previous post I discussed the various multiprocessing technologies at my disposal. These are: OpenMP, MPI, PPL (Microsoft Concurrency Runtime), the ACIS thread manager, and the CGM multiprocessing infrastructure. As it turns out, I overlooked a completely relevant technology, the GPU. I would like to correct this oversight and pass along my experiences in adapting the primality algorithm used in the previous discussion to run on a GPU.  (Look here for a brief introduction to GPU computing)

I’m working remotely at the moment and decided to avoid the potential headaches associated with accessing a video card through a remote desktop connection by simply using my laptop computer for this exercise. It does after all have a decent video card, an NVIDIA Quadro 1000M with 96 cores and 2GB of RAM. In analyzing compatible development tools for my video card, I chose CUDA, which is a parallel computing platform and programming model invented by NVIDIA. (Look here for a brief introduction to CUDA)

I downloaded and installed the latest CUDA Toolkit (version 4.1), associated video driver, and example codes from NVIDIA. Then I went through the documentation and a few sample programs to get a feel for the task at hand. I quickly identified several aspects of GPU programming that were different from what I was used  to, namely how jobs are presented to the GPU, how they are broken down into tasks to fit the available hardware, and how the tasks are computed.

Jobs are computed on the GPU using what is called a kernel launch. A job is typically made up of many tasks. In our example, the job is to “find all prime numbers between one and one hundred million”. This can be broken down into one hundred million tasks in which each calculates the primality of one single number. The kernel to launch is essentially a function with arguments that gets called for each task. It determines its index through variables made available by the CUDA architecture (discussed later), then performs the operation on the input data taken from the index location in the input array, and stores the results again using the index into the output array.

This is a classic Single-Instruction-Multiple-Data (SIMD) processing architecture, where each processor is mapped to a unique index and executes the exact same set of instructions on the corresponding input data - on all processors - at the same time. In our example, the input data is an array containing all the numbers between one and one hundred million. We hand this array to the GPU as an argument to the kernel function and it calculates the primality of each element in the array, overwriting the corresponding array data with the results of the computations, in this case either true or false.

Calculating the unique index in each task is not as straightforward as you might think. In CUDA, a job is broken down into a multidimensional grid of blocks, where the number of tasks is equal to the number of blocks in the grid times the number of threads in a block. These values, the number of blocks and the number of threads per block, are specified when the kernel function is launched. Corresponding variables are available in the kernel function to compute the current index based on these values.

Calculating an index might look something like this:

index = blockIdx * blockDim + threadIdx;

Where blockIdx is the current block index within the grid, blockDim is the number of threads in the block, and threadIdx is the current thread index within the block. To make things a bit more flexible (complex), the grid size and block size can be specified in multiple dimensions. This is necessary to overcome limitations that would otherwise severely restrict the number of tasks that can be performed in any single kernel launch.

For practical purposes there is an upper limit to the number of threads available at any given time. In CUDA this translates into the maximum number of threads that can be used in any block, which on my system is 1024. Because of this, we would need roughly one-hundred-thousand blocks to accommodate the target number of inputs in our example - in one launch. Unfortunately, there is also an upper limit to the number of blocks, which on my system is 65535. To overcome this limitation, CUDA provides a multi-dimensional grid of blocks. My system supports a grid of 65535 x 65535 x 65535 blocks, each with 1024 threads. That’s a large number of indices available for any one operation.

Next come the input and output arrays, which in our example is one and the same. Allocating the array of input values is simple in native code (referred to as the host in CUDA speak), and as it turns out is simple with CUDA as well. It’s accomplished with cudaMalloc and cudaFree, which allocates and frees memory on the device respectively. The typical approach is to allocate identical arrays on both the host and device and use cudaMemcpy to transfer the contents back and forth.

Here is where my respect for resources causes me to deviate from my original plan. Since even numbers are never prime (except 2) we can cut the array size in half and only consider odd numbers. This changes things quite a bit however, from the size of the input/output array to the adjustment of the index value, to the number of operations that are actually performed. Nonetheless, I am willing to extend this courtesy to the GPU for optimizations sake, even though it’s cheating a bit.

So now I have a program that allocates and initializes an array on both the host and device, copies it, then launches a kernel with appropriate grid and block dimensions, copies the array data back to the host, validates the results, and finally frees memory. The kernel function calculates the corresponding index, loads the input value, tests for primality, and writes the result back to the array.

When testing the code I naturally began with smaller ranges of inputs, to make sure everything was working as expected. I got a big surprise when I finally ran with the target range of one hundred million. The program failed! As it turns out, the Windows operating system terminates operations on the video device that take longer than a few seconds. This is known behavior called Timeout Detection and Recovery (TDR). I could have fiddled with registry settings to disable TDR but instead decided to simply restructure the program to have multiple kernel launches.

From experimentation I found it safe to process one million inputs at a time. So all I had to do was to launch the kernel from within a loop, passing the current iteration to the kernel function, and adjusting the index accordingly. That’s it. I now have a complete and working program.

View Main Function

View Kernel Function

The performance was the next big surprise. The serial run on my laptop took 109 seconds, using OpenMP and four processors it was reduced to 27 seconds, and in comparison the GPU run took 53 seconds. This was unexpected at first, given that we have 96 processors available, until I realized that the tasks are very unbalanced. Determining primality is quick in some cases and may take a long time in others, especially when the number is large and prime.

The drawback with tasks of varying complexity for the GPU is that each operation, with whatever chunk of inputs is ultimately scheduled, will take as long as the most complex task. My video card has 96 processors, which makes me suspect that the inputs are processed in chunks of 96. If calculating primality is mostly simple, then most processors will sit idle while a few are working on complex cases. In contrast, independent processors are seldom idle because they can simply move on to the next task.

As an experiment, l wanted to test the capabilities of my GPU with a more balanced task complexity. So I picked a very large prime number to use for every task, and modified the job to compute the same task one million times. The serial operation took 22.2 seconds, the OpenMP version took 5.6, and the GPU version took 1.5 seconds. Now that is the result I was hoping to see.

The GPU certainly has a place in the multiprocessing arena, but I think it can be very challenging to find applicable operations. For many years now we have been analyzing the performance of ACIS, and to date have never found a situation that would be best serviced by the GPU. I do think however, that many applications exist that can benefit greatly by utilizing the GPU. I’m sure it’s the right tool for some problems.

I would like to hear if anyone has found a good use for the GPU . . .?

Tags: