Today, I’m going to discuss how we’ve been using Application Driven Development (AppDD) principles in our work on CGM.
You might have noticed this really cool picture in Stefanie's last post. It’s really cool because it was the result of our work on the CGM componentization team over the past couple of months – putting dimensional metrology support into our demo application C3D Toolkit. And the other cool part is that we used AppDD to drive the project.
First, some background. A few years back we started getting requests for the ability to work with large sets of points (point clouds) from our dimensional metrology customers. So for the past few years we have been incrementally expanding our support for point clouds in our product line.
A primary workflow in dimensional metrology is to measure a cloud of points on the surface of a manufactured part and compare those measurements to a CAD model representing the desired geometry.
To do this, the following high-level operations are required:
Over the course of several projects, we had incorporated API functions into ACIS to provide these services. During these projects we worked closely with target customers who were incorporating these functions into their applications, so we were intentionally very focused on the individual functions. The last project of this sequence was to port this point cloud functionality into CGM – I’ll call this version “Phase 1”.
The opportunity to do AppDD came up a couple of months ago when Ray, our PM, told us that he needed a C3D Toolkit demonstration script for the CONTROL show. The idea was to show how CGM’s point cloud functionality could support dimensional metrology workflows.
Our first step in the AppDD process was to get a 'storyboard' from Ray of the entire workflow(s) that he wanted to show. This turned out to be a 1 page e-mail describing the steps of the demo.
When we originally estimated the project, we thought it would probably take a single two-week iteration, because the hard algorithmic work had already been done in Phase 1. What we hadn’t realized is that we would need a new “part management” layer in the application to manage e.g. the bookkeeping associated with assigning different colors to different parts of the cloud in a point cloud document. Our focus on storyboarding the entire demo first caused us to notice this mistake up front, while we had time to adjust our estimates. It also allowed us to rough out the design up front, rather than generating a lot of code and architecture churn as we incrementally discovered unanticipated workflow requirements.
Only after we understood the demo scripts (which also functioned as acceptance tests) did we actually start coding the functionality behind them. A week before our deadline we were able to demo the bulk of the scripts to Ray, who gave us some course corrections that we incorporated. In the mean time, Ray had also been busy. Since we were trying to mimic a real customer process as closely as possible, we wanted to work with real scan data of a real part manufactured from a real CAD model. So Ray decided on a CAD model to use in the demo, and sent it out (after introducing a distortion for us to detect) to a service provider who manufactured it with a 3D printer and then scanned the resulting part. Stef’s really cool picture is the result of running our demo on this data, with color coding based on whether the measured points are within tolerance of the model.
We’re told that the demo went off very well at the show; people especially liked the ability to hold the actual part in their hands and compare it to the scan results on the screen. An important take away from the project is that the core functionality that we had already written (registration and point-body distance) worked “out of the box” – it was the support infrastructure for managing point clouds in an application document that we spent most of our time on. This validated for us the idea that AppDD helps to ensure a complete interface for customers. Now that we’ve identified (and have working code for) a new layer of functionality that many of our customers will need, we have an opportunity to move it into our point cloud product.
The full demo on this part can be seen in Gregg’s webinar on C3D Toolkit and CGM – it starts at the 30 minute mark and runs about 10 minutes (but you should feel free to watch the whole thing :). One thing to look for: the scan of our part actually did have some garbage points that he removes as a first step. If we hadn’t already thought of the cleanup requirement when designing the demo, we would have discovered it here because we used a real scan (rather than manufacturing test data).
Two weeks ago, Spatial hosted a booth at the CONTROL Exhibition in Stuttgart, Germany. I hate to follow John's recent post with another one about a trade show, but this one is worth discussing - let's just call it "Interesting Shows - part 2."
For anybody not familiar with it, CONTROL is a huge show aimed at the dimensional metrology market. Whenever I go to trade shows, I am amazed at the scale of the market (4 huge buildings for this one) and the specificity of the vendors.
The range of devices was quite interesting. There were many varieties of bridge CMMs, but there was also a wide range of hand held measurement machines. One was a small metal ball with mirrors inside. You put the ball on the part you wish to measure, and a nearby camera shoots a laser at the ball, which reflects it back. A similar idea was a wand that looked like the ones used for frisking at airport security. You poke the point to measure, and again a camera measures specific points on the wand which allow it to infer the location of the point you poked. After wandering the halls for a few days, a simple understanding of all of it gelled in my mind.
All that these devices do is measure points in space
Of course they do that with tremendous variety, which is how they differentiate themselves from each other. Differentiation can be on the accuracy of measurement, point gathering speed, physical access (e.g. you can't put the wing of an airplane in a bridge machine, so you use a hand held device), and much more. But the one thing they have in common is that they're still all trying to do one basic thing - give you three very, very accurate coordinates, many, many times over.
As a small indicator of just how hard this actually is, I saw a few vendors selling only the granite slabs that go into the CMMs. Imagine - there are entire companies whose only business is to make sure that they give you something very flat on which to put your measurement machine. Now that's accurate.
I realize that to anybody working in this market, this is a simple and obvious concept, but sometimes working on software components, you get so focused on what a specific customer's application is doing that you only see the trees and not the forest -- or maybe the points and not the cloud :-).
Which brings me to the software side of things. The hardware is a major investment and differentiator in the CMM market, but good software is essential to run it. A good CMM program will do things like help the programmer and/or machine operator easily determine which points to measure, it'll tell the machine how to do that in the most optimal way, and it will analyze the gathered points and report the results back to the user.
Obviously, Spatial is very involved in this part of the measurement market, particularly as more and more systems are moving to measuring and comparing to 3D parts rather than 2D drawings. One thing in particular struck me throughout the show - almost every discussion I had turned to the subject of PMI (or GD&T) at some point. There was a time not so long ago when using PMI in CMM applications was a new idea. When we first added PMI to our 3D InterOp
product line, we had many customers excited about it, but mostly in principle. Very few were actually doing anything with it. Today the discussion is totally different. We're seeing applications do everything from drive automatic test plan creation to automatic post-process comparison between the gathered points and the tolerances originally specified by the designer.
Getting out to see the physical products in person is a tremendous help to anybody working in software. For me, I finally internalized both the simplicity and the complexity of dimensional metrology and how we fit into it.
Anybody out there have suggestions for another good educational experience in your market?
In my previous post I discussed the various multiprocessing technologies at my disposal. These are: OpenMP, MPI, PPL (Microsoft Concurrency Runtime), the ACIS thread manager, and the CGM multiprocessing infrastructure. As it turns out, I overlooked a completely relevant technology, the GPU. I would like to correct this oversight and pass along my experiences in adapting the primality algorithm used in the previous discussion to run on a GPU. (Look here for a brief introduction to GPU computing)
I’m working remotely at the moment and decided to avoid the potential headaches associated with accessing a video card through a remote desktop connection by simply using my laptop computer for this exercise. It does after all have a decent video card, an NVIDIA Quadro 1000M with 96 cores and 2GB of RAM. In analyzing compatible development tools for my video card, I chose CUDA, which is a parallel computing platform and programming model invented by NVIDIA. (Look here for a brief introduction to CUDA)
I downloaded and installed the latest CUDA Toolkit (version 4.1), associated video driver, and example codes from NVIDIA. Then I went through the documentation and a few sample programs to get a feel for the task at hand. I quickly identified several aspects of GPU programming that were different from what I was used to, namely how jobs are presented to the GPU, how they are broken down into tasks to fit the available hardware, and how the tasks are computed.
Jobs are computed on the GPU using what is called a kernel launch. A job is typically made up of many tasks. In our example, the job is to “find all prime numbers between one and one hundred million”. This can be broken down into one hundred million tasks in which each calculates the primality of one single number. The kernel to launch is essentially a function with arguments that gets called for each task. It determines its index through variables made available by the CUDA architecture (discussed later), then performs the operation on the input data taken from the index location in the input array, and stores the results again using the index into the output array.
This is a classic Single-Instruction-Multiple-Data (SIMD) processing architecture, where each processor is mapped to a unique index and executes the exact same set of instructions on the corresponding input data - on all processors - at the same time. In our example, the input data is an array containing all the numbers between one and one hundred million. We hand this array to the GPU as an argument to the kernel function and it calculates the primality of each element in the array, overwriting the corresponding array data with the results of the computations, in this case either true or false.
Calculating the unique index in each task is not as straightforward as you might think. In CUDA, a job is broken down into a multidimensional grid of blocks, where the number of tasks is equal to the number of blocks in the grid times the number of threads in a block. These values, the number of blocks and the number of threads per block, are specified when the kernel function is launched. Corresponding variables are available in the kernel function to compute the current index based on these values.
Calculating an index might look something like this:
index = blockIdx * blockDim + threadIdx;
Where blockIdx is the current block index within the grid, blockDim is the number of threads in the block, and threadIdx is the current thread index within the block. To make things a bit more flexible (complex), the grid size and block size can be specified in multiple dimensions. This is necessary to overcome limitations that would otherwise severely restrict the number of tasks that can be performed in any single kernel launch.
For practical purposes there is an upper limit to the number of threads available at any given time. In CUDA this translates into the maximum number of threads that can be used in any block, which on my system is 1024. Because of this, we would need roughly one-hundred-thousand blocks to accommodate the target number of inputs in our example - in one launch. Unfortunately, there is also an upper limit to the number of blocks, which on my system is 65535. To overcome this limitation, CUDA provides a multi-dimensional grid of blocks. My system supports a grid of 65535 x 65535 x 65535 blocks, each with 1024 threads. That’s a large number of indices available for any one operation.
Next come the input and output arrays, which in our example is one and the same. Allocating the array of input values is simple in native code (referred to as the host in CUDA speak), and as it turns out is simple with CUDA as well. It’s accomplished with cudaMalloc and cudaFree, which allocates and frees memory on the device respectively. The typical approach is to allocate identical arrays on both the host and device and use cudaMemcpy to transfer the contents back and forth.
Here is where my respect for resources causes me to deviate from my original plan. Since even numbers are never prime (except 2) we can cut the array size in half and only consider odd numbers. This changes things quite a bit however, from the size of the input/output array to the adjustment of the index value, to the number of operations that are actually performed. Nonetheless, I am willing to extend this courtesy to the GPU for optimizations sake, even though it’s cheating a bit.
So now I have a program that allocates and initializes an array on both the host and device, copies it, then launches a kernel with appropriate grid and block dimensions, copies the array data back to the host, validates the results, and finally frees memory. The kernel function calculates the corresponding index, loads the input value, tests for primality, and writes the result back to the array.
When testing the code I naturally began with smaller ranges of inputs, to make sure everything was working as expected. I got a big surprise when I finally ran with the target range of one hundred million. The program failed! As it turns out, the Windows operating system terminates operations on the video device that take longer than a few seconds. This is known behavior called Timeout Detection and Recovery (TDR). I could have fiddled with registry settings to disable TDR but instead decided to simply restructure the program to have multiple kernel launches.
From experimentation I found it safe to process one million inputs at a time. So all I had to do was to launch the kernel from within a loop, passing the current iteration to the kernel function, and adjusting the index accordingly. That’s it. I now have a complete and working program.
View Main Function
View Kernel Function
The performance was the next big surprise. The serial run on my laptop took 109 seconds, using OpenMP and four processors it was reduced to 27 seconds, and in comparison the GPU run took 53 seconds. This was unexpected at first, given that we have 96 processors available, until I realized that the tasks are very unbalanced. Determining primality is quick in some cases and may take a long time in others, especially when the number is large and prime.
The drawback with tasks of varying complexity for the GPU is that each operation, with whatever chunk of inputs is ultimately scheduled, will take as long as the most complex task. My video card has 96 processors, which makes me suspect that the inputs are processed in chunks of 96. If calculating primality is mostly simple, then most processors will sit idle while a few are working on complex cases. In contrast, independent processors are seldom idle because they can simply move on to the next task.
As an experiment, l wanted to test the capabilities of my GPU with a more balanced task complexity. So I picked a very large prime number to use for every task, and modified the job to compute the same task one million times. The serial operation took 22.2 seconds, the OpenMP version took 5.6, and the GPU version took 1.5 seconds. Now that is the result I was hoping to see.
The GPU certainly has a place in the multiprocessing arena, but I think it can be very challenging to find applicable operations. For many years now we have been analyzing the performance of ACIS, and to date have never found a situation that would be best serviced by the GPU. I do think however, that many applications exist that can benefit greatly by utilizing the GPU. I’m sure it’s the right tool for some problems.
I would like to hear if anyone has found a good use for the GPU . . .?
Gregg and Stefanie have described some management perspective on Agile programming. As a participant in several of the team rooms they mentioned, I would like to make a few comments.
- Active involvement with customers: the more the developers know about what end users will want, the better.
- Emphasis on refactoring: rearranging code to avoid duplication while adding capabilities is critical for any developer.
- Retrospective: with any important activity it is important to take some time to think about what you are doing right and what could be done better.
When done correctly, pair programming is exhausting. One especially rewarding session (where Mayank and I coded up how quad trees are intersected with face boundaries) produced very good code, but left us hoarse every day for a week! What made that session work is that we both challenged each other’s intuitions freely. The end result was reasonably well tested and reliable. We had enough clash to write code better than either of us would have alone.
Unfortunately, a pace like that cannot be sustained for long. It is much easier to develop a hierarchy, where someone “knows the most” about a particular area of the code, and the other partner either watches for typos, or is supervised by the more knowledgeable person. Even this mode of pairing is tiring.
Anything that makes a team’s efforts better than the sum of individual efforts (if you just divide the work by N and give everyone something to do) is good. But pairing requires continuous effort, and won’t improve the code without everyone’s sustained efforts. There is a lot of middle ground between pairing and not. Code reviews and having people frequently bounce ideas off each other gets a lot of the benefit with less stress.
If tools are set up correctly, most of the errors are semantic (i.e., the code looks good and compiles, but it’s not doing what it is supposed to). Unit testing only helps when you know the right tests to apply. It can’t catch poor scaling (e.g, using an n-squared algorithm where an O(n) or O(nln(n)) would work. ) I have become a big fan of writing out the contract for a function before I write the code, then placing assertions to specify pre and post conditions.
I think the big take away should be: if you are testing your code correctly, mistakes should be obvious. When you do something wrong,
- Your code should fail to compile
- Half a dozen or more unit tests should fail
- Assertions should be going off all over the place, etc.
Long undiscovered bugs cost more than those found early in testing. Good programming demands a high level of focus on details: the more time you have to forget the code you wrote, the harder it is to fix.
- Thin Vertical Slices/Design as you go.
Positives: Thin vertical slices make sense because there is business value in quickly getting small but usable pieces functionality to customers. If they like it, you follow up and develop it further until it meets their needs fully. If no one buys it, the project stops and you haven’t really lost that much (because you didn’t develop more than you needed).
Negatives: The notion that software can be redesigned on the fly is only partially true. The more people are using something the more risk there is in changing it. No amount of testing eliminates regression risk. If customers aren’t on board with iterative development, having a new drop every few weeks could cost you some credibility (why didn’t they do it right the first time?). Finally, it takes a lot of skill and good judgment to balance the goal of refactoring to get better code quality with regression risks.
What do you think? I was reading a recent survey on Agile and the results seemed largely positive. Does this fit with your experience?
'Form' in mathematics manifests itself in all manners of perspective and discussion. From your earliest mathematics courses, professors drilled home the discipline; “return all answers in simplest form”. My youthful efforts to dismiss the need yielded discussions as such; “OK, please graph this equation: " . In a quick second I would naively suggest that at x = -1 the equation is undefined and then I would start plotting points. But alas, this is why form is important. gets factored to which is simplified to to x - 1, whenever x isn’t -1. Wait, that’s a line. My eighth grade algebra teacher, Mr. Sower, was right, simplest form is important.
As you advanced in your course work, you start to define forms of equations by their mathematical representation and to understand advantages and disadvantages of each. Farin, in his book, Practical Linear Algebra, does a nice job outlining the three main forms of an equation of a line and advantages of each in computer graphics:
- Explicit Form: y = mx + b This is the form in every basic algebra book. It’s very conceptual; the coefficients have clear geometric meaning. In computer graphics it’s the preferred form for Bresenham’s line drawing algorithm and scan line fill algorithms.
- Implicit Form: (Given a point p and a vector a that is perpendicular to the line.) The implicit form is very useful for determining if an arbitrary point is on a line.
- Parametric Form: . The scalar value t is a parameter. In this form, we can easily calculate points along the line by use of the parameter, t.
I’m not certain when I internalized that inherent in mathematics is the art, strategy and beauty of 'form'. (I’m a slow learner, it wasn’t Mr. Sower’s fault.) But as my career developed into the commercial implementation of B-rep modeling kernels their translation technologies, 'form' again, became a principal view.
So, for the purpose of this discussion we define 'form' of geometric curves and surfaces in three ways: analytic, B-spline and procedural representations. All three of the major solid modeling kernels, ACIS, CGM, and Parasolid, maintain their geometry in either of these three forms, or sometimes as dual representations. 
- Analytic: geometry which can be represented explicitly by an equation (algebraic formula), for example: planes, spheres, and cylinders. These equations are very light weight and they intrinsically hold characteristics of the surface, for example the centroid of a sphere.
- B-spline: geometry represented by smooth polynomial functions (in parametric form) that are piece-wise defined. Generally used to represent free-form surfaces. Advantages of B-splines are their ability to represent many types of geometry and bounding boxes are easy to calculate.
- Procedural: geometry represented as an implicit equation or algorithm. For example, the IGES specification has tabulated cylinders and offset surfaces as defined procedural surfaces. The advantages are precision and the knowledge of progenitor data to understand design intent.
From this perspective, each of the major kernels has thought long and hard about the best form for each geometry type. In some cases it’s self-evident and easy. If you are modeling a sphere, the analytic form is clearly best. It’s a lighter weight representation and the full extents of the surface are defined. Even more, imagine doing a calculation requiring the distance between a point and a sphere. In this “special” case, you simply compute the distance between the point and the centroid of the sphere, subtract the radius and you’re done. If the sphere was in the form of a b-spline it’s much more computationally expensive. Despite this translation solutions still don’t get this preferred form right. Now imagine you’re a CMM application and you purchased the solution that translates spheres to B-splines? Your app is horribly slow.
Although spheres are a trivial example, more complex geometries become intriguing. In what form should you prefer a helical surface? Or an offset surface? ACIS has preferred multiple versions of helical surfaces over the years. Early on the preferred version was a procedural sweep surface with either a procedural rail curve or a b-spline rail curve. (The rail curve is what defines the helical nature of the surface). If the surface was translated in from Parasolid it came in as a generic b-spline surface. But the need to understand different characteristics of the helical surface soon became apparent. For example, the hidden line removal algorithm and intersectors all needed to understand the pitch and handedness to efficiently work with the geometry. To that end, ACIS moved to a procedural surface definition with an analytical representation of the helical rail curve.
The offset surface is an excellent example where the CGM developers and the ACIS developers came to different conclusions. In ACIS the offset surface is procedural; evaluate the progenitor and shoot up the surface normal the offset distance. ACIS choose this representation for preciseness and compactness of data. In addition, in ACIS, if you offset an offset surface the progenitor of the original offset becomes the progenitor for the second or third or fourth offset and more geometry sharing is possible. But all of this comes at a cost. Procedural surfaces, although exact, may have a performance penalty and may introduce unwanted discontinuities. The CGM developers decided the best strategy here was to create b-splines for offsets.
So what does this all have to do with translation? The key point here is; you need to understand what the preferred forms are for each of the modeling kernels. In each of these systems you can easily slip in geometry in non-optimal forms causing immense grief when doing downstream modeling operations. I spoke earlier about the translator solution that goofed up even a simple conversion of spheres. And the CMM application that purchased that translation solution? In short, don’t let that be you.
 For this discussion I’m going to leave off polyhedral form.