In my previous post I discussed the various multiprocessing technologies at my disposal. These are: OpenMP, MPI, PPL (Microsoft Concurrency Runtime), the ACIS thread manager, and the CGM multiprocessing infrastructure. As it turns out, I overlooked a completely relevant technology, the GPU. I would like to correct this oversight and pass along my experiences in adapting the primality algorithm used in the previous discussion to run on a GPU. (Look here for a brief introduction to GPU computing)
I’m working remotely at the moment and decided to avoid the potential headaches associated with accessing a video card through a remote desktop connection by simply using my laptop computer for this exercise. It does after all have a decent video card, an NVIDIA Quadro 1000M with 96 cores and 2GB of RAM. In analyzing compatible development tools for my video card, I chose CUDA, which is a parallel computing platform and programming model invented by NVIDIA. (Look here for a brief introduction to CUDA)
I downloaded and installed the latest CUDA Toolkit (version 4.1), associated video driver, and example codes from NVIDIA. Then I went through the documentation and a few sample programs to get a feel for the task at hand. I quickly identified several aspects of GPU programming that were different from what I was used to, namely how jobs are presented to the GPU, how they are broken down into tasks to fit the available hardware, and how the tasks are computed.
Jobs are computed on the GPU using what is called a kernel launch. A job is typically made up of many tasks. In our example, the job is to “find all prime numbers between one and one hundred million”. This can be broken down into one hundred million tasks in which each calculates the primality of one single number. The kernel to launch is essentially a function with arguments that gets called for each task. It determines its index through variables made available by the CUDA architecture (discussed later), then performs the operation on the input data taken from the index location in the input array, and stores the results again using the index into the output array.
This is a classic Single-Instruction-Multiple-Data (SIMD) processing architecture, where each processor is mapped to a unique index and executes the exact same set of instructions on the corresponding input data - on all processors - at the same time. In our example, the input data is an array containing all the numbers between one and one hundred million. We hand this array to the GPU as an argument to the kernel function and it calculates the primality of each element in the array, overwriting the corresponding array data with the results of the computations, in this case either true or false.
Calculating the unique index in each task is not as straightforward as you might think. In CUDA, a job is broken down into a multidimensional grid of blocks, where the number of tasks is equal to the number of blocks in the grid times the number of threads in a block. These values, the number of blocks and the number of threads per block, are specified when the kernel function is launched. Corresponding variables are available in the kernel function to compute the current index based on these values.
Calculating an index might look something like this:
index = blockIdx * blockDim + threadIdx;
Where blockIdx is the current block index within the grid, blockDim is the number of threads in the block, and threadIdx is the current thread index within the block. To make things a bit more flexible (complex), the grid size and block size can be specified in multiple dimensions. This is necessary to overcome limitations that would otherwise severely restrict the number of tasks that can be performed in any single kernel launch.
For practical purposes there is an upper limit to the number of threads available at any given time. In CUDA this translates into the maximum number of threads that can be used in any block, which on my system is 1024. Because of this, we would need roughly one-hundred-thousand blocks to accommodate the target number of inputs in our example - in one launch. Unfortunately, there is also an upper limit to the number of blocks, which on my system is 65535. To overcome this limitation, CUDA provides a multi-dimensional grid of blocks. My system supports a grid of 65535 x 65535 x 65535 blocks, each with 1024 threads. That’s a large number of indices available for any one operation.
Next come the input and output arrays, which in our example is one and the same. Allocating the array of input values is simple in native code (referred to as the host in CUDA speak), and as it turns out is simple with CUDA as well. It’s accomplished with cudaMalloc and cudaFree, which allocates and frees memory on the device respectively. The typical approach is to allocate identical arrays on both the host and device and use cudaMemcpy to transfer the contents back and forth.
Here is where my respect for resources causes me to deviate from my original plan. Since even numbers are never prime (except 2) we can cut the array size in half and only consider odd numbers. This changes things quite a bit however, from the size of the input/output array to the adjustment of the index value, to the number of operations that are actually performed. Nonetheless, I am willing to extend this courtesy to the GPU for optimizations sake, even though it’s cheating a bit.
So now I have a program that allocates and initializes an array on both the host and device, copies it, then launches a kernel with appropriate grid and block dimensions, copies the array data back to the host, validates the results, and finally frees memory. The kernel function calculates the corresponding index, loads the input value, tests for primality, and writes the result back to the array.
When testing the code I naturally began with smaller ranges of inputs, to make sure everything was working as expected. I got a big surprise when I finally ran with the target range of one hundred million. The program failed! As it turns out, the Windows operating system terminates operations on the video device that take longer than a few seconds. This is known behavior called Timeout Detection and Recovery (TDR). I could have fiddled with registry settings to disable TDR but instead decided to simply restructure the program to have multiple kernel launches.
From experimentation I found it safe to process one million inputs at a time. So all I had to do was to launch the kernel from within a loop, passing the current iteration to the kernel function, and adjusting the index accordingly. That’s it. I now have a complete and working program.
View Main Function
View Kernel Function
The performance was the next big surprise. The serial run on my laptop took 109 seconds, using OpenMP and four processors it was reduced to 27 seconds, and in comparison the GPU run took 53 seconds. This was unexpected at first, given that we have 96 processors available, until I realized that the tasks are very unbalanced. Determining primality is quick in some cases and may take a long time in others, especially when the number is large and prime.
The drawback with tasks of varying complexity for the GPU is that each operation, with whatever chunk of inputs is ultimately scheduled, will take as long as the most complex task. My video card has 96 processors, which makes me suspect that the inputs are processed in chunks of 96. If calculating primality is mostly simple, then most processors will sit idle while a few are working on complex cases. In contrast, independent processors are seldom idle because they can simply move on to the next task.
As an experiment, l wanted to test the capabilities of my GPU with a more balanced task complexity. So I picked a very large prime number to use for every task, and modified the job to compute the same task one million times. The serial operation took 22.2 seconds, the OpenMP version took 5.6, and the GPU version took 1.5 seconds. Now that is the result I was hoping to see.
The GPU certainly has a place in the multiprocessing arena, but I think it can be very challenging to find applicable operations. For many years now we have been analyzing the performance of ACIS, and to date have never found a situation that would be best serviced by the GPU. I do think however, that many applications exist that can benefit greatly by utilizing the GPU. I’m sure it’s the right tool for some problems.
I would like to hear if anyone has found a good use for the GPU . . .?
Gregg and Stefanie have described some management perspective on Agile programming. As a participant in several of the team rooms they mentioned, I would like to make a few comments.
- Active involvement with customers: the more the developers know about what end users will want, the better.
- Emphasis on refactoring: rearranging code to avoid duplication while adding capabilities is critical for any developer.
- Retrospective: with any important activity it is important to take some time to think about what you are doing right and what could be done better.
When done correctly, pair programming is exhausting. One especially rewarding session (where Mayank and I coded up how quad trees are intersected with face boundaries) produced very good code, but left us hoarse every day for a week! What made that session work is that we both challenged each other’s intuitions freely. The end result was reasonably well tested and reliable. We had enough clash to write code better than either of us would have alone.
Unfortunately, a pace like that cannot be sustained for long. It is much easier to develop a hierarchy, where someone “knows the most” about a particular area of the code, and the other partner either watches for typos, or is supervised by the more knowledgeable person. Even this mode of pairing is tiring.
Anything that makes a team’s efforts better than the sum of individual efforts (if you just divide the work by N and give everyone something to do) is good. But pairing requires continuous effort, and won’t improve the code without everyone’s sustained efforts. There is a lot of middle ground between pairing and not. Code reviews and having people frequently bounce ideas off each other gets a lot of the benefit with less stress.
If tools are set up correctly, most of the errors are semantic (i.e., the code looks good and compiles, but it’s not doing what it is supposed to). Unit testing only helps when you know the right tests to apply. It can’t catch poor scaling (e.g, using an n-squared algorithm where an O(n) or O(nln(n)) would work. ) I have become a big fan of writing out the contract for a function before I write the code, then placing assertions to specify pre and post conditions.
I think the big take away should be: if you are testing your code correctly, mistakes should be obvious. When you do something wrong,
- Your code should fail to compile
- Half a dozen or more unit tests should fail
- Assertions should be going off all over the place, etc.
Long undiscovered bugs cost more than those found early in testing. Good programming demands a high level of focus on details: the more time you have to forget the code you wrote, the harder it is to fix.
- Thin Vertical Slices/Design as you go.
Positives: Thin vertical slices make sense because there is business value in quickly getting small but usable pieces functionality to customers. If they like it, you follow up and develop it further until it meets their needs fully. If no one buys it, the project stops and you haven’t really lost that much (because you didn’t develop more than you needed).
Negatives: The notion that software can be redesigned on the fly is only partially true. The more people are using something the more risk there is in changing it. No amount of testing eliminates regression risk. If customers aren’t on board with iterative development, having a new drop every few weeks could cost you some credibility (why didn’t they do it right the first time?). Finally, it takes a lot of skill and good judgment to balance the goal of refactoring to get better code quality with regression risks.
What do you think? I was reading a recent survey on Agile and the results seemed largely positive. Does this fit with your experience?
'Form' in mathematics manifests itself in all manners of perspective and discussion. From your earliest mathematics courses, professors drilled home the discipline; “return all answers in simplest form”. My youthful efforts to dismiss the need yielded discussions as such; “OK, please graph this equation: " . In a quick second I would naively suggest that at x = -1 the equation is undefined and then I would start plotting points. But alas, this is why form is important. gets factored to which is simplified to to x - 1, whenever x isn’t -1. Wait, that’s a line. My eighth grade algebra teacher, Mr. Sower, was right, simplest form is important.
As you advanced in your course work, you start to define forms of equations by their mathematical representation and to understand advantages and disadvantages of each. Farin, in his book, Practical Linear Algebra, does a nice job outlining the three main forms of an equation of a line and advantages of each in computer graphics:
- Explicit Form: y = mx + b This is the form in every basic algebra book. It’s very conceptual; the coefficients have clear geometric meaning. In computer graphics it’s the preferred form for Bresenham’s line drawing algorithm and scan line fill algorithms.
- Implicit Form: (Given a point p and a vector a that is perpendicular to the line.) The implicit form is very useful for determining if an arbitrary point is on a line.
- Parametric Form: . The scalar value t is a parameter. In this form, we can easily calculate points along the line by use of the parameter, t.
I’m not certain when I internalized that inherent in mathematics is the art, strategy and beauty of 'form'. (I’m a slow learner, it wasn’t Mr. Sower’s fault.) But as my career developed into the commercial implementation of B-rep modeling kernels their translation technologies, 'form' again, became a principal view.
So, for the purpose of this discussion we define 'form' of geometric curves and surfaces in three ways: analytic, B-spline and procedural representations. All three of the major solid modeling kernels, ACIS, CGM, and Parasolid, maintain their geometry in either of these three forms, or sometimes as dual representations. 
- Analytic: geometry which can be represented explicitly by an equation (algebraic formula), for example: planes, spheres, and cylinders. These equations are very light weight and they intrinsically hold characteristics of the surface, for example the centroid of a sphere.
- B-spline: geometry represented by smooth polynomial functions (in parametric form) that are piece-wise defined. Generally used to represent free-form surfaces. Advantages of B-splines are their ability to represent many types of geometry and bounding boxes are easy to calculate.
- Procedural: geometry represented as an implicit equation or algorithm. For example, the IGES specification has tabulated cylinders and offset surfaces as defined procedural surfaces. The advantages are precision and the knowledge of progenitor data to understand design intent.
From this perspective, each of the major kernels has thought long and hard about the best form for each geometry type. In some cases it’s self-evident and easy. If you are modeling a sphere, the analytic form is clearly best. It’s a lighter weight representation and the full extents of the surface are defined. Even more, imagine doing a calculation requiring the distance between a point and a sphere. In this “special” case, you simply compute the distance between the point and the centroid of the sphere, subtract the radius and you’re done. If the sphere was in the form of a b-spline it’s much more computationally expensive. Despite this translation solutions still don’t get this preferred form right. Now imagine you’re a CMM application and you purchased the solution that translates spheres to B-splines? Your app is horribly slow.
Although spheres are a trivial example, more complex geometries become intriguing. In what form should you prefer a helical surface? Or an offset surface? ACIS has preferred multiple versions of helical surfaces over the years. Early on the preferred version was a procedural sweep surface with either a procedural rail curve or a b-spline rail curve. (The rail curve is what defines the helical nature of the surface). If the surface was translated in from Parasolid it came in as a generic b-spline surface. But the need to understand different characteristics of the helical surface soon became apparent. For example, the hidden line removal algorithm and intersectors all needed to understand the pitch and handedness to efficiently work with the geometry. To that end, ACIS moved to a procedural surface definition with an analytical representation of the helical rail curve.
The offset surface is an excellent example where the CGM developers and the ACIS developers came to different conclusions. In ACIS the offset surface is procedural; evaluate the progenitor and shoot up the surface normal the offset distance. ACIS choose this representation for preciseness and compactness of data. In addition, in ACIS, if you offset an offset surface the progenitor of the original offset becomes the progenitor for the second or third or fourth offset and more geometry sharing is possible. But all of this comes at a cost. Procedural surfaces, although exact, may have a performance penalty and may introduce unwanted discontinuities. The CGM developers decided the best strategy here was to create b-splines for offsets.
So what does this all have to do with translation? The key point here is; you need to understand what the preferred forms are for each of the modeling kernels. In each of these systems you can easily slip in geometry in non-optimal forms causing immense grief when doing downstream modeling operations. I spoke earlier about the translator solution that goofed up even a simple conversion of spheres. And the CMM application that purchased that translation solution? In short, don’t let that be you.
 For this discussion I’m going to leave off polyhedral form.
For this post, I thought I would talk about SPAR 2012, which I attended in Houston last week.
For those of you where are not familiar with it, SPAR is a conference/trade show for the medium and long-range scanning industries. From a geometry perspective, this means dealing with point clouds. Lots of point clouds. Lots of really big point clouds. For example, at one of the talks I attended, the speaker was discussing dealing with thousands of terabytes of data. Another speaker discussed the fact that developing the systems just to manage and archive the huge amount of data being produced is going to be a major challenge, let alone the need for processing it all.
As an example of this, the very first thing I noticed when I walked into the exhibit hall was the huge number of companies selling mobile terrestrial scanners. These are laser scanning units that you strap onto the back of a van, or an SUV, or an ultra-light airplane, or a UAV - there was even a lidar-equipped Cooper Mini on display. You then drive (or fly) along the path you want to scan, acquiring huge amounts of data. The data is then processed to tell you about e.g. potholes or highway signs or lane lines on roads (the scanners are often coupled with photographic systems) or vegetation incursions on power lines (typically from aerial scans). When I attended two years ago, this was a fairly specialized industry; there were only a few vans on display, the companies that made the hardware tended to be the ones doing the scans, and they also wrote the software to display and interpret the data.
This year, it seemed like this sector had commoditized: there were at least eight vehicles on display, the starting price of scanning units had come down to about $100K, and it seemed that there were vendors everywhere you looked on the display floor (and yes, it does sound odd to me that I’m calling a $100K price point “commoditized”). Another thing that I was looking for, and think I saw, was a bifurcation into hardware and software vendors. I asked several of the hardware vendors about analysis; this year they uniformly told me that they spat their data out into one of several formats that could be read by standard software packages. I view this specialization as a sign of growing maturity in the scanning industry; it shows that it is moving past the pioneer days when a hardware manufacturer has to do it all.
On the software side, I saw a LOT of fitting pipes to point clouds. This is because a large part of the medium range scanning market (at least as represented at SPAR) is capturing “as built” models of oil platforms and chemical plants, especially the piping. The workflow is to spend a few days scanning the facility, and then send the data to a contractor who spends a few months building a CAD model of the piping, from which renovation work on the facility can be planned. One of the sub-themes that ran through many of the talks at the conference was “be careful of your data – even though the scanner says it’s accurate to 1mm, you’re probably only getting ½ inch of accuracy”. This was driven home to us at Spatial a few years ago when we bought a low-end scanner to play around with and we discovered that a sharp black/white transition caused a “tear” in the surface mesh spit out from the scanner (due to differential systematic errors between white and black). A practical example of this was discussed in a talk by one of the service providers; he gave a case study of a company that tried to refit a plant using the workflow described above. Early on they discovered that the purported “as built” model (obtained by humans building models from scanned data) weren’t accurate enough to do the work – new piping that should fit correctly from the model wouldn’t fit in reality (for example all the small-diameter pipes had been left out of the model completely). This is because a real-world plant isn’t a bunch of clean vertical and horizontal cylinders; pipes sag, they’re stressed (bent) to make them fit pieces of equipment and so on. The company went back and had the job re-done, this time keeping a close tie between the original scans and the model at all stages. I really appreciated the attention to detail shown in this and other talks; in my opinion it’s just good engineering to understand and control for the systematic errors that are introduced at every stage of the process.
Two more quick observations:
- Several people mentioned the MS Kinect scanner (for the gaming console) as disruptive technology. My gut is telling me that there’s a good chance that this will truly commoditize the scanning world, and that photogrammetry might take over from laser scanning.
- I didn’t expect my former life as a particle physicist to be relevant at a scanning conference. Imagine my surprise when I saw not one but TWO pictures of particle accelerators show up in talks (and one of them a plenary session!)
Next year’s SPAR conference is in Colorado Springs – I hope to see you there!
Posted: April 27th, 2012 |
John's recent post on documentation and behavior driven development reminded me of an interesting experience I had last fall in developing training documentation. Our annual 3D Insiders' Summit (early bird registration is now open, by the way. We hope to see you there!) always gives the sales team a rare opportunity to come together from around the world in one geographic location with a large chunk of the development team. We decided to take the time to have some introductory CGM training for the TAMs (Technical Account Managers), and through the process of elimination, I somehow landed the task of organizing it.
Unfortunately, we were challenged by a number of issues. We only had a day and a half. Most of the developers and TAMS were busy in the months prior preparing presentations and demos for the Summit, including me. Amongst our team, we had varying levels of hands-on experience with CGM, and I had the least experience of all. Given these constraints, how could I ensure that we would make the most of our short time with development?
The first thing I did, of course, was to procrastinate for a few months. What's that saying, "I work best under pressure?" If that's true, there was going to be some good stuff coming for sure. With three weeks left, it hit me . . . people have extended their trips by two days to come to this training, which I haven't even started preparing. Panic! What could we do with the least amount of effort possible? I worked with development to gather any and all presentations we had lying around and threw them together into one messy powerpoint - something like 60 slides, I think. Uggh, nobody is going to have time to fix this, I don't know how to do it, and if we don't, it will be soooooo boring to sit through.
Hmm, let's avoid that topic for now. Maybe some hands-on exercises would help. I agreed to create a sequence of exercises demonstrating a (very, very) simple CAM mold and die workflow. Brilliant idea, Stef. I've never programmed with CGM before, and my ACIS is even a bit rusty. Oh well, dive in . . .
Early on, I had a pleasant surprise. The team working on componentizing CGM had spent a lot of time thinking about things they'd like to do differently from Acis, and one of those was a strong documentation structure right from the beginning. The structure is oriented towards hands-on cases, FAQs and tutorials (documentation driven development as John mentioned), with less emphasis on theory and technical articles. Their work had paid off. I was expecting to need a lot of help, given my novice state, but I was able to develop the whole workflow with only their documentation. I made some mistakes along the way, but I was able to sort them out on my own without insider help.
One problem though, was that despite the smooth development process, it was still enough work that it wouldn't fit into a 2 day training and leave us time to talk with development. Then somebody had the brilliant idea that we should assign the exercises as homework. I decided to turn my whole experience into the homework, mistakes and all. It took me a few hours to create a sequence of 15 assignments, with helpful documentation links, screenshots and hints, but no explanations from me.
Fig. 2 Above: We’re getting ready to create a mold for this swept body. We’ll use a draft to taper the sides of the part for extraction from a mold. Before drafting, we first need to pick faces for the draft.
- Pick the ribbon faces as shown in the picture below. (Hint: the little man is looking in the – X direction from 10, 0, 1 and in the +Z direction from 0,0, -1)
The idea worked pretty well. Most people did the homework. Some flew through it in a day, and some ran into difficulties and weren't able to finish. But everyone came into the training with a lot of questions and basic knowledge. During the training time, we skimmed through the messy presentation, spending most of the time asking development about the finer points and harder technical problems. The training seemed truly customized for the audience because in a sense, we created it as we went. John, what would this be called? CDT (customer driven training), PDD (Panic Driven Development), LOOE (Lucky, One-Off Experience)?
I'd be curious to know about your most valuable training experience.
Posted: April 17th, 2012 |