tag:blogger.com,1999:blog-12308832298843176532017-10-06T10:01:58.891-07:00MCLMatt Overbynoreply@blogger.comBlogger2125tag:blogger.com,1999:blog-1230883229884317653.post-88598051090012401952017-05-12T11:00:00.000-07:002017-05-12T11:00:13.542-07:001D Clustering with KDE<a href="https://en.wikipedia.org/wiki/Kernel_density_estimation">Kernel Density Estimation</a> (KDE) is a useful technique for clustering one-dimensional data. For example, I recently implemented an interface for clustered <a href="https://en.wikipedia.org/wiki/Parallel_coordinates">parallel coordinates</a>, in which I needed to cluster about 600k variables at the click of a button:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-RNrbinmSPF0/WRXTejvqdxI/AAAAAAAABTk/ApVeacGL7CEbM9VdWPDA7VWiAdybvVS_wCLcB/s1600/clusterpcoord.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="88" src="https://2.bp.blogspot.com/-RNrbinmSPF0/WRXTejvqdxI/AAAAAAAABTk/ApVeacGL7CEbM9VdWPDA7VWiAdybvVS_wCLcB/s400/clusterpcoord.png" width="400" /></a></div><br />Of course, KDE requires a lot of expensive operations. Making the above happen in a few seconds required a few efficiency tricks at the cost of perfectly accurate clusters. So, I put together this mini-tutorial to explain my approach.<br /><br /><h4>What is Kernel Density Estimation?</h4>KDE helps identify the density of a distribution of data. That is, it can be useful in finding where a lot of data is grouped together and where it isn't. Naturally, it can be used for 1D clustering by creating clusters about the points of highest density (local maxima), separated by the points of lowest density (local minima). There are many great online tutorials about KDE, and I recommend familiarizing yourself before moving on.<br /><br />1D clustering with KDE can be done in roughly 4 steps:<br /><ol><li>Normalize data (0 to 1) and sort</li><li>Compute densities</li><li>Find local maxima</li><li>Find minima and cluster</li></ol>For all of my examples I'll be using Matlab with the final script linked at the bottom. Because it's Matlab it will be slow. However, the methods translate well to other languages.<br /><br /><h4><u>1.) Normalize data and sort</u></h4>With standard KDE you don't need to sort, because density is calculated from every other point in your dataset. Since the following section makes use of a limited neighborhood, sorting is necessary.<br /><br /><h4><u>2) Compute densities</u></h4>This is the only step that requires a little bit of work and some knowledge of KDE. It's <i>also</i> where we can start taking shortcuts for faster clustering.<br /><br />The general idea is for a given point along the data, we compare the distance of that point to its neighbors. Neighbors that are really close add more to the density and neighbors that are far away add much less. How much they add is dependent on the choice of a <a href="https://en.wikipedia.org/wiki/Kernel_(statistics)#In_non-parametric_statistics">smoothing kernel</a> with radius <i>h</i>. A good choice for <i>h</i> is the Silverman's rule of thumb: <i>h</i>=std(<i>x</i>)*(4/3/n<i>)</i>^(1/5). For example, the density at point <i>i</i>, data <i>x</i>, number of elements <i>n, </i>and<i> </i>often-used Gaussian/normal kernel would be:<br /><span style="font-family: "Courier New",Courier,monospace;"><br /></span><span style="font-family: "Courier New",Courier,monospace;"> sum = 0;<br /> for j = 1:n<br /> v = ( x(i) - x(j) )/h;</span><br /><span style="font-family: "Courier New",Courier,monospace;"> sum = sum + exp(-0.5*v*v) / sqrt(2*pi);<br /> end<br /> density = sum/(n*h);</span><br /><br />For full KDE, that's an unseemly <i>n</i>^2 checks. To reduce the workload, we can make two simplifications:<br /><ol><li>Compute density in bins instead of every point</li><li>Check only nearby neighbors that have a greater impact on density</li></ol>For a small dataset this is a little unnecessary. But consider 600k with 10 bins and checking 100 neighbors: you've reduced the loop from 360 billion to 1000. The choice of number of bins and neighbors is dependent on application. Less bins gives you less clusters. Less neighbors will dramatically<i> </i>improve run time but may give you noisier results, especially when an ignored neighbor would add a not-so-insignificant amount to the density. Sometimes is sufficient to add a round of smoothing, where the density of a bin is recomputed as the average of itself and left and right bins. The following code snippet applies these simplifications:<br /><br /><span style="font-family: "Courier New",Courier,monospace;">h = std(x) * (4/(3*n))^(1/5);</span><br /><span style="font-family: "Courier New",Courier,monospace;">nn = 10; % number of neighbors to check on either side<br />nb = 10; % number of bins<br />bh = 1/10; % bin step size<br />bins = zeros(nb,1); % nb bins for computing density<br />for i = 1:n</span><br /><span style="font-family: "Courier New",Courier,monospace;"> bin = ceil(x(i)/bh); % bin index</span><br /><span style="font-family: "Courier New",Courier,monospace;"> sum = 0;<br /> for j = max(1,i-nn):min(n,i+nn)<br /> v = (x(i)-x(j))/h;</span><br /><span style="font-family: "Courier New",Courier,monospace;"> sum = sum + exp( -0.5*v*v ) / sqrt(2*pi);<br /> end<br /> dens = sum/(2*nn*h);<br /> bins(bin) = bins(bin) + dens;<br />end<br />for i = 2:nb-1 % Smoothing<br /> bins(i) = (bins(i-1)+bins(i)+bins(i+1))/3;<br />end</span><br /><br />Of course, the bins aren't exactly centered at the points of highest density. For a lot of applications this is perfectly fine. Plotting the resulting density and data gives us:<br /><div class="separator" style="clear: both; text-align: center;"></div><br /><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-_cfePGNCGaM/WRXxtDjg__I/AAAAAAAABUU/wIrrOAdIwO8hEK6UBLwKVPUxv0Ff8iSIwCLcB/s1600/plotbins.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="295" src="https://4.bp.blogspot.com/-_cfePGNCGaM/WRXxtDjg__I/AAAAAAAABUU/wIrrOAdIwO8hEK6UBLwKVPUxv0Ff8iSIwCLcB/s400/plotbins.png" width="400" /></a> </div><h4> </h4><h4><u>3) Find local maxima</u></h4>Local maxima are the peaks of the curves in the above plot. They can easily be discovered by checking the bins on either side. If it's the largest, it's a local maximum! If your curve is jaggy with too many maxima, you'll need to increase the number of neighbors sampled when computing density. Play with the tunable variables until you have something that looks nice:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-5aowe_uPzEM/WRXzdigi7zI/AAAAAAAABUg/PZCRn66oBiUv2iT_40oOmsgaj61dJ6a5gCLcB/s1600/maxima.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="295" src="https://1.bp.blogspot.com/-5aowe_uPzEM/WRXzdigi7zI/AAAAAAAABUg/PZCRn66oBiUv2iT_40oOmsgaj61dJ6a5gCLcB/s400/maxima.png" width="400" /></a></div><h4> </h4><h4><u>4) Find local minima</u></h4>The minima are simply the bins of lowest density between two maximums:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-J9RS2hCHMPI/WRX0GflQWwI/AAAAAAAABUo/-pMNZWuToLE7YNncA6bqw3oVZs0d8KFWwCEw/s1600/minima.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="295" src="https://2.bp.blogspot.com/-J9RS2hCHMPI/WRX0GflQWwI/AAAAAAAABUo/-pMNZWuToLE7YNncA6bqw3oVZs0d8KFWwCEw/s400/minima.png" width="400" /></a></div>Local minima define where the clusters split. In the above example, we can see the data is split into three clusters.<br /><br />And that's it. The script I used to create the plots above <a href="https://drive.google.com/open?id=0B3bQMrw3TWG6Z0dkWjZ4WWp2bzQ">can be found right here</a>.<br /><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"></div>Matt Overbyhttps://plus.google.com/114030453406772478567noreply@blogger.com0tag:blogger.com,1999:blog-1230883229884317653.post-34787108687179963952017-01-06T12:18:00.003-08:002017-02-14T10:12:18.476-08:00Optimization-Based Material Point MethodThe Material Point Method (MPM) is a powerful and interesting technique for physics simulation. MPM is great for continuum substances like fluids and soft bodies, as well as for dealing with complex problems like fracture and self collisions. It has received a lot of recent attention from Disney, most notably the <a href="https://www.disneyanimation.com/technology/publications/55">snow simulation</a> and <a href="https://www.disneyanimation.com/technology/publications/69">phase change/heat transfer</a> papers that appeared in SIGGRAPH.<br /><br />Recently I coded up optimization based implicit MPM to test out some research ideas. My ideas didn't work out, but I put the <a href="https://github.com/mattoverby/mpm-optimization">code up on github</a> instead of letting it collect dust on my hard drive. There are many graphics-oriented MPM implementations online, but very few of them use implicit integration, and even less (any?) use optimization.<br /><br />I'm going to skip a lot of the mathematical details in this post. I feel they are better covered by the referenced papers. Instead, I'll give just enough to understand the code.<br /><br /><h2>Material Point Method</h2>The basic idea is that particles move around the domain and carry mass. When it's time to do time integration, the particle mass/momentum is mapped to a grid. New velocities are computed on the grid, then mapped back to the particles, which are then advected. The grid essentially acts as a scratch pad for simulation that is cleared and reset each time step. For a (better) background on MPM, read the introduction chapter from the <a href="http://web.cs.ucla.edu/~cffjiang/research/mpmcourse/mpmcourse.pdf/mpmcourse.pdf">SIGGRAPH course notes</a>.<br /><br /><h3>Time integration</h3>The best starting point for computing new velocities is explicit time integration. Though explicit integration is fast to compute and easy to understand, it has problems with stability and overshooting. If we take too large of a time step, the particle goes past the physically-correct trajectory. Intermediate techniques like midpoint method can alleviate such problems, but the silver bullet is implicit integration.<br /><br />With implicit integration we figure out what the velocity needs to be to make the particles to end up where they should. The hard part is that we now have both unknowns in the velocity calculation, which requires solving a system of nonlinear equations. This is computationally expensive and time consuming, especially for large and complex systems.<br /><br />Fortunately, many important forces (like elasticity) are conservative. Mathematically, that means we can represent the forces as the negative gradient of potential energy. This lets us do implicit integration through a technique called <i>optimization</i>. This makes it a bit easier to formulate and can have better performance than a lot of other implicit solvers.<br /><h3></h3>By representing our forces as energy potentials, we can do implicit time integration by minimizing total system energy. The concept itself is not new, but was recently applied to Finite Elements/MPM in an <a href="https://www.math.ucla.edu/%7Ejteran/papers/GSSJT15.pdf">excellent paper by Gast et al</a>. Section 6 of the paper covers the algorithm for optimization-based MPM. One thing to keep in mind: in MPM the velocity calculations happen on grid nodes, not grid boundaries.<br /><br /><h3>PIC/FLIP versus APIC</h3>Mapping to/from the grid is an important part of the material point method. MPM was originally an extension of Fluid Implicit Particle (FLIP), which in turn was an extension of the Particle in Cell (PIC) method. PIC and FLIP are the standard in fluid simulation, and were used in the above Gast paper. Mike Seymour wrote a <a href="https://www.fxguide.com/featured/the-science-of-fluid-sims/">good article on fxguide</a> that covers what PIC/FLIP is and why they are useful. However, they suffer from a few well known limitations. Namely, PIC causes things to "clump" together, and FLIP can be unstable.<br /><br />A technique called the <a href="https://disney-animation.s3.amazonaws.com/uploads/production/publication_asset/104/asset/apic-aselle-final.pdf">Affine Particle In Cell (APIC)</a> was recently introduced to combat these limitations. It's more stable and does a better job at conserving momentum. In short, it's worth using. To do so, you only need to make a few subtle changes to the MPM algorithm. The SIGGRAPH course notes explain what to change, but the Gast paper does not.<br /><br /><h2>Implementation</h2>Everything you need to know to code up MPM can be found in the SIGGRAPH course notes. If you've ever done fluid simulation with PIC/FLIP, it should be straight forward. If not, here are some things I found useful: <br /><br />1. It helps with stability to have a cell size large enough to have 8-16 particles.<br /><br />2. Keep a fully-allocated grid as the domain boundary, but do minimization on a separate "active grid" list made up of pointers to grid nodes.<br /><br />3. A "make grid loop" function for particle/grid mapping: given a particle, return the indices of grid nodes within the kernel radius, as well as kernel smoothing values/derivatives. This was probably the trickiest part of the whole method. Much of this can be preallocated if you have a lot of memory, but I found (in 3D) allocating/deallocating on the spot was necessary.<br /><br />Another thing to consider is what algorithm to use for minimization. I'm too lazy to deal with Hessians, so I generally stick to L-BFGS and Conjugate Gradient. Optimization is widely used across disciplines, so there are some open-implementations you may want to consider using. I've found that <a href="https://github.com/PatWie/CppNumericalSolvers">Pat Wie's CppOptimizationLibrary</a> works well as a go-to for most of my test projects, because I like the interface and the Eigen3 compatibility. It also includes a lot of nice things like finite difference gradient/hessian, line search algorithms, etc... However, there are some issues with its L-BFGS and CG. For some of my applications, they failed to converge (although that may have been fixed in recent revisions, I haven't checked). Instead I use an L-BFGS implementation by <a href="https://people.cs.clemson.edu/~ioannis/">Ioannis Karamouzas</a> that I modified for Pat Wie's library (included in my MPM github repo).<br /><br /><h2>Rendering</h2>Since the code was just to test out some ideas, I didn't put any effort into rendering. Instead, I just visualized the particle locations using <a href="https://github.com/zdevito/vdb">VDB</a>. If you haven't used VDB, it's a pretty cool tool for drawing out primitives in fresh simulation code. All you have to do is run the viewer, include the header, then call a few functions (like "vdb_point(x,y,z)") to draw stuff.<br /><br />Matt Overbyhttps://plus.google.com/114030453406772478567noreply@blogger.com0