Rust is a relatively new programming language and many problems being tackled in Rust have existing solutions in C/C++. As a Rust developer in need of some non-trivial functionality, you must often choose between a Rust wrapper of an existing C/C++ library and a pure Rust alternative. In this post, I compare the runtime performance of simple operations from geo, a crate that provides algorithms for two dimensional geometric operations, with the most prominent alternative, geos, a wrapper of the C++ libgeos library. I had previously compared the features offered by these crates in another post on this blog.
The primary question I seek to answer: Is there a marked performance benefit in using the pure-Rust geo crate over geos because of its use of Rust? Or conversely, is there a marked performance benefit in using the geos crate because of its use of C++? This is an instance of C++ vs Rust performance comparison, but specially interesting for applications written in Rust. The conventional wisdom is that Rust is competitive with (bare, not wrapped in Rust) C, and though it is possible to point-out specific reasons programs written in one language might be faster, modern compilers and hardware are too complex for these reasons to apply broadly.
To provide an answer, I compare the performance of three algorithmically simple geospatial computations from the two crates - minimum bounding rectangle, area and centroid. I choose these simple computations for two reasons. First, it makes it more likely that any performance difference is due to the use of Rust or C++. In a more complex computation, the largest performance optimization opportunities are most likely algorithmic. Second, it makes it easier to analyze the root causes of the observed performance differences. As you will see below, the root cause analysis is at the core of the insights to be gained from this exercise.
What this performance comparison is not
Application performance evaluation is hard - you risk measuring the wrong thing, in the wrong way, and making the wrong inferences from observations.
What: An application’s performance depends heavily on operational context – the application may perform differently for different inputs, in executing different operations, and when used in the context of other larger software. Thus, you should benchmark an application in an environment that is faithful to the target context. My performance evaluation below is not a broadly applicable comparison of Rust and C++, or even geo and geos. Instead, it is an exercise in understanding the root causes of observed differences in performance of simple algorithms in isolation. I use realistic inputs so that I observe performance differences on meaningful operations, but side-step the issue of scale and software context. A more contextual evaluation would use real-world inputs at scale and a real-world problem with a sequence of operations. A good example is this study comparing several modern spatial libraries.
How: Application performance is also very sensitive to the execution context - hardware characteristics (CPU architecture, memory cache sizes), software characteristics (kernel scheduler, memory allocator), and other concurrent processes on the system greatly affect performance. Conventional wisdom in application performance evaluation is to control as many of these variables as possible. I take a different approach with fewer controls that works well for comparing the performance of two alternatives, but isn’t valid for reporting absolute performance of an application. I describe this methodology in its own section below.
Why: In this exercise, comparing the performance of the two crates is only the starting point. I arrive at my key insights by digging into the root causes of the observed difference. A performance evaluation exercise is incomplete, and of questionable validity, without a detailed understanding of the performance bottlenecks that explain the observations.
I ran my benchmarks on a Google Cloud Platform virtual machine (gasp!). A cloud virtual machine is a bad environment for performance evaluation - it allows little visibility and no control over the hardware, virtualization setup and resource sharing. But, the best camera is the one that’s with you, so I set out to develop an experimental design to counter this inconvenience.
The trick is to realize that my experiment only needs to compare the relative performance of two alternatives. Since I can not control the environmental factors, I ensure that the factors affect both crates similarly and measure the ratio of the performance between the two.
Measure: The primary quantity I measure is the elapsed real time measured using a monotonic wall-clock for each operation. I repeat the operation a large number of times so that measured runtime is at least a few milliseconds, because robust nanosecond-precision measurement is hard.
- Repeat:
- I measure each operation over a 100 times to counter the inevitable random measurement noise.
- I interleave the measurements for the two crates – in each iteration, I measure each crate once in order. This way, any drift in measurement due to changes in the execution environment equally affects both crates1.
- I pair the measurements for the two crates in each iteration, and use their ratio in computing the primary metric.
- Report: Runtime comparison, especially in the millisecond range, is unintuitive. I invert the measured time to compute the rate at which operations are performed - queries per second (QPS). For each iteration, I compute the ratio of the QPS observed for the two crates and report the mean and variance of this ratio across all iterations as my primary metric.
This way, instead of seeking to control the execution environment, I account for random noise (via repeated measurements) and drift (via interleaved measurements) and work with ratios of measured quantities2. I report both the mean and variance of the observed QPS ratio to provide confidence in the reported metric. In my opinion, the results from this experimental setup are more trustworthy than single-shot or best-of-N-runs comparisons, even though I have no control on the execution environment.
A final methodological note on what is being measured - even though I evaluate simple operations in isolation, I use realistic inputs so that the benchmark is indicative of how the operations would perform in a real-world application. I use as input a set of polygons that represent the boundaries of all the administrative districts of India3. These computations could reasonably arise as a small step in a larger geo-statistical study.
geo performed better than geos for all operations I benchmarked. The QPS ratio ranged from 1 to 40 for the operations. This difference in QPS stems from one of two sources:
- geo’s implementation provides better optimization opportunities to the compiler. The Rust compiler avoids memory allocations more aggressively and generates more efficient machine code than C++. But, as with many compiler optimization benchmarks, your mileage may vary.
- geo’s algorithms are simpler. geo avoids costly computations that geos incurs.
Ultimately, algorithmic differences lead to larger performance gains, and amplify the performance improvements from compiler optimizations. Thus, my analysis supports the guidance that it is better to concentrate on algorithmic and data-structure driven performance improvements in your programs over manual micro-tuning – a program that avoids a computation entirely always performs better than one that computes it efficiently.
Minimum Bounding Rectangle
The first operation I compare is also the simplest – the computation of the minimum bounding rectangle of a set of polygons. I found that geo and geos perform similarly. On average, geo computes 1.4 times as many queries per second as geos (mean from 150 observations with standard deviation of 0.2).
A visual representation of all 150 observations provides further insight into the observed difference:
Observe how the QPS ratio stays mostly above 1.0 but it varies to a much larger degree after the first 60 observations. The increased variance in latter observations is surprising. It is useful to graph the raw observations to understand this increase in variance:
This graph reveals that not only does the runtime vary more after the first 60 observations, it drops significantly for both crates. This is an example of observational drift I mentioned earlier – an uncontrolled environmental factor affected performance of both geo and geos significantly during this benchmark run, but the QPS ratio remained unaffected. The variance in the ratio jumps for the latter observations because the QPS for geo varies to a larger degree around its mean than geos for these observations.
Root-cause analysis
To understand the reasons behind the observed performance difference, I profiled both implementations using linux perf
4 to estimate the CPU cycles spent in various functions.
The following (vastly) simplified perf
report for geo’s computation shows what you would expect:
- 83.42% geo_types::private_utils::get_min_max (inlined) - 16.58% <core::iter::adapters::flatten::FlatMap<I,U,F> as core::iter::traits::iterator::Iterator>::next (inlined)
Most of the CPU cycles are spent inside geo_types::private_utils::get_min_max
, computing the minimum (and maximum) values of the x- and y-coordinates of all the vertices in the geometry. The remaining CPU cycles are spent iterating over the vertices.
Contrast this simple report with that for geos:
- 89.69% <geos::geometry::Geometry as geos::geometry::Geom>::envelope - 84.02% geos::geom::GeometryFactory::toGeometry - 21.57% geos::geom::GeometryFactory::createLinearRing - 17.81% geos::geom::GeometryFactory::createPolygon - 13.58% geos::geom::DefaultCoordinateSequenceFactory::create - 9.42% std::unique_ptr<geos::geom::Geometry, std::default_delete<geos::geom::Geometry> >::unique_ptr<geos::geom::Polygon, std::default_delete<geos::geom::Polygon>, void> 3.7% geos::geom::Envelope::getMinY - 10.30% core::ptr::drop_in_place<geos::geometry::Geometry> (inlined) - 9.4% geos::geom::Polygon::~Polygon
Like all perf
report snippets in this post, this is a simplified view of a report that shows a tree of call stacks. Stack frames deeper in the call graph are indented further to the right. Each stack frame is labeled with the fraction of total CPU cycles spent inside the subtree rooted at that node. For example, the report above states that 89.69% of the total CPU cycles are spent inside <geos::geometry::Geometry as geos::geometry::Geom>::envelope
or its children, 84.02% of total CPU cycles are spent in geos::geom::GeometryFactory::toGeometry
and its children, when called by <geos::geometry::Geometry as geos::geometry::Geom>::envelope
, and so on.
The report indicates that very little time is spent in a computation similar to geo_types::private_utils::get_min_max
from above (3.7% in geos::geom::Envelope::getMinY
). Most of the time is instead spent in creating new geometry objects via geos::geom::GeometryFactory::toGeometry
, and the rest is spent destroying those objects.
Another look at the input geometry in my experimental setup helps shed light on why memory allocation is the bottleneck in geos computation. The input geometry consists of a collection of polygons. The minimum bounding rectangle of the entire geometry is the minimum bounding rectangle of the individual polygons’ minimum bounding rectangles. geos allocates polygons to represent the minimum bounding rectangle of each polygon in the input, uses those to compute the overall minimum bounding rectangle, and then destroys the intermediate polygons. The allocation and deallocation of these temporary polygons easily outweighs the cost of computing the minimum and maximum of a few coordinates. What is surprising is that geo avoids a similar cost. A look at the disassembled machine code for geo’s implementation indicates how:
<geo_types::multi_polygon::MultiPolygon<T> as geo::algorithm::bounding_rect::BoundingRect<T>>::bounding_rect() [ ... SNIP ... ] _ZN9geo_types13private_utils17get_bounding_rect17h979dc24cd95606a7E(): return Some(Rect::new( movsd %xmm2,0x8(%rax) movsd %xmm1,0x10(%rax) movsd %xmm4,0x18(%rax) movsd %xmm3,0x20(%rax) mov $0x1,%r8d _ZN120_$LT$geo_types..multi_polygon..MultiPolygon$LT$T$GT$$u20$as$u20$geo..algorithm..bounding_rect..BoundingRect$LT$T$GT$$GT$13bounding_rect17hcabacac00bf1cedeE(): 148: mov %r8,(%rax) [ ... SNIP ... ]
This is a snippet of the relevant function in geo. The source code annotation shows the anticipated allocation of a new rectangle to represent the minimum bounding rectangle for each input polygon. But the corresponding assembly consists only of movsd
, an x86 instruction to move double precision floating point values, and no calls to memory allocation subroutines. The compiler is able to entirely optimize away memory allocation, instead recycling freed memory to store the individual minimum bounding rectangles! My guess here is that Rust’s explicit memory lifetimes make such optimizations easier for the compiler.
In the end, even though (or, partly because) the evaluated computations are very simple, this is not an apples-to-oranges comparison - the performance bottleneck for geo is computation of minimum and maximum coordinates of the input polygons, while for geos it is memory allocations.
The second operation I compare is computation of (2-dimensional, planar) area. I found that geo outperforms geos with a larger margin than for minimum bounding rectangle. On average, geo computes 8.1 times as many area queries per second as geos (mean from 98 observations (after outlier removal) with standard deviation of 0.1).
Root-cause analysis
Here is a simplified perf
report for a profiled geo area computation.
- geo::algorithm::area::twice_signed_ring_area (inlined) - 53.37% <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::next (inlined) - 27.31% geo_types::line::Line<T>::determinant (inlined) - 19.31% <geo_types::line::Line<T> as geo::algorithm::map_coords::MapCoords<T,NT>>::map_coords (inlined)
This report shows that half the CPU cycles are spent in iterators (over polygons and coordinates of polygons) and the remaining are spent in the individual polygons’ area computation.
A report for profiled geos computation shows similar CPU cycle distribution:
- 98.45% geos::algorithm::Area::ofRingSigned - 24.41% geos::geom::CoordinateArraySequence::getAt 13.24% std::vector<geos::geom::Coordinate, std::allocator<geos::geom::Coordinate> >::operator[]
A quarter of the CPU cycles are spent in iteration, and the rest are spent in the body of geos::algorithm::Area::ofRingSigned
computing the area of individual polygons.
Unlike minimum bounding rectangle, the two implementations for area have similar performance profiles and performance bottleneck - area computation for the individual polygons. Both geo and geos use the shoelace formula for computing area of individual polygons5. This computation in geo is about 12 times faster than that in geos (geo overall area computation is 8 times faster; individual polygon area computation accounts for 50% of time in geo, but 75% in geos). The following snippets of disassembled machine code of the relevant functions reveal the reason behind this difference:
geo::algorithm::area::get_linestring_area() Event: cpu-clock Percent [ … SNIP … ] mulpd %xmm2,%xmm4 _ZN45_$LT$f64$u20$as$u20$core..ops..arith..Sub$GT$3sub17h7b8b1be7ea3e7a7aE(): 1.86 movapd %xmm4,%xmm2 unpckhpd %xmm4,%xmm2 subsd %xmm2,%xmm4 [ … SNIP …]
geos::algorithm::Area::ofRingSigned() Event: cpu-clock Percent [... SNIP 10.18 movsd %xmm0,-0x80(%rbp) sum += p1.x * (p0.y - p2.y); 2.63 movsd -0x60(%rbp),%xmm1 movsd -0x38(%rbp),%xmm0 movsd -0x78(%rbp),%xmm2 subsd %xmm2,%xmm0 2.52 mulsd %xmm1,%xmm0 6.46 movsd -0x8(%rbp),%xmm1 addsd %xmm1,%xmm0 11.71 movsd %xmm0,-0x8(%rbp) [ ...SNIP… ]
geos spends a large fraction of CPU cycles moving scalar double-precision floating-point values between memory locations via the movsd
instruction. geo instead uses the movapd
instruction to move two double-precision floating-point values for far fewer CPU cycles. The Rust compiler vectorizes the area computation and delivers a significant speed up!
geo achieves another small speed up via faster iteration. Iteration in geo is 4 times faster than in geos. I have not analyzed the root cause of this speed up, but the machine code generated for iteration by the Rust compiler was significantly simpler than that for C++, with more aggressive inlining and consistent performance gain, for all the operations that I benchmarked.
Among the operations I benchmarked, I found the largest performance difference in computation of the centroid. On average, geo computes 47.6 times as many centroid queries per second as geos (mean from 100 observations (after outlier removal) with standard deviation of 1.39).
Root-cause analysis
Centroid computation is a more complex operation than the two I described above. The key sources of speed up in geo are the same – more aggressive function inlining and vectorization of hot computations – but their effect is amplified by algorithmic differences in the two implementations. geos’ algorithm iterates over polygons (and their coordinates) many more times than geo, and requires more numerical computations than geo.
The following simplified perf
report for geo indicates a more complex algorithm than any of the reports above:
- geo::algorithm::centroid::CentroidOperation<T>::add_ring - 54.68% (inlined) <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::fold - 31.77% (inlined) core::iter::adapters::map::map_fold::_$u7b$$u7b$closure$u7d$$u7d$::h852ff4143359a141 - 22.91% (inlined) geo::algorithm::centroid::CentroidOperation$LT$T$GT$::add_ring::_$u7b$$u7b$closure$u7d$$u7d$::hba507e9435c4b508 - 8.85% (inlined) <geo_types::coordinate::Coordinate<T> as core::ops::arith::Mul<T>>::mul - 8.85% (inlined) <geo_types::coordinate::Coordinate<T> as core::ops::arith::Add>::add 8.85% (inlined) geo_types::line_string::LineString$LT$T$GT$::lines::_$u7b$$u7b$closure$u7d$$u7d$::ha94da3e3994f7f24 22.91% (inlined) <core::slice::iter::Windows<T> as core::iter::traits::iterator::Iterator>::next - 45.31% geo::algorithm::area::get_linestring_area - 31.77% (inlined) <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::next - 18.23% (inlined) core::option::Option<T>::map - 8.85% (inlined) <geo_types::line::Line<T> as geo::algorithm::map_coords::MapCoords<T,NT>>::map_coords (inlined) <f64 as core::ops::arith::Sub>::sub
Note how all significant operations within add_ring
are inlined by the compiler. The following snippet from disassembled machine code for add_ring
shows that these inlined computations are also successfully vectorized (as before, the costliest instructions uses packed double-precision floating-point operands):
geo::algorithm::centroid::CentroidOperation<T>::add_ring() Event: cpu-clock Percent [ ...SNIP... ] _ZN9geo_types11line_string19LineString$LT$T$GT$5lines28_$u7b$$u7b$closure$u7d$$u7d$17ha94da3e3994f7f24E(): /// assert!(; /// ``` pub fn lines(&'_ self) -> impl ExactSizeIterator + Iterator<Item = Line<T>> + '_ {|w| { // slice::windows(N) is guaranteed to yield a slice with exactly N elements unsafe { Line::new(*w.get_unchecked(0), *w.get_unchecked(1)) } 1b0: movapd %xmm3,%xmm4 16.67 movapd %xmm2,%xmm5 movupd (%rcx),%xmm3 _ZN45_$LT$f64$u20$as$u20$core..ops..arith..Sub$GT$3sub17h7b8b1be7ea3e7a7aE(): subpd %xmm1,%xmm4 movapd %xmm3,%xmm6 8.33 subpd %xmm1,%xmm6 _ZN45_$LT$f64$u20$as$u20$core..ops..arith..Mul$GT$3mul17h6535908bdf049e14E(): movapd %xmm6,%xmm2 shufpd $0x1,%xmm6,%xmm2 mulpd %xmm4,%xmm2 _ZN45_$LT$f64$u20$as$u20$core..ops..arith..Add$GT$3add17hc97f37a33d9f3bdcE(): 16.67 addpd %xmm4,%xmm6 _ZN45_$LT$f64$u20$as$u20$core..ops..arith..Mul$GT$3mul17h6535908bdf049e14E(): movapd %xmm2,%xmm4 unpckhpd %xmm2,%xmm4 subsd %xmm4,%xmm2 16.67 unpcklpd %xmm2,%xmm2 mulpd %xmm6,%xmm2 _ZN45_$LT$f64$u20$as$u20$core..ops..arith..Add$GT$3add17hc97f37a33d9f3bdcE(): addpd %xmm5,%xmm2 _ZN94_$LT$core..slice..iter..Windows$LT$T$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$4next17h9d7ae68615430372E(): 8.33 cmp $0x2,%rax ↑ jb 96 33.33 add $0x10,%rcx add $0xffffffffffffffff,%rax [ ...SNIP... ]
Contrast the perf
report for geo with the following report for geos. The call graph is deeper (less inlining) and there are multiple instances of iteration and costly computations in geos::geom::Coordinate::distance
, geos::algorithm::Centroid::centroid3
and geos::algorithm::Centroid::area2
- <geos::geometry::Geometry as geos::geometry::Geom>::get_centroid - 99.79% geos::geom::Geometry::getCentroid - 99.60% geos::geom::Geometry::getCentroid - 99.30% geos::algorithm::Centroid::getCentroid - 99.10% geos::algorithm::Centroid::Centroid - 99.00% geos::algorithm::Centroid::add - 98.81% geos::algorithm::Centroid::add - 98.60% geos::algorithm::Centroid::add - 98.41% geos::algorithm::Centroid::addShell - 47.47% geos::algorithm::Centroid::addLineSegments - 17.73% geos::geom::CoordinateSequence::operator[] - 13.27% geos::geom::Coordinate::distance - 17.73% geos::algorithm::Centroid::addTriangle 5.74% geos::algorithm::Centroid::centroid3 2.87% geos::algorithm::Centroid::area2 - 15.75% geos::algorithm::Orientation::isCCW - 5.94% geos::geom::CoordinateSequence::getY - 8.22% std::unique_ptr<geos::geom::Coordinate, std::default_delete<geos::geom::Coordinate> >::operator* - 6.14% geos::geom::CoordinateSequence::operator[]
A look at the machine code for the particularly costly geos::geom::Coordinate::distance
function shows that the compiler is unable to vectorize the computations. The costliest instructions are the non-vectorized double precision floating-point instructions (movsd
, addsd
, …):
geos::geom::Coordinate::distance() Event: cpu-clock Percent [ ...SNIP... ] 9.64 movsd %xmm0,-0x8(%rbp) double dy = y - p.y; 3.61 mov -0x18(%rbp),%rax movsd 0x8(%rax),%xmm0 mov -0x20(%rbp),%rax 4.82 movsd 0x8(%rax),%xmm1 subsd %xmm1,%xmm0 1.20 movsd %xmm0,-0x10(%rbp) return std::sqrt(dx * dx + dy * dy); 3.61 movsd -0x8(%rbp),%xmm0 4.82 movapd %xmm0,%xmm1 mulsd -0x8(%rbp),%xmm1 19.28 movsd -0x10(%rbp),%xmm0 mulsd -0x10(%rbp),%xmm0 10.84 addsd %xmm1,%xmm0 10.84 → callq sqrt@plt [ ...SNIP... ]
The three operations I benchmarked have increasingly complex algorithms, and an increasing gap in performance between geo and geos. Analysis shows that the root cause of the performance gap is compiler optimization - function inlining, vectorization and skipped memory allocations - and the more complex algorithms amplify the effects of these optimizations. The answer to my question of Rust vs C++ performance might come down to the ability of each language, and it ecosystem, to nudge developers into writing programs that are more efficient and easier to optimize for the compiler. Rust’s explicit memory lifetimes and traits-based polymorphism (instead of OO-style classes) are foundational innovations that might explain the edge in performance that geo has over geos.
Reproducing these results
The harness and Jupyter notebooks used in this analysis are available on github. The geo crate used was version 0.18.0. The geos crate used was version 8.0.3 with libgeos version 3.10.1.
geo, geos and libgeos were all compiled from source. geo and geos were compiled with rustc
1.56.1 using cargo’s release profile. libgeos was compiled with gcc
8.3.0 using cmake
’s RelWithDebInfo
The benchmarks were run on a Google Cloud Project e2-medium VM instance, often repeatedly over the course of many days. I expect that a variety of physical nodes were used for the benchmark runs and there were uncontrolled effects from throttling and resource sharing. The Operating System was a GCP-optimized variant of Debian 4.19 linux.
A note on compiler optimization settings
Compiler optimization settings have a large impact on performance. I compiled all Rust code for this post with the highest optimization level available in rustc
: -O3
but compiled C++ code for libgeos with gcc
’s optimization flag set to -O2
instead of the highest available -O3
. This difference is the result of the default compiler flags used by the relevant tooling. cargo
’s release profile enables the highest level of optimization, but cmake
’s RelWithDebInfo
setting only enables -O2
I could have set CMAKE_BUILD_TYPE
to Release
, which enables -O3
level of optimization, but it also disables symbol table generation needed for debugging and profiling tools (in particular for my use of perf
). I could enable debugging symbols explicitly with Release
(by setting gcc
’s -g
flag), but gcc
’s documentation notes that call graphs may not be accurate with this level of optimization. Also, gcc
’s documentation does not recommend using -O3
optimization level by default as it can sometimes have the opposite effect of slowing programs down.
For completeness, I compared the operations once again with gcc
’s optimization level set to -O3
. The following table shows the average (and standard deviation of) ratio of observed QPS for geo compared to geos for each of the three operations when libgeos is compiled with -O2
and -O3
geo -O3 / geos -O2 |
geo -O3 / geos -O3 |
Minimum Bounding Rectangle | 1.45 (std 0.21) | 0.09 (std 0.004) |
Area | 8.13 (std 0.10) | 2.27 (std 0.02) |
Centroid | 47.61 (std 1.39) | 11.70 (std 0.27) |
uniformly improves the performance of geos. In the case of minimum bounding rectangle, geos even performs better than geo. But profiling with perf
shows that my root cause analyses still hold - the effect of function inlining is limited and memory allocations are not entirely removed, although gcc
finds more opportunities to vectorize loops. Thus, the analysis in this post is still (mostly) applicable with the increased optimization level, though it is harder to follow due to missing symbol information.
A stark example of measurement drift is CPU throttling – I found the measured runtime frequently jumps by ~30% after the benchmark has been running for about a minute. I also found that this drift does not affect runtime ratios (and hence the reported metrics) significantly. In my analyses, I remove the observations from the first few iterations to (mostly) avoid this jump. ↩
My methodology does not account for the possibility of systematic error. In the study I linked earlier, the authors found a large performance difference stemming from differences in cache behavior of the list data type in Java and C. This difference was dependent on JVM configuration and CPU cache sizes. Such errors aren’t easy to anticipate (and hence correct) even in methodologies that seek to control the execution environment. ↩
I chose to use districts of India as the inputs primarily because the TIGER/Line Shapefiles by no means have a monopoly on experimental design. ↩
Recorded with some variation of the command
perf record -F 300 -g --call-graph dwarf
and reported with some variation ofperf report -g
. ↩ -
In an earlier post on this blog, I discuss some numerical stability concerns in computing area of a polygon using the shoelace formula. ↩