Bug Hunting (in Fedora's KiCad)
I got a little lost, debugging KiCad!
This is a bit of a departure from the usual content here. I spent more time than I should have piecing all of this together, so i’m collecting all of the knowledge that I needed to solve it.
I’ve also tried to present it intuitively, so hopefully this helps someone equally lost!
Time Capsule
If you’re reading this several years down the line, there are probably better ways of reporting bugs. As of this post (May 2018), Fedora 28 is the latest.
Getting Some Work Done
On May 8th (after updating to Fedora 28), I loaded a copy of a simple board (the one pictured is the V-USB based controller for a volume controller project) board into Fedora 28’s KiCad.
I completely expected the schematic to blow up and was ready to fix it. The board survived intact, but DRC didn’t work and mysteriously crashed KiCad every time.
Design Rules Check (DRC)
When a board designer finishes a design, they typically run what’s called a Design Rules Check (DRC). This ensures that a Printed Circuit Board (PCB) is within the manufacturing tolerances for a particular PCB manufacturer/fabrication house.
It’s quite a critical function, as PCBs can be densely packed with components and traces; falling back to manual verification isn’t a thing (unless said designer is trying to bend spoons) – it’s difficult, tedious and error-prone enough designing a board.
A normal DRC run looks like this:
(N.B. The errors are intentional abuse of plated through-hole in surface mount pad)
On the other hand, Fedora 28’s (KiCad) DRC run just closed everything suddenly. No messages, no popups. Oops.
Where some intuition about the system helps
On Linux, running GUI programs from the terminal to figure out what’s wrong is always a good bet:
$ kicad
# <--- Ed note: opened the circuit board file, and ran DRC --->
/usr/include/c++/8/bits/stl_vector.h:932: std::vector<_Tp,
_Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp,
_Alloc>::size_type) [with _Tp = D_PAD*; _Alloc = std::allocator<D_PAD*>;
std::vector<_Tp, _Alloc>::reference = D_PAD*&; std::vector<_Tp,
_Alloc>::size_type = long unsigned int]: Assertion '__builtin_expect(__n <
this->size(), true)' failed.
Aborted (core dumped)
With very little understanding of the program’s construction, that is a confusing error message. The last part is somewhat helpful, since it says Aborted (core dumped)
.
Seeing that Assertion failed
along with a std::vector
is kind of useful too; a possible translation is: “This is potentially unintended/unsafe behavior on an std::vector. Stop.”
Not knowing anything else, this would be a somewhat decent message to hand to a developer.
Down The Core Rabbit Hole
core dumps save the state of the program when it crashes. With the right incantations (installing debugging versions of our programs), the path which lead to the crash becomes clearer.
But first, we have to be able to catch cores!
(This is an annotated version of this ask.fedora)
On some Linux distributions, core files are limited by default to preserve space on disk. Check the ulimit -c
setting to see what the limit is:
# This is usually the default setting, but check just in case
$ ulimit -c
unlimited
# Set it to unlimited if it doesn't report unlimited above
$ ulimit -c unlimited
And another command will tell us where core files land:
$ sysctl kernel.core_pattern
sysctl kernel.core_pattern = |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e
Normally, core files just end up as files in an obscure part of the filesystem. However, systemd-coredump
is catching all the cores for us, and shuffling it all away behind a friendly utility called coredumpctl
. Calling it with the list kicad
arguments, there appears to be a core file for KiCad.
$ coredumpctl list kicad
TIME PID UID GID SIG COREFILE EXE
[...] 17474 1000 1000 6 present /usr/bin/KiCad
Walking the core
Running gdb against the core is a first step. gdb can look pretty complex, but we just need a look at the backtrace.
$ coredumpctl gdb kicad
[ .. a ton of information scrolls by .. ]
Missing separate debuginfos, use: dnf debuginfo-install kicad-4.0.7-3.fc28.x86_64
(gdb) backtrace
#0 0x00007fad7b6cdf2b in raise () from /lib64/libc.so.6
#1 0x00007fad7b6b8561 in abort () from /lib64/libc.so.6
#2 0x00007fad574135e8 in std::__replacement_assert(char const*, int, char const*, char const*) () from /usr/bin/_pcbnew.kiface
#3 0x00007fad57623487 in DRC::testPad2Pad() () from /usr/bin/_pcbnew.kiface
#4 0x00007fad57623a5e in DRC::RunTests(wxTextCtrl*) () from /usr/bin/_pcbnew.kiface
[.. removed for brevity ..]
The blob of text is a bit neater this time, but it would be nice if we could see more relevant text. The prompt for Missing separate debuginfos, use: dnf debuginfo-install kicad-4.0.7-3.fc28.x86_64
is pretty useful here!
(gdb) quit
$ sudo dnf -y debuginfo-install kicad-4.0.7-3.fc28.x86_64
[.. installation messages ..]
$ coredumpctl gdb kicad
(gdb) backtrace
#0 0x00007fad7b6cdf2b in raise () from /lib64/libc.so.6
#1 0x00007fad7b6b8561 in abort () from /lib64/libc.so.6
#2 0x00007fad574135e8 in std::__replacement_assert (__file=__file@entry=0x7fad57adcee0 "/usr/include/c++/8/bits/stl_vector.h", __line=__line@entry=932,
__function=__function@entry=0x7fad57af6c60 <_ZZNSt6vectorIP5D_PADSaIS1_EEixEmE19__PRETTY_FUNCTION__> "std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = D_PAD*; _Alloc = std::allocator<D_PAD*>; std::vector<_Tp, _Alloc>::reference ="..., __condition=__condition@entry=0x7fad57adceb0 "__builtin_expect(__n < this->size(), true)")
at /usr/include/c++/8/x86_64-redhat-linux/bits/c++config.h:2389
#3 0x00007fad57623487 in std::vector<D_PAD*, std::allocator<D_PAD*> >::operator[] (__n=<optimized out>, this=0x7fff55b32170) at /usr/src/debug/kicad-4.0.7-3.fc28.x86_64/pcbnew/drc.cpp:456
This looks like the message we got when the application exited on the command-line, but #3 gives us a hint on where.
Now we have a better idea of where the error occurs, based on this line: /usr/src/debug/kicad-4.0.7-3.fc28.x86_64/pcbnew/drc.cpp:456
. This gets us close, but at this point, its better to get familiar with the code, or to connect with a developer.
To be even more helpful, we can think about what might have changed.
What we know so far
Fedora 27’s KiCad worked
DRC is an important function and it would be unlikely that it shipped broken (can validate by building KiCad standalone)
An assertation indicates unusual internal behavior, and seems to indicate bounding/quantitative errors?
The type it failed on was an std::vector
What might’ve changed when the upgrade to Fedora 28 took place?
Some intuition about the distribution
For most of Fedora’s recent releases, they’ve included a very helpful changeset/log. Searching the wiki should bring up any interesting changes (in this case: were there any changes that affected vectors?).
As of Fedora 28, there were indeed a set of new hardening flags.
Following the tracking bug link over to another wiki page, we land here.
Reading down the bullet points, -D_GLIBCXX_ASSERTIONS
pops out. Our crash lands on a vector! And the assert complains about a quantity comparison (was it less than?) Does it fit our crash? Maybe.
This is a good guess, but it is still a swing in the dark without verification.
Heading straight to the source
KiCad’s source code is online and open source. This means we can look up the line which caused the crash.
Remember that the version of KiCad where this occurs is 4.0.7-3 – the KiCad source is mirrored here. If we click around a bit into the releases, there’s a release tag named 4.0.7; clicking the string of letters lets us dig into the repository at that point in time: 9f35e38.
This is the latest commit though, which isn’t relevant to our bug. We need to see the drc.cpp
file – so clicking browse files gets us to the tree. We can then head to pcbnew/drc.cpp and scroll down to line 456
Here’s the version that caused the crash (with -D_GLIBCXX_ASSERTIONS set)
D_PAD** listEnd = &sortedPads[ sortedPads.size() ];
Bug verification
Verifying this (aka, setting up an entire build system) is a bit much for this post. (I did it to increase my confidence that this flag was the root cause.)
Building one to test anyway? Don’t make an entire containerized build system from scratch like I did. Look for a tool that makes patching/recompiling repeatable as you’ll most likely have to test any changes that a developer asks for.
For Fedora, pull this thread, and this thread as well.
How and where would I report this?
It’s ideal that everyone works together to fix a bug, but time is a precious commodity. This is especially true in open source work where unpaid labor is common.
This particular bug started off being filed in Fedora’s Bugzilla, but was eventually fixed on KiCad’s launchpad instead.
Fedora’s Bugzilla is located here. Here’s where this bug ended up. There were extra details included to aid in building an up-to-date version.
KiCad’s bug tracker is on a different service called launchpad. Searching was/is difficult, but eventually in this case, kismet; another fellow Fedora 28 user was experiencing the same crash.
The fix landed in KiCad’s repositories, which were then built for Fedora under KiCad’s nightly releases.
A workaround for Fedora users
If you’re reading this in the future (and stuck with this particular crash), consider switching over to the KiCad nightlies which has an upstream fix for this bug.
The upstream fix
Here is the fixed, upstream version:
D_PAD** listEnd = &sortedPads[0] + sortedPads.size();
This change avoids the assertation failure since the range is aggressively checked inside of the subscript &sortedPads[ sortedPads.size() ];
. Instead, the starting address of sortedPads is added together with the size of sortedPads to get a similar result.
Conclusion
Hopefully this shows that much of the process is intuition about the system at hand! Yes there are bugs that bend the mind and require specific areas of expertise; but often, only a little bit of intuition is required.
And don’t be afraid to ask for help if you are truly lost.