Thinking In Software: June 2016

Monday, June 13, 2016

#NoProjects are needed for software development - Prioritize MMF based on multi-dimensional analysis (value, risk and cost of delay)

#NoProjects are needed for software development. Prioritize MMF based on multi-dimensional analysis (value, risk and cost of delay) instead.

I was tempted to share why I think that Projects in software development kill value rather than bring value but before posting I did my homework and I found that somebody wrote about it before.

Prioritization should be about selecting the minimal marketable features (MMF) that bring the biggest value, minimize the cost of delay and/or are bigger risk reducers (what I call multi-dimensional analysis). It should be done continuously just as software should be shipped. Accepting a big effort as more profitable than focusing on MMF delivery is a mistake. The devil is *always* in the details.

Saturday, June 04, 2016

AH00051: child pid ? exit signal Segmentation fault (11), possible coredump in ?

I ran into the below segmentation fault in apache accompanied by high CPU usage in the server:

[Fri Jun 03 06:00:01.270303 2016] [core:notice] [pid 29628:tid 140412199950208] AH00051: child pid 2343 exit signal Segmentation fault (11), possible coredump in /tmp/apache-coredumps

After debugging what the problem was, I concluded that mod proxy was the culprit. There is an apparent recursive bug. I decided not to report it because it was hard to replicate (this happened only late at night) and there was a workaround I could use. This error happened when there was a first request for a resource that the proxied worker could not serve, apache tried to use a specific document for the error but a configuration issue would try to pull such document from the worker as well. Since the document is not in the remote server mod proxy will enter an infinite loop resulting in an apache crash. Most likely a recursion related bug. Clearly resolving the configuration issue to make sure the document is pulled from the local apache hides this potential mod proxy bug.

[Fri Jun 03 05:56:40.831871 2016] [proxy:error] [pid 2343:tid 140411834115840] (502)Unknown error 502: [client 192.168.0.180:58819] AH01084: pass request body failed to 172.16.2.2:8443 (sample.com)
[Fri Jun 03 05:56:40.831946 2016] [proxy:error] [pid 2343:tid 140411834115840] [client 192.168.0.180:58819] AH00898: Error during SSL Handshake with remote server returned by /sdk
[Fri Jun 03 05:56:40.831953 2016] [proxy_http:error] [pid 2343:tid 140411834115840] [client 192.168.0.180:58819] AH01097: pass request body failed to 172.16.2.2:8443 (sample.com) from 192.168.0.180 ()
[Fri Jun 03 05:56:40.844138 2016] [proxy:error] [pid 2343:tid 140411834115840] (502)Unknown error 502: [client 192.168.0.180:58819] AH01084: pass request body failed to 172.16.2.2:8443 (sample.com)
[Fri Jun 03 05:56:40.844177 2016] [proxy:error] [pid 2343:tid 140411834115840] [client 192.168.0.180:58819] AH00898: Error during SSL Handshake with remote server returned by /html/error/503.html
[Fri Jun 03 05:56:40.844185 2016] [proxy_http:error] [pid 2343:tid 140411834115840] [client 192.168.0.180:58819] AH01097: pass request body failed to 172.16.2.2:8443 (sample.com) from 192.168.0.180 ()

Why do I get a Linux SIGSEGV / segfault / Segmentation fault ?

To respond to this question you will need to get a coredump and use the GNU debugger (gdb).
The application generating the segmentation fault must be configured to produce a coredump or should be manually run with gdb in case a segmentation fault can be manually replicated.
Let us pick Apache to go through the debugging steps with a real world example. In order to produce a coredump apache2 needs to be configured:

$ vi /etc/apache2/apache2.conf
...
CoreDumpDirectory /tmp/apache-coredumps
...

Also the OS must not impose limits on the size of the core file in case we don't know how big it would be:

$ sudo bash
# ulimit -c unlimited

In addition the core dump directory must exist and must be owned by the apache user (www-data in this case)

sudo mkdir /tmp/apache-coredumps
sudo chown www-data:www-data /tmp/apache-coredumps

Make sure to stop and start apache separately

$ sudo apachectl stop
$ sudo apachectl start

Look into running processes:

$ ps -ef|grep apache2

Confirm that the processes are running with "Max core file size" unlimited:

$ cat /proc/${pid}/limits
...
Max core file size        unlimited            unlimited            bytes
...

Here is a quick way to do it with a one liner. It lists the max core file size for the parent apache2 process and all its children:

$ ps -ef | grep 'sbin/apache2' | grep -v grep | awk '{print $2}' | while read -r pid; do cat /proc/$pid/limits | grep core ; done

To test that your configuration works just force a coredump. This can be achieved by sending a SIGABRT signal to the process:

$ kill -SIGABRT ${pid}

Analyze the coredump file with gdb. In this case it confirms that the SIGABRT signal was used:

$ gdb core  /tmp/apache-coredumps/core
...
Core was generated by `/usr/sbin/apache2 -k start'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f1bc74073bd in ?? ()
...

Leave apache running and when it fails with a segmentation fault you can confirm the reason:

$ gdb core  /tmp/apache-coredumps/core
...
Core was generated by `/usr/sbin/apache2 -k start'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fc4d7ef6a11 in ?? ()
...

Explore then deeper to find out what the problem really is. Note that we run apache now through gdb to get deeper information about the coredump:

$ gdb /usr/sbin/apache2 /tmp/apache-coredumps/core
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fc4d7ef6a11 in apr_brigade_cleanup (data=0x7fc4d8760e68) at /build/buildd/apr-util-1.5.3/buckets/apr_brigade.c:44
44 /build/buildd/apr-util-1.5.3/buckets/apr_brigade.c: No such file or directory.

Using the bt command from the gdb prompt gives us more:

Core was generated by `/usr/sbin/apache2 -k start'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fc4d7ef6a11 in apr_brigade_cleanup (data=0x7fc4d8760e68) at /build/buildd/apr-util-1.5.3/buckets/apr_brigade.c:44
44 /build/buildd/apr-util-1.5.3/buckets/apr_brigade.c: No such file or directory.
(gdb) bt
#0  0x00007fc4d7ef6a11 in apr_brigade_cleanup (data=0x7fc4d8760e68) at /build/buildd/apr-util-1.5.3/buckets/apr_brigade.c:44
#1  0x00007fc4d7cd79ce in run_cleanups (cref=) at /build/buildd/apr-1.5.0/memory/unix/apr_pools.c:2352
#2  apr_pool_destroy (pool=0x7fc4d875f028) at /build/buildd/apr-1.5.0/memory/unix/apr_pools.c:814
#3  0x00007fc4d357de00 in ssl_io_filter_output (f=0x7fc4d876a8e0, bb=0x7fc4d8767ab8) at ssl_engine_io.c:1659
#4  0x00007fc4d357abaa in ssl_io_filter_coalesce (f=0x7fc4d876a8b8, bb=0x7fc4d8767ab8) at ssl_engine_io.c:1558
#5  0x00007fc4d87f1d2d in ap_process_request_after_handler (r=r@entry=0x7fc4d875f0a0) at http_request.c:256
#6  0x00007fc4d87f262a in ap_process_async_request (r=r@entry=0x7fc4d875f0a0) at http_request.c:353
#7  0x00007fc4d87ef500 in ap_process_http_async_connection (c=0x7fc4d876a330) at http_core.c:143
#8  ap_process_http_connection (c=0x7fc4d876a330) at http_core.c:228
#9  0x00007fc4d87e6220 in ap_run_process_connection (c=0x7fc4d876a330) at connection.c:41
#10 0x00007fc4d47fe81b in process_socket (my_thread_num=19, my_child_num=, cs=0x7fc4d876a2b8, sock=, p=, thd=)
    at event.c:970
#11 worker_thread (thd=, dummy=) at event.c:1815
#12 0x00007fc4d7aa7184 in start_thread (arg=0x7fc4c3fef700) at pthread_create.c:312
#13 0x00007fc4d77d437d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

I was miss led by the messages above until I started changing apache configurations and re-looking into generated coredumps. I realized Apache would fail at any line of any source code complaining about "No such file or directory" in a clear consequence of its incapacity to access resources that did exist:

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/sbin/apache2 -k start'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fb44328d992 in ap_save_brigade (f=f@entry=0x7fb4432e5148, saveto=saveto@entry=0x7fb4432e59c8, b=b@entry=0x7fb435ffa568, 
    p=0x7fb443229028) at util_filter.c:648
648 util_filter.c: No such file or directory.

Disabling modules I was able to get to the bottom of it. An apparent bug in mod-proxy when a miss configuration causes recursive pulling of unavailable resources from the proxied target. Once the issue is found comment or eliminate the configuration from apache:

$ sudo vi /etc/apache2/apache2.conf
...
# CoreDumpDirectory /tmp/apache-coredumps
...
$ sudo apachectl graceful

Increasing swap size in Ubuntu Linux

Here is how to depart with parted (no pun intended) from a swap partition to a swap file in Ubuntu Linux. When in need to increase the swap file it is way easier to increase the size of a swap file than to increase the size of a partition. Starting with the 2.6 Linux kernel, swap files are just as fast as swap partitions. Remove the current swap partition:

$ sudo parted
…
(parted) print                                                            
…
Number  Start   End     Size    Type      File system     Flags
 1      1049kB  40.8GB  40.8GB  primary   ext4            boot
 2      40.8GB  42.9GB  2145MB  extended
 5      40.8GB  42.9GB  2145MB  logical   linux-swap(v1)
…
(parted) rm 2                                                             
(parted) print                                                            
…
Number  Start   End     Size    Type     File system  Flags
 1      1049kB  40.8GB  40.8GB  primary  ext4         boot
…

Delete the entry from stab

# swap was on /dev/sda5 during installation
# UUID=544f0d91-d3db-4301-8d1b-f6bfb2fdee5b none            swap    sw              0       0

Disable all swapping devices:

$ sudo swapoff -a

Create a swapfile with correct permissions, for example the below creates a 4GB one:

$ sudo fallocate -l 4G /swapfile
$ sudo chmod 600 /swapfile

Setup the linux swap area:

$ sudo mkswap /swapfile

Enable the swap device:

$ sudo swapon /swapfile

Confirm that it was created:

$ sudo swapon -s
…
Filename    Type  Size       Used Priority
/swapfile            file  4194300   0     -1
…

Add the entry to /etc/fstab:

/swapfile       none    swap    sw      0       0

Restart:

$ sudo reboot now

Confirm that the swap is active:

$ free
             total       used       free     shared    buffers     cached
Mem:       4048220    1651032    2397188        536      66648     336164
-/+ buffers/cache:    1248220    2800000
Swap:      4194300          0    4194300

$ sudo swapon -s
…
Filename    Type  Size       Used Priority
/swapfile            file  4194300   0     -1
…

Thinking In Software