[HN Gopher] My First Kernel Module: A Debugging Nightmare
       ___________________________________________________________________
        
       My First Kernel Module: A Debugging Nightmare
        
       Author : ksml
       Score  : 62 points
       Date   : 2020-11-19 19:30 UTC (3 hours ago)
        
 (HTM) web link (reberhardt.com)
 (TXT) w3m dump (reberhardt.com)
        
       | Taniwha wrote:
       | So a story: I've been a kernel hack since Unix V6, made a living
       | doing it one way or another for over half my life ... learning to
       | think about concurrency, time, interrupts, race conditions etc is
       | hard, very hard - I got pretty good at it ... but then my career
       | took a diversion, I designed chips for a decade or so, everything
       | is concurrency, at the lowest levels .... after a while I came
       | back to doing kernel stuff and found that with this new
       | background all that hard stuff was trivial and obvious.
       | 
       | Mostly you just have to steep your brain in it for long enough
        
         | ksml wrote:
         | Concurrency is still hard for me, but I do find it getting much
         | easier over the years :) thanks for the story!
        
       | sweettea wrote:
       | You probably already did this, but for the audience: one of the
       | best ways to make sure you're using a function reasonably is to
       | use elixir.bootlin.com to look at other uses and make sure you're
       | using the function similarly. For instance, check out
       | https://elixir.bootlin.com/linux/latest/A/ident/for_each_pro... .
        
         | ksml wrote:
         | Elixir was extremely helpful to me! It didn't always help me
         | understand _why_ code was written the way it was (hence my
         | incorrect use of rcu_read_lock), but it was very helpful to see
         | some examples.
        
       | ksml wrote:
       | Hi HN, this was my first attempt at writing any sort of kernel
       | code. I would love to hear your thoughts on this experience and
       | on the fixes I applied, especially from anyone with more Linux
       | experience than me :)
        
         | ylyn wrote:
         | Seems like someone did try to get those functions exported, but
         | the maintainer rejected it, saying that no driver should be
         | poking so deep into fd internals. Makes sense. Your use case is
         | kind of niche.
         | 
         | https://lore.kernel.org/lkml/20180730163256.GC27761@infradea...
         | 
         | By the way, C Playground is really helpful for teaching an OS
         | course!
        
           | ksml wrote:
           | That is really interesting and good to know -- thanks for
           | that!
           | 
           | I hope C Playground is helpful, and I'm building it with
           | teaching in mind. If you teach anywhere and could find it
           | useful, let me know!
        
         | ylyn wrote:
         | Here's a hack you could use to get around the functions not
         | being exported: https://github.com/anbox/anbox-
         | modules/blob/master/binder/de...
        
           | ksml wrote:
           | Oh, that's clever! I might try that. I really don't feel
           | comfortable building my own kernel
        
         | warybeary wrote:
         | Have you looked into using eBPF instead of writing a kernel
         | module?
         | 
         | http://ebpf.io for some more insights.
         | 
         | At the very least, it'll provide some useful tooling for you to
         | debug problems in kernel-space.
        
           | ksml wrote:
           | I hadn't considered this! Can eBPF be used to access
           | arbitrary kernel data structures, though?
        
             | warybeary wrote:
             | Yes (to a degree) :)
             | 
             | Check out https://github.com/iovisor/bpftrace and the
             | example tools/ for a taste. You'll likely want to play with
             | kprobes/kretprobes.
        
               | ksml wrote:
               | This is really interesting; I hadn't realized it was so
               | capable/general. I'll look into this. Thanks for the
               | references!
        
       | nosefrog wrote:
       | Great story! I've had a lot of debugging nightmares, but
       | thankfully never anything as bad as that.
       | 
       | One thing that looks fishy is this branch:                 if
       | (container_tasks_len == max_container_tasks) {
       | printk("cplayground: ERROR: container_tasks list hit capacity! We
       | "         "may be missing processes from the procfile
       | output.\n");         break;       }
       | 
       | Since you said printk can block, why isn't calling it in the rcu
       | critical section a bug? Is it because you immediately break
       | afterwards and don't try to reference the next task?
        
         | ksml wrote:
         | That's a good point. I'm hoping that this never gets hit, and
         | if that line ever appears in the logs, then things are already
         | broken. However, it's probably better to improve the failure
         | mode where possible :) [edit] and yes, since we break and don't
         | follow the `next` pointer in the linked list, that also
         | shouldn't cause any problems.
        
       | devit wrote:
       | You can do most or all of that by reading /proc/<pid>/fdinfo/<fd>
       | and /proc/<pid>/fd/<fd> or by making system calls on the affected
       | fds (which you can do e.g. by injecting code with LD_PRELOAD or
       | ptrace or with nsenter with fd namespace or equivalent C code).
       | 
       | Even if you write a kernel driver, iterating over all tasks in
       | the system is a terrible design (there may be millions), not to
       | mention "determining if a task belongs to a C playground program"
       | in the kernel (obviously the kernel should have no knowledge
       | about such specifics).
       | 
       | Of course, if a developer cannot even produce a reasonable
       | overall design, it's not surprising that they aren't capable of
       | writing correct code.
        
         | nosefrog wrote:
         | "Be kind. Don't be snarky. Have curious conversation; don't
         | cross-examine. Please don't fulminate. Please don't sneer,
         | including at the rest of the community."
         | 
         | https://news.ycombinator.com/newsguidelines.html
        
         | ksml wrote:
         | I actually cannot get enough information from doing that.
         | Crucially, I need to be able to recognize whether two file
         | descriptors point to the same open `file_struct`. (To be clear,
         | this isn't the same as whether they're pointing to the same
         | file path. I need to know when the two file descriptors are
         | sharing the same cursor.) There is no way to do this using
         | existing APIs, because there is nothing identifying a `struct
         | file` besides the memory address of the struct. (The "open file
         | IDs" I mention are hashes of the `file_struct` address.)
         | 
         | I did spend a lot of time trying to avoid writing a kernel
         | module, and this was the only way I could find to do it :)
        
           | devit wrote:
           | You can use the kcmp system call with KCMP_FILE argument to
           | find out if two fds point to the same files structure (of
           | course you must use this as the custom comparison function of
           | a sort algorithm so you don't end up with quadratic run
           | time).
           | 
           | Linux has a project called CRIU that can save and restore
           | processes to disk without needing additional kernel modules,
           | so pretty much all state is already gettable and settable
           | from user space.
        
             | ksml wrote:
             | I can't do that across processes, though, can I? (to see
             | whether two processes have file descriptors pointing to the
             | same open file)
             | 
             | I hadn't heard of CRIU. I'll check that out. (edit: CRIU
             | looks super useful. I think the speed/overhead of
             | snapshotting will decide whether I can use it for this
             | project, but I can imagine it being handy in the future
             | regardless. Thanks for the link.)
        
               | dilyevsky wrote:
               | I recommend checking out podman (or docker) - they have
               | built-in criu support. Otherwise you'll need some other
               | namespacing mechanism to avoid colliding pids
        
       | lallysingh wrote:
       | EBPF is honestly the first thing to try _before_ writing a
       | module.
       | 
       | I'm glad to see you used a VM. That's the first step in the right
       | direction. Others have mentioned that you should've used
       | printk(), which is true.
       | 
       | I'll mention that you can also run the kernel in a debugger:
       | https://www.kernel.org/doc/html/latest/dev-tools/gdb-kernel-...
        
         | ksml wrote:
         | I hadn't considered eBPF because I needed some pretty obscure
         | information from the kernel internals (i.e. the addresses of
         | the `struct file`s) and I didn't realize eBPF was as capable as
         | it is. Another commenter suggested trying it, though, so I'm
         | checking it out now!
         | 
         | I did use printk for debugging, but I (incorrectly) assumed it
         | could block. Another commenter pointed out that this is not the
         | case. TIL!
         | 
         | The gdb link looks very helpful and I'll try that next time.
         | Thanks for linking that.
        
       | cesarb wrote:
       | > However, printk can block (while allocating memory)
       | 
       | No, printk() is magic. It can be called even in NMI context,
       | which is a worse place. Quoting https://lwn.net/Articles/800946/,
       | "[...] kernel code must be able to call printk() from any
       | context. Calls from atomic context prevent it from blocking;
       | calls from non-maskable interrupts (NMIs) can even rule out the
       | use of spinlocks. [...]"
        
         | ksml wrote:
         | This is really good to know. I had assumed it could block when
         | allocating memory for the formatted string buffer, but the
         | rationale explained in that article makes a lot of sense. Being
         | able to use printk simplifes things a lot.
        
           | kanox wrote:
           | Also: allocating memory with GFP_ATOMIC doesn't sleep.
        
       | lhoursquentin wrote:
       | Great post, also love what you are trying to do with C
       | playground, this is awesome!
       | 
       | I've recently been trying to build something similar, visualizing
       | forks/exeve/read/write, but using the strace output of a binary,
       | which is much less challenging.
        
         | ksml wrote:
         | Thank you! It's open source, and I'd love to hear if you have
         | any suggestions for it. Would also love to see what you're
         | building!
        
       | secondcoming wrote:
       | Great article! Reminds me of when I was working on a bug in a
       | phone kernel and adding its equivalent of printk() made the bug
       | disappear! Lauterbach time!
        
       ___________________________________________________________________
       (page generated 2020-11-19 23:00 UTC)