[HN Gopher] Object Detection from 9 FPS to 650 FPS
       ___________________________________________________________________
        
       Object Detection from 9 FPS to 650 FPS
        
       Author : briggers
       Score  : 106 points
       Date   : 2020-10-10 13:24 UTC (9 hours ago)
        
 (HTM) web link (paulbridger.com)
 (TXT) w3m dump (paulbridger.com)
        
       | lostdog wrote:
       | This is such a great post. It really shows how much room for
       | improvement there is in all released deep learning code. Almost
       | none of the open source work is really production ready for fast
       | inference, and tuning the systems requires a good working
       | knowledge of the GPU.
       | 
       | The article does skip the most important step for getting great
       | inference speeds: Drop Python and move fully into C++.
        
         | gameswithgo wrote:
         | c++ or .net or rust or go whatever. almost anything can get the
         | performance you want except python.
         | 
         | too bad such great ecosystems evolved around a language that
         | can't fully utilize the amazing hardware we have today.
        
         | blihp wrote:
         | I'd alter your conclusion that open source work isn't
         | production ready. As long as it works as described, it is
         | production ready for at least some subset of use cases. There's
         | just a lot of low hanging fruit re: performance improvement.
         | 
         | It's entirely valid to trade-off either a more straight-forward
         | design or minimizing development time for performance and just
         | throw hardware at the problem as needed.... companies do it all
         | of the time.
        
         | whimsicalism wrote:
         | All I see as the main insight of this article is that you
         | shouldn't use pytorch hub as a baseline for inference speed.
         | 
         | I know a number of python frameworks (ie. detectron) that are
         | fast.
         | 
         | I'd like to see the evidence that the performance bottleneck is
         | python, esp. when asynchronous dispatch exists.
        
         | threatripper wrote:
         | > Drop Python and move fully into C++.
         | 
         | Do you have any experience with that?
        
           | lostdog wrote:
           | Yes (though the details are private).
           | 
           | All the deep learning libraries are Python wrappers around
           | C/C++ (which then call into CUDA). If you call the C++ layers
           | directly, you have control over the memory operations applied
           | to your data. The biggest wins come from reducing the number
           | of copies, reducing the number of transfers between CPU and
           | GPU memory, and speeding up operations by moving them from
           | the CPU to the GPU (or vice versa).
           | 
           | This is basically what the article does, but if you want to
           | squeeze out all the performance, the Python layer is still an
           | abstraction that gets in the way of directly choosing what
           | happens to the memory.
        
             | dheera wrote:
             | There are lots of cases where people use e.g. ROS on robots
             | and Python to do inferences, which basically converts a ROS
             | binary image message data into a Python list of bytes
             | (ugh), then convert that into numpy (ugh), and then feed
             | that into TensorFlow to do inferences. This pipeline is
             | extremely sub-optimal, but it's what most people probably
             | do.
             | 
             | All because nobody has really provided off the shelf usable
             | deployment libraries. That Bazel stuff if you want to use
             | the C++ API? Big nope. Way too cumbersome. You're trying to
             | move from Python to C++ and they want you to install ...
             | Java? WTF?
             | 
             | Also, some of the best neural net research out there has
             | you run "./run_inference.sh" or some other abomination of a
             | Jupyter notebook instead of an installable, deployable
             | library. To their credit, good neural net engineers aren't
             | expected to be good software engineers, but I'm just
             | pointing out that there's a big gap between good neural
             | nets and deployable neural nets.
        
             | threatripper wrote:
             | I could see this working for the evaluation which basically
             | just glues OpenCV video reading with Tensorflow to extract
             | a handful of parameters per frame. The rest could stay in
             | Python.
             | 
             | Do you have experience how single frame processing compares
             | between Python and C++? I see that batched processing in
             | Python gives me a huge speed boost which hints at
             | inefficiencies at some point but I don't know if those are
             | related to Python, Tensorflow or CUDA itself. (Or just bad
             | resource management that requires re-initalization of some
             | costly things in between evaluations.)
        
               | whimsicalism wrote:
               | The fact that batching is faster does not inherently
               | imply some sort of inefficiency, but rather is indicative
               | of the fact that sequential memory access is faster than
               | random.
               | 
               | I am curious what the basis behind the idea that Python
               | is the performance bottleneck for inference is.
        
               | dheera wrote:
               | It depends, e.g. if you are moving data from memory into
               | a Python data structure and then sending it to the GPU
               | you will have a huge performance bottleneck in loading
               | the data into Python.
        
               | g_airborne wrote:
               | It's not that Python is by definition much slower than
               | C++, rather, doing inference in C++ makes it much easier
               | to control exactly when memory is initialised, copied and
               | moved between CPU and GPU. Especially on frame-by-frame
               | models like object detection this can make a big
               | difference. Also, the GIL can be a real problem if you
               | are trying to scale inference on multiple incoming video
               | streams for example.
        
       | threatripper wrote:
       | How would one accelerate object tracking on a video stream where
       | each frame depends on the result of the previous one? Batching
       | and multi-threading doesn't work here.
       | 
       | Are there some CNN-libraries that have way less overhead for
       | small batch sizes? Tensorflow (GPU accelerated) seems to go down
       | from 10000 fps on large batches to 200 fps for single frames for
       | a small CNN.
        
         | lostdog wrote:
         | It depends on the algorithm you're using, but here are some
         | places to start:
         | 
         | 1. How many times is the data being copied, or moved between
         | devices?
         | 
         | 2. Are you recomputing data from previous frames that you could
         | just be saving? For example, some tracking algorithms apply the
         | same CNN tower to the last 3-5 images, and you could just save
         | the results from the last frame instead of recomputing. (Of
         | course, you also want to follow hint #1 and keep these results
         | on the GPU).
         | 
         | 3. Change the algorithm or network you're using.
         | 
         | Really you should read the original article carefully. The
         | article is showing you the steps for profiling what part of the
         | runtime is slow. Typically, once you profile a little you'll be
         | surprised to find that time is being wasted somewhere
         | unexpected.
        
       | spockz wrote:
       | I think this is a great explanation. Are this kind of manual
       | optimisations still needed when using the higher level
       | frameworks? Or at least those should make it clear in the types
       | when a pipeline moves from cpu to gpu and vice versa.
        
       | t-vi wrote:
       | > The solution to Python's GIL bottleneck is not some trick, it
       | is to stop using Python for data-path code.
       | 
       | At least for the PyTorch bits of it, using the PyTorch JIT works
       | well. When you run PyTorch code through Python, the intermediate
       | results will be created as Python objects (with GIL and all)
       | while when you run it in TorchScript, the intermediates will only
       | be in C++ PyTorch Tensors, all without the GIL. We have a small
       | comment about it in our PyTorch book in the section on what
       | improvements to expect from the PyTorch JIT and it seems rather
       | relevant in practice.
        
         | g_airborne wrote:
         | The JIT is hands down the best feature of PyTorch. Especially
         | compared to the somewhat neglected suite of native inference
         | tools for TensorFlow. Just recently I was trying to get a
         | TensorFlow 2 model to work nicely in C++. Basically, the
         | external API for TensorFlow is the C API, but it does not have
         | proper support for `SavedModel` yet. Linking to the C++ library
         | is a pain, and both of them cannot do eager execution at all if
         | you have a model trained in Python code :(
         | 
         | PyTorch will happily let you export your model, even with
         | Python code in it, and run it in C++ :)
        
       | NikolaeVarius wrote:
       | I've been trying to coax better performance out of a Jetson nano
       | camera, currently using Python's Open CV lib, with some
       | threading, and can only manage at best about 29fps.
       | 
       | I would love an alternative that is reasonably simple to
       | implement. I dislike having to handle raw bits.
        
         | ilaksh wrote:
         | I wonder if the tool used in the article can be applied?
         | 
         | Seems like Xavier NX is more realistic for my needs right now
         | personally though. Of course it's much more expensive etc.
        
       | egberts1 wrote:
       | Try this one.
       | 
       | https://github.com/streamlit/demo-self-driving
       | 
       | It uses StreamLit
       | 
       | https://github.com/streamlit/streamlit
        
         | minimaxir wrote:
         | Streamlit is a UI framework, not a ML pipeline/performance
         | framework.
        
       | O5vYtytb wrote:
       | > The solution to Python's GIL bottleneck is not some trick, it
       | is to stop using Python for data-path code.
       | 
       | What about using pytorch multiprocessing[1]?
       | 
       | [1] https://pytorch.org/docs/stable/notes/multiprocessing.html
        
         | amelius wrote:
         | I can't attest to the usefulness of pytorch's multiprocessing
         | module, but using python's multiprocessing module feels like
         | low-level programming (serializing, packing and unpacking data-
         | structures, etc. where you'd hope the environment would handle
         | it for you).
        
           | modeless wrote:
           | I found python multiprocessing to work well to parallelize
           | deep learning data loading and preprocessing, because all I
           | needed to communicate was a couple of tensors which are easy
           | to allocate in shared memory. I didn't need complex data
           | structures or synchronization.
        
         | threatripper wrote:
         | Processing separate video streams works well with separate
         | processes. There is some cost related to starting the other
         | processes and sometimes libraries may stumble (e.g. several
         | instances of ML libraries allocating all the GPU memory) but
         | once it's running it's literally two separate processes that
         | can do their work independently.
         | 
         | Multiprocessing could be a pain if you need to pass frames of a
         | single video stream. Traditionally you'd need to
         | pickle/unpickle them to pass them between processes.
        
       ___________________________________________________________________
       (page generated 2020-10-10 23:00 UTC)