[HN Gopher] Object Detection from 9 FPS to 650 FPS ___________________________________________________________________ Object Detection from 9 FPS to 650 FPS Author : briggers Score : 106 points Date : 2020-10-10 13:24 UTC (9 hours ago) (HTM) web link (paulbridger.com) (TXT) w3m dump (paulbridger.com) | lostdog wrote: | This is such a great post. It really shows how much room for | improvement there is in all released deep learning code. Almost | none of the open source work is really production ready for fast | inference, and tuning the systems requires a good working | knowledge of the GPU. | | The article does skip the most important step for getting great | inference speeds: Drop Python and move fully into C++. | gameswithgo wrote: | c++ or .net or rust or go whatever. almost anything can get the | performance you want except python. | | too bad such great ecosystems evolved around a language that | can't fully utilize the amazing hardware we have today. | blihp wrote: | I'd alter your conclusion that open source work isn't | production ready. As long as it works as described, it is | production ready for at least some subset of use cases. There's | just a lot of low hanging fruit re: performance improvement. | | It's entirely valid to trade-off either a more straight-forward | design or minimizing development time for performance and just | throw hardware at the problem as needed.... companies do it all | of the time. | whimsicalism wrote: | All I see as the main insight of this article is that you | shouldn't use pytorch hub as a baseline for inference speed. | | I know a number of python frameworks (ie. detectron) that are | fast. | | I'd like to see the evidence that the performance bottleneck is | python, esp. when asynchronous dispatch exists. | threatripper wrote: | > Drop Python and move fully into C++. | | Do you have any experience with that? | lostdog wrote: | Yes (though the details are private). | | All the deep learning libraries are Python wrappers around | C/C++ (which then call into CUDA). If you call the C++ layers | directly, you have control over the memory operations applied | to your data. The biggest wins come from reducing the number | of copies, reducing the number of transfers between CPU and | GPU memory, and speeding up operations by moving them from | the CPU to the GPU (or vice versa). | | This is basically what the article does, but if you want to | squeeze out all the performance, the Python layer is still an | abstraction that gets in the way of directly choosing what | happens to the memory. | dheera wrote: | There are lots of cases where people use e.g. ROS on robots | and Python to do inferences, which basically converts a ROS | binary image message data into a Python list of bytes | (ugh), then convert that into numpy (ugh), and then feed | that into TensorFlow to do inferences. This pipeline is | extremely sub-optimal, but it's what most people probably | do. | | All because nobody has really provided off the shelf usable | deployment libraries. That Bazel stuff if you want to use | the C++ API? Big nope. Way too cumbersome. You're trying to | move from Python to C++ and they want you to install ... | Java? WTF? | | Also, some of the best neural net research out there has | you run "./run_inference.sh" or some other abomination of a | Jupyter notebook instead of an installable, deployable | library. To their credit, good neural net engineers aren't | expected to be good software engineers, but I'm just | pointing out that there's a big gap between good neural | nets and deployable neural nets. | threatripper wrote: | I could see this working for the evaluation which basically | just glues OpenCV video reading with Tensorflow to extract | a handful of parameters per frame. The rest could stay in | Python. | | Do you have experience how single frame processing compares | between Python and C++? I see that batched processing in | Python gives me a huge speed boost which hints at | inefficiencies at some point but I don't know if those are | related to Python, Tensorflow or CUDA itself. (Or just bad | resource management that requires re-initalization of some | costly things in between evaluations.) | whimsicalism wrote: | The fact that batching is faster does not inherently | imply some sort of inefficiency, but rather is indicative | of the fact that sequential memory access is faster than | random. | | I am curious what the basis behind the idea that Python | is the performance bottleneck for inference is. | dheera wrote: | It depends, e.g. if you are moving data from memory into | a Python data structure and then sending it to the GPU | you will have a huge performance bottleneck in loading | the data into Python. | g_airborne wrote: | It's not that Python is by definition much slower than | C++, rather, doing inference in C++ makes it much easier | to control exactly when memory is initialised, copied and | moved between CPU and GPU. Especially on frame-by-frame | models like object detection this can make a big | difference. Also, the GIL can be a real problem if you | are trying to scale inference on multiple incoming video | streams for example. | threatripper wrote: | How would one accelerate object tracking on a video stream where | each frame depends on the result of the previous one? Batching | and multi-threading doesn't work here. | | Are there some CNN-libraries that have way less overhead for | small batch sizes? Tensorflow (GPU accelerated) seems to go down | from 10000 fps on large batches to 200 fps for single frames for | a small CNN. | lostdog wrote: | It depends on the algorithm you're using, but here are some | places to start: | | 1. How many times is the data being copied, or moved between | devices? | | 2. Are you recomputing data from previous frames that you could | just be saving? For example, some tracking algorithms apply the | same CNN tower to the last 3-5 images, and you could just save | the results from the last frame instead of recomputing. (Of | course, you also want to follow hint #1 and keep these results | on the GPU). | | 3. Change the algorithm or network you're using. | | Really you should read the original article carefully. The | article is showing you the steps for profiling what part of the | runtime is slow. Typically, once you profile a little you'll be | surprised to find that time is being wasted somewhere | unexpected. | spockz wrote: | I think this is a great explanation. Are this kind of manual | optimisations still needed when using the higher level | frameworks? Or at least those should make it clear in the types | when a pipeline moves from cpu to gpu and vice versa. | t-vi wrote: | > The solution to Python's GIL bottleneck is not some trick, it | is to stop using Python for data-path code. | | At least for the PyTorch bits of it, using the PyTorch JIT works | well. When you run PyTorch code through Python, the intermediate | results will be created as Python objects (with GIL and all) | while when you run it in TorchScript, the intermediates will only | be in C++ PyTorch Tensors, all without the GIL. We have a small | comment about it in our PyTorch book in the section on what | improvements to expect from the PyTorch JIT and it seems rather | relevant in practice. | g_airborne wrote: | The JIT is hands down the best feature of PyTorch. Especially | compared to the somewhat neglected suite of native inference | tools for TensorFlow. Just recently I was trying to get a | TensorFlow 2 model to work nicely in C++. Basically, the | external API for TensorFlow is the C API, but it does not have | proper support for `SavedModel` yet. Linking to the C++ library | is a pain, and both of them cannot do eager execution at all if | you have a model trained in Python code :( | | PyTorch will happily let you export your model, even with | Python code in it, and run it in C++ :) | NikolaeVarius wrote: | I've been trying to coax better performance out of a Jetson nano | camera, currently using Python's Open CV lib, with some | threading, and can only manage at best about 29fps. | | I would love an alternative that is reasonably simple to | implement. I dislike having to handle raw bits. | ilaksh wrote: | I wonder if the tool used in the article can be applied? | | Seems like Xavier NX is more realistic for my needs right now | personally though. Of course it's much more expensive etc. | egberts1 wrote: | Try this one. | | https://github.com/streamlit/demo-self-driving | | It uses StreamLit | | https://github.com/streamlit/streamlit | minimaxir wrote: | Streamlit is a UI framework, not a ML pipeline/performance | framework. | O5vYtytb wrote: | > The solution to Python's GIL bottleneck is not some trick, it | is to stop using Python for data-path code. | | What about using pytorch multiprocessing[1]? | | [1] https://pytorch.org/docs/stable/notes/multiprocessing.html | amelius wrote: | I can't attest to the usefulness of pytorch's multiprocessing | module, but using python's multiprocessing module feels like | low-level programming (serializing, packing and unpacking data- | structures, etc. where you'd hope the environment would handle | it for you). | modeless wrote: | I found python multiprocessing to work well to parallelize | deep learning data loading and preprocessing, because all I | needed to communicate was a couple of tensors which are easy | to allocate in shared memory. I didn't need complex data | structures or synchronization. | threatripper wrote: | Processing separate video streams works well with separate | processes. There is some cost related to starting the other | processes and sometimes libraries may stumble (e.g. several | instances of ML libraries allocating all the GPU memory) but | once it's running it's literally two separate processes that | can do their work independently. | | Multiprocessing could be a pain if you need to pass frames of a | single video stream. Traditionally you'd need to | pickle/unpickle them to pass them between processes. ___________________________________________________________________ (page generated 2020-10-10 23:00 UTC)