[HN Gopher] Object Detection at 1840 FPS with TorchScript, Tenso... ___________________________________________________________________ Object Detection at 1840 FPS with TorchScript, TensorRT and DeepStream Author : briggers Score : 135 points Date : 2020-10-18 11:54 UTC (11 hours ago) (HTM) web link (paulbridger.com) (TXT) w3m dump (paulbridger.com) | stabbles wrote: | > There is evidence (measured using gil_load) that we were | throttled by a fundamental Python limitation with multiple | threads fighting over the Global Interpreter Lock (GIL). | | Can anyone comment on how often this is a problem and if this | problem is truly fundamental to Python? Could it be solved in a | Python 3.x release? | the-dude wrote: | The subject has been debated to death. Google it. | shepardrtc wrote: | Yes, this is a fundamental part of Python. By default, a single | Python process is single-threaded in the traditional sense. So, | using "threads" (i.e. the Threading module) in Python is | actually more like using fibers in some other language. They're | not OS threads. So, if you're not waiting on I/O, then yes, the | threads will fight over the GIL and performance will suffer. | This is inherent to Python and will not be changed. | | But there's a few more things that can be said about this. | Python "threads" are really just a mental construct for | designing programs. The selling point is that you can share | variables and data between "threads" without having to worry | about locks or data corruption or anything like that. It just | works. But, even with that advantage, you're relying on Python | to switch between "threads" on its own, and that could easily | slow things down. If you're willing to drop the mental | construct and go for better performance but still use a single | process and be able to share variables, the asyncio module will | let you control when the main Python process will move between | points of code flow. | | However, if you really want to use traditional multiple | processes/threads just use the Multiprocessing module. It | actually launches multiple Python processes and links them | together. It's called in a similar fashion to Threading, so | there isn't much code change for that part. But because it's no | longer a single process - and no longer bound by the GIL - you | can't share data between the processes as easily. With | Multiprocessing, you'll need to create slightly more complex | data structures (like a multiprocessing manager namespace) to | share that data. It's not that hard, but it requires a bit of | planning ahead of time. | Grimm1 wrote: | Good work getting TensorRT running we had a real pain in the butt | recently when working with it and just opted to go with | ONNXRuntime, their graph optimizer and their TensorRT backend -- | may not be as fast as straight TensorRT from comparisons I've | seen but it got us to a competitive inference and latency so | we're happy with it. | indeyets wrote: | Name clash again... I thought about https://deepstream.io/ | cm2187 wrote: | Out of curiosity, what are the possible use cases for object | detection at >100 fps? I assume it would have to be objects that | move very fast, i.e. nothing ordinary that I can think of. | | [edit] actually stupid question. I assume it's more about | throughput than fps, i.e. be able to process lots of streams on | the same machine, for instance for doing mass analysis of CCTV | streams. | guywhocodes wrote: | For me it's only exciting because it lowers the barrier of how | much I can do with a much smaller system than double 2080ti's. | briggers wrote: | Yeah. A 2080Ti doesn't fit in your pocket or in your AR | glasses but the same techniques and tools scale down. | aaronblohowiak wrote: | Food processing, recycling separation? I can imagine lots of | small parts moving fast | unnouinceput wrote: | tomato / potato sorting while they are on conveyor. bigger fps, | more objects to dump on said conveyor. | dragonelite wrote: | Smart missiles and weapons i guess | [deleted] | darepublic wrote: | Roulette spin predictor | ineedasername wrote: | I guess, but that's generally illegal, and if it became | common & easy then casinos would simply require bets to be | placed before the ball is dropped. | umvi wrote: | Reminds me of: | | http://www1.cs.columbia.edu/graphics/courses/mobwear/resourc. | .. | stabbles wrote: | Another aspect is you might get better accuracy with larger | input dimensions, but the number of pixels scales quadratically | with width/height. | zanethomas wrote: | Shooting down drown swarms. | dekhn wrote: | this would be useful for sorting cells at high speed. | https://www.sinobiological.com/category/fcm-facs-facs | briggers wrote: | Very practical question :) Exactly as you say, multi-stream | throughput. Also for faster than realtime offline processing of | video. Check the caveats section at the end of the post - | DeepStream is probably not well suited to high throughput | single-stream inference. | bufferoverflow wrote: | Self-driving. Ideally you want something around 1000fps and low | latency, so it has time to react. | | I'm sure military and sports applications are obvious too. | gugagore wrote: | I don't think you'll find a 1000 fps camera on a "standard" | AV platform. And if you did, I imagine it would be too noisy | to be useful without a ton of illumination. | Aeolos wrote: | Smartphones offer 960fps video capture since a while now. | raverbashing wrote: | Doubt | | Humans reaction times are much slower than that. In fact for | some things it can take a whole second | https://www.visualexpert.com/Resources/reactiontime.html | | Maybe racing sports have shorter reaction times, but I'd be | frankly surprised if it was something < 100ms | | 10fps for your average drive should be more than enough | cm2187 wrote: | Answering my own question. Possibly industrial applications | like detecting objects on a fast conveyor belt. A recycling | facility for instance. | mhh__ wrote: | 100ms is generally considered to be the lower bound for a | valid start, i.e. 99 would be considered a jump start in F1 | iirc | ineedasername wrote: | No, that's only if you want the human to be able to make | the reaction. If the application was self-driving, you'd | prefer the car to react faster than a human. For a military | application like projectile detection to avoid or destroy | the object, you'd want something even faster. | gcanyon wrote: | But machines aren't (yet) as capable as humans at driving- | situation-recognition and driving-decision-making. One way | they can compensate for those shortcomings is to be | superior in other ways: 100% vigilance and super-fast | reaction times/decision-making. | magicalhippo wrote: | While I'm not into object detection such as this, I can easily | imagine this being part of a system where you want the rest of | the system to have time to act on the information. | | As such the point isn't that you can detect objects >N fps, but | rather that the object detection shouldn't take more than X% of | the time per cycle so that the overall cycle time can run at a | given rate. | joshvm wrote: | If your pipeline depends on running inference on a single | frame at a time, for example some kind of control loop, then | you need to be a bit careful about how you measure speed; you | have to use the effective time per batch (ie batch size 1), | not the amortised frames per second using as big a batch as | will fit. You can still interleave processing though. | izhak wrote: | Hypersonic missiles | pantelisk wrote: | Cloud service that needs to serve multiple requests or process | many video streams in parallel. (faster performance = less | hardware required, bigger scale and potentially improve end | user experience - save from their data being on the cloud of | course). | | On device (eg mobile phone) processing with battery usage that | respects the user. Older hardware/models inclusion as well. | | Of course the above aren't cases were the stream itself is | 100+fps, but more of broader general benefits. For a 100+ fps | stream.. well there are many things that go fast, imagine you | wanted a robot that tracks or catches a fly before it takes | off. Flies have a reaction time of 5ms (200fps), that's why | it's hard for us to catch! Expand and apply the same concept to | other things (that are fast, or happen very quickly) now... | moron4hire wrote: | Any word on latency? I didn't see anything in the article. I | guess, since this is a synthetic test just pumping a single image | file through repeatedly instead of an actual video stream, then | it wouldn't realistically be measurable. But if latency is | particularly low, this would be a boon for AR systems. | liuliu wrote: | Latency fundamentally limited by the model processing single | frame, all-in-all, probably somewhere around 10 to 15ms depends | on your input size (assuming VGA type of input). This is a | great article talks about system-engineering for the vision | pipeline, but to solve the latency issue, you need either a | beefier processor (or more specialized processor) or a better | tuned algorithm. | briggers wrote: | BTW, this is pumping the same video file through the network - | not just a single file. I don't measure latency, but this is | not a deep pipeline so it's easy to calculate. | moron4hire wrote: | Ok, I guess I misread that part. | janimo wrote: | How portable are these techniques to other architectures? Could | >100 FPS be realistically achieved today using only CPUs or | mobile phones? | nl wrote: | > Could >100 FPS be realistically achieved today using only | CPUs or mobile phones? | | Not yet. | | Google's MediaPipe object detector (which is one of the most | optimised mobile solutions around) can do "26fps on an Adreno | 650 mobile GPU"[1]. | | The Adreno 650 is the GPU in the Snapdragon 865, ie the current | high end SOC used by most non-Apple phones. This gives roughly | the same performance as an iPhone 11. | | [1] https://google.github.io/mediapipe/solutions/objectron.html | | [2] https://www.tweaktown.com/news/69097/qualcomm- | adreno-650-gpu... | | [3] https://www.tomsguide.com/news/snapdragon-865-benchmarks | motoboi wrote: | Converting to ONNX gives you advantage in Intel CPU too, if you | convert from ONNX to OpenVINO. | briggers wrote: | Mobile phones definitely since these days most of them have | pretty powerful GPUs. | mozak1111 wrote: | I see this and I immediately think of "trash sorting" at ultra | high speed. If one can combine this with a bunch of accurate | (laser precision) air guns, to shoot and move individual pieces | of trash you can sort through a truck load of trash in a matter | of seconds, perhaps in the air while they are being dumped! | compare this approach with how we are currently doing it [0] - | Somebody should get Elon Musk on this project right away! | | [0] - https://www.youtube.com/watch?v=QbKA9uNgzYQ | [deleted] | MauranKilom wrote: | This is a thing already. In my understanding, it's a staple in | several kinds of recycling processes. Random sample of related | links (there's a seemingly infinite amount of these though): | | https://youtu.be/mLya2NuY4Yk | | https://youtu.be/GJeOfHxMWQo?t=87 | | https://youtu.be/bWUuBz2hWc0?t=83 | tyfon wrote: | Trash sorting is probably better than self driving cars. I only | see speed talk on this page and nothing about accuracy. | | Musk needs like 99.9999% accuracy at near zero latency over | several hours of operation. I think Tesla currently is at maybe | 99.995% from driving my car. The last 0.005% results in phantom | braking etc. It's actually a very hard nut to crack and I don't | expect them to achieve the full self driving in all conditions | for another 10-15 years maybe. The edge cases are just too | many. | | I like the trash idea though (or a Q/A robot at a factory etc). | ebalit wrote: | Not related to Elon but there is a company called Pellenc ST in | the south of France that work exactly on this kind of problem. | You can see a video of one of their machines here [0]. | | I work at an AI consultancy [1] that help them use deep neural | nets in these high throughput and low latency conditions. It's | an interesting challenge and the performance than can be | squeezed from modern hardware are indeed impressive. | | 0: https://youtu.be/XLciSGE82DY?t=280 | | 1: https://neovision.fr | Cerium wrote: | Sorting by optical recognition and air guns to separate a | falling curtain of product into two output streams is already a | product. The development of these machines are the reason that | 10 or 15 years ago you stopped seeing bad beans in bulk bean | bags. I am involved in the tea industry where they are used to | sort tea by grade - stems, bad leaves, broken leaves, full | leaves. | | Here is a diagram: https://www.satake-usa.com/what-is-optical- | sorting.html | handol wrote: | Reminds me of the library modernization drive in 'Rainbows | End'. The book digitizer is basically a wood chipper with | lights and high speed cameras in the debris chute. | gcanyon wrote: | A weird question, but since there's another article on HN right | now about programming language energy efficiency | https://news.ycombinator.com/item?id=24816733 any idea whether | going from 9fps to 1840fps consumes the same power, 200x the | power, or somewhere in between? | briggers wrote: | Great question, now I wish I'd recorded power consumption for | all these experiments. Judging from cumulative hours of | watching the output of nvidia-smi I've definitely seen a | linearish relationship between utilization and power draw (with | a non-zero floor of 30-40W). | eggy wrote: | I see Rust is almost equal to C if not better in the graph, | however, I think equally-skilled programmers in either language | would show the Rust programmer spending more 'energy' | programming and iterating than the C programmer, but then make | the argument that the C program will use more 'energy' | downstream if bugs slip in. In any case, an eye-opening metric | on what I, and I am sure many, take for granted. Cool. | eggy wrote: | EDIT: I think Zig would come up pretty good here too: | | https://twitter.com/andy_kelley/status/1317586767260774400 ___________________________________________________________________ (page generated 2020-10-18 23:01 UTC)