[HN Gopher] Object Detection at 1840 FPS with TorchScript, Tenso...
       ___________________________________________________________________
        
       Object Detection at 1840 FPS with TorchScript, TensorRT and
       DeepStream
        
       Author : briggers
       Score  : 135 points
       Date   : 2020-10-18 11:54 UTC (11 hours ago)
        
 (HTM) web link (paulbridger.com)
 (TXT) w3m dump (paulbridger.com)
        
       | stabbles wrote:
       | > There is evidence (measured using gil_load) that we were
       | throttled by a fundamental Python limitation with multiple
       | threads fighting over the Global Interpreter Lock (GIL).
       | 
       | Can anyone comment on how often this is a problem and if this
       | problem is truly fundamental to Python? Could it be solved in a
       | Python 3.x release?
        
         | the-dude wrote:
         | The subject has been debated to death. Google it.
        
         | shepardrtc wrote:
         | Yes, this is a fundamental part of Python. By default, a single
         | Python process is single-threaded in the traditional sense. So,
         | using "threads" (i.e. the Threading module) in Python is
         | actually more like using fibers in some other language. They're
         | not OS threads. So, if you're not waiting on I/O, then yes, the
         | threads will fight over the GIL and performance will suffer.
         | This is inherent to Python and will not be changed.
         | 
         | But there's a few more things that can be said about this.
         | Python "threads" are really just a mental construct for
         | designing programs. The selling point is that you can share
         | variables and data between "threads" without having to worry
         | about locks or data corruption or anything like that. It just
         | works. But, even with that advantage, you're relying on Python
         | to switch between "threads" on its own, and that could easily
         | slow things down. If you're willing to drop the mental
         | construct and go for better performance but still use a single
         | process and be able to share variables, the asyncio module will
         | let you control when the main Python process will move between
         | points of code flow.
         | 
         | However, if you really want to use traditional multiple
         | processes/threads just use the Multiprocessing module. It
         | actually launches multiple Python processes and links them
         | together. It's called in a similar fashion to Threading, so
         | there isn't much code change for that part. But because it's no
         | longer a single process - and no longer bound by the GIL - you
         | can't share data between the processes as easily. With
         | Multiprocessing, you'll need to create slightly more complex
         | data structures (like a multiprocessing manager namespace) to
         | share that data. It's not that hard, but it requires a bit of
         | planning ahead of time.
        
       | Grimm1 wrote:
       | Good work getting TensorRT running we had a real pain in the butt
       | recently when working with it and just opted to go with
       | ONNXRuntime, their graph optimizer and their TensorRT backend --
       | may not be as fast as straight TensorRT from comparisons I've
       | seen but it got us to a competitive inference and latency so
       | we're happy with it.
        
       | indeyets wrote:
       | Name clash again... I thought about https://deepstream.io/
        
       | cm2187 wrote:
       | Out of curiosity, what are the possible use cases for object
       | detection at >100 fps? I assume it would have to be objects that
       | move very fast, i.e. nothing ordinary that I can think of.
       | 
       | [edit] actually stupid question. I assume it's more about
       | throughput than fps, i.e. be able to process lots of streams on
       | the same machine, for instance for doing mass analysis of CCTV
       | streams.
        
         | guywhocodes wrote:
         | For me it's only exciting because it lowers the barrier of how
         | much I can do with a much smaller system than double 2080ti's.
        
           | briggers wrote:
           | Yeah. A 2080Ti doesn't fit in your pocket or in your AR
           | glasses but the same techniques and tools scale down.
        
         | aaronblohowiak wrote:
         | Food processing, recycling separation? I can imagine lots of
         | small parts moving fast
        
         | unnouinceput wrote:
         | tomato / potato sorting while they are on conveyor. bigger fps,
         | more objects to dump on said conveyor.
        
         | dragonelite wrote:
         | Smart missiles and weapons i guess
        
           | [deleted]
        
         | darepublic wrote:
         | Roulette spin predictor
        
           | ineedasername wrote:
           | I guess, but that's generally illegal, and if it became
           | common & easy then casinos would simply require bets to be
           | placed before the ball is dropped.
        
           | umvi wrote:
           | Reminds me of:
           | 
           | http://www1.cs.columbia.edu/graphics/courses/mobwear/resourc.
           | ..
        
         | stabbles wrote:
         | Another aspect is you might get better accuracy with larger
         | input dimensions, but the number of pixels scales quadratically
         | with width/height.
        
         | zanethomas wrote:
         | Shooting down drown swarms.
        
         | dekhn wrote:
         | this would be useful for sorting cells at high speed.
         | https://www.sinobiological.com/category/fcm-facs-facs
        
         | briggers wrote:
         | Very practical question :) Exactly as you say, multi-stream
         | throughput. Also for faster than realtime offline processing of
         | video. Check the caveats section at the end of the post -
         | DeepStream is probably not well suited to high throughput
         | single-stream inference.
        
         | bufferoverflow wrote:
         | Self-driving. Ideally you want something around 1000fps and low
         | latency, so it has time to react.
         | 
         | I'm sure military and sports applications are obvious too.
        
           | gugagore wrote:
           | I don't think you'll find a 1000 fps camera on a "standard"
           | AV platform. And if you did, I imagine it would be too noisy
           | to be useful without a ton of illumination.
        
             | Aeolos wrote:
             | Smartphones offer 960fps video capture since a while now.
        
           | raverbashing wrote:
           | Doubt
           | 
           | Humans reaction times are much slower than that. In fact for
           | some things it can take a whole second
           | https://www.visualexpert.com/Resources/reactiontime.html
           | 
           | Maybe racing sports have shorter reaction times, but I'd be
           | frankly surprised if it was something < 100ms
           | 
           | 10fps for your average drive should be more than enough
        
             | cm2187 wrote:
             | Answering my own question. Possibly industrial applications
             | like detecting objects on a fast conveyor belt. A recycling
             | facility for instance.
        
             | mhh__ wrote:
             | 100ms is generally considered to be the lower bound for a
             | valid start, i.e. 99 would be considered a jump start in F1
             | iirc
        
             | ineedasername wrote:
             | No, that's only if you want the human to be able to make
             | the reaction. If the application was self-driving, you'd
             | prefer the car to react faster than a human. For a military
             | application like projectile detection to avoid or destroy
             | the object, you'd want something even faster.
        
             | gcanyon wrote:
             | But machines aren't (yet) as capable as humans at driving-
             | situation-recognition and driving-decision-making. One way
             | they can compensate for those shortcomings is to be
             | superior in other ways: 100% vigilance and super-fast
             | reaction times/decision-making.
        
         | magicalhippo wrote:
         | While I'm not into object detection such as this, I can easily
         | imagine this being part of a system where you want the rest of
         | the system to have time to act on the information.
         | 
         | As such the point isn't that you can detect objects >N fps, but
         | rather that the object detection shouldn't take more than X% of
         | the time per cycle so that the overall cycle time can run at a
         | given rate.
        
           | joshvm wrote:
           | If your pipeline depends on running inference on a single
           | frame at a time, for example some kind of control loop, then
           | you need to be a bit careful about how you measure speed; you
           | have to use the effective time per batch (ie batch size 1),
           | not the amortised frames per second using as big a batch as
           | will fit. You can still interleave processing though.
        
         | izhak wrote:
         | Hypersonic missiles
        
         | pantelisk wrote:
         | Cloud service that needs to serve multiple requests or process
         | many video streams in parallel. (faster performance = less
         | hardware required, bigger scale and potentially improve end
         | user experience - save from their data being on the cloud of
         | course).
         | 
         | On device (eg mobile phone) processing with battery usage that
         | respects the user. Older hardware/models inclusion as well.
         | 
         | Of course the above aren't cases were the stream itself is
         | 100+fps, but more of broader general benefits. For a 100+ fps
         | stream.. well there are many things that go fast, imagine you
         | wanted a robot that tracks or catches a fly before it takes
         | off. Flies have a reaction time of 5ms (200fps), that's why
         | it's hard for us to catch! Expand and apply the same concept to
         | other things (that are fast, or happen very quickly) now...
        
       | moron4hire wrote:
       | Any word on latency? I didn't see anything in the article. I
       | guess, since this is a synthetic test just pumping a single image
       | file through repeatedly instead of an actual video stream, then
       | it wouldn't realistically be measurable. But if latency is
       | particularly low, this would be a boon for AR systems.
        
         | liuliu wrote:
         | Latency fundamentally limited by the model processing single
         | frame, all-in-all, probably somewhere around 10 to 15ms depends
         | on your input size (assuming VGA type of input). This is a
         | great article talks about system-engineering for the vision
         | pipeline, but to solve the latency issue, you need either a
         | beefier processor (or more specialized processor) or a better
         | tuned algorithm.
        
         | briggers wrote:
         | BTW, this is pumping the same video file through the network -
         | not just a single file. I don't measure latency, but this is
         | not a deep pipeline so it's easy to calculate.
        
           | moron4hire wrote:
           | Ok, I guess I misread that part.
        
       | janimo wrote:
       | How portable are these techniques to other architectures? Could
       | >100 FPS be realistically achieved today using only CPUs or
       | mobile phones?
        
         | nl wrote:
         | > Could >100 FPS be realistically achieved today using only
         | CPUs or mobile phones?
         | 
         | Not yet.
         | 
         | Google's MediaPipe object detector (which is one of the most
         | optimised mobile solutions around) can do "26fps on an Adreno
         | 650 mobile GPU"[1].
         | 
         | The Adreno 650 is the GPU in the Snapdragon 865, ie the current
         | high end SOC used by most non-Apple phones. This gives roughly
         | the same performance as an iPhone 11.
         | 
         | [1] https://google.github.io/mediapipe/solutions/objectron.html
         | 
         | [2] https://www.tweaktown.com/news/69097/qualcomm-
         | adreno-650-gpu...
         | 
         | [3] https://www.tomsguide.com/news/snapdragon-865-benchmarks
        
         | motoboi wrote:
         | Converting to ONNX gives you advantage in Intel CPU too, if you
         | convert from ONNX to OpenVINO.
        
         | briggers wrote:
         | Mobile phones definitely since these days most of them have
         | pretty powerful GPUs.
        
       | mozak1111 wrote:
       | I see this and I immediately think of "trash sorting" at ultra
       | high speed. If one can combine this with a bunch of accurate
       | (laser precision) air guns, to shoot and move individual pieces
       | of trash you can sort through a truck load of trash in a matter
       | of seconds, perhaps in the air while they are being dumped!
       | compare this approach with how we are currently doing it [0] -
       | Somebody should get Elon Musk on this project right away!
       | 
       | [0] - https://www.youtube.com/watch?v=QbKA9uNgzYQ
        
         | [deleted]
        
         | MauranKilom wrote:
         | This is a thing already. In my understanding, it's a staple in
         | several kinds of recycling processes. Random sample of related
         | links (there's a seemingly infinite amount of these though):
         | 
         | https://youtu.be/mLya2NuY4Yk
         | 
         | https://youtu.be/GJeOfHxMWQo?t=87
         | 
         | https://youtu.be/bWUuBz2hWc0?t=83
        
         | tyfon wrote:
         | Trash sorting is probably better than self driving cars. I only
         | see speed talk on this page and nothing about accuracy.
         | 
         | Musk needs like 99.9999% accuracy at near zero latency over
         | several hours of operation. I think Tesla currently is at maybe
         | 99.995% from driving my car. The last 0.005% results in phantom
         | braking etc. It's actually a very hard nut to crack and I don't
         | expect them to achieve the full self driving in all conditions
         | for another 10-15 years maybe. The edge cases are just too
         | many.
         | 
         | I like the trash idea though (or a Q/A robot at a factory etc).
        
         | ebalit wrote:
         | Not related to Elon but there is a company called Pellenc ST in
         | the south of France that work exactly on this kind of problem.
         | You can see a video of one of their machines here [0].
         | 
         | I work at an AI consultancy [1] that help them use deep neural
         | nets in these high throughput and low latency conditions. It's
         | an interesting challenge and the performance than can be
         | squeezed from modern hardware are indeed impressive.
         | 
         | 0: https://youtu.be/XLciSGE82DY?t=280
         | 
         | 1: https://neovision.fr
        
         | Cerium wrote:
         | Sorting by optical recognition and air guns to separate a
         | falling curtain of product into two output streams is already a
         | product. The development of these machines are the reason that
         | 10 or 15 years ago you stopped seeing bad beans in bulk bean
         | bags. I am involved in the tea industry where they are used to
         | sort tea by grade - stems, bad leaves, broken leaves, full
         | leaves.
         | 
         | Here is a diagram: https://www.satake-usa.com/what-is-optical-
         | sorting.html
        
           | handol wrote:
           | Reminds me of the library modernization drive in 'Rainbows
           | End'. The book digitizer is basically a wood chipper with
           | lights and high speed cameras in the debris chute.
        
       | gcanyon wrote:
       | A weird question, but since there's another article on HN right
       | now about programming language energy efficiency
       | https://news.ycombinator.com/item?id=24816733 any idea whether
       | going from 9fps to 1840fps consumes the same power, 200x the
       | power, or somewhere in between?
        
         | briggers wrote:
         | Great question, now I wish I'd recorded power consumption for
         | all these experiments. Judging from cumulative hours of
         | watching the output of nvidia-smi I've definitely seen a
         | linearish relationship between utilization and power draw (with
         | a non-zero floor of 30-40W).
        
         | eggy wrote:
         | I see Rust is almost equal to C if not better in the graph,
         | however, I think equally-skilled programmers in either language
         | would show the Rust programmer spending more 'energy'
         | programming and iterating than the C programmer, but then make
         | the argument that the C program will use more 'energy'
         | downstream if bugs slip in. In any case, an eye-opening metric
         | on what I, and I am sure many, take for granted. Cool.
        
           | eggy wrote:
           | EDIT: I think Zig would come up pretty good here too:
           | 
           | https://twitter.com/andy_kelley/status/1317586767260774400
        
       ___________________________________________________________________
       (page generated 2020-10-18 23:01 UTC)