[HN Gopher] Downsampling Time Series for Visual Representation [...
       ___________________________________________________________________
        
       Downsampling Time Series for Visual Representation [pdf]
        
       Author : todsacerdoti
       Score  : 53 points
       Date   : 2021-03-09 18:32 UTC (4 hours ago)
        
 (HTM) web link (skemman.is)
 (TXT) w3m dump (skemman.is)
        
       | elsherbini wrote:
       | This looks cool, I especially like the "intuitive" longest line
       | algorithm.
       | 
       | ~6 years ago I was trying to build a dashboard that displayed
       | real-time measurements from a beehive. The sensors would take
       | temperature, weight, humidity, etc. Back then I used simplify.js
       | [0] which uses two simplification algorithms in two passes. The
       | more compute intensive one is the Ramer Douglas Peucker algorithm
       | [1]. One issue I had with streaming data is that new data points
       | could change the past line, at least with my naive
       | implementation. I'd love to see a real-time / streaming time
       | series simplification algorithm where the past points don't
       | appear to change wildly despite continuing to downsample.
       | 
       | [0] https://github.com/mourner/simplify-js
       | 
       | [1]
       | https://en.wikipedia.org/wiki/Ramer%E2%80%93Douglas%E2%80%93...
        
         | tda wrote:
         | Some (often desirable) properties of RDP for timeseries is that
         | is preserves details at extreme events and does not shift
         | peaks. Very useful for e.g. seismic data. I have compressed
         | tons of data with 1% subsampling with hardly any information
         | loss
        
       | the8472 wrote:
       | Why reduce it to a single line when you can paint hulls around
       | your curve? For coarser intervals you can also use candle stick
       | charts.
        
       | mattiasfestin wrote:
       | This is a form of compression that is made to preserve chart
       | visual looks, if I understand it right. There is no analysis on
       | how it bias the data with the down sample.
       | 
       | My instinct is the algorithm looks too specialized for this one
       | task (which is good sometimes, if used for the right task).
       | 
       | As a thought could you not use a FFT to the frequency domain and
       | remove the high frequency data. And then retransform it back to
       | the time domain.
       | 
       | Fourier transforms are used all over the place, and FFT libs are
       | usually well optimized and can even be hardware accelerated.
       | 
       | Without checking it, the cut off frequency would be determined by
       | the Nyquist-Shannon sampling theorem. And should be dependent of
       | the width of the graph in pixels.
       | 
       | So if the graph is resized, then recompute the new down sampled
       | timeseries from the new Nyquist-Shannon sampling limit of the new
       | width.
       | 
       | A sliding DFT can also be used for streaming data for realtime.
        
         | skykooler wrote:
         | Specifically, this is trying to downsample a chart while only
         | using points from the original time series. Something like an
         | FFT would generate a new series of points which would most
         | likely not be in the original dataset, though they might
         | visually represent them well.
        
         | FredFS456 wrote:
         | Why not just use a low-pass linear-phase FIR filter instead of
         | FFT/iFFT? Same result but achieved in time-domain, somewhat
         | more performant.
        
       | MasterScrat wrote:
       | Interesting to see this pop up here! I was looking for a solution
       | to plot terabytes of time series during my MSc Thesis at CERN and
       | ended up adding support for this to a Grafana fork. Fun times.
       | 
       | Details here:
       | https://masterscrat.github.io/documents/Florian_Laurent_CERN...
        
       | jarenmf wrote:
       | A demo for "Largest-Triangle-Three-Buckets" algorithm
       | https://www.base.is/flot/
       | 
       | And the plugin used in the demo: https://github.com/sveinn-
       | steinarsson/flot-downsample
        
       | dcl wrote:
       | Neat. I've noticed that a few stock market apps/sites definitely
       | use different methods to do this, and with dramatically different
       | results.
        
       | dabreegster wrote:
       | Does anybody know an approach for downsampling time series when
       | you need to compare two downsampled series? Let's say:
       | 
       | 1) You measure millions of (time, count) events, covering 24hrs.
       | 
       | 2) You downsample this somehow to save space.
       | 
       | 3) You start measuring (time, count) events again, covering 7
       | hours. You expect the data to match up with the previous
       | measurements, and you'd like to plot differences.
       | 
       | Is there a way to trim the downsampled 24hr set so that it
       | matches the downsampled 7hr set?
       | 
       | Motivation: https://github.com/a-b-street/abstreet/issues/85
        
         | 1996 wrote:
         | Do you have as many events in both samples?
         | 
         | If not, even if you measure the same thing, the variance may
         | differ - you can start applying the law of large numbers since
         | you mentionned "millions" and just consider a normal
         | distribution, but for rare events you need something different
         | (Poisson)
         | 
         | Are the series static? If not (ex: temperature) you will have
         | to model on time. As the link mentions people crossing
         | intersection, I assume more people will at 9am than at 2am (7
         | hours ago). Likewise if one series is a weekday and the other a
         | weekend, you may have surprises!
         | 
         | You have to consider other things (ex: day of the week).
         | 
         | I would recommend avoiding the problem altogether: fit
         | something (KDE, ARIMA...) on your timeseries, and plot the fit
         | with the 95% CI from X=0 to X=24
         | 
         | Do the same for the other series, from X=0 to X=7 and see if
         | they are at least in the same CI
         | 
         | Then compute the correlation of the 2 fits (just to have a
         | reference number) and use that as a metric.
        
           | dabreegster wrote:
           | To be more specific, I'm measuring throughput along road
           | segments in a deterministic traffic simulation. Differences
           | would occur when the user modifies the road network. The
           | modifications often have no effect on most roads, but some
           | wind up with more or less traffic than usual at a certain
           | time.
           | 
           | I'm interested in showing differences like "during morning
           | rush hour, there was more traffic here, but it was about the
           | same the rest of the day", so I'm not sure a single
           | correlation as a metric would be best. Ideally I'd plot two
           | line plots and call out differences over time.
        
             | 1996 wrote:
             | There is going to be complex behavior (adapting to traffic
             | by routing around it) that a simulation will poorly predict
             | 
             | Also, you will have correlation between adjacent segments
             | of the network- and the idea of extracting "rush hours"
             | brings its own set of problems.
             | 
             | Consult with a statistician familiar with geographical
             | models. I'm sorry I can't help much more than that. It's
             | very different from finance.
        
               | dabreegster wrote:
               | Surprises from emergent behavior are some of the most
               | satisfying things about working on this.
               | 
               | I'm not trying to extract rush hour patterns or anything
               | quantitatively, just show two timeseries plots to the
               | user and have them figure out what's going on.
               | 
               | Thanks for the ideas! Lots of new things to research.
               | Stats isn't my background at all.
        
         | TrackerFF wrote:
         | Do you have some criteria for the error?
        
           | dabreegster wrote:
           | Not particularly. The data comes from a deterministic traffic
           | simulation, and I just want to show people roughly how their
           | changes to a road network affect throughput along road
           | segments. I'm interested in showing differences like "during
           | morning rush hour, there was more traffic here, but it was
           | about the same the rest of the day". "About the same" is
           | fuzzy.
        
         | contravariant wrote:
         | You're basically calculating a histogram anyway so the most
         | sensible thing to do with would be to calculate both histograms
         | with the same buckets (simplest is just the sum every 5 minutes
         | or so).
         | 
         | At any rate you want do to the same thing to both datasets, and
         | if it's counts of some random event then the only thing you can
         | really compare is the sum-total over the same period.
        
           | dabreegster wrote:
           | This is basically the approach I'm doing now, with
           | granularity being 1 hour. The problem is that when you're in
           | the middle of an hour, the bucket for the live sim only
           | accounts for half of the hour, but the stored bucket from the
           | full day's data represents the entire hour. I could assume
           | the bucket accumulates its count linearly through the hour,
           | and that'd be much better than my current approach, but it
           | still feels a bit weak.
           | 
           | 5 minute buckets would make this comparison problem easier,
           | but it's too much data to store.
        
             | dabreegster wrote:
             | https://github.com/a-b-
             | street/abstreet/issues/85#issuecommen...
             | 
             | Here's a video of the problem with the current approach,
             | which is storing a count per hour. No sliding windows;
             | 7:59am and 8:01am are different buckets. Green is "less
             | than usual", red is "more than usual." Every time the hour
             | resets, lots of green appears, and slowly fades.
        
         | benlivengood wrote:
         | I've seen this called window alignment. You can either pick a
         | default and stick to it everywhere, e.g. truncate to the
         | nearest minute or Nth-minute, but this only works for divisors
         | of an hour for example so that it's easy to align different
         | sources at any starting point.
         | 
         | Resampling is pretty effective; linear interpolation is usually
         | good enough because the data is already downsampled, presumably
         | with an averaging or aggregating step that can be interpolated
         | over. The mean-value theorem is your friend if you're looking
         | at derivatives of time series data.
         | 
         | The type of metric also matters; gauges can be simply
         | interpolated. Counters make more sense to re-bucket into the
         | chosen sampling period and offset, and for your specific
         | question what you can do is switch from downsampled data to
         | "realtime" data exactly at the bucket boundary so that you
         | avoid the half-full bucket problem.
         | 
         | A good way to treat counter timeseries is as bundles of updates
         | (deltas) that arrive at a certain cadence. That makes the
         | transition from different cadences smooth by resampling to
         | avoid overlapping updates, e.g. if you have a downsampled
         | bucket counting events from 1:00 to 1:10 then resample other
         | timeseries so that buckets begin exactly at 1:10 or end exactly
         | at 1:00 instead of trying to let them overlap.
        
           | dabreegster wrote:
           | I haven't had nearly enough caffeine today to process this
           | properly, but thank you -- it partly makes sense, and it's
           | plenty to go off of later!
        
       ___________________________________________________________________
       (page generated 2021-03-09 23:00 UTC)