[HN Gopher] Downsampling Time Series for Visual Representation [... ___________________________________________________________________ Downsampling Time Series for Visual Representation [pdf] Author : todsacerdoti Score : 53 points Date : 2021-03-09 18:32 UTC (4 hours ago) (HTM) web link (skemman.is) (TXT) w3m dump (skemman.is) | elsherbini wrote: | This looks cool, I especially like the "intuitive" longest line | algorithm. | | ~6 years ago I was trying to build a dashboard that displayed | real-time measurements from a beehive. The sensors would take | temperature, weight, humidity, etc. Back then I used simplify.js | [0] which uses two simplification algorithms in two passes. The | more compute intensive one is the Ramer Douglas Peucker algorithm | [1]. One issue I had with streaming data is that new data points | could change the past line, at least with my naive | implementation. I'd love to see a real-time / streaming time | series simplification algorithm where the past points don't | appear to change wildly despite continuing to downsample. | | [0] https://github.com/mourner/simplify-js | | [1] | https://en.wikipedia.org/wiki/Ramer%E2%80%93Douglas%E2%80%93... | tda wrote: | Some (often desirable) properties of RDP for timeseries is that | is preserves details at extreme events and does not shift | peaks. Very useful for e.g. seismic data. I have compressed | tons of data with 1% subsampling with hardly any information | loss | the8472 wrote: | Why reduce it to a single line when you can paint hulls around | your curve? For coarser intervals you can also use candle stick | charts. | mattiasfestin wrote: | This is a form of compression that is made to preserve chart | visual looks, if I understand it right. There is no analysis on | how it bias the data with the down sample. | | My instinct is the algorithm looks too specialized for this one | task (which is good sometimes, if used for the right task). | | As a thought could you not use a FFT to the frequency domain and | remove the high frequency data. And then retransform it back to | the time domain. | | Fourier transforms are used all over the place, and FFT libs are | usually well optimized and can even be hardware accelerated. | | Without checking it, the cut off frequency would be determined by | the Nyquist-Shannon sampling theorem. And should be dependent of | the width of the graph in pixels. | | So if the graph is resized, then recompute the new down sampled | timeseries from the new Nyquist-Shannon sampling limit of the new | width. | | A sliding DFT can also be used for streaming data for realtime. | skykooler wrote: | Specifically, this is trying to downsample a chart while only | using points from the original time series. Something like an | FFT would generate a new series of points which would most | likely not be in the original dataset, though they might | visually represent them well. | FredFS456 wrote: | Why not just use a low-pass linear-phase FIR filter instead of | FFT/iFFT? Same result but achieved in time-domain, somewhat | more performant. | MasterScrat wrote: | Interesting to see this pop up here! I was looking for a solution | to plot terabytes of time series during my MSc Thesis at CERN and | ended up adding support for this to a Grafana fork. Fun times. | | Details here: | https://masterscrat.github.io/documents/Florian_Laurent_CERN... | jarenmf wrote: | A demo for "Largest-Triangle-Three-Buckets" algorithm | https://www.base.is/flot/ | | And the plugin used in the demo: https://github.com/sveinn- | steinarsson/flot-downsample | dcl wrote: | Neat. I've noticed that a few stock market apps/sites definitely | use different methods to do this, and with dramatically different | results. | dabreegster wrote: | Does anybody know an approach for downsampling time series when | you need to compare two downsampled series? Let's say: | | 1) You measure millions of (time, count) events, covering 24hrs. | | 2) You downsample this somehow to save space. | | 3) You start measuring (time, count) events again, covering 7 | hours. You expect the data to match up with the previous | measurements, and you'd like to plot differences. | | Is there a way to trim the downsampled 24hr set so that it | matches the downsampled 7hr set? | | Motivation: https://github.com/a-b-street/abstreet/issues/85 | 1996 wrote: | Do you have as many events in both samples? | | If not, even if you measure the same thing, the variance may | differ - you can start applying the law of large numbers since | you mentionned "millions" and just consider a normal | distribution, but for rare events you need something different | (Poisson) | | Are the series static? If not (ex: temperature) you will have | to model on time. As the link mentions people crossing | intersection, I assume more people will at 9am than at 2am (7 | hours ago). Likewise if one series is a weekday and the other a | weekend, you may have surprises! | | You have to consider other things (ex: day of the week). | | I would recommend avoiding the problem altogether: fit | something (KDE, ARIMA...) on your timeseries, and plot the fit | with the 95% CI from X=0 to X=24 | | Do the same for the other series, from X=0 to X=7 and see if | they are at least in the same CI | | Then compute the correlation of the 2 fits (just to have a | reference number) and use that as a metric. | dabreegster wrote: | To be more specific, I'm measuring throughput along road | segments in a deterministic traffic simulation. Differences | would occur when the user modifies the road network. The | modifications often have no effect on most roads, but some | wind up with more or less traffic than usual at a certain | time. | | I'm interested in showing differences like "during morning | rush hour, there was more traffic here, but it was about the | same the rest of the day", so I'm not sure a single | correlation as a metric would be best. Ideally I'd plot two | line plots and call out differences over time. | 1996 wrote: | There is going to be complex behavior (adapting to traffic | by routing around it) that a simulation will poorly predict | | Also, you will have correlation between adjacent segments | of the network- and the idea of extracting "rush hours" | brings its own set of problems. | | Consult with a statistician familiar with geographical | models. I'm sorry I can't help much more than that. It's | very different from finance. | dabreegster wrote: | Surprises from emergent behavior are some of the most | satisfying things about working on this. | | I'm not trying to extract rush hour patterns or anything | quantitatively, just show two timeseries plots to the | user and have them figure out what's going on. | | Thanks for the ideas! Lots of new things to research. | Stats isn't my background at all. | TrackerFF wrote: | Do you have some criteria for the error? | dabreegster wrote: | Not particularly. The data comes from a deterministic traffic | simulation, and I just want to show people roughly how their | changes to a road network affect throughput along road | segments. I'm interested in showing differences like "during | morning rush hour, there was more traffic here, but it was | about the same the rest of the day". "About the same" is | fuzzy. | contravariant wrote: | You're basically calculating a histogram anyway so the most | sensible thing to do with would be to calculate both histograms | with the same buckets (simplest is just the sum every 5 minutes | or so). | | At any rate you want do to the same thing to both datasets, and | if it's counts of some random event then the only thing you can | really compare is the sum-total over the same period. | dabreegster wrote: | This is basically the approach I'm doing now, with | granularity being 1 hour. The problem is that when you're in | the middle of an hour, the bucket for the live sim only | accounts for half of the hour, but the stored bucket from the | full day's data represents the entire hour. I could assume | the bucket accumulates its count linearly through the hour, | and that'd be much better than my current approach, but it | still feels a bit weak. | | 5 minute buckets would make this comparison problem easier, | but it's too much data to store. | dabreegster wrote: | https://github.com/a-b- | street/abstreet/issues/85#issuecommen... | | Here's a video of the problem with the current approach, | which is storing a count per hour. No sliding windows; | 7:59am and 8:01am are different buckets. Green is "less | than usual", red is "more than usual." Every time the hour | resets, lots of green appears, and slowly fades. | benlivengood wrote: | I've seen this called window alignment. You can either pick a | default and stick to it everywhere, e.g. truncate to the | nearest minute or Nth-minute, but this only works for divisors | of an hour for example so that it's easy to align different | sources at any starting point. | | Resampling is pretty effective; linear interpolation is usually | good enough because the data is already downsampled, presumably | with an averaging or aggregating step that can be interpolated | over. The mean-value theorem is your friend if you're looking | at derivatives of time series data. | | The type of metric also matters; gauges can be simply | interpolated. Counters make more sense to re-bucket into the | chosen sampling period and offset, and for your specific | question what you can do is switch from downsampled data to | "realtime" data exactly at the bucket boundary so that you | avoid the half-full bucket problem. | | A good way to treat counter timeseries is as bundles of updates | (deltas) that arrive at a certain cadence. That makes the | transition from different cadences smooth by resampling to | avoid overlapping updates, e.g. if you have a downsampled | bucket counting events from 1:00 to 1:10 then resample other | timeseries so that buckets begin exactly at 1:10 or end exactly | at 1:00 instead of trying to let them overlap. | dabreegster wrote: | I haven't had nearly enough caffeine today to process this | properly, but thank you -- it partly makes sense, and it's | plenty to go off of later! ___________________________________________________________________ (page generated 2021-03-09 23:00 UTC)