[HN Gopher] I booted Linux 293k times in 21 hours
       ___________________________________________________________________
        
       I booted Linux 293k times in 21 hours
        
       Author : jandeboevrie
       Score  : 560 points
       Date   : 2023-06-14 13:54 UTC (9 hours ago)
        
 (HTM) web link (rwmj.wordpress.com)
 (TXT) w3m dump (rwmj.wordpress.com)
        
       | ineedasername wrote:
       | If there is a platonic ideal of 'uptime' then this has got to be
       | its opposite.
        
       | w-m wrote:
       | 292,612 is not an interesting number, it's not contained in any
       | known integer sequence. The search in OEIS only brings up
       | sequence A292612
       | (https://oeis.org/search?q=292612&fmt=data&sort=number).
        
         | akira2501 wrote:
         | 2 * 2 * 191 * 383
         | 
         | Which is mildly interesting.
        
           | w-m wrote:
           | Neat indeed.
           | 
           | 3 * 2 ^ {0, 0, 6, 7} - 1
           | 
           | And all of them are palindromes.
        
             | jwilk wrote:
             | For people confused by the above notation:
             | 
             | 2 = 3 x 20 - 1
             | 
             | 191 = 3 x 26 - 1
             | 
             | 383 = 3 x 27 - 1
        
         | high_pathetic wrote:
         | > it's not contained in any known integer sequence
         | 
         | I think this makes it interesting!
        
           | w-m wrote:
           | Ah yes, the good old
           | https://en.wikipedia.org/wiki/Interesting_number_paradox
        
           | Dylan16807 wrote:
           | Then your standard is too low.
           | 
           | And I mean that objectively. That standard would not allow an
           | uninteresting number.
        
       | adverbly wrote:
       | Make sure you add it to the integration test suite so it doesn't
       | get re-introduced later ;)
        
       | vintagedave wrote:
       | > I found the culprit, a regression in the printk time feature:
       | https://lkml.org/lkml/2023/6/13/733
       | 
       | The issue hasn't been fixed yet, but if it affects you the
       | proximate cause is known and can be reverted locally.
        
       | efitz wrote:
       | I told him not to turn on Windows Update.
        
       | Laremere wrote:
       | Here they mention that each bisect ran a large number of times to
       | try and catch the rare failure. Reminds me of a previous
       | experience:
       | 
       | We had a large integration test suite. It made calls to an
       | external service, and took ~45 minutes to fully run. Since it
       | needed an exclusive lock on an external account, it could only
       | run a few tests at a time. We started getting random failures, so
       | we were in a tough spot: bisecting didn't work because the
       | failure wasn't consistent, and you couldn't run a single version
       | of a test enough times to verify that a given version definitely
       | did or didn't have the failure in any practical way. I ended up
       | triggering a spread of runs over night, and then used Bayesian
       | statistics to hone in on where the failure was introduced. I felt
       | mighty proud about figuring that out.
       | 
       | Unfortunately, it turns out the tests were more likely to pass at
       | night when the systems were under less strain, so my prior for
       | the failure rate was off and all the math afterwards pointed to
       | the wrong range of commits.
       | 
       | Ultimately, the breakage got worse and I just read through a
       | large number of changes trying to find a likely culprit. After
       | finally finding the change, I went to fix it only to see that the
       | breakage had been fixed by a different team a hour or so before.
       | It turned out to be one of our dependencies turning on a feature
       | by slowly increasing the probability it was used. So when the
       | feature was on it broke our tests.
        
         | ambicapter wrote:
         | > Ultimately, it turned out to be one of our dependencies
         | turning on a feature by slowly increasing the probability it
         | was used.
         | 
         | Wow. I feel like this dependency should be named and shamed.
        
           | thehappypm wrote:
           | Isn't this how multi-armed bandits work?
        
             | rootsudo wrote:
             | algo 101, but I can see how it can be nifty for
             | $internalapp.
        
           | Laremere wrote:
           | Big company internal dependency. So nothing for the public to
           | care about.
        
             | vamega wrote:
             | What company. I've seen this being done (and my team does
             | it a lot at Amazon) but curious to know if others are doing
             | it at build time too.
             | 
             | If done in a company with a monorepo I'd be especially
             | interested in hearing more
        
               | aeyes wrote:
               | > If done in a company with a monorepo I'd be especially
               | interested in hearing more
               | 
               | Are there any big companies left which haven't adopted a
               | monorepo?
        
               | [deleted]
        
               | PartiallyTyped wrote:
               | AWS. We probably have the worst build systems :(
        
         | n49o7 wrote:
         | Probabilistic feature flags! Love it.
        
           | thehappypm wrote:
           | Multi-armed bandits utilize this
        
           | Thorrez wrote:
           | Always base the probability on something stable, such as hash
           | of the username.
        
             | IshKebab wrote:
             | Bug report: changing my username breaks $product.
             | 
             | Yeah no thanks. It's probably better than completely random
             | but software should be predictable and unsurprising.
        
               | btilly wrote:
               | I've used the hash of username+string trick before for a
               | flag. I used it to replace a home-grown heavyweight A/B
               | testing framework which had turned into a performance
               | bottleneck.
               | 
               | It worked quite well.
        
               | burnished wrote:
               | The important part is the stability - if your usernames
               | can change then they aren't stable so you don't select
               | it.
               | 
               | I think it is a good reminder that most things you think
               | of as being unchanging that are also directly related to
               | a person.. aren't unchanging. Or at least any conceivable
               | attribute probably has some compelling reason why some
               | one will need to change it.
        
               | dietr1ch wrote:
               | That's why you have internal user ids instead of using
               | data directly provided by users.
               | 
               | Will it cost an extra lookup? It's cheap, and if you
               | really need to, you could embed the lookup in some
               | encrypted cookie so you can verify you approved some
               | name->id mapping recently without doing a lookup.
        
               | robocat wrote:
               | > changing my username breaks $product.
               | 
               | https://m.youtube.com/watch?v=r-TLSBdHe1A&t=14m10s
               | 
               | Discussing a performance regression due to longer
               | username due to username being in ENVIRONMENT variable
               | which changes memory layout of process.
        
             | [deleted]
        
         | painted-now wrote:
         | Man, this story sounds like you could be on my team :-) Pretty
         | much experienced the same stuff working at BigCo!
         | 
         | In the end, I think the real problem is that you can't test all
         | combinations of experiments. I don't trust "all off" or "all
         | on" testing. In my book, you should indeed sample from the true
         | distribution of experiments that real users see. Yes, you get
         | flaky tests, but you also actually test what matters most, i.e.
         | what users will - statistically - see.
        
           | joosters wrote:
           | This sounds like a situation that would benefit from using an
           | approach like all-pairs testing -
           | https://en.wikipedia.org/wiki/All-pairs_testing
           | 
           | Basically, if you have N different features (let's assume
           | they are all on/off switches, but it works for multi-values
           | too), in theory you'd need to run 2^N tests to cover them
           | all, which would become completely impractical. But, you can
           | generate a far, far smaller set of test setups that guarantee
           | that every pair of features gets tested together. Run those
           | tests and you'll probably encounter most feature-interaction
           | bugs in a much quicker time.
        
             | cscheid wrote:
             | All-pairs is for _pairs of features_. For subsets you're in
             | much deeper trouble because of the exponential dependence
             | on N. For a fixed polynomial dependence, you can get clever
             | and let tail bounds eventually work for you, but for
             | exponentially growing hypothesis sets, that won't work.
        
         | yojo wrote:
         | Yikes!
         | 
         | FWIW, I think best practice here is to hardcode all feature
         | flags to off in the integration test suite, unless explicitly
         | overwritten in a test. Otherwise you risk exactly these sorts
         | of heisenbugs.
         | 
         | At a BigCo that's probably going to require coordinating with
         | an internal tools team, but worth getting it on their backlog.
         | All tests should be as deterministic as possible, and this goes
         | double for integration tests that can flake for reasons outside
         | of the code.
        
           | btilly wrote:
           | No, the best practice is that on each test run, every feature
           | flag used implicitly or explicitly needs to be captured AND
           | it must be possible to re-run the test with the same set of
           | feature flags.
           | 
           | That way when you get a failure, you can reproduce it. And
           | then one of the easy things to do is test which features may
           | have contributed to it.
        
           | nosefrog wrote:
           | But then you won't catch the bug before it hits production :)
        
             | dmoy wrote:
             | Also you end up with some strange long term test behavior.
             | Because people will often leave feature flags in place long
             | after full release ( _years_ sometimes), you end up with a
             | default-off-in-tests only testing behavior with everything
             | newer than N years since the last feature flag cleanup
             | disabled.
             | 
             | Yes it's kinda fractal of bad practices that have to align
             | for this problem to occur, but that's the nature of tech
             | debt.
        
               | linuxdude314 wrote:
               | You are both misunderstanding the post.
               | 
               | He's not saying to alter any of the feature flags used
               | for the test, but simply to record which were used during
               | the test.
               | 
               | Simply logging doesn't introduce any of the issues you
               | are describing.
        
         | ASinclair wrote:
         | This is my daily life at BigCo. These bugs are the worst.
        
       | anotherhue wrote:
       | Excellent 'obsessed detective' story
        
         | hinkley wrote:
         | I used to think I was amazing at performance tuning and
         | debugging but after working with a few hundred different people
         | it turns out I'm just really fucking stubborn. I am not going
         | to shrug at this bug again. You are going down. I do have a
         | better way of processing concurrency information in my head,
         | but the rest is just elbow grease.
         | 
         | I had a friend in college who was dumb as a post but could
         | study like nobody's business. Some of us skated through, some
         | of us earned our degree, but he really _earned_ his. We became
         | friends over computer games and for a long time I wondered if
         | games and fiction were the only things we had in common. Turns
         | out there's maybe more to that story than I thought at the
         | time.
        
           | allenrb wrote:
           | I think you're absolutely right. Some of the things I've been
           | most proud of have been products of stubbornly refusing to
           | give up. On the other hand, some vast oceans of wasted time
           | have been another result. It's tricky to know _when_ to be
           | tenacious!
        
             | hinkley wrote:
             | In my defense, I am a strong proponent of refactoring to
             | make all problems shallow. So there are classes of bug that
             | I will see before anyone else because I move the related
             | bits around and it becomes obvious that there are missing
             | modes in the decision tree.
             | 
             | I tend to believe that discipline and tenacity are separate
             | traits. Often appearing in the same people, but different
             | skills with different exercises.
        
               | allenrb wrote:
               | Bingo, that is very well put. Discipline is where I'll
               | tend to fall short. :-)
        
       | 7ewis wrote:
       | Reminds me how cosmic rays were noted to have caused computer
       | glitches. [0]
       | 
       | Impressive that they managed to discover this bug.
       | 
       | [0] - https://www.bbc.com/future/article/20221011-how-space-
       | weathe...
        
         | Musky wrote:
         | In the speed running community there is a pretty famous clip
         | [0], where a glitch caused a Super Mario speed runner to
         | suddenly teleport to the platform above him, saving him some
         | valuable time.
         | 
         | Of course people tried to find ways to reproduce the bug
         | reliably, as saving even milliseconds can mean everything in a
         | speed run. They went as far as replicating the state of the
         | game from the original occurrence 1:1, but AFAIK no one has
         | been able to reproduce the glitch without messing with the
         | games memory.
         | 
         | For that reason it is speculated that a cosmic ray caused a
         | bit-flip in the byte that stores the players y coordinate,
         | shooting him up into the air and onto the next platform.
         | 
         | [0] - https://youtu.be/o3Cx2wmFyQQ?t=16
        
       | DerekBickerton wrote:
       | Before clicking I thought someone kept note of how many times
       | Linux booted in regard to their computing habits, and not testing
       | software. I know for me I boot roughly 3 times a day into
       | different machines, do my work, shutdown, then rinse & repeat.
       | 
       | Then you have those types who put their machine into
       | hibernate/sleep with 100+ Chrome tabs open and never do a full
       | boot ritual. Boggles my mind that people do that.
        
         | Tubru3dhb22 wrote:
         | > Boggles my mind that people do that.
         | 
         | Why? I only restart my (linux) laptop every 3-4 months when I
         | update software.
         | 
         | I can't think of any downside that I've experienced from this
         | practice. I do a lot of work with data loaded in a REPL, so
         | it's certainly saved me time having everything restored to as I
         | left it.
        
         | bbarn wrote:
         | I had a developer that I inherited from a previous manager some
         | years ago. Made tons of excuses about his machine, the
         | complexity of the problem, etc. I offered to check his machine
         | out and he refused because it had "private stuff" on it. He had
         | the same machine as the rest of the team, so since he hadn't
         | made a commit in two weeks on a relatively simple problem,
         | refused help from anyone, etc., we ultimately let him go.
         | 
         | When we looked at his PC to see if there was anything useful
         | from the project, his browser had around a thousand tabs open.
         | Probably 80% of them were duplicates of other tabs, linking to
         | the same couple stack overflow and C# sites for really basic
         | stuff. The other 20% were... definitely "private stuff".
        
           | hinkley wrote:
           | I'm at the other extreme of "private stuff". Nothing work
           | related should live on my work machine. It should all be
           | pushed to git or dumped in the wiki (personal pages if
           | nothing else).
           | 
           | On one of my largest projects the IT dept made bulk orders
           | for hardware and doled them out to new hires. 18 months into
           | our new project someone's hard drive died.
           | 
           | Everyone acted like his dog died. I said no problem let's go
           | through the onboarding docs. The longest step by far was that
           | the company mandated Whole Disk Encryption but IT hadn't put
           | it in their old inventory yet. So that was 2/3 of setup time.
           | We found some issues with the docs and fixed them.
           | 
           | Every two to four weeks that summer, someone else's drive
           | would go. You see, we got all of these machines from the same
           | production run. So the hard drives came from the same
           | production run, which was apparently faulty. The process got
           | a little faster as we went. By the end of the summer it was
           | my turn, and people still looked at me like I needed
           | condolences. I got a faster machine for a few hours worth of
           | work. I'm not sad. All my stuff was in the network already. I
           | lost a couple hours' of work, tops.
        
             | opello wrote:
             | This is the best way to reduce bus factor and not fall
             | behind documenting key details!
        
             | teachrdan wrote:
             | > Nothing work related should live on my work machine.
             | 
             | I thought this was a typo at first. Love this as an
             | engineering koan.
        
               | noSyncCloud wrote:
               | And the corollary - nothing personal should be on your
               | work machine, either
        
               | canucker2016 wrote:
               | "Nothing work related should live ONLY on my work
               | machine." is the intent.
        
           | sureglymop wrote:
           | He was let go after two weeks? No confrontation nothing?
           | 
           | Sounds very american. In European working culture if you
           | don't show up for two weeks people will be worried that
           | something happened to you and try to work it out with you.
           | This type of all or nothing reaction is a bit sporadic imo.
        
             | mikestew wrote:
             | _Sounds very american._
             | 
             | Yeah, it's not like that part of the story was condensed
             | and might have left out a bunch of details that weren't
             | important to the story. So let's give OP a hard time and
             | make judgements about a situation for which we have not
             | even the slightest bit of context.
        
               | sureglymop wrote:
               | Oh absolutely, you're right. I am saying that despite
               | whatever may have happened, two weeks is very short. I
               | feel like it would be at least a month here regardless.
        
             | RandallBrown wrote:
             | He was let go after two weeks of not doing any work,
             | despite the manager offering to help him.
        
           | JohnFen wrote:
           | > he refused because it had "private stuff" on it.
           | 
           | There's a huge red flag. "Private stuff" (embarrassing or
           | otherwise) shouldn't be on company machines in the first
           | place.
        
             | dijit wrote:
             | I agree completely.
             | 
             | However if anyone touches my computer: don't you dare
             | f*%king touch my private key.
             | 
             | (ditto for my browsers sessions database, google cloud
             | credentials directory etc;)
             | 
             | I'm paranoid about it, but not enough to buy a yubikey,
             | apparently.
        
               | lostlogin wrote:
               | > However if anyone touches my computer: don't you dare
               | f*%king touch my private key.
               | 
               | Touch the computer, sure, but please don't touch the
               | screen with your filthy grease fingers.
        
               | mdpye wrote:
               | My work laptop has a touchscreen. I've never used it, but
               | other people use it by accident fairly often. Usually
               | only once each though, the look of shock is sometimes
               | even worth the fingerprint :D
        
               | JohnFen wrote:
               | I'm unusually strict about maintaining a separation
               | between work and personal (for instance, I would never
               | allow my personal smartphone to connect to my employer's
               | WiFi), so I wouldn't use personal keys on a work machine
               | at all.
               | 
               | But if those keys (or passwords, etc.) are generated for
               | work purposes, I consider them to be as much company
               | property as the machine itself, so I'm no more protective
               | of them than I am of any other sensitive company data.
        
               | dijit wrote:
               | Interesting thought.
               | 
               | How do you feel about giving your colleague your
               | password?
               | 
               | My personal opinion is that I can hold someone legally
               | culpable if _their account_ does something like leak
               | financial information; you have a professional
               | responsibility to secure your account from absolutely
               | everyone.
               | 
               | Administrators acting on your account must of course be
               | heavily logged and audited, which is the case.
        
               | JohnFen wrote:
               | > How do you feel about giving your colleague your
               | password?
               | 
               | I usually don't, mostly just out of good security habits,
               | but also because most employers specifically prohibit
               | doing that.
               | 
               | Almost always, your colleague can be given his own access
               | to whatever the password is for anyway. If that's not
               | possible, then I'll share the password and change it
               | immediately after my colleague doesn't need access
               | anymore.
               | 
               | > you have a professional responsibility to secure your
               | account from absolutely everyone.
               | 
               | I agree -- that's part of treating credentials the same
               | way as all other sensitive company data. But it's still
               | my employer's data, not mine.
               | 
               | If I quit the company or if my supervisor wants to see
               | the contents of my machine, I'm fine with that. The
               | machine and everything on it belongs to the company
               | anyway.
        
               | chucksmash wrote:
               | > If I quit the company or if my supervisor wants to see
               | the contents of my machine, I'm fine with that. The
               | machine and everything on it belongs to the company
               | anyway.
               | 
               | I'm fine with that, but I still will not share my
               | passwords. I'd be happy to reset the passwords for them
               | if they can't access the data by other means, but as
               | another commenter pointed out, the fact that anything
               | needs to be recovered from my^H^H _not my_ laptop
               | indicates mistakes were made.
        
               | StillBored wrote:
               | Isn't this largely the point of company directory
               | services? The machines/routers/applications/etc are all
               | doing their authentication against the directory service,
               | and permissions are granted and revoked there. Its a
               | large part of running a company with more than a couple
               | employees because when someone leaves you don't need to
               | run around changing passwords and wondering if they still
               | have access to the AWS account to spin stuff up, or punch
               | through the VPN. The account in the directory service is
               | just deactivated and with it all access.
               | 
               | By default this should be what is happening on all but
               | the most ephemeral of machines/testing platforms/etc. And
               | even then if its a formal testing system it should
               | probably be integrated too.
               | 
               | Directory service integration BTW is the one feature that
               | clearly delineates enterprise products from the rest.
        
               | dijit wrote:
               | Ok, but your private key, session tokens and CLI access
               | tokens (kube configs, gcloud etc;) _are_ your password in
               | those situations.
               | 
               | They tie to your identity, thus you must not treat them
               | the same as company secrets, they are professional
               | _personal_ secrets which should not be disclosed or
               | allowed to fall into anyone elses hands (less they be
               | revoked and cycled).
               | 
               | It's not just good security posture it could affect your
               | career quite badly or lead to legal issues.
        
               | JohnFen wrote:
               | I agree. I don't think I've said anything counter to that
               | (or perhaps I wasn't being clear?)
               | 
               | > thus you must not treat them the same as company
               | secrets, they are professional personal secrets
               | 
               | They are company secrets that are tied to my identity.
               | The company owns those secrets, not me. Just like my
               | keycard to get into the building.
        
               | dijit wrote:
               | > I agree. I don't think I've said anything counter to
               | that (or perhaps I wasn't being clear?)
               | 
               | I think given the context of the thread (don't touch my
               | secrets), saying that you don't have anything you would
               | consider confidential towards your employer or colleagues
               | is a direct contradiction to what I stated.
               | 
               | That's why I'm "arguing" because my employer/colleagues
               | should not have access to my private key, ever.
        
               | JohnFen wrote:
               | Ah, OK. Then we do disagree to an extent.
               | 
               | There are several very legitimate times when my employer
               | needs to have access to my keys. If I'm leaving the
               | company, for an obvious instance.
               | 
               | But my core point is that such keys/passwords aren't
               | really mine, they're the company's and in the end, the
               | company gets to decide what I'm to do with them.
               | 
               | I think the building access keycard is a perfect analogy.
               | I'd never let anyone borrow mine on my own volition, but
               | if the company wants to retrieve it from me, that's their
               | prerogative. It's theirs, after all.
        
               | brazzledazzle wrote:
               | If an employer needs someone's particular keys something
               | probably went wrong or there's bad processes in place.
               | But that aside I think the default course of action
               | should be to aggressively guard your secrets and tokens
               | since they represent you. Not as personal or private
               | property but to keep someone (be it a fellow employee or
               | a 3rd party attacker) from impersonating you without
               | authorization.
               | 
               | There are exceptions but the circumstances where an
               | employer would need to retrieve my keys without my
               | assistance are extremely rare and in those instances it's
               | unlikely I'd still be an employee anyway.
        
               | dijit wrote:
               | We disagree.
               | 
               | The handing of the keycard is necessary to ensure it's
               | destroyed and can't be used as a "proof" you work
               | somewhere (most access cards these days have your name,
               | face and the company logo printed on the front).
               | 
               | The keycard will be removed from the access list to the
               | building even when it's destroyed, they're not considered
               | reusable by most companies.
               | 
               | Your private key is not reusable, it should be destroyed
               | and revoked from all system when you leave a company.
        
               | lmm wrote:
               | We could destroy the keycard with both parties present,
               | that seems safest. I don't mind turning in a private key
               | permanently and getting a receipt at the time, but it
               | needs to be very clear that it's no longer my
               | responsibility.
        
               | JohnFen wrote:
               | > but to keep someone (be it a fellow employee or a 3rd
               | party attacker) from impersonating you without
               | authorization.
               | 
               | Aside from a third party attacker (which is well-covered
               | by my normal practices), that's a threat model that I'm
               | personally not worried about at all, really. In part
               | because I've never seen or heard of that happening and in
               | part because if it did, I am confident that there are
               | enough records to be able to prove it.
        
         | ryanjshaw wrote:
         | I used to shutdown regularly, then the power situation here in
         | South Africa got so bad that we'd regularly have about 3 hours
         | of power between interruptions.
         | 
         | Restoring all my work every couple of hours was becoming a
         | pain, so I decided to re-enable hibernation support on Windows
         | for the first time in 10 years... And surprisingly it works
         | absolutely flawlessly.
         | 
         | Even on my 12yr old hardware, even if I'm running a few virtual
         | machines. I honestly haven't seen any reason to reboot other
         | than updates.
        
           | lelanthran wrote:
           | > I used to shutdown regularly, then the power situation here
           | in South Africa got so bad that we'd regularly have about 3
           | hours of power between interruptions.
           | 
           | I'm in SA too, and I used to have 100s of days uptime (one
           | even over a year and a half) ... until the regular blackouts.
           | 
           | Had to stop using a desktop, I've resigned myself to using a
           | laptop, purely so that I don't have to boot the thing all the
           | time and lose my context.
        
           | pessimizer wrote:
           | This thread is like reading that someone is shocked that
           | other people don't burn their beds every morning after they
           | wake up.
        
         | rmbyrro wrote:
         | I get anxious just to think that restoring from
         | sleep/hibernation may fail and I lose all my workspace state...
         | 
         | If there was no boot failure, nor the need to reboot after some
         | upgrade, I'd never, ever reboot my system.
        
         | eertami wrote:
         | Sleep uses almost 0 power and works flawlessly. I'm never going
         | to waste my time, however short, waiting for a machine to boot.
        
         | vbezhenar wrote:
         | I think that there are two types of people. One set of people
         | (I guess, relatively small) don't trust software and prefer to
         | reboot OS and even periodically reinstall it to keep it
         | "uncluttered". Another set of people prefer to run and repair
         | it forever.
         | 
         | I'm from the first set of people and the only reason I stopped
         | shutting down my macbook is because I'm now keeping its lid
         | closed (connected to display) and there's no way to turn it on
         | without opening a lid which is very inconvenient. I still
         | reboot it every few days, just in case.
        
           | ComputerGuru wrote:
           | I'm in the second group (avoid reboots like the plague) but
           | for the reason you attribute to the first: I never trust that
           | my Windows machine - currently working - will reboot
           | successfully and into the same working condition between OS
           | update regressions, driver issues, etc.
        
         | coldtea wrote:
         | > _Then you have those types who put their machine into
         | hibernate /sleep with 100+ Chrome tabs open and never do a full
         | boot ritual. Boggles my mind that people do that._
         | 
         | If the OS and hardware drivers properly support sleep, you
         | almost never need to do otherwise (except to install a new
         | kernel driver or similar).
         | 
         | In macOS for example it hasn't been the case that you need
         | reboot in your regular OS use for over 10+ years.
         | 
         | The "100+ Chrome tabs" or whatever mean nothing. They're paged
         | out when not directly viewed anyway, and if you close just
         | Chrome (not reboot the OS) the memory will be freed in any
         | case...
        
           | [deleted]
        
           | moron4hire wrote:
           | > If the OS and hardware drivers properly support sleep...
           | 
           | That's like the biggest of big IFs.
        
             | tom_ wrote:
             | I've found sleep very reliable on macOS, and both sleep and
             | hibernate reliable on Windows.
             | 
             | I once had my work PC unhibernate and not pop up the login
             | box. The computer appeared to be running normally
             | otherwise; I just couldn't log in, and I had to tap the
             | power button to shut it down. This stuck in my mind due to
             | its rarity.
             | 
             | Can't remember ever having a serious issue on macOS. A
             | couple of my programs sometimes don't survive the
             | sleep/wake cycle, but it's intermittent, and I'm always in
             | the middle of something else when it happens. I've never
             | lost any meaningful work.
        
               | andrekandre wrote:
               | > Can't remember ever having a serious issue on macOS.
               | 
               | macos is fine for the most part, but there are some edge
               | cases, such as some sketchy corporate required "security
               | software" that eats up kernel memory or cpu for some
               | unknown reason, a reboot can fix performance issues there
               | 
               | also if you are a dev and apps (like xcode, android
               | studio etc) fill your drive with cache files* or have
               | weird background daemons that eat up cpu, at the least a
               | logout/login (or a reboot) can fix some of those eierd
               | things
               | 
               | you could manually delete them without a reboot but ymmv
        
         | tasuki wrote:
         | > Boggles my mind that people do that.
         | 
         | Why?
         | 
         | It boggles my mind that you'd reboot needlessly. My uptime is
         | usually in the hundreds of days.
         | 
         | Sleep is good: I just close the lid. Next time I open the lid
         | it immediately picks up where I left off. _Why_ on earth would
         | you want any other behaviour?
        
           | 2b3a51 wrote:
           | Full drive encryption on Linux.
           | 
           | I close down my laptop when I'm moving around or when I leave
           | it somewhere while I'm in another part of the building.
        
           | tom_ wrote:
           | I reboot most weeks, just to make sure the right stuff
           | happens when I do. (I try to do it in the middle of the day,
           | so there's time to sort out any matters arising.)
           | 
           | A couple of times I've discovered I've forgotten to set stuff
           | to auto-run on login, or things turn out to have lost their
           | settings, or stuff doesn't work for whatever reason - I'd
           | much rather discover this at a time of my own choosing!
        
           | rolandog wrote:
           | Security-wise: encryption at rest? In high security scenarios
           | you may be required to shutdown so you're forcing "attackers"
           | to go through several layers: motherboard password, disk
           | password, encryption password, OS user password + 2FA, etc.
        
           | JohnFen wrote:
           | On my personal machines? I don't shut them down or reboot
           | very often.
           | 
           | At work, however, I have to use Windows. In that case, I shut
           | it down at the end of every workday, in part because that
           | prevents weird issues Windows tends to develop when running
           | too long.
           | 
           | Mostly, though, it's because of those damned forced updates.
           | Since I can't trust Windows to not reboot itself at any
           | random point in time, having the habit of shutting down at
           | the end of the day at least ensures that I won't accidentally
           | lose my state overnight or over the weekend.
        
             | tom_ wrote:
             | How to stop Windows installing updates behind your back:
             | https://news.ycombinator.com/item?id=18157968
             | 
             | If you don't/won't/can't use the group policy editor, I got
             | a lot of mileage out of hibernating the PC and powering it
             | off at the mains. You can't leave it running something
             | overnight, but you can at least quickly get back to exactly
             | where you left things the previous day.
             | 
             | (Powering it off at the mains ensures that even if you have
             | a device connected that could wake the PC up - thus putting
             | your computer in a state where WIndows Update can reboot it
             | - it can't. You can turn this feature off on a per-device
             | basis with powercfg, but then one day you'll plug something
             | new in and leave it plugged in and it'll wake the PC up
             | while you're away and Windows Update will do its thing.)
        
           | jameson71 wrote:
           | Security patching?
        
             | pessimizer wrote:
             | What do you need to reboot to patch other than the kernel?
             | I just restart things.
        
             | cannonpalms wrote:
             | Can all be done online, no?
        
           | mcculley wrote:
           | A long time ago, I had desktops with huge uptimes. The world
           | has changed. I will no longer go that long without a security
           | update. Too much is now passing through my machine.
        
         | sieabahlpark wrote:
         | I just have it running 24/7 and never restart for weeks. I
         | don't even have the 100 tab problem, I just like having the
         | immediate availability without waiting for startup.
        
           | 5e92cb50239222b wrote:
           | Unless you're on solar, does wasting electricity not bother
           | you? I used to seed a lot of stuff for years (with typical
           | uptime measured in months), but the CO2 impact, however tiny
           | it is in the grand scheme of things, does not seem to worth
           | it anymore.
        
             | sieabahlpark wrote:
             | [dead]
        
             | pessimizer wrote:
             | If you're shutdown or hibernating, is the power draw
             | anything compared to a lightbulb?
        
             | Hikikomori wrote:
             | My desktop uses 2w in sleep mode. Likely less if i disable
             | the motherboard RGB.
        
         | aeyes wrote:
         | > Boggles my mind that people do that.
         | 
         | :( I only reboot when my machine freezes or when updates
         | require a reboot. I did a lot of on-call in my life and I saved
         | tons of time by leaving everything open exactly as I left it
         | during the day.                 ~> w       11:19  up 18 days,
         | 17:03, 9 users, load averages: 3.87 2.96 2.39
        
           | ComputerGuru wrote:
           | You haven't properly kept a machine alive until the clock
           | rolls over.
           | 
           | I logged into a firewalled Windows VM on EC2 that's been
           | running an internal micro service that was acting up and it
           | caught my eye that task manager showed an uptime of 6 days
           | making my mind immediately think it might be a bug caused by
           | the recent reboot or perhaps the update that triggered it.
           | 
           | It turns out no reboot had taken place and in fact, the
           | uptime counter had merely rolled over - and not for the first
           | time! Bug was unrelated to the machine and it's still (afaik)
           | ticking merrily away.
           | 
           | (Our `uptime` tool for Windows [0] reported the actual time
           | the machine was up correctly.)
           | 
           | [0]: https://neosmart.net/uptime/
        
             | exikyut wrote:
             | Okay, what was the actual uptime? :) (:E)
        
         | andrewaylett wrote:
         | Conversely, it boggles _my_ mind that people think 100+ tabs is
         | a lot. I 've got >500 open in Firefox at the moment, they won't
         | go away just because I reboot or upgrade. I'll probably not
         | look at most of them again, but they're not doing any harm just
         | sitting there waiting to be cleaned up.
        
           | db48x wrote:
           | That's because in Firefox an open tab that you haven't
           | recently viewed uses no memory.
        
         | drbawb wrote:
         | >Then you have those types who put their machine into
         | hibernate/sleep with 100+ Chrome tabs open and never do a full
         | boot ritual.
         | 
         | I would never suspend to RAM or disk, far too error-prone in my
         | experience. (Plus serializing out 128GiB of RAM is not great.)
         | I just leave my machine running "all the time." My most
         | recently retired disks (WD Black 6TB) have 309 power cycles
         | with ~57,382 power-on hours. Seems like that works out to
         | rebooting a little less than once per week. That tracks: I
         | usually do kernel updates on the weekend, just in case the
         | system doesn't want to reboot unattended.
        
         | trashburger wrote:
         | > Then you have those types who put their machine into
         | hibernate with 100+ Chrome tabs open and never do a full boot
         | ritual. Boggles my mind that people do that.
         | 
         | Hey, I'm that guy (although I put it to sleep instead)! It
         | honestly works really well and is in stark contrast to how
         | Linux and sleep mode interacted just ~10 years ago. It's
         | amazing for keeping your workspace intact.
         | 
         | (FWIW, I also don't reboot or shutdown my desktop where it acts
         | as a mainframe for my "dumb" laptop.)
        
         | bregma wrote:
         | > Boggles my mind that people do that.                   $
         | uptime          15:39:13 up 359 days,  2:02, 16 users,  load
         | average: 0.09, 0.08, 0.15
         | 
         | 16 users is 16 tmux sessions, all me doing different tasks.
        
           | exikyut wrote:
           | _[Cries in outdated kernel]_
           | 
           | One of the fascinating curiosities you're missing out on is
           | Pressure Stall Information
           | (https://docs.kernel.org/accounting/psi.html). Here's what
           | the PSI gauges look like in htop when kernel support is
           | available:                 PSI some CPU:     0.37%  0.78%
           | 1.50%        PSI some IO:      0.38%  0.33%  0.25%        PSI
           | full IO:      0.38%  0.31%  0.23%        PSI some memory:
           | 0.02%  0.04%  0.00%        PSI full memory:  0.02%  0.04%
           | 0.00%
        
       | jchw wrote:
       | I have found that my MicroPC fails on some newer kernels: when
       | GDM starts up, the machine locks up and the LCD goes wonky. I'm
       | not particularly looking forward to the bisect, but at least it
       | won't take 292,612 reboots.
        
         | StillBored wrote:
         | I some ways an early boot kernel only failure is easier. Late
         | boot failures like that, could just as well have been something
         | changing in wayland/X/gdm/mesa/dbus/whatever at the same time.
         | And then if it turns out everything but the kernel is constant,
         | its easy to take a wild guess and look for something in say the
         | DRM/GPU driver in use vs the entire kernel. Although last time
         | I did that turns out it wasn't even in the GPU specific code
         | but a refactoring in the generic display mgmt code. Still ended
         | up doing a bisect across like 5 kernel revisions after
         | everything else failed. Which points to the fact that if linux
         | had a less monolithic tree it would be possible to a/b test
         | just the kernel modules and then bisect their individual trees,
         | rather than adjusting each bisect point to the closest related
         | commit if your sure its a driver specific problem. There is a
         | very good chance that if say a particular monitor config + GPU
         | stops working on my x86, the problem is likely in /drivers/gpu
         | rather than all the commits in arch/riscv that are also mixed
         | into the bisect. Ideally the core kernel, arch specific code,
         | and driver subystems would all be independent trees with
         | fixed/versioned ABIs of their own. That way one could upgrade
         | the GPU driver to fix a bug without having to pull forward
         | btrfs/whatever and risk breaking it.
        
           | jchw wrote:
           | Since I'm in NixOS, I can at least emphatically confirm it is
           | JUST the kernel.
           | 
           | Though, given the way the LCD panel wonks out, I'm actually
           | concerned it's power management related. It looks like what
           | happens to an LCD panel when the voltage goes too low. (Or at
           | least, I think that's what that effect is, based on what I've
           | seen with other weird devices with low battery.) Since
           | MicroPC is x86, though, I doubt the kernel is driving any of
           | the voltages too directly, so who knows.
        
       | rjmunro wrote:
       | I wonder if bisect is the optimal algorithm for this kind of
       | case. Checking for the error still existing still takes an
       | average of [?]500 iterations before a fail, checking for the
       | error not existing takes 10,000 iterations, 20 times longer, so
       | maybe biasing the bisect to only skip 1/20th of the remaining
       | commits, rather than half of them would be more efficient.
        
         | pacaro wrote:
         | Biasing a binary search would only be beneficial if you know
         | something about the distribution of the search space
        
           | bgirard wrote:
           | If the factor in one direction is large enough then a linear
           | search becomes more efficient. Say you have 20 commits
           | remaining and the factor is 1,000x more costly to make it
           | easier to picture. You're better off doing a linear search
           | which guarantees you'll spend less than 2,000x searching the
           | space.
           | 
           | That suggests that for a larger search space with a large
           | enough difference, the optimal bisection point is probably
           | not always the midpoint even if you know nothing about the
           | distribution.
           | 
           | Perhaps someone can find the exact formula for selecting the
           | next revision to search?
        
             | jwilk wrote:
             | > You're better off doing a linear search which guarantees
             | you'll spend less than 2,000x searching the space.
             | 
             |  _Almost_. If only the last commit is slow, binary search
             | is still faster.
        
               | bgirard wrote:
               | > better off
               | 
               | Better off as in expected/average case. Good point, but
               | only marginally better in the worse case.
        
           | electroly wrote:
           | There's an additional stopping problem here that isn't
           | present in a normal binary search. Binary search assumes you
           | can do a test and know for sure whether you've found the
           | target item, a lower item, or a higher item. If the test
           | itself is stochastic and you don't know how long you have to
           | run it to get the hang, I'd think you'd get results faster by
           | running commits randomly and excluding them from
           | consideration when they hang. Effectively, you're running all
           | the commits at the same time instead of working on one commit
           | and not moving on until you've made a decision on it. Then at
           | any time you will have a list of commits that have hanged and
           | a list of commits that have not hanged yet, and you can keep
           | the entire experiment running arbitarily long to catch the
           | long-tail effects rather than having to choose when to stop
           | testing a single non-hanging commit and move onto the next
           | one.
        
             | pacaro wrote:
             | I can see some interesting approaches here. Given n
             | threads/workers you could divide the search space into n
             | sample points (for simplicity let's divide it evenly) and
             | run the repeated test on each point. When a point hangs,
             | that establishes a new upper limit, all higher search
             | points are eliminated, the workers reassigned in the
             | remaining search space.
             | 
             | Given the uncertainty I can see how this might be more
             | efficient, especially if the variance of the heisenbug is
             | high.
        
           | mortehu wrote:
           | Each boot updates your empirical distribution. As a trivial
           | example, if you have booted a version 9999 times with no
           | hanging, a later version will likely give you more
           | information per boot.
        
         | coldtea wrote:
         | Still, why would they need to reboot 292,612 times?
         | 
         | Is that supposed to be the log of the commit messages space?
        
           | remram wrote:
           | If they boot it 10,000 times for revisions that don't fail,
           | and ~1,000 times for revisions that do fail, you can reach
           | this number with log2(revisions) about 30.
        
           | x86x87 wrote:
           | read the article. they booted so many times to show that it
           | was not reproducing. it's overkill but you don't need to boot
           | 200k times
        
             | rwmj wrote:
             | I didn't mention it in the blog, but Paolo Bonzini was
             | helping me and suggested I run the bootbootboot test for 24
             | hours, to make sure the bug wasn't latent in the older
             | kernel. I got bored after 21 hours, which happened to be
             | 292,612 boots.
             | 
             | Maybe it would have failed on the 292,613rd boot ...
        
               | quickthrower2 wrote:
               | I think your p value is pretty good here
        
               | opello wrote:
               | I've been on a similar quest for hard to reproduce,
               | timing/hardware/... bugs, and if you're facing any kind
               | of skepticism (your own or otherwise) it can be very
               | comforting to have a 10x or even 100x no failure occurred
               | confidence.
               | 
               | It's particularly comforting when the reason for the
               | failure/fix/change in behavior isn't completely
               | understood.
        
               | bsilvereagle wrote:
               | If the bug occurs reasonably often, say usually once
               | every 10 minutes, you can model an exponential
               | distribution of the intervals between the bug triggering
               | and then use the distribution to "prove" the bug is fixed
               | in cases where the root cause isn't clear:
               | https://frdmtoplay.com/statistically-squashing-bugs/
        
         | ajb wrote:
         | There is actually a bayesian version which I wrote:
         | https://github.com/ealdwulf/bbchop
         | 
         | Basically it calculates the commit to test at each step which
         | gains the most information, under some trivial assumptions. The
         | calculation is O(N) in the number of commits if you have a
         | linear history, but it requires prefix-sum which is not O(N) on
         | a DAG so it could be expensive if your history is complex.
         | 
         | Never got round to integrating it into git though.
        
           | muxator wrote:
           | Hidden gem! Thanks!
        
           | defen wrote:
           | That's a cool idea. Would also be interesting to consider the
           | size of the commit - a single 100-line change is probably
           | more likely to introduce a bug than 10 10-line changes.
        
             | phist_mcgee wrote:
             | You haven't met the developers at my last company.
        
           | [deleted]
        
       | dumbaccount123 wrote:
       | [flagged]
        
       | TechBro8615 wrote:
       | This reminded me of another story [0] (discussed on HN [1]) about
       | debugging hanging U-Boot when booting from 1.8 volt SD cards, but
       | not from 3.0 volt SD cards, where the solution involved a kernel
       | patch that actually _introduced_ a delay during boot, by
       | "hardcoding a delay in the regulator setup code
       | (set_machine_constraints)." (In fact it sounded so similar that I
       | actually checked if that patch caused the bug in the OP, but they
       | seem unrelated.)
       | 
       | The story is a wild one, and begins with what looks like a patch
       | with a hacky workaround:
       | 
       | > The patch works around the U-Boot bug by setting the signal
       | voltage back to 3.0V at an opportune moment in the Linux kernel
       | upon reboot, before control is relinquished back to U-Boot.
       | 
       | But wait... it was "the weirdest placebo ever!" Turns out the
       | only reason this worked was because:
       | 
       | > all this setting did was to write a warning to the kernel
       | log... the regulator was being turned off and on again by
       | regulator code, and that writing that line took long enough to be
       | a proper delay to have the regulator reach its target voltage.
       | 
       | The full story is well worth a read.
       | 
       | [0]
       | https://kohlschuetter.github.io/blog/posts/2022/10/28/linux-...
       | 
       | [1] https://news.ycombinator.com/item?id=33370882
        
       | headline wrote:
       | Very interesting, I wonder the _why_
        
         | mgsouth wrote:
         | Disclaimer: not a kernel dev, opinion based upon very cursory
         | inspection.
         | 
         | The patch references the "scheduler clock," which is a high-
         | speed, high-resolution monotonic clock used to schedule future
         | events. For example, a network card driver might need to reset
         | a chip, wait 2 milliseconds, and then do another initialization
         | step. It can use the scheduler to cause the second step to be
         | executed 2 milliseconds in the future; the "scheduler clock" is
         | the alarm clock for this purpose.
         | 
         | Measuring the "current time" is pretty complicated when you're
         | dealing with multiple-core variable-frequency processors, need
         | a precise measurement, and can't afford to slow things down.
         | The "scheduler clock" code fuses together time sources and
         | elapsed-time indicators to provide an estimated current time
         | which has certain guarentees (such as code running a particular
         | core will never see time go backwards, it will be accurate
         | within particular limits, and it won't need global locks). The
         | sources and elapsed-time indicators it has available varies by
         | computer architecture, vendor, and chip family; therefore the
         | exact behavior on an Intel core 5 will differ from that of an
         | Arm M7.
         | 
         | The patch in question changes the behavior of local_time();
         | this is the function used by code which wants to know what the
         | current time is on its particular core. The patch tries to make
         | local_time() return a sane value if the schedule clock hasn't
         | been fully initialized but is at least running.
         | 
         | As you can imagine, there a lot of things that can go wrong
         | with that. I _think_ the problem is that
         | sched_clock_init_late() is marking the clock as  "running"
         | before it should. I could very well be wrong. Regardless, it's
         | pretty clear that there's some kind of architecture-dependent
         | clock initialization race condition that once in a while gets
         | triggered.
        
           | cryptonector wrote:
           | Great thinking. I'll also note that `sched_clock_register()`
           | uses `pr_debug()`, which can be an alias of `printk()`,
           | though I don't think that's it.
        
       | rwmj wrote:
       | If anyone would like to try reproducing the bug, I have a fairly
       | solid reproducer here:
       | 
       | https://lore.kernel.org/lkml/20230614173430.GB10301@redhat.c...
       | 
       | You will need a vmlinux or vmlinuz file from Linux 6.4 RC.
       | 
       | If these are the last two lines of output then congratulations
       | you reproduced the bug:                 [    0.074993] Freeing
       | SMP alternatives memory: 48K       *** ERROR OR HANG ***
       | 
       | You could also try reverting f31dcb152a3 and rerunning the test
       | to see if you get through 10,000 iterations.
        
         | Twirrim wrote:
         | I've been having flashbacks to troubleshooting some
         | particularly thorny unreliable boot stuff several years ago. In
         | the end tracked that one down to the fact that device order was
         | changing somewhat randomly between commits (deterministically,
         | though, so the same kernel from the same commit would always
         | have devices return in the same order), and part of the early
         | boot process was unwittingly dependent on particular network
         | device ordering due to an annoying bug. The kernel has never
         | made any guarantees about device ordering, so the kernel was
         | behaving just fine.
         | 
         | That one was.. fun. First time I've ever managed to identify
         | dozens of commits widely dispersed within a large range, all
         | seem to be the "cause" of the bug, while clearly having nothing
         | to do with anything related to it, and having commits all
         | around them be good :)
        
         | chenxiaolong wrote:
         | I gave that reproducer a try and it failed after 1968
         | iterations.
         | 
         | * CPU: Intel(R) Core(TM) i9-9900KS
         | 
         | * qemu: qemu-kvm-7.2.1-2.fc38.x86_64
         | 
         | * host kernel: 6.3.6-200.fc38.x86_64
         | 
         | * guest kernel: 6.4.0-0.rc6.48.fc39.x86_64 (grabbed latest from
         | mirrors.kernel.org/fedora since fedoraproject.org DNS is down
         | and I can't access koji)
         | 
         | Log:                   <...>         1966... 1967... 1968...
         | [    0.075343] LSM: initializing
         | lsm=lockdown,capability,yama,bpf,landlock,integrity         [
         | 0.075514] Yama: becoming mindful.         [    0.075514] LSM
         | support for eBPF active         [    0.075514] landlock: Up and
         | running.         [    0.075514] Mount-cache hash table entries:
         | 4096 (order: 3, 32768 bytes, linear)         [    0.075514]
         | Mountpoint-cache hash table entries: 4096 (order: 3, 32768
         | bytes, linear)         [    0.075514] x86/cpu: User Mode
         | Instruction Prevention (UMIP) activated         [    0.075514]
         | Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0         [
         | 0.075514] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
         | [    0.075514] Spectre V1 : Mitigation: usercopy/swapgs
         | barriers and __user pointer sanitization         [    0.075514]
         | Spectre V2 : Mitigation: Enhanced / Automatic IBRS         [
         | 0.075514] Spectre V2 : Spectre v2 / SpectreRSB mitigation:
         | Filling RSB on context switch         [    0.075514] Spectre V2
         | : Spectre v2 / PBRSB-eIBRS: Retire a single CALL on VMEXIT
         | [    0.075514] RETBleed: Mitigation: Enhanced IBRS         [
         | 0.075514] Spectre V2 : mitigation: Enabling conditional
         | Indirect Branch Prediction Barrier         [    0.075514]
         | Speculative Store Bypass: Mitigation: Speculative Store Bypass
         | disabled via prctl         [    0.075514] TAA: Mitigation: TSX
         | disabled         [    0.075514] MMIO Stale Data: Vulnerable:
         | Clear CPU buffers attempted, no microcode         [
         | 0.075514] SRBDS: Unknown: Dependent on hypervisor status
         | [    0.075514] Freeing SMP alternatives memory: 48K         ***
         | ERROR OR HANG ***
         | 
         | I'll try reverting f31dcb152a3 and testing again later. Happy
         | to test anything else if needed.
        
           | rwmj wrote:
           | Yup, that's the bug. If it does away after reverting the
           | commit, that would be interesting too. I don't have any other
           | suggestions.
        
             | chenxiaolong wrote:
             | I tested with 6.4.0-0.rc6.48.fc39.x86_64 + f31dcb152a3
             | revert and all 10000 iterations succeeded (same hardware
             | and environment as my previous post).
             | 
             | To guarantee that there's absolutely no other difference
             | between the two tests, I took the source RPM, added the
             | commit f31dcb152a3 diff + `%patch -P 2 -R`, and built the
             | kernel RPM with mock.
        
         | swordbeta wrote:
         | I wasn't able to reproduce this with 10k iterations on arch,
         | I'm probably doing something wrong. Does the host kernel
         | matter?
         | 
         | Host kernel: 6.1.33
         | 
         | Guest kernel: 6.4-rc6
         | 
         | Guest config: http://oirase.annexia.org/tmp/config-bz2213346
         | 
         | QEMU: 8.0.2
         | 
         | Hardware: AMD Ryzen 7 3700X CPU @ 4.2GHz
        
           | [deleted]
        
           | rwmj wrote:
           | > Does the host kernel matter?
           | 
           | Honestly I don't know! We've seen it appear with host kernel
           | 6.2.15
           | (https://bugzilla.redhat.com/show_bug.cgi?id=2213346#c5) but
           | I'm not aware of anyone either reproducing or not reproducing
           | it with earlier host kernels. All your other config looks
           | right.
        
             | garaetjjte wrote:
             | vmlinuz-6.4.0-0.rc6.48.fc39.x86_64 failed on my 6.0.0 host
             | after 249 iterations.
        
               | rwmj wrote:
               | We had another report that it happens on RHEL _8_ host,
               | which is a very much older (franken) kernel.
        
       | [deleted]
        
       | allanrbo wrote:
       | Running binary search on something that's flaky is a pain. "Noisy
       | binary search" or "robust binary search" can help here:
       | https://github.com/adamcrume/robust-binary-search
        
         | hoten wrote:
         | That README is light on details. How is this different from
         | selecting some N (and hoping it is high enough) and repeating
         | your test case that many times? You just don't have to select a
         | value for N using this tool?
         | 
         | EDIT: I missed the link to the white paper.
        
           | IshKebab wrote:
           | The paper lists the algorithm (which is relatively simple)
           | but basically it is much more efficient than repeating test
           | cases.
           | 
           | You can see that that must be possible fairly easily.
           | Consider two algorithms:
           | 
           | 1. Classic binary search - test each element once and 100%
           | trust the result.
           | 
           | 2. Overkill - test each element 100 times because you don't
           | trust the result one bit.
           | 
           | The former will clearly give you the wrong result most of the
           | time, and the latter is extremely inefficiency. There's
           | clearly a solution that's more efficient without sacrificing
           | accuracy in-between.
           | 
           | Skimming the algorithm, it looks like they maintain Bayesian
           | probabilities for each element being "the one" and then test
           | an element 50% probability point each iteration, then update
           | the probabilities accordingly. Basically a Bayesian version
           | of the traditional algorithm.
        
             | allanrbo wrote:
             | Good explanation! And in the case of "I booted Linux 293k
             | times in 21 hours" it wasn't just 100 times, it was 10,000
             | :-)
        
           | allanrbo wrote:
           | You do still have to select an N, but it's not as critical
           | that the N gives 100% guarantee of the flaky failure (which
           | can be really difficult or even impossible to achieve).
           | Unlike regular binary search, robust binary search doesn't
           | permanently give up on the left or right half based on just a
           | single result.
        
       | NelsonMinar wrote:
       | What a fantastic bug report writeup this is. Both the linked post
       | and the backing LKML and QEMU bug report.
        
       | [deleted]
        
       | [deleted]
        
       | sp332 wrote:
       | To save anyone clicking through the email thread: there is no
       | resolution in there so far.
        
         | loeg wrote:
         | Bisect points at this commit, even if the cause isn't known
         | yet:
         | https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...
        
           | [deleted]
        
       | parentheses wrote:
       | It makes sense n-sect (rather than bi-sect) as long as these can
       | be run in parallel. For example, if you're searching 1000
       | commits, a 10-sect will get you there with 30 tests, but only 3
       | iterations. OTOH, a 2-sect will take more than 3x the time, but
       | require 10 iterations.
       | 
       | There's ofc always some sort of bayesian approach mentioned in
       | other answers.
        
         | eichin wrote:
         | Yeah, I did a 4-way search like this on gcc back in the Cygnus
         | days - way before git, and the build step involved "me setting
         | up 4 checkouts to build at once and coming back in a few hours"
         | so it was more about giving the human more to dig into at
         | comparison time than actual computer time and usage. (It always
         | amazes me that people _have_ bright-line tests that make the
         | fully automated version useful, but I 've also seen "git bisect
         | exists" used as encouragement to break up changes into more
         | sensible components...)
        
       | eknkc wrote:
       | No disrespect to Peter Zijlstra, I'm sure he has been a lot more
       | impactful on the open source community than I will ever be but
       | his immmediate reply caught my attention:
       | 
       | >> [Being tracked in this bug which contains much more detail: >>
       | https://gitlab.com/qemu-project/qemu/-/issues/1696 ]
       | 
       | > Can I please just get the detail in mail instead of having to
       | go look at random websites?
       | 
       | Maybe it's me but if I did boot boot linux 292.612 times to find
       | a bug, you might as well click a link to a repository of a major
       | open source project on a major git hosting service.
       | 
       | Is it really that weird to ask people online to check a website?
       | Maybe I don't know the etiquette of these mail lists so this is a
       | geniune question. I guess it is better to keep all conversation
       | in a single place, would that be the intention?
        
         | dezgeg wrote:
         | Many kernel people really are stuck in their ways like that.
         | They don't want to leave their Mutt (e-mail client) at any
         | cost. I recall some are even to this day running using a text
         | console (ie. no X11 or Wayland).
        
           | donalhunt wrote:
           | Don't blame them. I'm fed up of browsers using gigs of ram to
           | display kb of data. :(
        
         | CommitSyn wrote:
         | I am only guessing here, but I assume it's so the content of
         | the mailing list archive remains. If a linked website goes down
         | or changes at any time in the future, then that archive is no
         | longer fulfilling its purpose of archiving important
         | information.
        
           | zxexz wrote:
           | I'm pretty much 100% sure that's the reason, and a good one
           | at that. Mailing lists are the lifeblood of a lot of big open
           | source projects.
        
           | cjsawyer wrote:
           | This is the same logic in avoiding link-only answers on Stack
           | Overflow. They're both good rules.
        
           | sidfthec wrote:
           | The irony being that he presumably wants more information on
           | the mailing list to keep a good archive, while not giving
           | enough information for people to understand that and follow
           | the advice later.
        
           | kevincox wrote:
           | If that was the reason it would have been best to state that
           | in the request.
           | 
           | > Can I please just get the detail in mail so that it is
           | archived with the list?
           | 
           | Of course you can't expect every email written to be perfect,
           | it is generally treated as an informal medium in these
           | settings. But stating the reason helps people understand your
           | motives and serve them better.
        
             | enedil wrote:
             | I think that hardcode kernel devs already know the reasons,
             | and there is no point in raising it again. For you it might
             | seem like a random requirement, but it's because of lack of
             | familiarity.
        
               | Szpadel wrote:
               | i think in that case explaination is needed even more, if
               | you are hardcore dev, then no one need to remind you
               | about such rule, on the other hand if you are not so
               | familiar with those rules yet, explanation would be very
               | helpful
        
         | actionfromafar wrote:
         | Maybe it's so the mail threads keep the full records.
        
         | aidenn0 wrote:
         | My suspicion is that it's not about reading the bug info once,
         | but having the information in the mailing-list, which is the
         | archive of record for kernel bugs.
        
         | dale_glass wrote:
         | It's LKML. The volume of that list is insane, and technical
         | discussion is very much the point, so they'd expect you to
         | explain the problem right there, where people can quote parts
         | of it, and comment on each part separately.
        
           | nroets wrote:
           | Many of the participants may also be reading it in a terminal
           | emulator with no web browser nearby.
        
             | _zoltan_ wrote:
             | maybe those people should rethink how to do stuff in 2023.
        
               | mulmen wrote:
               | You're welcome to go tell the Linux kernel devs what they
               | are doing wrong. Fuck around and find out as the kids
               | say. Or start the Zolnux project and see how far that
               | goes chasing shiny objects.
        
               | owenmarshall wrote:
               | Their software, their workflow. "Bend to it or pick
               | something else" seems entirely fine to me.
        
               | _zoltan_ wrote:
               | this is not really true for open source, I think. since
               | it's collaborative I think it's fair to expect people to
               | be able to open a GitHub link
        
               | snapcaster wrote:
               | you're wrong. instead you should adopt the standards of
               | the group you're attempting to join. Getting "tourist who
               | complains about customs of country they visit" vibes from
               | this comment
        
               | owenmarshall wrote:
               | I run OpenBSD on most of my systems. The OpenBSD
               | development team collaborates using cvs instead of git
               | because it fits their workflow well. If I wanted to
               | collaborate with them, I'd use cvs too - and if I wanted
               | to move them to git I'd do it _after_ becoming a core
               | contributor, not before. If I 'm going to send bug
               | reports & patches here and there, I'm going to do it in a
               | way that makes it easy for Theo and team to review.
               | 
               | This is very much a Chesterton's fence topic, I think.
               | Linux developers have settled on a workflow that works
               | for them, and if you want to get time from the people who
               | are doing the bulk of the work it's fair to expect _you_
               | to work within their requests.
        
               | mulmen wrote:
               | It's a gitlab link, not github. And it isn't reasonable
               | in this context. GitHub hosts a lot of open source
               | projects but it is not the only place where open source
               | happens. That's kinda the point of open source, and
               | especially of git.
               | 
               | Git itself is a satellite project of the Linux kernel. It
               | can work without the web at all. That someone EEE'd it so
               | hard that even Microsoft couldn't resist is no reason to
               | expect the kernel devs to change their workflow.
        
             | rblatz wrote:
             | Are they on a PDP-11 or a dumb terminal?
        
               | treeman79 wrote:
               | https://en.m.wikipedia.org/wiki/Lynx_(web_browser)
               | 
               | Used this daily for many years. Was great when connecting
               | to the internet was only practical via a shell.
        
               | Dylan16807 wrote:
               | Did you try it on this site?
               | 
               | All of the comments/updates on the bug report are loaded
               | by javascript and don't work for me in lynx or elinks.
        
               | aabbcc1241 wrote:
               | Do you mean hacker news as "this site"? HN seems to be
               | server side rendered, so it should display well without
               | Javascript.
        
               | jwilk wrote:
               | I think they meant <https://gitlab.com/qemu-
               | project/qemu/-/issues/1696>.
        
               | inetknght wrote:
               | > _Are they on...?_
               | 
               | I've met people who seriously do use dumb terminals and
               | other people who have seriously discussed using a PDP-11.
               | 
               | So, while your question might sound sarcastic, the answer
               | is definitely yes.
               | 
               | Nerds gonna nerd. Nothing wrong with that.
               | 
               | I personally don't like going to gitlab or github because
               | I don't like the businesses behind them. That's another
               | point irrespective of whether I'm browsing in a terminal
               | or ancient device.
        
         | rwmj wrote:
         | I was a bit short in the original description, but luckily
         | we've since reached an understanding on how to try to reproduce
         | this bug.
         | 
         | Unfortunately he's not been able to reproduce it, even though I
         | can reproduce it on several machines here (and it's been
         | independently reproduced by other people at Red Hat). We do
         | know that it happens much less frequently on Intel hardware
         | than AMD hardware (likely just because of subtle timing
         | differences), and he's of course working at Intel.
        
         | mulmen wrote:
         | Asking to click a link in an email is unreasonable in this
         | context. The email list is the official channel and project
         | participants are expected to use it. They are not expected to
         | have a web browser. The popularity of the linked site is
         | irrelevant. Part of filing good bug reports is understanding a
         | project's communication style. A link to supplementary
         | information is fine. But like a Stack Overflow answer the email
         | should stand on its own.
        
         | sigzero wrote:
         | Yes, he should have just went and looked there. Github is not a
         | "random website".
        
           | mulmen wrote:
           | The link is to gitlab, not github. But any website is
           | inappropriate in this context because it's not permanent. The
           | email list is, at least as far as the project is concerned.
        
       | gfiorav wrote:
       | I once had to bisect a Rails app between major versions and
       | dependencies. Every bisect would require me to build the app, fix
       | the dependency issues, and so on.
       | 
       | And I thought I had it bad!
        
       | hoten wrote:
       | > For unclear reasons the bisect only got me down to a merge
       | commit, I then had to manually test each commit within that which
       | took about another day.
       | 
       | Having hit this before myself... does anyone know how to finagle
       | git bisect to be useful for non-linear history?
        
       | voytec wrote:
       | What was the title editorialized for, few hours after posting,
       | with "21 hours" (not important, clickbait-ish)? It was not
       | breaking any of guidelines[1] to my understanding.
       | 
       | [1] https://news.ycombinator.com/newsguidelines.html
        
       ___________________________________________________________________
       (page generated 2023-06-14 23:00 UTC)