[HN Gopher] Synthea: Open-source synthetic patient generation
       ___________________________________________________________________
        
       Synthea: Open-source synthetic patient generation
        
       Author : johncole
       Score  : 58 points
       Date   : 2023-05-19 14:30 UTC (8 hours ago)
        
 (HTM) web link (synthetichealth.github.io)
 (TXT) w3m dump (synthetichealth.github.io)
        
       | synaesthesisx wrote:
       | I've worked at the intersection of AI & healthcare for years and
       | this has been an excellent tool I've leveraged in the past;
       | synthetic data is particularly helpful in the context of
       | healthcare!
        
       | ThaDood wrote:
       | I actually had this idea when I worked for a local HIE. I just
       | lacked the technical competency to make it real. I think this
       | would be incredibly useful for the adoption of FHIR and also
       | learning more about HL7. For security-minded folks this
       | information could be a good tool for tuning DLP and other tools
       | without using real patient data.
        
         | ska wrote:
         | I've done this sort of thing before with home-rolled tools, it
         | easily becomes a time sink. Having a centralized shared effort
         | seems like it could be really valuable.
         | 
         | One thing that is tricky is that you often needs signals and/or
         | image data as well.
        
       | reshmakh wrote:
       | Synthea is great! We use it a ton at Medplum - and the sample
       | data that conforms to USCDI is especially useful we recommend for
       | those who are getting started.
       | https://www.medplum.com/docs/tutorials/importing-sample-data
        
       | techwizrd wrote:
       | I never expected to see MITRE on the front page of HN! We're
       | actively adding more synthetic data sources to Synthea all the
       | time.
        
         | shawnz wrote:
         | Doesn't MITRE maintain the CVE database?
        
           | orbz wrote:
           | Yes, MITRE is a non-profit organization that works with a
           | number of US government agencies to cover a pretty large
           | swath of areas: https://www.mitre.org/focus-areas
        
         | orbz wrote:
         | Love Synthea, it's an amazing project and you should be very
         | proud of it. My only gripe is how clean the data is compared to
         | what many other EHR providers actually generate, but that's
         | more on them than you guys.
        
           | ska wrote:
           | It would be perhaps interesting to add a layer capable of
           | injecting reasonable noise on top of these clean records.
        
             | techwizrd wrote:
             | I was thinking about earlier this month actually! I
             | generate a lot of synthetic flight data, and we have to
             | reproduce the noisiness of real data as well.
        
       | MilStdJunkie wrote:
       | Does anyone know if there is an equivalent for generating
       | "random" viable products[1] in a PDM/ERP system?
       | 
       | I'm demoing some systems in this field for outside interests, but
       | I can't use any "real" data due to ITAR and data restrictions
       | like TC, NC, etc. Wait, what about the ERP? The ERP I'm
       | developing against has "sample" data that's basically useless.
       | Not much better than _lorem ipsum_ pasted across ten thousand
       | cells. Actually, it 's worse than that, because . . ah hell, this
       | is HN, I won't waste your time. People here know what the ERP
       | ecosystem is like. I also don't want to build out from a bespoke,
       | brittle ERP - that's how we got into this mess in the first
       | place.
       | 
       | [1] Like a multi-level BOM that makes sense, or a Service BOM /
       | Logistics Database that's meaningful. Anything for making pseudo-
       | random PLs that follow MIL-STD-100, which is still considered
       | frickin' Holy Ground by these people.
        
         | ted_dunning wrote:
         | Building synthetic BOMs can be fairly straightforward if you
         | can define the level of coherency you want to see. The only big
         | trick in building structured data like this that I have built
         | is to first build dictionaries of randomized data with very
         | little coherence and then build larger structures that include
         | elements of the dictionaries.
         | 
         | As an example, you might want to have a model of users
         | interacting with a web site, ordering products and shipping
         | them to their homes. This can start with building a dictionary
         | of user records and orderable item descriptions. The user
         | records would have an address and some "interest" variables
         | that define what the user is likely to order. The item
         | descriptions can have lots of a little information but would
         | centrally contain a part number and some information that
         | allows the part to be selected efficiently (a numerical vector
         | may be enough). If you want to be crazy, you can use generative
         | models to generate descriptions from random semantic starting
         | points or use lower level tables to piece together these
         | things.
         | 
         | At this point, you can pretty easily build a user model and run
         | it for each user to generate coherent transactional histories.
         | 
         | Several of these ideas are present in a project I worked on
         | called log-synth [1]. For instance, the VIN generator has
         | tables of factories and such for BMW and Ford so it generates
         | kind of coherent VINs that can be traced back with factory
         | location, engine and body type. If you look hard these are
         | nonsense, but if you squint the right way they look fine.
         | 
         | The commuter generator or the DNS query generator are examples
         | of a higher-level transaction generators. For the commuter,
         | there is a model of a user with a home location and a work
         | location. These commuters go to work some days and run errands
         | other day and there is a simple model to pick an activity.
         | Digging in, each activity breaks down into journeys along
         | entirely incoherent road structures but details like a physical
         | model of the engine and car velocity is maintained so you can
         | get realistic diagnostics from the vehicles from somewhat
         | realistic life histories. The DNS query generator is similar
         | but with less physics.
         | 
         | One nice statistical concept in all of this is the concept of a
         | statistical distribution over a notionally infinite set. Some
         | things in the set will be much more commonly seen than others
         | and thus we are likely to see those sooner. The generator of
         | these things can maintain an estimate of the frequency of all
         | previously seen things and a probability of seeing something
         | new (see the Chinese Restaurant process [2]). You only need to
         | generate the specifics of a thing in this infinite when you
         | first see it which gives you pretty realistic texture to the
         | fictional transactional world.
         | 
         | Relative to your problem of multi-level BOMs, you could say
         | that a BOM is a list of items. Pick the desired length from a
         | suitable distribution. Then pick each item from a Chinese
         | Restaurant process. As you generate new items, decide if the
         | item is composite and if so, generate a BOM for it recursively.
         | Constraints like forcing a composite item to not recursively
         | contain anything of the same type can be enforced using a
         | rejection method (sometimes).
         | 
         | If this seems at all interesting, ping me by filing an issue on
         | the log-synth github repository.
         | 
         | [1] https://github.com/tdunning/log-synth [2]
         | https://en.wikipedia.org/wiki/Chinese_restaurant_process
        
       | erwinh wrote:
       | Played around with this in my soon-to-be previous health-tech job
       | and its great.
       | 
       | Actually the entire hl7-fhir ( https://www.hl7.org/fhir/ )
       | standard seems to me quite solid. It would be wonderful if a new
       | cohort of start-ups would leverage it to drastically improve the
       | digital UX of healthcare generally.
        
         | jjordan wrote:
         | That would be great except that the 8,000 lb gorillas of the
         | medical data industry, at least as of a year or two ago, did
         | next to nothing to really make their EHR's FHIR-compatible.
         | Getting even some of the very basics on their demo environments
         | were fundamentally broken.
        
           | erwinh wrote:
           | Yeah so many cards stacked against potential start-ups who
           | could potentially bring some quality to the industry :/
           | Curious to see though that Google cloud / AWS etc are
           | building fhir store APIs.
        
       | [deleted]
        
       ___________________________________________________________________
       (page generated 2023-05-19 23:01 UTC)