[HN Gopher] Synthea: Open-source synthetic patient generation ___________________________________________________________________ Synthea: Open-source synthetic patient generation Author : johncole Score : 58 points Date : 2023-05-19 14:30 UTC (8 hours ago) (HTM) web link (synthetichealth.github.io) (TXT) w3m dump (synthetichealth.github.io) | synaesthesisx wrote: | I've worked at the intersection of AI & healthcare for years and | this has been an excellent tool I've leveraged in the past; | synthetic data is particularly helpful in the context of | healthcare! | ThaDood wrote: | I actually had this idea when I worked for a local HIE. I just | lacked the technical competency to make it real. I think this | would be incredibly useful for the adoption of FHIR and also | learning more about HL7. For security-minded folks this | information could be a good tool for tuning DLP and other tools | without using real patient data. | ska wrote: | I've done this sort of thing before with home-rolled tools, it | easily becomes a time sink. Having a centralized shared effort | seems like it could be really valuable. | | One thing that is tricky is that you often needs signals and/or | image data as well. | reshmakh wrote: | Synthea is great! We use it a ton at Medplum - and the sample | data that conforms to USCDI is especially useful we recommend for | those who are getting started. | https://www.medplum.com/docs/tutorials/importing-sample-data | techwizrd wrote: | I never expected to see MITRE on the front page of HN! We're | actively adding more synthetic data sources to Synthea all the | time. | shawnz wrote: | Doesn't MITRE maintain the CVE database? | orbz wrote: | Yes, MITRE is a non-profit organization that works with a | number of US government agencies to cover a pretty large | swath of areas: https://www.mitre.org/focus-areas | orbz wrote: | Love Synthea, it's an amazing project and you should be very | proud of it. My only gripe is how clean the data is compared to | what many other EHR providers actually generate, but that's | more on them than you guys. | ska wrote: | It would be perhaps interesting to add a layer capable of | injecting reasonable noise on top of these clean records. | techwizrd wrote: | I was thinking about earlier this month actually! I | generate a lot of synthetic flight data, and we have to | reproduce the noisiness of real data as well. | MilStdJunkie wrote: | Does anyone know if there is an equivalent for generating | "random" viable products[1] in a PDM/ERP system? | | I'm demoing some systems in this field for outside interests, but | I can't use any "real" data due to ITAR and data restrictions | like TC, NC, etc. Wait, what about the ERP? The ERP I'm | developing against has "sample" data that's basically useless. | Not much better than _lorem ipsum_ pasted across ten thousand | cells. Actually, it 's worse than that, because . . ah hell, this | is HN, I won't waste your time. People here know what the ERP | ecosystem is like. I also don't want to build out from a bespoke, | brittle ERP - that's how we got into this mess in the first | place. | | [1] Like a multi-level BOM that makes sense, or a Service BOM / | Logistics Database that's meaningful. Anything for making pseudo- | random PLs that follow MIL-STD-100, which is still considered | frickin' Holy Ground by these people. | ted_dunning wrote: | Building synthetic BOMs can be fairly straightforward if you | can define the level of coherency you want to see. The only big | trick in building structured data like this that I have built | is to first build dictionaries of randomized data with very | little coherence and then build larger structures that include | elements of the dictionaries. | | As an example, you might want to have a model of users | interacting with a web site, ordering products and shipping | them to their homes. This can start with building a dictionary | of user records and orderable item descriptions. The user | records would have an address and some "interest" variables | that define what the user is likely to order. The item | descriptions can have lots of a little information but would | centrally contain a part number and some information that | allows the part to be selected efficiently (a numerical vector | may be enough). If you want to be crazy, you can use generative | models to generate descriptions from random semantic starting | points or use lower level tables to piece together these | things. | | At this point, you can pretty easily build a user model and run | it for each user to generate coherent transactional histories. | | Several of these ideas are present in a project I worked on | called log-synth [1]. For instance, the VIN generator has | tables of factories and such for BMW and Ford so it generates | kind of coherent VINs that can be traced back with factory | location, engine and body type. If you look hard these are | nonsense, but if you squint the right way they look fine. | | The commuter generator or the DNS query generator are examples | of a higher-level transaction generators. For the commuter, | there is a model of a user with a home location and a work | location. These commuters go to work some days and run errands | other day and there is a simple model to pick an activity. | Digging in, each activity breaks down into journeys along | entirely incoherent road structures but details like a physical | model of the engine and car velocity is maintained so you can | get realistic diagnostics from the vehicles from somewhat | realistic life histories. The DNS query generator is similar | but with less physics. | | One nice statistical concept in all of this is the concept of a | statistical distribution over a notionally infinite set. Some | things in the set will be much more commonly seen than others | and thus we are likely to see those sooner. The generator of | these things can maintain an estimate of the frequency of all | previously seen things and a probability of seeing something | new (see the Chinese Restaurant process [2]). You only need to | generate the specifics of a thing in this infinite when you | first see it which gives you pretty realistic texture to the | fictional transactional world. | | Relative to your problem of multi-level BOMs, you could say | that a BOM is a list of items. Pick the desired length from a | suitable distribution. Then pick each item from a Chinese | Restaurant process. As you generate new items, decide if the | item is composite and if so, generate a BOM for it recursively. | Constraints like forcing a composite item to not recursively | contain anything of the same type can be enforced using a | rejection method (sometimes). | | If this seems at all interesting, ping me by filing an issue on | the log-synth github repository. | | [1] https://github.com/tdunning/log-synth [2] | https://en.wikipedia.org/wiki/Chinese_restaurant_process | erwinh wrote: | Played around with this in my soon-to-be previous health-tech job | and its great. | | Actually the entire hl7-fhir ( https://www.hl7.org/fhir/ ) | standard seems to me quite solid. It would be wonderful if a new | cohort of start-ups would leverage it to drastically improve the | digital UX of healthcare generally. | jjordan wrote: | That would be great except that the 8,000 lb gorillas of the | medical data industry, at least as of a year or two ago, did | next to nothing to really make their EHR's FHIR-compatible. | Getting even some of the very basics on their demo environments | were fundamentally broken. | erwinh wrote: | Yeah so many cards stacked against potential start-ups who | could potentially bring some quality to the industry :/ | Curious to see though that Google cloud / AWS etc are | building fhir store APIs. | [deleted] ___________________________________________________________________ (page generated 2023-05-19 23:01 UTC)