[HN Gopher] Full text search on 400M US court cases ___________________________________________________________________ Full text search on 400M US court cases Author : richardbarosky Score : 256 points Date : 2020-11-19 15:52 UTC (7 hours ago) (HTM) web link (www.judyrecords.com) (TXT) w3m dump (www.judyrecords.com) | jaequery wrote: | What a clean interface. We need more website to look like this. | richardbarosky wrote: | Thank you | bflesch wrote: | The search is very quick. Does anybody know how their tech stack | looks like? | richardbarosky wrote: | https://www.reddit.com/r/programming/comments/jg4rkv/how_a_s... | kordlessagain wrote: | From Reddit thread: | | > MySQL 8 is used for DB. The seach server uses elasticsearch | 7.8. | jillesvangurp wrote: | Sounds like that would be an easy use case for elasticsearch | indeed. I've seen it handle much bigger data sets. Solr would | work as well. There are probably a few other options on the | market but elasticsearch would probably do pretty well on | this even without a lot of tuning. | | For reference, I once threw the entirity of open streetmaps | at it before it even hit 1.0 to implement a simple reveres | geocoding thing. Basically a couple hundred million street | segments, some polygons, etc. At the time the geospatial | support wasn't great and very new and very CPU intensive. I | got away with indexing all of that and running it on a single | node cluster with a xeon and 32G of RAM and spinning disk | (RAID 1, no SSD). It worked great. Very responsive. Indexing | only took about 50 minutes or so. Most of that was my parsing | logic. That's not comparable of course, I'd expect this to be | faster on the same hardware with a current version of | Elasticsearch. They've made a lot of leaps with improving | performance, memory usage, cpu usage, disk usage, robustness, | etc. in the 7 major versions since then. | godmode2019 wrote: | I feel this is bad news. Some things should be forgotten. In my | country your record gets soft wiped after 8 years. With this the | employer could just look up your name. | richardbarosky wrote: | The general idea has various problems. For example, would | newspapers or accounts of things/people, in various media, of | objectively public information be required to be retroactively | removed from any mention? Does it make sense to force and | dictate what entities/individuals can do with basic information | at the discretion of anyone who doesn't like it? Just a few | thoughts. The records exist in the database because they are | public information. If a record is removed from public view, | that's done when requested because it's the right thing to do, | although there is no strict legal obligation to do so. | colejohnson66 wrote: | How does Europe's "right to be forgotten" handle it? | richardbarosky wrote: | Not sure. Maybe if someone doesn't like the Google search | results, they make a complaint, and Google has to do what | they want. | | There are many other public records databases that have | similar data, including the federdal judiciary and many | state courts across the country. | | Some are listed on the info page: | https://www.judyrecords.com/info | r3trohack3r wrote: | Interested in how large this dataset is? | | Is it in a format that could be backed up by a community to | protect? Seems like something folks in /r/datahoarder would be | interested in backing up. | richardbarosky wrote: | 15KB is maybe the average case size, including HTTP request | data. | | That's 1024 * 15 * 439,000,000 = 6.7TB roughly. | | The cases are all compressed, so I'm not using 6.7TB non- | compressed for cases. But there are other request and non- | request related records needed too. Just my backups currently. | nerdponx wrote: | One Reddit user estimated the monthly cost of this site at over | $2000 USD. How are you funding that? | | https://www.reddit.com/r/programming/comments/jg4rkv/comment... | richardbarosky wrote: | I've downgraded from that. I talked about that in that post. It | was most definitely a knee-jerk reation to getting slashdotted | on a popular subreddit and not wanting that to happen again. | However, still on some very good hardware and handling current | workload pretty well right now. That estimate was high. | ethbr0 wrote: | Bullet points on what you downgraded to cut costs? Curious | technical minds want to know. | richardbarosky wrote: | Sure, I'll post after the dust settles. Server getting | smashed but still handling searches pretty dang well. | | Some sites crash from the page views, and here I have to | handle everyone searching 400 million documents too. | vlmutolo wrote: | Odds are this won't help you, but just in case you | haven't seen it. | | https://blog.burntsushi.net/transducers/ | [deleted] | whoaWtf wrote: | Wow, the OP's reddit history in /r/PurplePillDebate is fucking | cringe. | | https://www.reddit.com/user/aoeusnth48/ | josefresco wrote: | Wow, I _really_ wish I hadn 't jumped down that rabbit hole. | tomorrowfuture wrote: | trellis.law does something similar | | their searches are indexed and have rulings and documents as | well. | | does this differ from that service? | [deleted] | nkw wrote: | You might consider giving credit to the sources of data used to | make this. | wtvanhest wrote: | It is all be public records. The source of the original data is | the court system. If a 3rd party physically scrapped it from | the court system, others should be able to digitally scrape it. | richardbarosky wrote: | All the data is from government databases directly, aside from | CourtListener, which was recently integrated. It would be good | to specifically mention CourtListener's contribution. | aschatten wrote: | How did you get all that data from government databases | directly? Do they provide some sort an API for bulk export? | visarga wrote: | This dataset would be yummy for GPT3 | richardbarosky wrote: | Interesting, I'll check it out. Thanks for the link. | ElijahLynn wrote: | Interesting (obviously these aren't all the same person): | | Page 1 of 1,763 total cases for: donald j. trump Page 1 of 2,299 | total cases for: donald trump | txmachinery wrote: | It's 80 cases when searching: "donald j trump"~4 | | This is a proximity search, to ensure it's actually turning up | one of the various permutations of the name (as different court | protocols may refer by surname first), rather than documents | that just happen to contain each of the terms somewhere. | | For fairness, "hillary rodham clinton"~4 turns up 193 cases. | | Relevant doc: https://www.judyrecords.com/info (down the page, | under "proximity search") | [deleted] | m3kw9 wrote: | What stack did you use for this? | richardbarosky wrote: | See other comment link to reddit. | _gtly wrote: | https://www.courtlistener.com has more useful features and is | part of the Free Law Project. | richardbarosky wrote: | I've noted CourtListener on the info page: | https://www.judyrecords.com/info | | "PACER notwithstanding, CourtListener is the most powerful case | law research tool available online -- and in many ways is much | more powerful." | | This is based on CourtListener's 4 million+ written court | opinions, which judyrecords has recently integrated. But you're | right, CourtListener has more case law research features. | tbrock wrote: | This is awesome. Where did the data come from? | richardbarosky wrote: | Thanks. All the data is collected from various government | databases. | mrits wrote: | It's pretty hilarious and somewhat frightening I found my dad's | arrest 25 years ago for a speeding ticket he had "forgotten" to | pay. I remember being 11 years old and having to wait 8 hours for | my parents to come back from picking up a pizza. Data | availability is crazy. | hbt wrote: | the frightening part is although your father's record is | available to the public, police officers who are caught lying | while testifying get to seal the record. | | in many jurisdictions, sealing a record is the equivalent of | destroying it. | | your crimes will haunt you forever because the system never | forgets, meanwhile they simply go back to business like it | never happened | | ref | https://www.google.com/amp/s/www.nytimes.com/2018/03/18/nyre... | handol wrote: | Found an assault charge on my Mom from '92. | richardbarosky wrote: | Do you think the data should be removed from the government | portals? Those are interesting points. What do you think is | the right balance to strike? | | I can see why it might be surprising to find some results | when searching. The same data has already been available in | many other databases that have existed long before this one | and in those described on the info page as well. | handol wrote: | It's on the internet forever now. If there's a balance to | strike it would have had to have been done in 2007 when the | court digitized their records and put them online. | | A search for "minor consuming" reveals a few hundred | thousand cases against children. I'm a little surprised to | see that. | nautical wrote: | Results like "MEETING ID" and "PASSWORD" for zoom meetings show | up way more than any other video conferencing tool for 2020 | cases. | nautical wrote: | Many Zoom meetings are recurring and this might not be safe | programbreeding wrote: | Looked at one record as an example and sure enough, the same | meeting ID and password is found in 709 different cases in | Cleveland, OH. | vmception wrote: | Fascinating! Was surprised to see random infractions | | Does this have the lower trial court records too? | richardbarosky wrote: | Yes, it has records from different trial courts. | vmception wrote: | Is there a list of jurisdictions and courts that it has? | kontxt wrote: | Super cool--and very fast! Anyone looking to collaborate on these | can easily add Kontxt (https://www.kontxt.io) right on to them | and have localized discussions directly on page-parts. | richardbarosky wrote: | Thanks. I saw your post on reddit a while back. Was going to | ask about your tech stack. | justinzollars wrote: | This is great. Its like google before it become evil. | richardbarosky wrote: | Thank you | chris_f wrote: | Congratulations on the launch. I have worked in open source and | public record research for the last 15 years, and your coverage | is extremely impressive. | | Do you have any long term plan for the site? I can see this going | in a lot of different directions depending on your goals. | richardbarosky wrote: | Thanks, as far as I know it's the largest database of court | cases on the Internet. If there's enough traffic I'll support | the site with ads. Don't have any other specific plans | currently. | HoverSausage wrote: | I just managed to find the home address of a YouTuber I'm a fan | of in 15 seconds. Creepy site. Glad I'm not in the US | richardbarosky wrote: | Interesting point. If you know the state where someone lives, | you can look up the same info on the government website. | Additionally, many many other databases have same public data | but they ask for a payment to search. | draw_down wrote: | Wow, this is really good. I found a court case from my childhood | that I've always half wanted to, and half wanted not to, read | through (painful, but potentially enlightening). Now I don't have | to go to the courthouse to read it. Choices, choices. | | (False alarm, it's just the case record and not the transcript) | 120bits wrote: | From the the reddit link[1] | | Sorry, I'm just curious. | | It says MySQL 8 and Elasticsearch 7.8. I don't have much | experience in elasticsearch, I wanted to know how does | elasticsearch makes it faster? Is it like an extension that makes | it faster? Or Elasticsearch has its own data store that consumes | data from the database and magically makes it faster? | | Thanks. | | [1]https://www.reddit.com/r/programming/comments/jg4rkv/how_a_s.. | . | richardbarosky wrote: | Elasticsearch is a search platform. A "database" but meant for | search stuff. It's not part of MySQL. | rpedela wrote: | Elasticsearch, Lucene under the hood, implements an inverted | index which is an extremely fast data structure for text | search. ES has clustering as a primary feature too and many | search features that can significantly improve relevance that | you won't find in MySQL and most other databases. | devy wrote: | Have you tried Toshi[1] or MeiliSearch[2]. I wonder how it | would compare in terms of operational costs (monthly cloud | hosting bill) at the current data size. | | [1]: https://news.ycombinator.com/item?id=18895655 | | [2]: https://news.ycombinator.com/item?id=22685831 | jabo wrote: | Do you have the structured dataset somewhere? I'd love to | index it in Typesense [1] and see how it does. | | I recently tried a 32M songs dataset [2] and it works | great, so I'm on the lookout for larger datasets to | benchmark with. | | [1] https://news.ycombinator.com/item?id=22181437 | | [2] https://songs-search.typesense.org/ | arminiusreturns wrote: | How would you say it stands up to splunk these days? | lolive wrote: | Plus it does not accept joins. So you basically have to | denormalize all your data before injecting into Elastic. It | helps speedup things. But is a headache to manage on a day to | day basis. | nerdponx wrote: | https://news.ycombinator.com/item?id=25152925 | pachico wrote: | Jeez, 36 pages with resume for "Napster"! | onetimemanytime wrote: | I understand the open court argument, we need to see what goes on | so nothing funny happens there. But unless we're talking about a | major crime, what good does it do to list and index on Google | everything from 30 years ago? | | I am no fan of this at all. | richardbarosky wrote: | Only 3 pages are indexed on Google. Actually, most of the other | legal databases (listed on info page) have their cases indexed | on Google. However, judyrecords cases aren't indexed on Google. | I understand your general sentiment. | WindyLakeReturn wrote: | If our society decides it is necessary to act with the full | weight of the law behind it, then it would seem better to have | the information available for the public to verify than not. | I'm not saying it is all great, but that it is far better to | have information available so that things like average sentence | length for a given crime based on demographic and psychographic | information can be queried by all. If a city that is 50/50 | male/female and 20/80 black/non-black finds their speeding | tickets are 70/30 male female and 35/65 black/non-black, then | it may be worth investigating to see if police are being fair | who they give warnings to, who gets reduces tickets, and who | gets neither. | | As for major privacy concerns, it is generally the more major | crimes that have the larger issue with the victim being known. | Knowing that some one was the victim of mischief vandalism is | far less a privacy invasion than knowing they were the victim | of sexual assault of a child (and even hiding the victim's | identity often doesn't do more than hide the name from a | passive search). | | Then there are the benefits that other posters have raised, | such as being useful for knowing past decisions used even in | minor trials. | richardbarosky wrote: | Good points. | | If you look at the info page there is a specific example | about how to look up codes of cases that had the same charge. | | Being able to see how other offenders are sentenced is useful | to make sure people are being treated fairly. Lawyers use | this kind of data up to the point of producing analytics from | data like that to understand outcomes. Major legal data | companies have a large segment of business doing analytics | for lawyers handling high and lower level cases. | | Here are a few related links: https://cluesearch.org/ | https://measuresforjustice.org/ | ghaff wrote: | The general privacy issue that most jurisdictions have | decided they just don't care that much about is that easy, | indexed, free access to public records is different from the | case where that same information is in a dusty file cabinet | somewhere. There are a lot of things that people are, in | principle, OK with being a matter of public record but are | maybe less OK with their neighbor being able to casually | discover it through Google. | distances wrote: | Totally agree. I'd be all for open court records, requested | in person, received in paper form against a small | processing fee. | | I do have a different cultural background so it's probably | natural this feel horrible. Everything about this site | would be so illegal in my home country it's almost | hilarious in comparison. I'm used to (and fully approve of) | a law that you can't keep a list of names in a notebook | without a proper reason and everyone's consent, that would | already be an illegal register. | ccostes wrote: | So a Christmas card list would be illegal? That | seems...excessive. | bidnessmodell wrote: | Worse, there's no obvious business model or disclosed funding | source or institutional affiliation here. | | That leaves me with the distinct impression that they're | monetizing data about visitors and searches in some horrible | way. (Data targeting for mugshot shakedown operations?) | | I'm not going near this. | richardbarosky wrote: | Maybe ads at some point. | Finnucane wrote: | Sometimes even seemingly trivial cases can be caselaw precedent | that people should be able to see and access without paying | (they are public records). | richardbarosky wrote: | Good point. PACER, in fact, has been called out by major news | publications for literally being a scam the way the change | for access to public records. | | https://www.politico.com/magazine/story/2019/03/20/pacer- | cou... | caseyscottmckay wrote: | This is courtlistener.com data correct? | richardbarosky wrote: | From other comment: CourtListener has about 4 million opinions, | which are included. On top of that, 435 million additional | cases from throughout the US. | FpUser wrote: | I am not fond exposing this kind of info. Don't we all have | enough prying eyes | richardbarosky wrote: | I've mentioned other legal databases on the info page. It's | public information. judyrecords is the largest free database of | court cases, but there are many other free/not free ones as | well. | FpUser wrote: | I did not mean this one in particular. Just my opinion about | the subject in general. | MeinBlutIstBlau wrote: | In my state you can get some kind of understanding of whats | going on, but it's so legalese vague that half the time you | only know if someone got a speeding ticket, underage, or | divorced. | vmception wrote: | lol so many records that should have been destroyed and not | indexable! | | so do I get a court order for each county, the website, the | resyndicating source that the website uses or what? | | I looked at the reddit page and other people noticed the same | thing, the author just said send me the link! Hahaha one by one | removal maybe! | | Shut it down, enjoy it while it lasts | richardbarosky wrote: | I don't think you understand what you're talking about. There | are many databases that are made up of public records. Many | aren't free, some are. | vmception wrote: | That may be the reality but if the court or due process | ordered something expunged from a record it should be updated | in all records and the details not present. | | Should just do a search for expunged or similar terms and | remove those entries. | chrisseaton wrote: | > lol so many records that should have been destroyed and not | indexable! | | You want secret courts? | vmception wrote: | they weren't secret and were available for public perusal and | judgement until the designated time | | secret courts have cases that are secret from the beginning | jschwartzi wrote: | Do you want things that children do to follow them for the | rest of their lives? | matz1 wrote: | The following itself is not the issue right? | tshaddox wrote: | Well, no. Are there names of minors in this database? I | thought the US had a mechanism to prevent that, or at least | to petition to have records of minors removed or | anonymized. | vmception wrote: | "The US" has 39,044 distinct local governments and | municipalities and they all do their procedural nuances | differently and to varying efficacy and different points | in time! :D | lazyasciiart wrote: | Yes. The mechanisms are shit. Many of these cases are | juvenile cases with a note saying the case is sealed, | along with full details of the charge, name, and outcome. | | Edit: wow, plus family court stuff like a four year | custody dispute, kids being adopted, etc | chrisseaton wrote: | I don't know what culture you come from, but in the US and | UK and similarly influenced cultures justice being seen to | be done and recorded is a pretty important principle and | mechanism against overreach of the state. | lazyasciiart wrote: | There are significant limits to that, such as juvenile | courts. | dadrock wrote: | I agree. But it's worth noting that the UK has recently | enacted the Right to be Forgotten Law, which plays into | this discussion. | ghaff wrote: | Of course, once data is replicated and distributed | around, it's very hard to put the genie back into the | bottle. | ikeboy wrote: | I don't see a breakdown by source. What does this have that | courtlistener doesn't, for example? | richardbarosky wrote: | CourtListener has about 4 million opinions, which are included. | On top of that, 435 million additional cases from throughout | the US. | ikeboy wrote: | Where are they getting public domain opinions that CL doesn't | have? Are these states or counties that CL doesn't scrape? It | would be nice to have a breakdown by jurisdiction. | | Also, by "case" do you mean "opinions"? | | Full disclosure, I've written and contributed to several | scrapers for CL, and if there's a large source they're | missing I'd like to know. | | Note that the CL opinion number you're quoting doesn't | include orders from Federal courts that are in the RECAP | collection, which accounts for several million additional | opinions. | kordlessagain wrote: | It's throwing a 500 for some regex I fed it. | richardbarosky wrote: | I recently added advanced query support. Looks like I need to | clean up some validation. Thanks. | pashabitz wrote: | I wasn't trying to be an asshole, just honestly searched for | "javascript". Was disappointed :) | richardbarosky wrote: | Looking at the results, those all appear to be from | CourtListener's bulk data. | bpeebles wrote: | I'm not sure why I didn't expect them to be in this database, but | this also has like traffic tickets and similar. | LunaSea wrote: | Hmmm: https://www.judyrecords.com/record/1ikmhvbrhfa3a | csunbird wrote: | 333 N Warcraft Lane Undercity, Washington 99999 | | Looks like a place I would like to live in. | richardbarosky wrote: | Ah, good one. There are many nuggets in there. "holy shit", | fart, etc. | LunaSea wrote: | Any idea how those came into the system? | | The one I quoted seems to be some kind of test case? | richardbarosky wrote: | Most likely, just like the asdf occurrences. | ergwwrt wrote: | Didn't Aaron Swartz try to do this but couldn't because it costs | $0.10 per page? | richardbarosky wrote: | From what I understand, he had some kind of academic library | access for PACER and used that to bypass what others would be | changed for. There are lawsuits against PACER charging fees for | what's public information generated by taxpayer money. He ended | up being charged with various crimes related to maybe computer | fraud and eventually committed suicide. A very sad story. ___________________________________________________________________ (page generated 2020-11-19 23:00 UTC)