[HN Gopher] Full text search on 400M US court cases
       ___________________________________________________________________
        
       Full text search on 400M US court cases
        
       Author : richardbarosky
       Score  : 256 points
       Date   : 2020-11-19 15:52 UTC (7 hours ago)
        
 (HTM) web link (www.judyrecords.com)
 (TXT) w3m dump (www.judyrecords.com)
        
       | jaequery wrote:
       | What a clean interface. We need more website to look like this.
        
         | richardbarosky wrote:
         | Thank you
        
       | bflesch wrote:
       | The search is very quick. Does anybody know how their tech stack
       | looks like?
        
         | richardbarosky wrote:
         | https://www.reddit.com/r/programming/comments/jg4rkv/how_a_s...
        
         | kordlessagain wrote:
         | From Reddit thread:
         | 
         | > MySQL 8 is used for DB. The seach server uses elasticsearch
         | 7.8.
        
           | jillesvangurp wrote:
           | Sounds like that would be an easy use case for elasticsearch
           | indeed. I've seen it handle much bigger data sets. Solr would
           | work as well. There are probably a few other options on the
           | market but elasticsearch would probably do pretty well on
           | this even without a lot of tuning.
           | 
           | For reference, I once threw the entirity of open streetmaps
           | at it before it even hit 1.0 to implement a simple reveres
           | geocoding thing. Basically a couple hundred million street
           | segments, some polygons, etc. At the time the geospatial
           | support wasn't great and very new and very CPU intensive. I
           | got away with indexing all of that and running it on a single
           | node cluster with a xeon and 32G of RAM and spinning disk
           | (RAID 1, no SSD). It worked great. Very responsive. Indexing
           | only took about 50 minutes or so. Most of that was my parsing
           | logic. That's not comparable of course, I'd expect this to be
           | faster on the same hardware with a current version of
           | Elasticsearch. They've made a lot of leaps with improving
           | performance, memory usage, cpu usage, disk usage, robustness,
           | etc. in the 7 major versions since then.
        
       | godmode2019 wrote:
       | I feel this is bad news. Some things should be forgotten. In my
       | country your record gets soft wiped after 8 years. With this the
       | employer could just look up your name.
        
         | richardbarosky wrote:
         | The general idea has various problems. For example, would
         | newspapers or accounts of things/people, in various media, of
         | objectively public information be required to be retroactively
         | removed from any mention? Does it make sense to force and
         | dictate what entities/individuals can do with basic information
         | at the discretion of anyone who doesn't like it? Just a few
         | thoughts. The records exist in the database because they are
         | public information. If a record is removed from public view,
         | that's done when requested because it's the right thing to do,
         | although there is no strict legal obligation to do so.
        
           | colejohnson66 wrote:
           | How does Europe's "right to be forgotten" handle it?
        
             | richardbarosky wrote:
             | Not sure. Maybe if someone doesn't like the Google search
             | results, they make a complaint, and Google has to do what
             | they want.
             | 
             | There are many other public records databases that have
             | similar data, including the federdal judiciary and many
             | state courts across the country.
             | 
             | Some are listed on the info page:
             | https://www.judyrecords.com/info
        
       | r3trohack3r wrote:
       | Interested in how large this dataset is?
       | 
       | Is it in a format that could be backed up by a community to
       | protect? Seems like something folks in /r/datahoarder would be
       | interested in backing up.
        
         | richardbarosky wrote:
         | 15KB is maybe the average case size, including HTTP request
         | data.
         | 
         | That's 1024 * 15 * 439,000,000 = 6.7TB roughly.
         | 
         | The cases are all compressed, so I'm not using 6.7TB non-
         | compressed for cases. But there are other request and non-
         | request related records needed too. Just my backups currently.
        
       | nerdponx wrote:
       | One Reddit user estimated the monthly cost of this site at over
       | $2000 USD. How are you funding that?
       | 
       | https://www.reddit.com/r/programming/comments/jg4rkv/comment...
        
         | richardbarosky wrote:
         | I've downgraded from that. I talked about that in that post. It
         | was most definitely a knee-jerk reation to getting slashdotted
         | on a popular subreddit and not wanting that to happen again.
         | However, still on some very good hardware and handling current
         | workload pretty well right now. That estimate was high.
        
           | ethbr0 wrote:
           | Bullet points on what you downgraded to cut costs? Curious
           | technical minds want to know.
        
             | richardbarosky wrote:
             | Sure, I'll post after the dust settles. Server getting
             | smashed but still handling searches pretty dang well.
             | 
             | Some sites crash from the page views, and here I have to
             | handle everyone searching 400 million documents too.
        
               | vlmutolo wrote:
               | Odds are this won't help you, but just in case you
               | haven't seen it.
               | 
               | https://blog.burntsushi.net/transducers/
        
       | [deleted]
        
       | whoaWtf wrote:
       | Wow, the OP's reddit history in /r/PurplePillDebate is fucking
       | cringe.
       | 
       | https://www.reddit.com/user/aoeusnth48/
        
         | josefresco wrote:
         | Wow, I _really_ wish I hadn 't jumped down that rabbit hole.
        
       | tomorrowfuture wrote:
       | trellis.law does something similar
       | 
       | their searches are indexed and have rulings and documents as
       | well.
       | 
       | does this differ from that service?
        
         | [deleted]
        
       | nkw wrote:
       | You might consider giving credit to the sources of data used to
       | make this.
        
         | wtvanhest wrote:
         | It is all be public records. The source of the original data is
         | the court system. If a 3rd party physically scrapped it from
         | the court system, others should be able to digitally scrape it.
        
         | richardbarosky wrote:
         | All the data is from government databases directly, aside from
         | CourtListener, which was recently integrated. It would be good
         | to specifically mention CourtListener's contribution.
        
           | aschatten wrote:
           | How did you get all that data from government databases
           | directly? Do they provide some sort an API for bulk export?
        
       | visarga wrote:
       | This dataset would be yummy for GPT3
        
         | richardbarosky wrote:
         | Interesting, I'll check it out. Thanks for the link.
        
       | ElijahLynn wrote:
       | Interesting (obviously these aren't all the same person):
       | 
       | Page 1 of 1,763 total cases for: donald j. trump Page 1 of 2,299
       | total cases for: donald trump
        
         | txmachinery wrote:
         | It's 80 cases when searching: "donald j trump"~4
         | 
         | This is a proximity search, to ensure it's actually turning up
         | one of the various permutations of the name (as different court
         | protocols may refer by surname first), rather than documents
         | that just happen to contain each of the terms somewhere.
         | 
         | For fairness, "hillary rodham clinton"~4 turns up 193 cases.
         | 
         | Relevant doc: https://www.judyrecords.com/info (down the page,
         | under "proximity search")
        
         | [deleted]
        
       | m3kw9 wrote:
       | What stack did you use for this?
        
         | richardbarosky wrote:
         | See other comment link to reddit.
        
       | _gtly wrote:
       | https://www.courtlistener.com has more useful features and is
       | part of the Free Law Project.
        
         | richardbarosky wrote:
         | I've noted CourtListener on the info page:
         | https://www.judyrecords.com/info
         | 
         | "PACER notwithstanding, CourtListener is the most powerful case
         | law research tool available online -- and in many ways is much
         | more powerful."
         | 
         | This is based on CourtListener's 4 million+ written court
         | opinions, which judyrecords has recently integrated. But you're
         | right, CourtListener has more case law research features.
        
       | tbrock wrote:
       | This is awesome. Where did the data come from?
        
         | richardbarosky wrote:
         | Thanks. All the data is collected from various government
         | databases.
        
       | mrits wrote:
       | It's pretty hilarious and somewhat frightening I found my dad's
       | arrest 25 years ago for a speeding ticket he had "forgotten" to
       | pay. I remember being 11 years old and having to wait 8 hours for
       | my parents to come back from picking up a pizza. Data
       | availability is crazy.
        
         | hbt wrote:
         | the frightening part is although your father's record is
         | available to the public, police officers who are caught lying
         | while testifying get to seal the record.
         | 
         | in many jurisdictions, sealing a record is the equivalent of
         | destroying it.
         | 
         | your crimes will haunt you forever because the system never
         | forgets, meanwhile they simply go back to business like it
         | never happened
         | 
         | ref
         | https://www.google.com/amp/s/www.nytimes.com/2018/03/18/nyre...
        
         | handol wrote:
         | Found an assault charge on my Mom from '92.
        
           | richardbarosky wrote:
           | Do you think the data should be removed from the government
           | portals? Those are interesting points. What do you think is
           | the right balance to strike?
           | 
           | I can see why it might be surprising to find some results
           | when searching. The same data has already been available in
           | many other databases that have existed long before this one
           | and in those described on the info page as well.
        
             | handol wrote:
             | It's on the internet forever now. If there's a balance to
             | strike it would have had to have been done in 2007 when the
             | court digitized their records and put them online.
             | 
             | A search for "minor consuming" reveals a few hundred
             | thousand cases against children. I'm a little surprised to
             | see that.
        
       | nautical wrote:
       | Results like "MEETING ID" and "PASSWORD" for zoom meetings show
       | up way more than any other video conferencing tool for 2020
       | cases.
        
         | nautical wrote:
         | Many Zoom meetings are recurring and this might not be safe
        
           | programbreeding wrote:
           | Looked at one record as an example and sure enough, the same
           | meeting ID and password is found in 709 different cases in
           | Cleveland, OH.
        
       | vmception wrote:
       | Fascinating! Was surprised to see random infractions
       | 
       | Does this have the lower trial court records too?
        
         | richardbarosky wrote:
         | Yes, it has records from different trial courts.
        
           | vmception wrote:
           | Is there a list of jurisdictions and courts that it has?
        
       | kontxt wrote:
       | Super cool--and very fast! Anyone looking to collaborate on these
       | can easily add Kontxt (https://www.kontxt.io) right on to them
       | and have localized discussions directly on page-parts.
        
         | richardbarosky wrote:
         | Thanks. I saw your post on reddit a while back. Was going to
         | ask about your tech stack.
        
       | justinzollars wrote:
       | This is great. Its like google before it become evil.
        
         | richardbarosky wrote:
         | Thank you
        
       | chris_f wrote:
       | Congratulations on the launch. I have worked in open source and
       | public record research for the last 15 years, and your coverage
       | is extremely impressive.
       | 
       | Do you have any long term plan for the site? I can see this going
       | in a lot of different directions depending on your goals.
        
         | richardbarosky wrote:
         | Thanks, as far as I know it's the largest database of court
         | cases on the Internet. If there's enough traffic I'll support
         | the site with ads. Don't have any other specific plans
         | currently.
        
       | HoverSausage wrote:
       | I just managed to find the home address of a YouTuber I'm a fan
       | of in 15 seconds. Creepy site. Glad I'm not in the US
        
         | richardbarosky wrote:
         | Interesting point. If you know the state where someone lives,
         | you can look up the same info on the government website.
         | Additionally, many many other databases have same public data
         | but they ask for a payment to search.
        
       | draw_down wrote:
       | Wow, this is really good. I found a court case from my childhood
       | that I've always half wanted to, and half wanted not to, read
       | through (painful, but potentially enlightening). Now I don't have
       | to go to the courthouse to read it. Choices, choices.
       | 
       | (False alarm, it's just the case record and not the transcript)
        
       | 120bits wrote:
       | From the the reddit link[1]
       | 
       | Sorry, I'm just curious.
       | 
       | It says MySQL 8 and Elasticsearch 7.8. I don't have much
       | experience in elasticsearch, I wanted to know how does
       | elasticsearch makes it faster? Is it like an extension that makes
       | it faster? Or Elasticsearch has its own data store that consumes
       | data from the database and magically makes it faster?
       | 
       | Thanks.
       | 
       | [1]https://www.reddit.com/r/programming/comments/jg4rkv/how_a_s..
       | .
        
         | richardbarosky wrote:
         | Elasticsearch is a search platform. A "database" but meant for
         | search stuff. It's not part of MySQL.
        
         | rpedela wrote:
         | Elasticsearch, Lucene under the hood, implements an inverted
         | index which is an extremely fast data structure for text
         | search. ES has clustering as a primary feature too and many
         | search features that can significantly improve relevance that
         | you won't find in MySQL and most other databases.
        
           | devy wrote:
           | Have you tried Toshi[1] or MeiliSearch[2]. I wonder how it
           | would compare in terms of operational costs (monthly cloud
           | hosting bill) at the current data size.
           | 
           | [1]: https://news.ycombinator.com/item?id=18895655
           | 
           | [2]: https://news.ycombinator.com/item?id=22685831
        
             | jabo wrote:
             | Do you have the structured dataset somewhere? I'd love to
             | index it in Typesense [1] and see how it does.
             | 
             | I recently tried a 32M songs dataset [2] and it works
             | great, so I'm on the lookout for larger datasets to
             | benchmark with.
             | 
             | [1] https://news.ycombinator.com/item?id=22181437
             | 
             | [2] https://songs-search.typesense.org/
        
           | arminiusreturns wrote:
           | How would you say it stands up to splunk these days?
        
           | lolive wrote:
           | Plus it does not accept joins. So you basically have to
           | denormalize all your data before injecting into Elastic. It
           | helps speedup things. But is a headache to manage on a day to
           | day basis.
        
         | nerdponx wrote:
         | https://news.ycombinator.com/item?id=25152925
        
       | pachico wrote:
       | Jeez, 36 pages with resume for "Napster"!
        
       | onetimemanytime wrote:
       | I understand the open court argument, we need to see what goes on
       | so nothing funny happens there. But unless we're talking about a
       | major crime, what good does it do to list and index on Google
       | everything from 30 years ago?
       | 
       | I am no fan of this at all.
        
         | richardbarosky wrote:
         | Only 3 pages are indexed on Google. Actually, most of the other
         | legal databases (listed on info page) have their cases indexed
         | on Google. However, judyrecords cases aren't indexed on Google.
         | I understand your general sentiment.
        
         | WindyLakeReturn wrote:
         | If our society decides it is necessary to act with the full
         | weight of the law behind it, then it would seem better to have
         | the information available for the public to verify than not.
         | I'm not saying it is all great, but that it is far better to
         | have information available so that things like average sentence
         | length for a given crime based on demographic and psychographic
         | information can be queried by all. If a city that is 50/50
         | male/female and 20/80 black/non-black finds their speeding
         | tickets are 70/30 male female and 35/65 black/non-black, then
         | it may be worth investigating to see if police are being fair
         | who they give warnings to, who gets reduces tickets, and who
         | gets neither.
         | 
         | As for major privacy concerns, it is generally the more major
         | crimes that have the larger issue with the victim being known.
         | Knowing that some one was the victim of mischief vandalism is
         | far less a privacy invasion than knowing they were the victim
         | of sexual assault of a child (and even hiding the victim's
         | identity often doesn't do more than hide the name from a
         | passive search).
         | 
         | Then there are the benefits that other posters have raised,
         | such as being useful for knowing past decisions used even in
         | minor trials.
        
           | richardbarosky wrote:
           | Good points.
           | 
           | If you look at the info page there is a specific example
           | about how to look up codes of cases that had the same charge.
           | 
           | Being able to see how other offenders are sentenced is useful
           | to make sure people are being treated fairly. Lawyers use
           | this kind of data up to the point of producing analytics from
           | data like that to understand outcomes. Major legal data
           | companies have a large segment of business doing analytics
           | for lawyers handling high and lower level cases.
           | 
           | Here are a few related links: https://cluesearch.org/
           | https://measuresforjustice.org/
        
           | ghaff wrote:
           | The general privacy issue that most jurisdictions have
           | decided they just don't care that much about is that easy,
           | indexed, free access to public records is different from the
           | case where that same information is in a dusty file cabinet
           | somewhere. There are a lot of things that people are, in
           | principle, OK with being a matter of public record but are
           | maybe less OK with their neighbor being able to casually
           | discover it through Google.
        
             | distances wrote:
             | Totally agree. I'd be all for open court records, requested
             | in person, received in paper form against a small
             | processing fee.
             | 
             | I do have a different cultural background so it's probably
             | natural this feel horrible. Everything about this site
             | would be so illegal in my home country it's almost
             | hilarious in comparison. I'm used to (and fully approve of)
             | a law that you can't keep a list of names in a notebook
             | without a proper reason and everyone's consent, that would
             | already be an illegal register.
        
               | ccostes wrote:
               | So a Christmas card list would be illegal? That
               | seems...excessive.
        
         | bidnessmodell wrote:
         | Worse, there's no obvious business model or disclosed funding
         | source or institutional affiliation here.
         | 
         | That leaves me with the distinct impression that they're
         | monetizing data about visitors and searches in some horrible
         | way. (Data targeting for mugshot shakedown operations?)
         | 
         | I'm not going near this.
        
           | richardbarosky wrote:
           | Maybe ads at some point.
        
         | Finnucane wrote:
         | Sometimes even seemingly trivial cases can be caselaw precedent
         | that people should be able to see and access without paying
         | (they are public records).
        
           | richardbarosky wrote:
           | Good point. PACER, in fact, has been called out by major news
           | publications for literally being a scam the way the change
           | for access to public records.
           | 
           | https://www.politico.com/magazine/story/2019/03/20/pacer-
           | cou...
        
       | caseyscottmckay wrote:
       | This is courtlistener.com data correct?
        
         | richardbarosky wrote:
         | From other comment: CourtListener has about 4 million opinions,
         | which are included. On top of that, 435 million additional
         | cases from throughout the US.
        
       | FpUser wrote:
       | I am not fond exposing this kind of info. Don't we all have
       | enough prying eyes
        
         | richardbarosky wrote:
         | I've mentioned other legal databases on the info page. It's
         | public information. judyrecords is the largest free database of
         | court cases, but there are many other free/not free ones as
         | well.
        
           | FpUser wrote:
           | I did not mean this one in particular. Just my opinion about
           | the subject in general.
        
             | MeinBlutIstBlau wrote:
             | In my state you can get some kind of understanding of whats
             | going on, but it's so legalese vague that half the time you
             | only know if someone got a speeding ticket, underage, or
             | divorced.
        
       | vmception wrote:
       | lol so many records that should have been destroyed and not
       | indexable!
       | 
       | so do I get a court order for each county, the website, the
       | resyndicating source that the website uses or what?
       | 
       | I looked at the reddit page and other people noticed the same
       | thing, the author just said send me the link! Hahaha one by one
       | removal maybe!
       | 
       | Shut it down, enjoy it while it lasts
        
         | richardbarosky wrote:
         | I don't think you understand what you're talking about. There
         | are many databases that are made up of public records. Many
         | aren't free, some are.
        
           | vmception wrote:
           | That may be the reality but if the court or due process
           | ordered something expunged from a record it should be updated
           | in all records and the details not present.
           | 
           | Should just do a search for expunged or similar terms and
           | remove those entries.
        
         | chrisseaton wrote:
         | > lol so many records that should have been destroyed and not
         | indexable!
         | 
         | You want secret courts?
        
           | vmception wrote:
           | they weren't secret and were available for public perusal and
           | judgement until the designated time
           | 
           | secret courts have cases that are secret from the beginning
        
           | jschwartzi wrote:
           | Do you want things that children do to follow them for the
           | rest of their lives?
        
             | matz1 wrote:
             | The following itself is not the issue right?
        
             | tshaddox wrote:
             | Well, no. Are there names of minors in this database? I
             | thought the US had a mechanism to prevent that, or at least
             | to petition to have records of minors removed or
             | anonymized.
        
               | vmception wrote:
               | "The US" has 39,044 distinct local governments and
               | municipalities and they all do their procedural nuances
               | differently and to varying efficacy and different points
               | in time! :D
        
               | lazyasciiart wrote:
               | Yes. The mechanisms are shit. Many of these cases are
               | juvenile cases with a note saying the case is sealed,
               | along with full details of the charge, name, and outcome.
               | 
               | Edit: wow, plus family court stuff like a four year
               | custody dispute, kids being adopted, etc
        
             | chrisseaton wrote:
             | I don't know what culture you come from, but in the US and
             | UK and similarly influenced cultures justice being seen to
             | be done and recorded is a pretty important principle and
             | mechanism against overreach of the state.
        
               | lazyasciiart wrote:
               | There are significant limits to that, such as juvenile
               | courts.
        
               | dadrock wrote:
               | I agree. But it's worth noting that the UK has recently
               | enacted the Right to be Forgotten Law, which plays into
               | this discussion.
        
               | ghaff wrote:
               | Of course, once data is replicated and distributed
               | around, it's very hard to put the genie back into the
               | bottle.
        
       | ikeboy wrote:
       | I don't see a breakdown by source. What does this have that
       | courtlistener doesn't, for example?
        
         | richardbarosky wrote:
         | CourtListener has about 4 million opinions, which are included.
         | On top of that, 435 million additional cases from throughout
         | the US.
        
           | ikeboy wrote:
           | Where are they getting public domain opinions that CL doesn't
           | have? Are these states or counties that CL doesn't scrape? It
           | would be nice to have a breakdown by jurisdiction.
           | 
           | Also, by "case" do you mean "opinions"?
           | 
           | Full disclosure, I've written and contributed to several
           | scrapers for CL, and if there's a large source they're
           | missing I'd like to know.
           | 
           | Note that the CL opinion number you're quoting doesn't
           | include orders from Federal courts that are in the RECAP
           | collection, which accounts for several million additional
           | opinions.
        
       | kordlessagain wrote:
       | It's throwing a 500 for some regex I fed it.
        
         | richardbarosky wrote:
         | I recently added advanced query support. Looks like I need to
         | clean up some validation. Thanks.
        
       | pashabitz wrote:
       | I wasn't trying to be an asshole, just honestly searched for
       | "javascript". Was disappointed :)
        
         | richardbarosky wrote:
         | Looking at the results, those all appear to be from
         | CourtListener's bulk data.
        
       | bpeebles wrote:
       | I'm not sure why I didn't expect them to be in this database, but
       | this also has like traffic tickets and similar.
        
       | LunaSea wrote:
       | Hmmm: https://www.judyrecords.com/record/1ikmhvbrhfa3a
        
         | csunbird wrote:
         | 333 N Warcraft Lane Undercity, Washington 99999
         | 
         | Looks like a place I would like to live in.
        
           | richardbarosky wrote:
           | Ah, good one. There are many nuggets in there. "holy shit",
           | fart, etc.
        
             | LunaSea wrote:
             | Any idea how those came into the system?
             | 
             | The one I quoted seems to be some kind of test case?
        
               | richardbarosky wrote:
               | Most likely, just like the asdf occurrences.
        
       | ergwwrt wrote:
       | Didn't Aaron Swartz try to do this but couldn't because it costs
       | $0.10 per page?
        
         | richardbarosky wrote:
         | From what I understand, he had some kind of academic library
         | access for PACER and used that to bypass what others would be
         | changed for. There are lawsuits against PACER charging fees for
         | what's public information generated by taxpayer money. He ended
         | up being charged with various crimes related to maybe computer
         | fraud and eventually committed suicide. A very sad story.
        
       ___________________________________________________________________
       (page generated 2020-11-19 23:00 UTC)