[HN Gopher] ArXiv now offers papers in HTML format
       ___________________________________________________________________
        
       ArXiv now offers papers in HTML format
        
       Author : programd
       Score  : 454 points
       Date   : 2023-12-21 18:34 UTC (4 hours ago)
        
 (HTM) web link (blog.arxiv.org)
 (TXT) w3m dump (blog.arxiv.org)
        
       | shrimpx wrote:
       | Since the article doesn't link to any example HTML article,
       | here's a random link:
       | 
       | https://browse.arxiv.org/html/2312.12451v1
       | 
       | It's cool that it has a dark mode. Didn't see a toggle but
       | renders in the system mode.
       | 
       | Overall will make arXiv a lot more accessible on mobile.
        
         | burkaman wrote:
         | And here's the PDF of the same paper for comparison:
         | https://arxiv.org/pdf/2312.12451.pdf
        
           | FredPret wrote:
           | The contrast is massive. I'm much more likely to read the
           | html version; that PDF is deeply off-putting in some hard to
           | define way. Maybe it's the two columns, or the font, or the
           | fact that the format doesn't adjust to fit different screen
           | sizes.
        
             | ForkMeOnTinder wrote:
             | Definitely the two columns for me. It's super annoying
             | skimming a paper and having to scroll down and back up
             | again in a zig-zag pattern.
        
               | mmis1000 wrote:
               | I think the consuming device matters. A ipad or computer
               | have much wider screen width. One column layout is too
               | wide for them for average people to scan text lines
               | quickly.
               | 
               | While it looks perfectly fine on a phone. Two columns
               | layout looks terrible on a smartphone, the text is too
               | tiny to read comfortably.
               | 
               | It would probably be even better if you can flip it left
               | and right like a ebook instead of scrolling to allocate
               | the content faster. But current design is good enough
               | IMO. (Compare to reading a pdf on cellphone)
        
               | kjkjadksj wrote:
               | Just zoom the smartphone into one column. Problem solved.
        
               | mmis1000 wrote:
               | And then you will have to scroll both top bottom and left
               | right, a even worst experience.
        
             | tobias2014 wrote:
             | This is very interesting, because for me it's just the
             | opposite. In particular the two column layout is just more
             | readable and approachable for me. The PDF version also
             | allows for a presentation just as the authors intended. I
             | guess it's good that they offer both now.
        
               | kjkjadksj wrote:
               | The authors don't format the pdf, the editor does.
               | Authors probably sent a double spaced word document with
               | figures and tables on another file.
        
               | tonyg wrote:
               | In computer science, the usual case is that the author
               | fully formats the paper.
        
               | z2h-a6n wrote:
               | Not on arXiv (unless I'm much mistaken), which is a
               | preprint server, not a conventional journal.
               | 
               | arXiv accepts various flavors of TeX, or PDFs not
               | produced by TeX [0], and automatically produces PDFs and
               | HTML where possible (e.g. if TeX is submitted). In the
               | case of the example paper under discussion, the authors
               | submitted TeX with PDF figures [1], and the PDF version
               | of the paper was produced by arXiv. The formatting was
               | mainly set by using REVTeX, which is a set of macros for
               | LaTeX intended for American Physical Society journals.
               | 
               | [0]
               | https://info.arxiv.org/help/submit/index.html#formats-
               | for-te... [1] https://arxiv.org/format/2312.12451
        
               | smartmic wrote:
               | FWIW, I recently learned that it is also possible to
               | produce nice PDF papers with GNU roff (groff), have a
               | look at this example: https://github.com/SudarsonNantha/L
               | inuxConfigs/blob/master/....
        
               | frocmlol wrote:
               | You are very confidently wrong.
               | 
               | In the arxiv you use latex and do everything yourself.
               | There is no editor.
        
               | cozzyd wrote:
               | You typically send a .tar.gz of tex files (and, figures,
               | .bbl, etc.) to the journal. And then you typically upload
               | something very similar to the arxiv (I have an arxivify
               | Makefile target for for my papers that handles some arxiv
               | idiosyncrasies like requiring all figures to be in the
               | same folder as the .tex file, and it also clears all the
               | comments; sometimes you can find amusing things in source
               | file comments for some papers).
               | 
               | Some fields may use Word files, but in most of physics
               | you would get laughed at...
               | 
               | It is true that most journals will typically reformat
               | your .tex in a different way than is displayed on the
               | arXiv.
        
               | eigenket wrote:
               | You are completely wrong. ArXiv doesn't work like that.
        
               | JumpCrisscross wrote:
               | Do you work extensively with LaTeX?
               | 
               | Two columns is good, albeit annoying on mobile. But the
               | font. The typeface kills me, and almost every LaTeX-
               | generated document sports it.
        
               | saurik wrote:
               | Hilariously, I would probably tolerate the HTML version a
               | lot better if it had the font from the PDF (and FWIW, the
               | answer for me is "no: I don't work with LaTeX at all... I
               | just read a lot of papers").
        
               | cozzyd wrote:
               | Hating on Computer Modern (ok, probably now Latin Modern)
               | is something close to blasphemy.
        
             | kjkjadksj wrote:
             | If you read a lot of papers in your line of work you will
             | quickly appreciate the two columns and justification.
        
               | FredPret wrote:
               | Admittedly, I don't read research papers. But with HTML,
               | surely the choice between one or two columns is a
               | checkbox away.
        
               | IlliOnato wrote:
               | Which checkbox?
               | 
               | I cannot find anything relevant in any of the 3 browsers
               | I use (Vivialdi, Firefox, Chrome). Would really
               | appreciate this option.
               | 
               | A quick search gave some apparently unmaintained browser
               | extensions, and it's it.
        
               | FredPret wrote:
               | No, I'm saying there _should_ be a checkbox. That way,
               | you can switch between two columns formatted like LaTeX
               | and that font they always use, and one column with
               | Helvetica  / Arial.
        
               | jabroni_salad wrote:
               | Only problem is jagoffs like me who need the text to be
               | bigger. On PDFs you now get to experience a horizontal
               | scrollbar. HTML has text reflow and I can set the line
               | length by resizing the window. I'm willing to make a lot
               | of sacrifices for that experience.
        
             | z2h-a6n wrote:
             | For what it's worth, two column layouts are very common in
             | the physical sciences, or at least in physics which I'm
             | more familliar with. I have a feeling that the reason is at
             | least partly to save page space when using displayed math
             | (e.g. equations that are formatted in a break between
             | blocks of text), which use the full text width (i.e. the
             | width of one column) to display what may be much less than
             | half a page wide.
        
               | FredPret wrote:
               | It makes sense - for paper. But pixels are infinite -
               | HTML is far better for screen display, which is how
               | people read things nowadays.
               | 
               | The extra column next to the one I'm reading introduces a
               | lot of visual noise, and the content is hard enough as it
               | is. I'm sure physicists have all gotten used to it, but
               | it certainly trips me up.
        
               | nyssos wrote:
               | > The extra column next to the one I'm reading introduces
               | a lot of visual noise
               | 
               | Papers are generally not read start to finish in one go:
               | there's lots of rereading and jumping back and forth
               | between key parts, and anything that moves them further
               | apart makes this harder.
        
               | FredPret wrote:
               | Ah, that makes more sense. I imagined scientists just
               | reading the whole thing start-to-finish.
               | 
               | I still think a flexible layout is best. If you like
               | multi-columns and have a wide screen, why not display 12
               | columns next to each other?
               | 
               | With PDF this is not possible. With HTML the content can
               | in principle be sliced and diced how you like it.
        
         | shusaku wrote:
         | Seems like the references aren't working very well.
         | 
         | I really want journals to have two way links in a paper. I get
         | google scholar alerts about certain papers being cited, and I
         | want to skip to "why did they cite this? Did they use it,
         | improve it, it just mention it?"
        
           | r3trohack3r wrote:
           | I'd never considered setting up citation alerts like this.
           | 
           | Thank you for the idea!
        
           | shrimpx wrote:
           | Looks like clicking a reference adds the hash to the URL but
           | doesn't scroll to the reference. If you load the hash URL
           | directly in the browser you get a 404 page...
        
             | burkaman wrote:
             | https://browse.arxiv.org/html/2312.12451v1#bib.bib1 works,
             | but https://browse.arxiv.org/html/2312.12451v1/#bib.bib1
             | doesn't.
        
               | IlliOnato wrote:
               | Yeah, it seems like a bug in HTML generator...
        
         | winwang wrote:
         | Probably more accessible in general. (PDF) Papers are
         | psychologically scary.
        
           | mmis1000 wrote:
           | Pdf is by design a image format that can also embed text. It
           | just don't have the primitives to properly retain the article
           | structure.
        
             | PaulHoule wrote:
             | Nah, it's a super-complex system that creates a graph of
             | components, can draw vectors like PostScript, can embed 3-d
             | models, etc. The spec is here
             | 
             | https://opensource.adobe.com/dc-acrobat-sdk-
             | docs/pdfstandard...
             | 
             | if you look at sections 14.6 through 14.10 you will find
             | quite baroque facilities for representing the structure of
             | documents in great detail, making documents with
             | accessibility data, making documents that can reflow with
             | HTML, etc. Note to mention the 14.11 stuff which addresses
             | problems with high end printing (say you want to make litho
             | plates for a book.)
             | 
             | For that matter sections 14.4 and 14.5 describe facilities
             | that can be used to add additional private data to PDF
             | files for particular applications. For instance Adobe
             | Illustrator's files are PDF files with some extra private
             | data, and https://en.wikipedia.org/wiki/GeoPDF
             | 
             | I like to complain that PDF has no facility to draw a
             | circle but instead makes you approximate a circle with
             | (accursed) Bezier curves but other than that the main
             | complaint people make about PDF is that it is too
             | complicated not that it is lacking this feature or that
             | feature.
             | 
             | Contrast that to a highly opinionated document format like
             | DjVu
             | 
             | https://en.wikipedia.org/wiki/DjVu
             | 
             | which came out around the same time as PDF and is
             | specialized for the problem of scanned documents and works
             | by decomposing the document into three layers, one of which
             | is a bilevel layer intended to represent text. All three
             | layers have specialized coding schemes, the text layer in
             | particular tries to identify that every copy of (say) the
             | letter "e" or the character "Han " is the same and reuse s
             | the same bitmap for them.
        
               | anonimo37 wrote:
               | You would normally use a library to create the PDF so you
               | don't need deal with the complexity of the format. A
               | library would likely provide a function for drawing
               | circles that translates the circle into Bezier curves.
        
         | tarboreus wrote:
         | One of the reasons is to make the papers more accessible to
         | people with disabilities, especially the blind. I participated
         | in a conference they hosted on this a few months ago, I
         | recommend taking a look at the recordings if you're interested
         | in thinking on this.
         | 
         | https://accessibility2023.arxiv.org/
        
           | miki123211 wrote:
           | Blind person here, can confirm this. Reading PDFs with a
           | screen reader is bad, reading PDFs that come from LaTeX is
           | worse, reading LaTeX math is pretty much impossible. All the
           | semantic info you need is just thrown away.
           | 
           | You _can_ make decently accessible PDFs but it 's lots of
           | work, you need Acrobat on the producer' side and might also
           | need it on the consumer's side. Free tools don't even come
           | close. There's also the fact that the process of making
           | accessible PDFs in Acrobat isn't itself accessible.
           | 
           | With that said, the way screen readers treat HTML math
           | certainly isn't perfect, it's geared more towards school
           | children than anything above calculus. I'm probably going to
           | stay with my LaTeX source files for now. At least ArXiv
           | offers those, not many sites do. To be fair, that approach
           | also has its own set of problems (particularly when people
           | use some extra fancy formatting in their math equations,
           | making the markup hard to read), but I find this to be the
           | best approach for me so far, at least on AI/ML papers.
        
             | saurik wrote:
             | Huh. It would seem like, of all the things which should
             | make it easy to generate the correct accessibility
             | information, the pipeline of compiling a paper from source
             | code in LaTeX should nail it... maybe we should all pitch
             | in to some pool to pay someone to put in the required
             | effort to connect all the dots?
        
               | semi-extrinsic wrote:
               | Kind of tangential, but it's also kind of surprising how
               | difficult it is in LaTeX to make a plot of an equation.
               | 
               | Say I have Equation \ref{eq}. Why can't I just say "plot
               | \ref{eq} for x from -6 to 11" and get my graph?
               | 
               | And yes, I know about pgfplots, PSTricks, TikZ etc. But
               | in all those cases, I need to define the same equation
               | twice, in different syntax to boot. It's kind of
               | unsatisfying.
        
             | ldenoue wrote:
             | I wrote an app called PDF Reflow that reflows the original
             | PDF using image processing to cut out words into tiles so
             | you see the reflowed version of the text in their original
             | look.
             | 
             | https://www.appblit.com/pdfreflow
        
             | jakderrida wrote:
             | Hold on... Are you telling me that all these complex
             | sentences are being typed out based on your voice alone?
             | That's insane.
        
               | ehPReth wrote:
               | ? blind people can use keyboards
        
               | kzrdude wrote:
               | Hm tangential question but shouldn't touch typing be well
               | accessible for many blind computer users?
        
               | topato wrote:
               | I'd say it would be simple to talk type these using
               | windows 11's redux of voice typing. Pretty damn accurate
               | and easy to modify/variate text/options. I use it all the
               | time to make tech/engineering blog posts, faster and more
               | organic than typing, typically, and it learns your
               | technoacronyms. Combined with voice access, it makes it
               | trivial to fully operate your computer (well, at least,
               | browse the web, email, and media apps) from across the
               | room. For anyone who hasn't tried the updated version,
               | highly suggest hitting windowskey+h and giving it a shot.
        
             | anthk wrote:
             | Emacs with Emacspeak has a math reading module.
        
         | codethief wrote:
         | Ugh. I don't belong to the target audience (people with
         | disabilities) but the typesetting doesn't exactly look pleasant
         | on my machine (Chrome on Linux).
        
         | jll29 wrote:
         | It's a cool feature because it makes the papers more finable,
         | more easily navigatable, easier to read online and faster to
         | scroll through. I am also happy for blind people that they can
         | more easily use ArXive with Braille readers now.
         | 
         | (I'm still a fan of printing the PDFs, because I annotate on
         | paper and refer to page numbers, but the HTML feature is in
         | addition to PDF download, not a replacement.)
         | 
         | One thing that still sucks (not ArXiv related though) is
         | reading mathematical formulae on the Kindle - wonder if someone
         | with rendering expertise could have a look into the MOBI
         | format.
        
       | alephnerd wrote:
       | This is a great UX addition. Why did it take them so long?
        
         | gwern wrote:
         | The conversion is still very error-prone. It can't convert a
         | lot of packages, and the last paper I read, StarVector, half
         | the HTML version is just missing. (I think it hit an error at a
         | figure of some sort.) I reported an error, but I've been
         | reporting errors against the ar5iv and abstracts for years now
         | and the long tail of problems just seems like an incredible
         | slog.
        
           | KRAKRISMOTT wrote:
           | Where are the computer vision people? This is the perfect
           | type of problem for multi modal LLMs
        
             | IlliOnato wrote:
             | Except that the errors made by an LLM might be harder to
             | spot then converter errors that typically are very blatant,
             | and don't usually alter text (perhaps just drop parts of
             | it).
             | 
             | Also, a bug in a converter is conceptually much easier to
             | fix than to re-train your LLM.
             | 
             | I am not sure that AI in it's current state is useful when
             | "high fidelity" is required.
        
           | dginev wrote:
           | Can confirm. From an ar5iv standpoint, 2.56% articles
           | currently fail to convert entirely, and 22.9% have known
           | errors to the converter. That leaves 74.5% of nominally
           | usable articles. This success rate is noticeably _lower_ for
           | the newest batches of arXiv submissions, as the converter
           | hasn 't caught up with the most recent package innovations.
           | 
           | We have a plan in place to meaningfully fall back for unknown
           | packages, but that will take at least another year to put in
           | place, and likely another couple of years to stabilize.
           | 
           | Meanwhile, there is some hope that with arXiv launching the
           | HTML Beta we will get more contributions for package support
           | (LaTeXML is an open source project, with public domain
           | licensing, everybody benefits).
           | 
           | But again the original point is spot on. Coverage will be
           | hit-or-miss for a while longer yet, for an arbitrary arXiv
           | submission. The good news is that authors _could_ work
           | towards better support for their articles, if they wanted to.
        
         | eviks wrote:
         | Because this is a rather conservative field with little
         | dependency on the general public, so without much interest in
         | hepling disseminate the knowledge broadly & accessibly
         | (relative to other priorities, not absolute)
        
         | Strilanc wrote:
         | How would you do it quickly?
         | 
         | For example, HTML isn't divided into numbereres pages while
         | PDFs are. A lot of latex interacts with page boundaries.
         | Figures tend towards the tops of pages. And there's \clearpage.
         | And the reference list might say which page each citation
         | appeared on. All that stuff needs someone to decide how to
         | handle it and then to implement that handling. Like... what
         | value does \pageheight return? Sometimes I resize things to fit
         | the page height, and if it was doubled then I should have
         | resized to fit the width instead.
        
         | lynndotpy wrote:
         | Almost universally, we prepare conference papers as LaTeX files
         | made to export to PDFs which fit within the conferences
         | template.
         | 
         | It's nontrivial to export this to HTML in all cases, and even
         | then, nobody is asking for HTML from us even though we all want
         | it. I'm guessing Arxiv is using some kind of converter which
         | _usually_ but not _always_ works.
         | 
         | That said, this is a long time coming and PDF as the standard
         | should've died a decade ago. I wish I had this when I was in my
         | PhD program.
        
         | alright2565 wrote:
         | Latex is a very complicated programming language for creating
         | documents. It is not easy to create a new backend for it.
         | 
         | As a glimpse into the very tip of the iceberg, this diagram is
         | https://tex.stackexchange.com/a/158740/ generated with 100%
         | Latex code.
        
       | binarymax wrote:
       | Nice! Now I don't need to manually replace arxiv with ar5iv.
       | Congrats to the team.
        
         | imjonse wrote:
         | "Our ultimate goal is to backfill arXiv's entire corpus so that
         | every paper will have an HTML version, but for now this feature
         | is reserved for new papers."
         | 
         | For now it only works for papers submitted this month. But it's
         | great to have this feature, makes it so much easier to read on
         | phones.
        
       | eviks wrote:
       | Finally a modern format you can copy&paste from and read on one
       | of the most popular computing platforms!!!
        
       | pushfoo wrote:
       | Previously discussed:
       | https://news.ycombinator.com/item?id=38713215
        
       | carlosjobim wrote:
       | With the 2024 browser update, this means I can read these
       | articles on my ancient Kindle perfectly fine.
        
       | ChrisArchitect wrote:
       | [dupe] from yesterday
       | 
       | More here: https://news.ycombinator.com/item?id=38713215
        
       | ZeroCool2u wrote:
       | Wow, this is _so_ much better!
        
       | choppaface wrote:
       | Hope they benefit from CDN caching now too.
       | 
       | Edit: aaaand they got Fastly
       | https://news.ycombinator.com/item?id=38723373
        
       | cozzyd wrote:
       | doesn't work great with long author lists...
       | 
       | https://browse.arxiv.org/html/2312.12907v1
        
         | degenerate wrote:
         | The PDF is worse, so there is no simple answer to this:
         | https://arxiv.org/pdf/2312.12907v1.pdf
         | 
         | At least the HTML version pairs each author with their
         | affiliations, instead of the PDF which has all the names on
         | page 1, and all the affiliations on page 2. That's completely
         | unreadable.
        
           | cozzyd wrote:
           | The PDF is better because I'm trained to scroll past the
           | author list. That takes forever on the html version .
        
             | mattigames wrote:
             | You can click the "Introduction" anchor on the left side
             | and it scrolls for you past the author list
        
               | cozzyd wrote:
               | well it skips the abstract too, but yes, you can scroll
               | back up to see it.
        
               | mattigames wrote:
               | Yeah, its a bit weird that the abstract doesn't have a
               | link on the left
        
               | cozzyd wrote:
               | Probably because \abstract{ } is treated differently than
               | \section{ }, I guess...
        
           | IlliOnato wrote:
           | For me the PDF is much better. It's compact and clean, if I
           | really need to see an affiliation for a particular author,
           | it's really easy to do so in the PDF, not so in the HTML.
           | 
           | It's highly unlikely anybody will read an entire author list
           | this long; typically you would read the first two or three
           | names, or check if some particular name is on the list. So
           | the compactness of the list and being able to quickly get to
           | the article contents is important.
        
       | Al-Khwarizmi wrote:
       | Nice! It would be even better if they offered authors of previous
       | papers the option of converting to HTML, as the latex sources are
       | already in the system.
        
         | fprog wrote:
         | The article states they're going to backfill all, or nearly
         | all, previously submitted papers!
        
       | FredPret wrote:
       | This is brilliant. I don't share academia's love of LateX multi-
       | column PDFs.
        
         | tiagod wrote:
         | I like multi-column text on paper (literally), but it's awkward
         | in digital where you can just shape text on the fly to whatever
         | column size you want
        
       | leoncaet wrote:
       | I just hope they don't stop to offer the papers in PDF. Even when
       | I'm on a computer, I still prefer to read PDFs.
        
       | sylware wrote:
       | Like the maths noscript/basic (x)html wikipedia generator:
       | 
       | The magic of inline images at a known DPI, of course you can
       | provide images for different DPIs.
       | 
       | Reading maths/science noscript/basic (x)html documents on my 100
       | DPI monitor, on wikipedia. Not yet fully ready on arxiv.
        
       | gms7777 wrote:
       | About time. Biorxiv and medrxiv have been doing this for probably
       | half a decade at this point?
        
       | jez wrote:
       | It would be neat if they offered submitters the chance to upload
       | their own HTML version alongside the PDF version, instead of
       | always relying on an automatic conversion process.
       | 
       | - I can imagine authors feeling frustrated if someone reaches out
       | about a problem in the HTML version of their paper, but they have
       | no way to correct it except by hoping that a change to the PDF
       | fixes a change to the generated HTML. Easier to just fix the
       | formatting problem in the PDF outright.
       | 
       | - It would be neat to allow people to experiment with alternative
       | formatting for their papers. For example, imagine a paper about a
       | programming language that embeds a sandbox you can use to play
       | around with the language under discussion. Or a paper about
       | multivariable calculus and you can interact with a three
       | dimensional plot of some function.
        
         | layer8 wrote:
         | They'd have to define and document a "safe" subset of HTML, and
         | implement a filter/checker for it. Otherwise we'd end up with
         | papers containing ads and tracking and XSS vulnerabilities and
         | whatnot.
        
           | digging wrote:
           | Those are issues with JavaScript, not HTML. Wouldn't
           | filtering out iframes pretty much keep us in the clear?
        
             | layer8 wrote:
             | The parent wanted interactive 3D plots, which means
             | JavaScript embedded in or linked from the HTML. Then
             | there's stuff like JavaScript embedded in SVG.
        
         | diffeomorphism wrote:
         | > It would be neat if they offered submitters the chance to
         | upload their own HTML version alongside the PDF version,
         | instead of always relying on an automatic conversion process.
         | 
         | Please don't. Then you will have a mismatch between the source
         | and the "own html" which ruins the point of uploading the
         | source.
        
           | eviks wrote:
           | Pdf isn't the source
        
             | IlliOnato wrote:
             | But the PDF is also generated. LaTeX is the single source
             | of truth.
        
         | kjkjadksj wrote:
         | Most authors probably have no interest in learning html. Also
         | most authors want nothing to do with the work by the time its
         | submitted. It was probably hell getting the project to that
         | point of publishing, they want to be done with it and move on
         | to the next thing going on in their career asap.
        
           | jez wrote:
           | I think this is an argument in favor of doing automatic PDF
           | -> HTML conversion for the authors that don't want to touch
           | it, but I don't think it's an argument against letting those
           | who are fine with HTML provide their own.
        
         | tiagod wrote:
         | I was under the impression the source authors publish to arxiv
         | was a latex file
        
           | jraph wrote:
           | It is.
        
           | jez wrote:
           | Ah, thanks for clarifying!
           | 
           | I looked up the submission formats, and it looks like if you
           | authored the paper in TeX/LaTeX, they do not accept pre-
           | rendered versions of the document.
           | 
           | https://info.arxiv.org/help/submit/index.html#formats-for-
           | te...
           | 
           | But if you did not author it in TeX/LaTeX (e.g., Word, Google
           | Docs, etc.) it appears you can upload a PDF or HTML yourself.
        
         | IlliOnato wrote:
         | No, it would not. It's critically important that there is only
         | one "logical" article, albeit with different representations.
         | In other words, a single "source of truth".
         | 
         | With "sideloading" of HTML there is no way in general to make
         | sure that the _contents_ of LaTeX (and PDF) on one side and
         | HTML on the other side is the same.
        
         | thomasahle wrote:
         | > It would be neat if they offered submitters the chance to
         | upload their own HTML version alongside the PDF version,
         | instead of always relying on an automatic conversion process.
         | 
         | Can you recommend a system I can use to compile my latex, while
         | also making sure the html is going to look good? I'd like some
         | kinds of css style @media queries to switch between certain
         | parts of the layout, while keeping a single latex file.
        
       | endergen wrote:
       | I was hoping this meant that html native submissions would be
       | possible, so that people made interactive explanations.
        
       | lucidrains wrote:
       | nice! will make reading papers on the phone so much more
       | pleasant!
        
       | odyssey7 wrote:
       | article {         text-justify: Knuth-Plass;       }
        
       | matt1 wrote:
       | For anyone interested in staying informed about important new
       | AI/ML papers on arXiv, check out https://www.emergentmind.com, a
       | site I'm building that should help.
       | 
       | Emergent Mind works by checking social media for arXiv paper
       | mentions (HackerNews, Reddit, X, YouTube, and GitHub), then ranks
       | the papers based on how much social media activity there has been
       | and how long since the paper was published (similar to how HN and
       | Reddit work, except using social media activity, not upvotes, for
       | the ranking). Then, for each paper, it summarizes it using GPT-4,
       | links to the social media discussions, paper references, and
       | related papers.
       | 
       | It's a fairly new site and I haven't shared it much yet. Would
       | love any feedback or requests you all have for improving it.
        
         | raccoonDivider wrote:
         | That looks great. No real feedback yet, but it's the kind of
         | thing I've always been looking for as a better alternative to
         | Twitter.
        
           | matt1 wrote:
           | Thanks! I've got a lot more planned for it too. If anyone has
           | any feedback that doesn't make sense to share here, or if
           | you're a researcher who is open to some questions about how
           | you currently follow arXiv papers, drop me a note at
           | matt@emergentmind.com.
        
         | CodeCube wrote:
         | Love to see Energent Mind continuing to innovate!
        
         | sureglymop wrote:
         | Love the clean design of the website! Looks amazing on mobile.
        
         | jakderrida wrote:
         | This is exactly what I was using HN for. But, yeah, in kinda
         | sucked compared to yours. Another thing I was trying to create
         | was some sort of NN model that could use the semanticscholar
         | h-index of authors along with the abstract text and T5 to
         | estimate the one-year out citations. Just for personal use,
         | though. That whole thing fell apart because semanticscholar is
         | kinda crap for associating author links to the same author. I
         | frequently ended up with the wrong professors, which I'd think
         | would be easily fixable for them.
        
           | carlossouza wrote:
           | I did that (used other features). This is how new papers are
           | ranked here:
           | 
           | https://trendingpapers.com
        
       | apstats wrote:
       | I wonder if this could be used to train an LLM to convert PDFs
       | with rich charts into HTML?
        
       | reqo wrote:
       | A lot of AI/ML papers these days have an accompanying interactive
       | page like [0], will we see anything like these now directly in
       | arXive?
       | 
       | [0] https://voyager.minedojo.org/
        
         | z2h-a6n wrote:
         | I think then arXiv would have to deal with mantaining the tech
         | stack and providing the presumably much higher server capacity
         | to serve the more varied web pages that would result, so it
         | seems like a tall order. arXiv already has an experimental
         | integration with Papers with Code [0], which I guess provides
         | similar results for the reader, though the authors have to
         | figure out their own web hosting.
         | 
         | [0] https://info.arxiv.org/labs/showcase.html#arxiv-links-to-
         | cod...
        
       | ansk wrote:
       | When I open a large pdf on arxiv (100+ MB, not uncommon for ML
       | papers focused on hi-res image generation), there is a
       | significant load time (10+ seconds) before anything is rendered
       | at all other than a loading bar. Does anyone know what the source
       | of this delay is? Is it network-bound or is Chrome just really
       | slow to render large PDFs? Do PDFs have to be fully downloaded to
       | begin rendering? In any case, this delay is my only gripe with
       | arxiv and a progressively rendered HTML doc that instantly loads
       | the document text would be a huge improvement.
        
         | IlliOnato wrote:
         | It may be even that the time is taken to _generate_ a PDF.
         | 
         | The format in which articles are submitted and stored in arXive
         | is LaTeX. PDF is automatically generated from it.
         | 
         | Probably arXiv does some caching of PDFs so they don't have to
         | be generated anew every time they are requested, but I don't
         | know how this caching works.
        
         | upbeat_general wrote:
         | I have the same issue. From what I can tell it's just network-
         | bound and the Arxiv servers are slow. They theoretically allow
         | for you to setup a caching server but after spending a while
         | trying to get it setup, I haven't been able to get it to work.
         | 
         | https://info.arxiv.org/help/faq/cache.html
        
           | arccy wrote:
           | maybe it'll be faster now with fastly
           | 
           | https://news.ycombinator.com/item?id=38723373
        
       | ww520 wrote:
       | That's great. Now I can read the papers on my phone.
        
       | svag wrote:
       | The tool that it's being used for this offering is this one,
       | https://github.com/arXiv/arxiv-readability, just to save a few
       | clicks :)
        
         | IshKebab wrote:
         | Wow I did not know they have the LaTeX for all the papers and
         | compile it themselves! That's pretty crazy. What if they don't
         | have packages you need? What if your paper isn't written with
         | LaTeX?
        
       | WendyTheWillow wrote:
       | I'm so far left wanting for an app that gives me a way to easily
       | track and consume newly published work of a given topic. The
       | existing apps are not great, and maybe this change will make it
       | easier to provide better "reader" views, and possibly even tts (I
       | like to listen+read).
        
       | aragonite wrote:
       | A lot of academic journals (say from Springer) also offer HTML
       | formats for papers published in the past decade or so, which I
       | personally often find more convenient for reading purposes than
       | PDFs. For example, I parse text a lot faster if I use a regex to
       | split each paragraph into sentences and place a linebreak after
       | each sentence, or if I do natural language "syntax highlighting"
       | by assigning a distinctive color to functional words indicating
       | logical structure like 'if/then', 'and', 'or', 'not', 'because',
       | and 'is'. And sometimes it really improves readability to be able
       | to do "semantic highlighting", in the sense of say assigning a
       | different hashed color to each proper name (or each labeled
       | thesis, etc) that occurs in the paper. Such manipulations are
       | basically impossible with PDFs. It makes me wish sci-hub would
       | start archiving HTML versions in addition to PDFs!
        
       | johnsillings wrote:
       | https://www.arxiv-vanity.com/
        
         | jakderrida wrote:
         | And, of course, https://ar5iv.labs.arxiv.org/html
         | 
         | However, ar5iv isn't a la carte like arxiv-vanity. They pretty
         | much do last month's papers every month or so. Something like
         | that.
        
           | dginev wrote:
           | Hi, ar5iv creator here.
           | 
           | You can think of both arxiv-vanity and ar5iv as the "alpha"
           | experiments that lead into the official arXiv "beta" HTML
           | announced today.
           | 
           | Once a few rounds of feedback and improvements are
           | integrated, and the full collection of articles acquires HTML
           | in the main arXiv site, ar5iv will be decommissioned.
           | 
           | The plan is to turn all existing ar5iv links into redirects
           | to the official HTML, and free up the resources for
           | maintaining it. I am not sure what are the plans for
           | maintaining arxiv-vanity, but I suspect they may head down a
           | similar path some time later.
        
       | philipashlock wrote:
       | 30 years after HTML was invented to support accessibility and
       | collaboration for research and academia and the same day the
       | White House released their new accessibility guidance which
       | happens to be the first time they've published formal new policy
       | natively has HTML rather than PDF -
       | https://www.whitehouse.gov/omb/management/ofcio/m-24-08-stre...
        
         | murphyslab wrote:
         | I feel surprised by how succinct, easy-to-understand, and
         | sensible the policy (M-23-22) is:
         | 
         | > Default to HTML: HyperText Markup Language (HTML) is the
         | standard for publishing documents designed to be displayed in a
         | web browser. HTML provides numerous advantages (e.g., easier to
         | make accessible, friendlier to assistive technology, more
         | dynamic and responsive, easier to maintain). When developing
         | information for the web, agencies should default to creating
         | and publishing content in an HTML format in lieu of publishing
         | content in other electronic document formats that are designed
         | for printing or preserving and protecting the content and
         | layout of the document (e.g., PDF and DOCX formats). An agency
         | should develop online content in a non-HTML format only if
         | necessitated by a specific user need.
         | 
         | https://www.whitehouse.gov/omb/management/ofcio/delivering-a...
        
       | golol wrote:
       | IMO pdf and HTML optimize for different things. pdf is easy and
       | pretty. HTML is easy and responsive. But making pdf responsive is
       | impossible and making HTML pretty is not easy. I think having
       | arxiv for well-polished pretty documents, not responsive ugly
       | documents. Most researchers don't have time to make an HTML
       | responsive and pretty.
        
         | querez wrote:
         | Am researcher, care about responsiveness way more than pretty.
         | I am super glad for the option. Downloading PDFs is super
         | annoying. I'm stoked.
        
       | radicalriddler wrote:
       | FUCK YES (excuse my profanity). I have a tool that converts HTML
       | to Neural Speech and I always wanted to push arXiv papers through
       | it, but couldn't be bothered with a PDF implementation.
        
       | topicseed wrote:
       | What do they use to convert a PDF document to a clean, correct
       | HTML document? It's a difficult space, especially with the
       | variety of layouts you may find in PDF documents...
        
         | blackbear_ wrote:
         | Arxiv encourages users to submit the latex source of their
         | papers rather than the PDF
        
       | vegabook wrote:
       | PDF is objectively much better than HTML at rendering text
       | documents. And it's not even close. This could easily have been
       | done 10, even 15-20 years ago. That it didn't is not just
       | inertia. Latex and PDF have enormously better text rendering, and
       | the static format locks a state-commit in time that is much
       | easier to go back to and reference/critique. Unlike the
       | intrinsically fluid nature of HTML. For academic work, milestone-
       | like formats, that lock state in time, are useful for those who
       | later build on them. And again, the rendering just doesn't
       | compare and that imparts [sub]conscious quality signals.
        
       | imranq wrote:
       | At this point are academic papers simply peer-reviewed blog
       | posts?
        
       | acjohnson55 wrote:
       | This is great! I browse papers on mobile, and PDF is so bad for
       | that use case.
        
       ___________________________________________________________________
       (page generated 2023-12-21 23:00 UTC)