[HN Gopher] Phind Model beats GPT-4 at coding, with GPT-3.5 spee...
       ___________________________________________________________________
        
       Phind Model beats GPT-4 at coding, with GPT-3.5 speed and 16k
       context
        
       Hi HN,  We're excited to announce that Phind now defaults to our
       own model that matches and exceeds GPT-4's coding abilities while
       running 5x faster. You can now get high quality answers for
       technical questions in 10 seconds instead of 50.  The current 7th-
       generation Phind Model is built on top of our open-source
       CodeLlama-34B fine-tunes that were the first models to beat GPT-4's
       score on HumanEval and are still the best open source coding models
       overall by a wide margin:
       https://huggingface.co/spaces/bigcode/bigcode-models-leaderb....
       This new model has been fine-tuned on an additional 70B+ tokens of
       high quality code and reasoning problems and exhibits a HumanEval
       score of 74.7%. However, we've found that HumanEval is a poor
       indicator of real-world helpfulness. After deploying previous
       iterations of the Phind Model on our service, we've collected
       detailed feedback and noticed that our model matches or exceeds
       GPT-4's helpfulness most of the time on real-world questions. Many
       in our Discord community have begun using Phind exclusively with
       the Phind Model despite also having unlimited access to GPT-4.  One
       of the Phind Model's key advantages is that it's very fast. We've
       been able to achieve a 5x speedup over GPT-4 by running our model
       on H100s using the new TensorRT-LLM library from NVIDIA. We can
       achieve up to 100 tokens per second single-stream while GPT-4 runs
       around 20 tokens per second at best.  Another key advantage of the
       Phind Model is context - it supports up to 16k tokens. We currently
       allow inputs of up to 12k tokens on the website and reserve the
       remaining 4k for web results.  There are still some rough edges
       with the Phind Model and we'll continue improving it constantly.
       One area where it still suffers is consistency -- on certain
       challenging questions where it is capable of getting the right
       answer, the Phind Model might take more generations to get to the
       right answer than GPT-4.  We'd love to hear your feedback.  Cheers,
       The Phind Team
        
       Author : rushingcreek
       Score  : 500 points
       Date   : 2023-10-31 17:40 UTC (5 hours ago)
        
 (HTM) web link (www.phind.com)
 (TXT) w3m dump (www.phind.com)
        
       | pleonasticity wrote:
       | This is great work, but HumanEval is an extremely limited
       | benchmark and I don't think you can seriously claim to beat GPT-4
       | at coding based only on that metric.
        
         | rushingcreek wrote:
         | Thank you. You're right -- which is why we rely on feedback
         | we've received from our own users for that claim. Many of our
         | users who have the choice to use either GPT-4 or the Phind
         | Model on Phind choose the Phind Model.
        
           | pleonasticity wrote:
           | I understand, but big claims require big evidence and so it's
           | still IMHO not rhetorically a strong position. I'm glad
           | people find it more useful!
        
           | Kranar wrote:
           | You likely know this, but keep in mind the kind of selection
           | bias in taking feedback mostly from your own users. The
           | number of times I've heard product designers claim that their
           | users prefer some aspect of how their application already
           | works, ignoring the fact that the users who didn't prefer it
           | have left and hence are likely not available to survey.
        
             | rushingcreek wrote:
             | Of course. We do our best to talk to churned users as well,
             | but we're doing this Show HN to get even more diverse
             | feedback.
        
         | nomel wrote:
         | Fifth sentence:
         | 
         | > However, we've found that HumanEval is a poor indicator of
         | real-world helpfulness.
        
       | v3ss0n wrote:
       | from my last week test of opensource model , it keep repeating
       | and gives out broken outputs, using q4
        
         | rushingcreek wrote:
         | This V7 model is much better than the V2 model that we
         | previously open-sourced. And Q4 quantization would also likely
         | have a large detrimental impact.
        
           | nannal wrote:
           | Are there plans to open source V7?
        
       | slowhadoken wrote:
       | I love that Phind cites what it scrapes. This should be the
       | obligation of all LLM. I always suggest people use it over
       | ChatGPT.
        
         | Racing0461 wrote:
         | As a user, i perfer getting the right response compared to the
         | thing spitting out a link. (not saying phind is bad). Lets
         | focus on getting llm right before nerfing it in its baby
         | stages.
        
           | ryanklee wrote:
           | Who said anything about nerfing? Citation is just additive,
           | no?
        
             | joshspankit wrote:
             | In fact, I'd argue that citation makes LLM better. Kind of
             | a "think carefully" indicator. When LLMs are able to verify
             | those citations independently it's going to level up again
             | by skyrocketing the objective truthiness.
        
               | lsaferite wrote:
               | Interestingly, I'd say that _not_ being able to give
               | citations helps protect the LLM from copyright issues.
               | That being said, I'm much prefer if the LLM could provide
               | citations for every piece of information it was trained
               | on and uses to provide an answer.
        
               | pbhjpbhj wrote:
               | Citations are essential for me as I'm using Phind for
               | work and can't rely on "trust me bro". It needs to
               | confirm to my expectations or be confirmed in a couple of
               | the citations that have trustworthy sources (eg are from
               | known domains, well-cited journals, etc.).
        
             | Racing0461 wrote:
             | Nerf is the wrong word, more like regulatory capture. If
             | all llm had to quote their sources at this point, along
             | with all the other for the human changes we want to do,
             | only the big players would be able to do them effectively
             | making it hard to enter and compete. The current big
             | players want launching a new llm product to be more like
             | opening a new bank than opening a lemonade stand based on
             | the ai executive order released yesterday.
        
             | __jonas wrote:
             | I find it often makes the responses worse when it's being
             | pre-fed these search results, it was the case when I tried
             | gpt-4 with web browsing enabled, and seems to be the case
             | with this, since even the person from the Phind team in
             | this thread pointed out that turning this feature off
             | improves performance for some tasks:
             | 
             | https://news.ycombinator.com/item?id=38089888
             | 
             | https://news.ycombinator.com/item?id=38090442
        
         | make3 wrote:
         | What they're citing isn't what the LLM "scraped", it's what the
         | retrieval model fed to the LLM. You're not guaranteed that it's
         | what it actually used to give you the output, and it's also
         | definitely not all the text that it used to get appropriate
         | knowledge to generate the answer, as this is split over
         | whatever millions of examples for the language and for human
         | language in a non human-understandable way
        
           | pbhjpbhj wrote:
           | A couple of times I've had the reference not include the
           | detail being mentioned in the foregoing paragraph; the
           | citations are still highly relevant, but it wasn't quite what
           | I expected.
        
       | xydac wrote:
       | This is amazing, kudos to the team
        
       | seidleroni wrote:
       | The results I get are so-so. The rubric I use to evaluate coding
       | LLM's is to ask it to create a Python script that determines if
       | the contents of a given directory have been changed since the
       | last time the script was run. This should be done recursively and
       | handle files being added, removed, or modified and be based off
       | the contents of the files and not the timestamps.
       | 
       | When I ask it as one statement it performed ok, but if I made
       | more specifications with follow-up statements, it kept trying to
       | go down one path even though I told it to do it a different way.
       | A solid start but it definitely needs some improvements, IMO.
        
         | rushingcreek wrote:
         | Thanks for the feedback. We're working on improving consistency
         | and precise instruction following in followups.
        
         | jiggawatts wrote:
         | This is a problem that human programmers screw up... regularly.
         | 
         | E.g.: the efficient and robust file change monitoring on
         | Windows is to read the NTFS change journal. For a single
         | process lifetime there are other change notification APIs as
         | well. Most software does neither and is either very slow or
         | misses changes...
        
       | qorrect wrote:
       | It's so fast ... and accurate.
        
       | tydunn wrote:
       | This is awesome. Are you planning to open-source the V7 model?
        
         | rushingcreek wrote:
         | Thanks! We generally plan to open-source our previous models
         | once they're no longer cutting-edge, so yep :)
        
       | brucethemoose2 wrote:
       | > We can achieve up to 100 tokens per second single-stream while
       | GPT-4 runs around 20 tokens per second at best.
       | 
       | Is that with batching? If so, thats quite impressive.
       | 
       | > certain challenging questions where it is capable of getting
       | the right answer, the Phind Model might take more generations to
       | get to the right answer than GPT-4.
       | 
       | Some of this is sampler tuning. Y'all should look at grammar
       | based sampling (https://github.com/ggerganov/llama.cpp/pull/1773)
       | if you aren't using it already, as well as some of the "dynamic"
       | sampling like mirostat and dynatemp:
       | https://github.com/LostRuins/koboldcpp/pull/464
       | 
       | I think these should work with nvidia's implementation if you
       | just swap the sampling out with the HF version.
       | 
       | BTW, all this is a great advantage of pulling away from OpenAI.
       | You can dig in and implement experimental features that you just
       | can't necessarily do through their API.
        
         | rushingcreek wrote:
         | We leverage Flash Decoding
         | (https://crfm.stanford.edu/2023/10/12/flashdecoding.html) in
         | TensorRT-LLM to achieve 100 tokens per second on H100s.
        
         | claytonjy wrote:
         | is that impressive? I was thinking 100 tok/s on an H100 is
         | really slow considering LMDeploy claims 2000+ on an A100 and a
         | large batch size.
        
           | rushingcreek wrote:
           | We get 100 tokens a second with batch size 1. Those 2000+
           | figures are for large batches.
        
             | claytonjy wrote:
             | Ah, that's fair, and faster than any of the LMDeploy stats
             | for batch size 1; nice work!
             | 
             | Using an H100 for inference, especially without batching,
             | sounds awfully expensive. Is cost much of a concern for you
             | right now?
        
               | lyjackal wrote:
               | I don't think they're saying they're doing batch size of
               | 1, just giving performance expectations of user facing
               | performance
        
               | brucethemoose2 wrote:
               | Yeah, and this is basically what I was asking.
               | 
               | 100 tokens/s on the user's end, on a host that _is_
               | batching requests, is very impressive.
        
               | claytonjy wrote:
               | I think they _are_ saying batch size 1, given that
               | rushingcreek is OP.
        
           | brucethemoose2 wrote:
           | Without batching, I was actually thinking that's kind of
           | modest.
           | 
           | ExllamaV2 will get 48 tokens/s on a 4090, which is much
           | slower/cheaper than an H100:
           | 
           | https://github.com/turboderp/exllamav2#performance
           | 
           | I didn't test codellama, but the 3090 TI figures for other
           | sizes are in the ballpark of my generation speed on a 3090.
           | 
           | 100 tokens/s batched throughput (for each individual user) is
           | much harder.
        
       | tinco wrote:
       | Will you be offering the model as an API service? The product my
       | team is working on would benefit from a significantly faster and
       | possibly better performing model than GPT-4. If you're planning
       | on keeping pace with competitive models we'd love to integrate
       | the use of your model into our service.
        
         | rushingcreek wrote:
         | If we get enough demand that's definitely something we'll
         | consider. We're still a small team, however, and we do
         | everything in our power to not get distracted from our main
         | mission.
        
           | tinco wrote:
           | Makes sense, we're also very small (pre-seed) so definitely
           | no cash cow for you guys yet. We probably shouldn't be
           | prematurely optimizing our prompting performance as it's not
           | really a bottleneck, but a 4x improvement just by swapping an
           | API would be too good not to act on.
        
           | mike_hearn wrote:
           | If you offer an API then you can be used with tools like
           | https://aider.chat/, which is the best way to use LLMs for
           | coding. But if only available via the web it's not possible.
           | BTW this is the main reason I pay for the OpenAI API.
        
           | ilaksh wrote:
           | Please consider releasing an API. Having a faster alternative
           | to GPT-4 would be amazing for so many use cases.
           | 
           | Especially for agents that do function calling.
        
       | redox99 wrote:
       | Will you open source anything newer than the v2 currently on HF?
        
       | theage wrote:
       | Pretty big jump for java eval, what is the reason for java being
       | so notoriously difficult for LLMs? never mind I asked phind[1]
       | and it said all the complexity... but do you have any tips or
       | tricks for working with that language in your model?
       | 
       | [1] https://www.phind.com/search?cache=u3mnj3iwmjvgqlyf60bnbqo1
        
       | Kim_Bruning wrote:
       | So on firefox with normal protections I get a blank page in reply
       | to a phind query for whatever reason. On chrome phind does seem
       | to get some interesting answers (and is a bit cheaper than GPT to
       | begin with for sure ;-) )
        
       | BugsJustFindMe wrote:
       | > _You can now get high quality answers for technical questions
       | in 10 seconds instead of 50._
       | 
       | ChatGPT 4 does not take 50 seconds to answer, so I don't
       | understand this comparison.
        
         | rushingcreek wrote:
         | We find that it takes around a minute for a 1024-token answer.
         | Answers to less complex questions will take less time, but
         | Phind will still be 5x faster.
        
         | bethekind wrote:
         | Recently I've used gpt 4 and yes it does take up to a minute
         | even for easy questions.
         | 
         | I've asked it how to scp a file on Windows 11 and it'll take a
         | minute to tell me all the options possible.
         | 
         | If this takes 1/5th the time for equivalent questions, I'd
         | consider switching
        
           | joshspankit wrote:
           | Not my experience at all. Are you counting the entire answer
           | in your time?
           | 
           | If so, consider adding one of the "just get to the point"
           | prompts. GPT4's defaults have been geared towards public
           | acceptance through long-windedness which is imo entirely
           | unnecessary when using it to do functional things like scp a
           | file.
        
             | phillipcarter wrote:
             | Yeah, I would say this is a prompting problem and not a
             | model problem. In a product area we're building out right
             | now with GPT-4, our prompt (more or less) tells it to
             | provide exactly 3 values and it does that and only that.
             | It's quite fast.
             | 
             | Also, use case thing. It is very likely the case that for
             | certain coding use cases, Phind will always be faster
             | because it's not designed to be general purpose.
        
             | londons_explore wrote:
             | The words "briefly" or "without explanation" work well.
             | 
             | By keeping the prompt short, it starts generating output
             | quicker too.
        
             | theWreckluse wrote:
             | LOL, it's not just for "public acceptance". Look up Chain
             | of Thought. Asking it to get to the point typically reduces
             | the accuracy.
        
               | freedomben wrote:
               | > _LOL, it's not just for "public acceptance". Look up
               | Chain of Thought. Asking it to get to the point typically
               | reduces the accuracy._
               | 
               | Just trying to provide helpful feedback for you, this
               | would have been a great comment, except for the "LOL" at
               | the beginning that was unnecesary and demeaning.
        
               | bigfudge wrote:
               | You are being snarky but are right. I have scripts set up
               | to auto summarise expansive answers. I wish I could build
               | this into the ChatGPT ui though.
        
           | furyofantares wrote:
           | This isn't a fair comparison because I have custom
           | instructions that mention being brief but complete, but I did
           | "how to scp a file on Windows 11"
           | 
           | ChatGPT4: 14 seconds
           | 
           | phind with "pair programmer" checked: 65 seconds
           | 
           | phind default: 16 seconds
        
           | BugsJustFindMe wrote:
           | > _I 've asked it how to scp a file on Windows 11 and it'll
           | take a minute_
           | 
           | https://imgur.com/a/iqxOJUV was 6.5 seconds.
           | 
           | https://imgur.com/a/pQFfWli was 15.
           | 
           | You can tell they're GPT-4 because the logo is purple (the
           | logo is green when using 3.5).
        
         | JoshGlazebrook wrote:
         | ChatGPT4 is more often than not noticeably slow enough that I
         | question why I pay for it.
        
           | shmoogy wrote:
           | Sometimes it's insanely quick - like gpt3,5 turbo or a cached
           | answer or something.
        
       | benxh wrote:
       | I can't wait to see this open sourced, there's a lot of sampling
       | strategies that help coding.
       | 
       | And I also can't wait to see how much Phind will improve further
       | if the Glaive dataset is added onto it.
       | 
       | Edit: Contrastive search, dynamic temperatures.
        
       | SirMaster wrote:
       | I tried this, but I still have yet to get any LLM to answer me a
       | programming question (that actually works) that I actually want
       | to solve.
       | 
       | Basically:
       | 
       | "How can I send network control commands to an AppleTV in C#"
       | 
       | They always make up some nonexistent library or gives an example
       | using some nonexistent API.
        
         | jiggawatts wrote:
         | That's because you're asking it something too obscure that I
         | would have at first assumed wasn't even possible.
         | 
         | "Make me a billionaire... I'm still poor! Bad AI!"
         | 
         | You need to collaborate with the AI, use it to help with each
         | small step of the problem, with input references provided.
         | 
         | To a degree Phind can do the reference chasing for you, but
         | it's _not magic_.
        
           | SirMaster wrote:
           | It's definitely not impossible at least.
           | 
           | Someone is doing it in python here:
           | 
           | https://pyatv.dev/
           | 
           | GPT-4 actually sent me here:
           | 
           | "Here is an example of a C# library that implements the HAP:
           | CSharp.HomeKit (https://github.com/brutella/hkhomekit). You
           | can use this library as a reference or directly use it in
           | your project."
           | 
           | Which, to no surprise based on my experiences with LLMs for
           | programming does not exist and doesn't seem to have ever
           | existed.
           | 
           | I get that they aren't magic, but I guess I am just bad at
           | trying to use LLMs to help in my programming. Apparently all
           | I do are obscure things or something. Or I am just not good
           | enough at prompting. But I feel like that's also a reflection
           | of the weakness of an LLM in that it needs such perfect and
           | specific prompting to get good answers.
        
             | wizzwizz4 wrote:
             | > _Or I am just not good enough at prompting._
             | 
             | Or you're good enough at using your tools that you can do
             | all the low-hanging fruit. LLMs excel at working around
             | inadequate tooling, but (at least at the moment) they can't
             | help you if you're trying to do something actually tricky
             | and get stuck enough that no rubber duck can save you.
        
         | danenania wrote:
         | I'm working on an open source, terminal-based AI coding tool
         | that is designed specifically for more complex, multi-iteration
         | tasks and features. I think it could likely do a good job on
         | this task.
         | 
         | I'm using it personally every day and while it still needs more
         | work and polish, I'm finding it much better than ChatGPT or any
         | other tools I've tried for bigger and more difficult tasks.
         | 
         | Please let me know if you (or anyone else reading this) would
         | be interested to try a late alpha/early beta version:
         | dane@envkey.com
        
         | richardw wrote:
         | I'd guess the intersection of both tech has low training
         | content so it starts dreaming. If you break up the question
         | into "AppleTV API" (or whatever the primary terms are), then
         | use that context for C# it might work better? Isolate the Apple
         | bit so it uses more specific parts of the training.
        
       | yellow_lead wrote:
       | The speed and quality seem good to me. Will try it on some real
       | scenarios this week.
        
       | eoinboylan wrote:
       | Ran a quick test with a Rust async code snippet that contains an
       | error. Compared with GPT-4 its gives a far clearer solution, with
       | linked sources to learn more! Super impressive!
        
         | rushingcreek wrote:
         | Amazing, that's great to hear.
        
           | passion__desire wrote:
           | Is it possible to output all steps of solutions in a single
           | copyable block? I don't want to copy 4 separate blocks.
        
             | rushingcreek wrote:
             | You can tell it that in a followup. Or, configure an answer
             | profile and tell it to use that style:
             | https://phind.com/profile.
        
       | unshavedyak wrote:
       | Re: "We're excited to announce" - when did this get deployed? I
       | was on Phind Pro ... a month ago or something, and curious if i
       | already experienced this or not.
       | 
       | Phind was really good, but still had a difficult time with
       | library versions. Notably a lot of the search results it saw felt
       | like they polluted it with incorrect assumptions about available
       | methods on specific library versions. The web results felt like
       | it made the LLM worse at some things. In the end i switched back
       | to ChatGPT. Though i expect i'll retry Phind at some point, i do
       | tend to ping pong on each respective release.
       | 
       | Does this version tackle that any better in your eyes?
        
         | rushingcreek wrote:
         | Thanks for the feedback and I'm sorry to see you go. The new
         | version should be better at library versions. If you're in our
         | Discord, I'd be happy to help you one-on-one -- please send me
         | a DM.
        
           | unshavedyak wrote:
           | I'm sure i'll be back soon, the overall experience was good.
           | So many competing products it's difficult to pay for them all
           | at once.
        
       | yellow_lead wrote:
       | Some small feedback/bug: (Mobile, Firefox, using pair programmer
       | mode)
       | 
       | The text box gets hidden after the conversation exceeds the page
       | height
        
         | rushingcreek wrote:
         | Thanks, we'll take a look.
        
       | WhitneyLand wrote:
       | The headline seems a little disingenuous: "beats GPT-4 at coding"
       | 
       | The results are impressive and things have been really
       | progressing quickly, so kudos.
       | 
       | But even by your own description in this post, something like
       | "rivals GPT-4 at coding" seems a more accurate appraisal.
        
       | iillexial wrote:
       | Didn't work fine when I asked it a design question: the code and
       | API it used is not correct. GPT-4 did a better job.
       | 
       | https://www.phind.com/search?cache=ay8rx37gq8oy3z7uixftlqkt
       | 
       | https://chat.openai.com/share/a3a91dcc-a91a-4b04-8afd-40bd1a...
        
         | xmprt wrote:
         | The GPT-4 answer is only better in so far as it uses
         | RunTransaction. I don't know why it's trying to loop through
         | the stores and then running the i'th operation on that store
         | when it could have just had the store referenced in the
         | operation instead of passing it as a parameter. And then it's
         | also creating a new client for each transaction which seems
         | wrong (to be fair I'm not familiar with Firestore so maybe this
         | is idiomatic).
        
           | iillexial wrote:
           | It's not idiomatic. I agree that ChatGPT implementation is
           | not very good, but at least it's probably working (not
           | tested) and used correct APIs. I tried several iterations
           | after that, and it came up with a better design.
        
         | rushingcreek wrote:
         | Thanks for sharing the links, we'll investigate this example.
        
           | passion__desire wrote:
           | I straight away asked it a stackoverflow question in which
           | input and expected output samples were given. Phind didn't do
           | well. ChatGPT though, [kissing hearts emoji]
        
         | naet wrote:
         | Not looking deeply at the technical side of the answers, but
         | the time of GPT4's answer is very casual/conversational (it
         | starts with "Alright, listen up." and keeps that tone
         | throughout).
         | 
         | I think you might get a better answer if you rewrote your
         | prompt using full sentences and more formal language.
        
       | fuddle wrote:
       | Your About page is really lacking in detail.
       | https://www.phind.com/about I wouldn't feel comfortable using
       | your service without a lot more detail about the founders and
       | company etc.
        
       | accrual wrote:
       | > it supports up to 16k tokens
       | 
       | > Llama 1 supports up to 2048 (2K) tokens, Llama 2 up to 4096
       | (4K), CodeLlama up to 16384 (16K). [0]
       | 
       | This is wild to me.
       | 
       | The token window is one of the limiting factors for having an AI
       | that can actually remember you and past conversations. Having a
       | large window is key for future AI applications that involve long
       | running conversations (weeks, months, years). The tech is already
       | very impressive, but imagine it as it becomes more like an
       | _actual_ pair programmer and remembers all the various things it
       | 's learned and worked on with you in the past.
       | 
       | [0]
       | https://huggingface.co/docs/transformers/main/model_doc/llam...
        
         | Der_Einzige wrote:
         | Still waiting for the day that medium term memory (token
         | average pooling like in sentence transformers) becomes used for
         | this. It's staring all of these companies in the face and
         | apparently no one thinks to implement it.
        
           | brrrrrm wrote:
           | Out of curiosity, why do you think the answer would be so
           | simple and also completely untested?
        
             | fullstackchris wrote:
             | Another curiosity, what do we estimate (if it's even
             | possible) the context window of a human? Obviously an
             | extremely broad question, and of course it must have some
             | sort of decay factor... but... would be interesting to get
             | a rule of thumb number in terms of token count. I can
             | imagine its massive!
        
               | Filligree wrote:
               | I don't think it's massive. In fact, since it's roughly
               | equivalent to working memory, I suspect it's on the order
               | of 100 tokens at most.
               | 
               | It's just that, unlike these AIs, we're capable of online
               | learning.
        
               | travisjungroth wrote:
               | Human memory, in my limited understanding, doesn't have
               | the bifurcation of weights and context that LLMs do. It's
               | all a bit blurrier than that.
               | 
               | Something interesting that I heard from people trying to
               | memorize things better is that memory "storage space"
               | limits for people are essentially irrelevant. We're
               | limited by our learning and forgetting speeds. There's no
               | evidence of brains getting "full".
               | 
               | Think of it like a giant warehouse of plants, with one
               | employee. He can accept shipments (learning). He can take
               | care of plants (remembering). Too long without care and
               | they die (forgetting). The warehouse is big enough that
               | it is not a limiting factor in how many plants he can
               | keep alive. If it was 10x bigger it wouldn't make a bit
               | of difference.
        
           | heavyarms wrote:
           | I've been thinking along the same lines. The token window IMO
           | should be a conceptual inverted pyramid, where there most
           | recent tokens are retained verbatim but previous iterations
           | are compressed/pooled more and more as the context grows. I'm
           | sure there's some effort/research in this direction. It seems
           | pretty obvious.
        
             | matsemann wrote:
             | But some of the earlier tokens are also the most important
             | ones, right? Like the instructions and rules you want it to
             | follow.
        
               | a_wild_dandan wrote:
               | They are. Moreover, the idea that AI companies are
               | missing and/or not implementing this "obvious" tactic is
               | hilarious. Folks, these approaches have profound
               | consequences for training and inference performance.
               | Y'all aren't pointing out some low hanging fruit here,
               | lol
        
         | mycall wrote:
         | Token window size is being virtualized with the like of MemGPT,
         | so its effect will diminish.
        
         | seydor wrote:
         | 640k is enough for anyone
        
       | xrd wrote:
       | I know it isn't popular, but I wish there was a way to use this
       | inside Emacs. Or, vim. I just don't want to use VS Code anymore.
        
         | bigdict wrote:
         | Pretty sure GitHub Copilot has emacs/vim integration.
        
           | freedomben wrote:
           | It does, although not the most recent features. I use the
           | compatible features in Vim and I really like it. Not enough
           | to switch editors though.
        
         | Jeff_Brown wrote:
         | If only the depth of our feelings for Emacs counted for more in
         | the market.
         | 
         | There's an argument that music and the arts are dumbed down by
         | the fact that, for instance, making an album worth $10 to
         | millions of people pays way better than making an album worth a
         | million dollars to tens of people, since the album is going to
         | get priced at $10 one way or the other. It only just now
         | occurred to me that the same phenomenon applies to tools.
        
         | freedomben wrote:
         | The standardizing on VS code is one of the saddest developments
         | over the last several years IMHO. I think it's great that VS
         | Code exists, but we're headed for a world where you _have_ to
         | use VS Code if you want the best tooling because it won 't
         | support other options. The same thing happened with Java dev
         | and IntelliJ, and IMHO it has been extremely unhealthy for the
         | ecosystem. I'm immensely glad that Copilot supports vim, but
         | I'm fearful that it soon won't.
        
           | FreezerburnV wrote:
           | Same could have/could be said about Jetbrains products.
           | People are likely always going to use vim/emacs and create
           | tooling around whatever new hotness exists for them. And
           | honestly? VS Code is just a new iteration on how vim/emacs
           | work in a lot of ways: Providing a place to edit text and
           | then a bunch of plugins that do things with that text.
           | 
           | And if you want vim/emacs to keep living, then you should
           | spend time helping! Create your own extensions,
           | maintain/contribute to existing ones, etc. They will only die
           | out when the last person actively contributing to them stops,
           | so keep the chain of people going :)
        
           | papichulo2023 wrote:
           | Didnt vscode standardise language servers making much easier
           | for all the rest text-editor-close-almost-ides to integrate?
           | Is it really that sad?
        
             | freedomben wrote:
             | Very fair point. Vim has benefited tremendously from that
             | effort.
        
         | haarts wrote:
         | You and me both brother. LSP integration seems the way forward.
        
         | notpublic wrote:
         | https://github.com/github/copilot.vim
         | 
         | https://github.com/huggingface/llm.nvim
        
         | mg wrote:
         | In Vim, I tried to assign a shortcut to send the selected text
         | to Phind (or any other LLM) and came up with this:
         | 
         | :'<,'>y|call system('firefox <url>?q='.shellescape(@*).' &')
         | 
         | The only problem left is that the text is not urlencoded.
         | 
         | There probably is some elegant way to urlencode it. But I did
         | not come up with one yet.
        
           | wizzwizz4 wrote:
           | https://stackoverflow.com/a/76488059 claims to have one,
           | though it's not explained.
        
         | fictorial wrote:
         | https://github.com/CoderCookE/vim-chatgpt
        
         | accoil wrote:
         | Maybe ellama[1] would work? It doesn't support Phind yet, but a
         | provider could be created for the underlying connection package
         | llm[2].
         | 
         | [1]: https://github.com/s-kostyaev/ellama
         | 
         | [2]: https://github.com/ahyatt/llm
        
       | _bramses wrote:
       | Such an AWESOME time to be a programmer!!
        
       | mistercow wrote:
       | So I gave it this prompt:
       | 
       | > I need a typescript function which takes in an object with an
       | id string property and a string message property, and also takes
       | an array of search strings, and returns a mapping of search
       | strings to matching message ids
       | 
       | The response I got was close, but it assumed that each search
       | string would match only one message, so it returned
       | Record<string, string>. I fed this to GPT-3.5 and it answered 10x
       | faster with the correct return type.
       | 
       | This is a slightly tricky example, because it requires the model
       | to infer that multiple message matches are possible. But I think
       | that it's interesting that ChatGPT nailed it despite not using
       | any chain of thought.
        
         | starbugs wrote:
         | > I need a typescript function which takes in an object with an
         | id string property and a string message property, and also
         | takes an array of search strings, and returns a mapping of
         | search strings to matching message ids
         | 
         | Your prompt is wrong. You want a function that takes an array
         | of id/message objects, not an object.
         | 
         | It's quite impressive that GPT is just able to correct for
         | that. As a human, I would first ask what you actually mean,
         | because your prompt appears to be unclear.
        
       | dontreact wrote:
       | I tried this question and GPT4 did way way better to getting
       | closer to a final answer. Phind was horribly wrong. I can't help
       | but think something seems off with your eval given just how badly
       | Phind did on this.
       | 
       | I want to make an interactive plot in Colab where I can show
       | 
       | X axis is interest rate of a 15 year mortgage. Y axis is the
       | relative advantage of buying a house vs. renting in terms of
       | total net worth at 15 years.
       | 
       | Assume a monthly budget for renting + investing or buying a house
       | of 10k
       | 
       | Plot different lines for a few different market returns.
       | 
       | Make a slider that controls the total size of the loan.
        
         | rushingcreek wrote:
         | Seemed to give plausible results for me:
         | https://www.phind.com/search?cache=lswmiuewv2l33jt337dgrsho
        
           | dontreact wrote:
           | def calculate_relative_advantage(interest_rate, loan_size,
           | market_return): # Your calculation logic here pass
           | 
           | Chat gpt actually implements it
        
       | epups wrote:
       | In my experience, Phind is not as good as GPT4, but it's by far
       | the second best LLM for programming. I find that tremendously
       | impressive considering they are competing against the whole world
       | for that title right now.
       | 
       | I agree with the assessment about consistency being its major
       | flaw. While with GPT4 I can continue a conversation for quite
       | long, Phind easily looses the required context. Perhaps it has to
       | do with summarization capabilities, or messing with the context
       | window has these types of side effects.
        
         | rushingcreek wrote:
         | Have you tried clicking the model selection dropdown and
         | enabling "Ignore web results"? That can help with keeping
         | context for complicated design tasks.
        
       | buildbot wrote:
       | Well, neither GPT4 or this Phind model where able to answer my
       | torture test: "Write amaranth code that can be used to control
       | the readout of a frame from a kodak CCD with 4096 columns and
       | 2048 rows."
       | 
       | Which yes, is missing a lot of detail (you could/I have feed/fed
       | in a datasheet).
       | 
       | But Phind goes off on using pyserial (?!), and GPT4 assumes
       | amaranth is a hypothetical CCD control library and makes a
       | useless class control CCD using the hypothetical library.
       | 
       | Edit - Phind at least acknowledged that amaranth exists, unlike
       | GPT4 with this prompt: "Write amaranth code that can be used to
       | control the readout of a frame from a kodak CCD using an lattice
       | FPGA with 4096 columns and 2048 rows. Assume the design will be
       | hooked up to a larger litex SoC "
        
         | mensetmanusman wrote:
         | That's torture for humans as well. The key to LLMs is
         | communicating clearly to the information cloud.
        
           | buildbot wrote:
           | Sure, but a good example of how far certain domains have to
           | go still. These datasheets should be in the models training
           | data, at least one CCD datasheet, and verilog & (migen |
           | nmigen | amaranth) certainly are.
           | 
           | Controlling a CCD is actually pretty easy, I built (very
           | simple, but working) controllers for several CCD chips in
           | undergrad doing research for the ATLAS detector. You just
           | clock a rows out basically, N columns times. Reset first. I'd
           | expect an senior undergrad EE student to be able to design a
           | simple core in a few class projects.
        
       | anon115 wrote:
       | this why i dont even bother reading the 'evals' or how its better
       | i test it out myself and it seems always to not be true
        
       | cccybernetic wrote:
       | I'm glad there's growing competition, but GPT4 still outperforms.
       | Here's a recent question I asked:
       | 
       |  _Could you please provide me with a query for my PostgreSQL
       | table? It consists of three columns: 'id,' 'teaser,' and
       | 'full_text,' all of which are of type 'text.' Most 'full_text'
       | entries are articles around 1000 words in length. I'd like to
       | update the 'teaser' column for each row with the first
       | approximately 200 words from the 'full_text' column. How can I
       | achieve this?_
       | 
       | Phind:                 CREATE OR REPLACE FUNCTION
       | get_first_n_words(text, integer)       RETURNS text AS $$
       | DECLARE          word_count integer := 0;          word_start
       | integer := 1;          word_end integer;       BEGIN
       | WHILE word_count < \$2 AND word_start <= length(\$1) LOOP
       | word_end := position(' ' in \$1 from word_start) + 1;
       | IF word_end = 0 THEN                  word_end := length(\$1) +
       | 1;              END IF;              word_count := word_count +
       | 1;              word_start := word_end;          END LOOP;
       | RETURN substring(\$1 from 1 for word_start - 1);       END;
       | $$ LANGUAGE plpgsql;
       | 
       | GPT4:                 UPDATE your_table_name       SET teaser = (
       | SELECT STRING_AGG(word, ' ')            FROM (
       | SELECT SPLIT_PART(full_text, ' ', i) AS word               FROM
       | generate_series(1, 200) AS i           ) AS words           WHERE
       | word <> ''       )       WHERE full_text IS NOT NULL;
        
         | riku_iki wrote:
         | One example is not enough for performance conclusions
        
           | Wytwwww wrote:
           | There is a performance conclusion in the title though.
        
             | riku_iki wrote:
             | That conclusion is based on benchmark with many examples in
             | different tasks.
        
               | Wytwwww wrote:
               | From what I understand it's a single test suite? Of
               | course I don't really mind the clickbait title that much,
               | it's hard to attract attention otherwise.
        
               | spmurrayzzz wrote:
               | AFAIK they haven't released the dataset they fine-tuned
               | on, so we can't be 100% there wasn't benchmark
               | contamination. Agree that we definitely need more than
               | N=1 to challenge the performance claims, but I still
               | think its valid to call it out given how much
               | benchmarking-gaming we've seen in this space.
        
               | riku_iki wrote:
               | I think you can bring contamination claim to every public
               | benchmark results nowdays: models are trained on TBs of
               | data crawled from internet, and there is no guarantee
               | benchmark is not leaked in some way.
        
           | cccybernetic wrote:
           | Obviously not. Perfectly reasonable to share anecdotes
           | though.
           | 
           | Also, I ran a few different tests, and every GPT-4 response
           | was superior, but I didn't want to clutter my comment with
           | queries and code.
        
           | amelius wrote:
           | Depends on the claims made.
        
         | rushingcreek wrote:
         | Running "Ignore Web Context" enabled can improve performance
         | for design tasks like this. I just got a more plausible answer:
         | https://www.phind.com/search?cache=f0fkv5mxscwvagxgkuwnwgtl.
         | Consistency is something we're working on.
        
           | cccybernetic wrote:
           | Thanks for sharing, you're right - that does improve
           | performance!
        
           | ta8645 wrote:
           | How do you enable "Ignore Web Context"? I don't see that
           | option anywhere on the page you linked, am I just being
           | blind?
        
             | rushingcreek wrote:
             | It's in the model dropdown under the search bar.
        
         | nofunsir wrote:
         | I really dislike article teasers and "read more" buttons. Now I
         | know it's intentional clipping of the corresponding articles.
        
       | mtkd wrote:
       | Been using Phind for a bit now and started paying for pro
       | 
       | They're smashing it and can't do enough if you report an issue,
       | also they have started a weekly voice call to discuss algos and
       | such with senior devs, like a surgery, only 10 people join at
       | moment
       | 
       | Don't think I've ever recommended anything as much as I have
       | these guys in last couple of months
        
       | GalaxyNova wrote:
       | are you planning on open sourcing the model eventually?
        
       | joaodias wrote:
       | Asked it to write a program that I've written before, to compare
       | with gpt4. Didn't really get what I was asking for, gpt4
       | understood it perfectly, and is ready to continue prompting
       | toward completion.
       | 
       | https://www.phind.com/agent?cache=cloeowfla000dl1084ermly3c vs
       | https://chat.openai.com/share/4147da33-3669-4657-88fa-3a9dfc...
       | 
       | Might not be representative of the whole thing, but it went on
       | about random things I didn't ask about, and just basic
       | information I already knew
        
         | rushingcreek wrote:
         | The Pair Programmer mode currently either uses GPT-4 or GPT-3.5
         | (if you've run out). Please try again in the default search
         | mode to use the Phind Model.
         | 
         | Using the Phind Model in the default search seems to work well:
         | https://www.phind.com/search?cache=ln6dpdtv5auwn4cq1ofg3gs9
        
           | Capricorn2481 wrote:
           | Even though the phind model is selected? Is there a technical
           | reason Phind doesn't do pair programming yet?
        
             | rushingcreek wrote:
             | It's because we haven't updated the Phind Model to support
             | function calling yet but we're working on it.
        
               | Capricorn2481 wrote:
               | Can you share what your long term monetization model is?
               | I'm noticing Phind is free to use right now.
        
               | rushingcreek wrote:
               | We have a Pro plan where you can get (virtually)
               | unlimited GPT-4 and soon, an even faster Phind model.
               | https://phind.com/plans
        
         | TheGeminon wrote:
         | The problem is that it's doing a search of your relatively
         | niche problem, and probably getting pretty poor results. The
         | text from the search is then more highly weighted than the base
         | model, but with relative junk, so it performs better without
         | the additional (unhelpful) context.
         | 
         | You see this with Bing search on ChatGPT as well, and I've seen
         | it in my own projects.
        
       | interstice wrote:
       | If it's trained on data (particularly docs) after 2021 that's an
       | automatic win over Chat GPT in some situations!
        
         | lmeyerov wrote:
         | Gpt is 2023 now
        
       | nikita wrote:
       | https://www.phind.com/search?cache=hnqqc3fo3o3n61blb6bfh69b
       | 
       | It's not generating the wrong answer. It's quoting the wrong
       | answer
        
       | taylorfinley wrote:
       | The speed is really impressive! I tried it with a moderately
       | challenging task and it failed pretty spectacularly,
       | hallucinating class methods and missing a bunch. It seemed like
       | the UI struggled with my code too, breaking in and out of
       | markdown somewhat randomly. I was impressed enough I may try
       | again with some simpler stuff, but I'm not quite ready to switch
       | away from GPT4.
        
         | rushingcreek wrote:
         | Would you mind sharing the link? I'd also suggest trying to
         | enable "Ignore search results" from the model dropdown for
         | inputs with lots of specific details.
        
       | JanSt wrote:
       | What's the cutoff date?
        
         | rushingcreek wrote:
         | October 2023
        
       | zoogeny wrote:
       | I just spent a few minutes doing a comparison between Phind and
       | GPT-4 for a very high-level question on a distributed job queue.
       | I gave them both the same fairly vague sketch of a kind of system
       | I would like to build. Here are my impressions:
       | 
       | In the positives of Phind:
       | 
       | * Phind was able, even eager, to recommend specific libraries
       | relevant to the implementation. The recommendations matched my
       | own research. GPT-4 takes some coaxing to get it to recommend
       | libraries. Phind also provided sample code using the libraries it
       | recommended.
       | 
       | * Phind provides copious relevant sources including github,
       | stackoverflow and others. This is a major advantage, especially
       | if you use these AI assistants as a jumping off ground for
       | further research.
       | 
       | * Phind provides recommendations for follow on questions that
       | were very good. One suggestion to the Phind team: don't remove
       | the alternate follow on questions once I select one. A couple of
       | times it recommended a few really good follow up questions but as
       | soon as I selected one the others disappear.
       | 
       | In the positives of GPT-4:
       | 
       | * GPT-4 gave better answers. This is my subjective opinion
       | (obviously) but if I was interviewing two candidates for a job
       | position and using my question as the basis for a systems-design
       | interview then GPT-4 was just overall better. In many cases it
       | added context beyond my question, recommending things like
       | logging and metrics for example. It seemed to intuit the
       | "question behind the question" in a much better way than the
       | literal interpretation of Phind. This is probably highly case-
       | dependent, sometimes I just want an answer to my explicit
       | question. But GPT-4 seemed to understand the broader context of
       | the question and replied with that in mind leading to an overall
       | more relevant response.
       | 
       | * GPT-4 handled follow-up questions better. This is similar to
       | the previous point - but GPT-4 gave me the impression of
       | narrowing down the scope of the discussion based on the context
       | of my follow-up question. It seemed to "understand" the direction
       | of the conversation in a way that felt like it was following
       | context.
       | 
       | NOTE: this was not a test on coding capability (e.g. implementing
       | algorithms) but on using these AI coding assistants as sounding
       | boards for high-level design and architecture decisions.
        
         | X6S1x6Okd1st wrote:
         | > * Phind provides copious relevant sources including github,
         | stackoverflow and others. This is a major advantage, especially
         | if you use these AI assistants as a jumping off ground for
         | further research.
         | 
         | Did you find them to be correct?
        
           | pbhjpbhj wrote:
           | I don't use Phind for coding, except occasionally, but I like
           | it best for generalised tech search because each para has a
           | reference and there's a list of references down the side --
           | often the references would really be sufficient for me on
           | their own.
           | 
           | I've had one glaring error, I can't quite remember the
           | details, but it switched the names/characteristics of two
           | different processes (ie was exactly opposite in what it
           | said); it was something to do with instruction caching and
           | TLB, IIRC. I assumed you'd was a problem with the input
           | corpus not allowing antonyms to be disambiguated.
           | 
           | Anyway, for me it's the best of the LLM tools I have access
           | to and had mostly replaced search engine (Google, Dukgo) for
           | my tech-related work.
           | 
           | I've only used chat.openai.com (free), bing chat,
           | HuggingChat.
        
           | zoogeny wrote:
           | I don't think "correct" is the right word since these were
           | open ended systems design type questions. There are many ways
           | to accomplish the same task.
           | 
           | I also spent about 20 minutes on this which is why I
           | mentioned this is a first impression. I'll leave it to
           | researchers to develop a "relevancy" metric and objectively
           | apply it.
           | 
           | In my experience, the sources were sufficiently relevant
           | based on its responses. They were about as relevant as
           | equivalent Google queries. Some tiny, tiny niggles, like I
           | was explicit I wanted it to recommend approaches in Go and
           | for one reference I recall related to distributed locking
           | mechanisms it provided a reference to an implementation in
           | Java. However, that is completely fine for me since the
           | context was more about the locking on the database side and
           | not really the implementation in a specific language.
        
         | fthd wrote:
         | mind providing some of the prompts you use to question them?
        
           | zoogeny wrote:
           | Here are the conversation logs:
           | 
           | https://chat.openai.com/share/867ff0c4-d4cf-4af9-a785-31a599.
           | ..
           | 
           | https://www.phind.com/search?cache=ej8pn1dfjjwfr1tgc6ybwhlg
           | 
           | NOTE: there are a few more question/answer blocks in the
           | phind conversation since I was testing out the follow up
           | question feature.
        
         | webappguy wrote:
         | Do you have custom instructions? Everyone needs to mention and
         | post prompts else entirely antidotal
        
           | rushingcreek wrote:
           | We support custom instructions at https://phind.com/profile.
        
             | bredren wrote:
             | I'm trying to get it to answer only in executable Python. I
             | used the template with instructions I use for my system
             | prompt on gpt4. And I tried using the additional context
             | field for the same.
             | 
             | It gets to writing the expected code but it still wants to
             | include formatted headings instead of commenting those out
             | so the entire response is executable Python.
             | 
             | As a follow up I provided an example heading with the hash
             | out front. It didn't work.
             | 
             | Any ideas on how to get it to this? Fwiw, gpt4 often if
             | ignores this request, but only about half the time. When it
             | does it is typically a single block of explanatory text.
             | 
             | For that, I include prose detection and commenting as part
             | of my post processing.
             | 
             | Also, I don't see it easily, but do you have an API for
             | this or is it intended to be run by the user?
        
               | rushingcreek wrote:
               | Getting it to not output additional text is not something
               | that it can do super well at the moment, unfortunately.
               | We'll work on that.
        
               | soulofmischief wrote:
               | My trick for this has been one-shot training + regex. I
               | tell the model to produce executable code within triple
               | backticks suffixed by a keyword, like:
               | 
               | ```keyword // code ```
               | 
               | and then I just ignore anything outside of those blocks.
        
           | zoogeny wrote:
           | I did not have custom instructions for either assistant. You
           | can see the full conversation logs which I posted as a reply
           | to another comment.
        
       | shazar wrote:
       | I gave it two tries, GPT-4 was much better in both cases. Tried
       | with two Leetcode questions. It came back with an empty response
       | for one, and provided a worse code (O(n2) solutions when it can
       | be done with linear time) for the other one.
       | 
       | GPT-4 on the other hand provided a good answer for both
       | questions. Also I guess the UI is buggy w.r.t code formatting, it
       | things the following line is a code and switches to a code block.
       | 
       | ``` You are given an array prices where prices[i] is the price of
       | a given stock on the ith day. ```
       | 
       | The only downside for GPT-4 for me right now, is its slowness.
        
         | rushingcreek wrote:
         | I suggest you try enabling "Ignore search results" from the
         | model dropdown for these types of questions. The web results
         | can be distracting for the model for Leetcode-type questions.
        
           | doctoboggan wrote:
           | I see you've had to suggest this a few times in this thread,
           | and in my experience I would agree with the suggestion. I
           | wonder if you can have a simple gpt model decide
           | automatically when ignoring search results would improve the
           | result and do it automatically.
        
             | rushingcreek wrote:
             | Interesting idea.
        
           | shazar wrote:
           | I tried with that option enabled and now it can't generate
           | code at all. Here's my prompt:
           | 
           | ``` You are given an array prices where prices[i] is the
           | price of a given stock on the ith day.
           | 
           | Find the maximum profit you can achieve. You may complete at
           | most two transactions.
           | 
           | Note: You may not engage in multiple transactions
           | simultaneously (i.e., you must sell the stock before you buy
           | again).
           | 
           | Write Python code to solve this: def maxProfit(self, prices:
           | List[int]) -> int: ```
           | 
           | Output:
           | 
           | ``` It seems like you want to find the maximum profit that
           | can be achieved by buying and selling stocks, with the
           | constraint that you can only make at most two transactions.
           | Is that correct?
           | 
           | Could you please provide some example input and output to
           | help me better understand your requirements? ```
           | 
           | I also tried a more basic prompt, but the output is not what
           | I'd consider good code.
           | 
           | Can you maybe share some examples where we can see how it
           | exceeds GPT-4's capabilities? Thanks!
        
             | rushingcreek wrote:
             | Seemed to work well just now:
             | https://www.phind.com/search?cache=w1jyatqyia1a8r3pfxlxqby6
        
           | mediaman wrote:
           | In my own RAG implementations in the industrial sector, I've
           | found it effective to first have the AI decide whether it
           | needs to search at all. If it doesn't, the answers are much
           | better.
        
         | popularonion wrote:
         | GPT-4 has ingested all of Leetcode, you can literally just type
         | "leetcode 100 python" and it will regurgitate a response for
         | you.
         | 
         | Only exception I found is with some of the Leetcode Premium
         | questions, you might have to actually type in the problem
         | statement, but it's still very likely that multiple solutions
         | have been ingested from GitHub and elsewhere.
        
       | halyconWays wrote:
       | Are the weights for the 70B version of the model available?
        
       | EvgeniyZh wrote:
       | What about more realistic benchmarks, like SWE-bench [1]?
       | 
       | [1] https://www.swebench.com/
        
       | claytonjy wrote:
       | How have you liked using TensorRT-LLM? Did you come from faster-
       | transformers, vLLM, LMDeploy, TGI, something else?
       | 
       | We started migrating to it the day it came out, very glad to have
       | it, but lots of little annoyances along the way. Biggest one has
       | been loading our model repository; having to hardcode the
       | location of the engine file means we can't use the built-in ways
       | Triton has for downloading from GCS!
        
       | johnfn wrote:
       | > Show HN: Phind Model beats GPT-4 at coding
       | 
       | Does it? I don't see any evidence of this strong claim in your
       | post, and I think it's quite deceptive how the only link is to a
       | benchmark of open source models (which doesn't include GPT-4).
       | I've tried Phind a few times in the past when it made equally
       | strong claims and been somewhat unimpressed. (To be fair,
       | comparing anything to GPT-4 is tough!) I think it would
       | strengthen your position significantly to simply say that you're
       | the best of all open-source models.
       | 
       | To be honest though I've been completely ruined by
       | https://cursor.sh/; copying and pasting results back and forth
       | from a web UI to my IDE is so painfully slow when you do it tens
       | or hundreds of times that I don't think I would be able to go
       | back. I'd be happy to try out a Phind extension that has similar
       | UI/UX if you ever make one.
        
       | DotaFan wrote:
       | Been subscribed to Phind for 3 months now on 30EUR per month, and
       | constant outages made me unsubscribe this month. I did compare
       | Phind and GPT-4 in the past, whenever Phind came out with these
       | kind of articles, and after first question it was obvious Phind
       | was nowhere near.
        
         | rushingcreek wrote:
         | Sorry to hear that you didn't have a great experience. I'd love
         | to chat further, my email is founders(at)phind.com
        
       | thesz wrote:
       | That bot of yours is second chat bot that claimed it can program
       | or can help with programming. And that bot of yours is second one
       | that utterly failed to provide me with an implementation of
       | blocked clause decomposition in Haskell. I needed something, even
       | the most slow version would do.
       | 
       | Yours' bot also tried to bullshit me about validity of its
       | answer, just like the other one.
       | 
       | The difference? Your bot mentioned a paper on arxiv about the
       | problem. But the paper (and I read it long time ago, of course)
       | does not provide even pseudocode implementations of most of the
       | algoithms mentioned there.
       | 
       | Color me not impressed.
       | 
       | As usual, bots like yours are not for when you need something
       | new. If I have an idea, I cannot use any AI, including yours, for
       | prototyping work.
       | 
       | It is expected as neural networks are interpolators, not
       | extrapolators and for them to "extrapolate" one need to train
       | them over the "extrapolation" area quite well.
        
       | sergiotapia wrote:
       | "I have a table called `onboardings` with the state field. I want
       | to return how many people we have in each state. The Postgres
       | query should return the state, how many people count, and what
       | percentage do those people represent."
       | 
       | Claude-2, correct response and rounds the percentage - makes a
       | nice assumption for me.:                   SELECT state,
       | count(*) AS people_count,                  round(100.0 * count(*)
       | / (SELECT count(*) FROM onboardings), 2) AS percent         FROM
       | onboardings          GROUP BY state         ORDER BY people_count
       | DESC;
       | 
       | Phind, correct response as well! Really fast too!:
       | WITH state_counts AS (          SELECT state, COUNT(*) as count
       | FROM onboardings          GROUP BY state         ),
       | total_counts AS (          SELECT COUNT(*) as total          FROM
       | onboardings         )         SELECT sc.state, sc.count,
       | (sc.count::decimal / tc.total::decimal) * 100 as percentage
       | FROM state_counts sc, total_counts tc         ORDER BY sc.count
       | DESC;
        
       | s09dfhks wrote:
       | The vscode integration seems cool, but why do I have to have an
       | account to use it?
        
       | nextworddev wrote:
       | Wow they really simplified the front end
        
       | efd6821b wrote:
       | Wait, can't you use this to develop chemical weapons? Where's
       | your 20-person government-mandated safety team?
        
       | blago wrote:
       | It failed for me at a much more basic level.
       | 
       | I asked 5 different, and increasing explicit, variations of the
       | following question: "Can you generate HTML and CSS for a JPG
       | mockup I'm going to give you?"
       | 
       | Each time it answered along the following lines: "Sure, here is
       | how you can create HTML and CSS from a JPG mockup. Follow this
       | process..."
       | 
       | In my experience this never happens with GPT-4.
        
         | pbhjpbhj wrote:
         | I've not seen that anywhere, ChatGPT does image input now? Do
         | you have examples of the output from feeding it a JPEG?
        
       | ShakataGaNai wrote:
       | "Python script to extract a list of all Elastic IP's from all
       | regions, from multiple AWS accounts."
       | 
       | ChatGPT4 gave me a solid answer hitting all the points I wanted.
       | Phind din't get the account handling correct, didn't address
       | regions, and didn't handle pagination.
       | 
       | "Write a python based script that uses boto3 to query AWS
       | Route53. It should print a list of every record for a given
       | hosted zone ID."
       | 
       | ChatGPT4 did exactly as requested with pagination, and even
       | smartly decided to use "input" so I could give it a zone ID at
       | run time. Phind didn't handle pagination, or do ANY error
       | handling. It was also slower than ChatGPT4 to generate currently,
       | and it wasn't in a single block of copy/pasteable code.
       | 
       | ChatGPT's solution worked without modification. Copy-Paste. Run.
        
         | rushingcreek wrote:
         | Just worked well for me:
         | https://www.phind.com/search?cache=g9y2uizgjwcn378aovb65v92.
         | 
         | We do have issues with consistency sometimes -- please try
         | regenerating if that is the case.
        
           | facu17y wrote:
           | "We do have issues with consistency sometimes" That's a
           | strange statement. Having issues with consistency means that
           | _sometimes_ the output is wrong. What does it mean to have
           | issues with consistency _sometimes_ ? You 're either
           | consistent or you're not.
        
             | rushingcreek wrote:
             | There's a difference between models that are incompetent
             | and aren't capable of getting the right answers ever and
             | models that are capable of getting the right answer but may
             | not do so every time. The Phind Model is in the latter
             | camp.
             | 
             | Consistency issues can be caused by a wide range of factors
             | from inference hyperparameters to prompting.
        
               | facu17y wrote:
               | I meant that saying "something is inconsistent sometimes"
               | is weird because inconsistency implies "sometimes"
        
           | scarface_74 wrote:
           | Your example didn't include pagination.
        
       | charlieyu1 wrote:
       | I tried "Draw a cone with tkz-euclide". The result is not quite
       | right, it outputs code that draws a circle and two vertical lines
       | and that's it.
       | 
       | Just curious about how good it works with niche languages
        
       | sylware wrote:
       | Is there a noscript/basic (x)html prompt somewhere?
        
       | dancemethis wrote:
       | Phind has been pretty nice to use for some rubber ducking with
       | C#. Only disadvantage is using Discord to wall communication.
        
       | andai wrote:
       | To the folks in this thread comparing the model with GPT-4, are
       | you comparing it with GPT-4 in ChatGPT, or with GPT-4 on Phind?
       | Because it should be the latter for a fair comparison. The Phind
       | response seems to be heavily based on the top search results,
       | which may affect the quality of the response.
       | 
       | (An even more interesting question would be to compare ChatGPT
       | GPT-4 with Phind GPT-4, i.e. GPT-4 with relevant web results in
       | context.)
        
       | lgkk wrote:
       | First off, congrats on building such a cool product. I love that
       | I can just "jump into it" which is great.
       | 
       | Note that I'm not really a power user of these GPT style tools-
       | here are my questions:
       | 
       | Is it possible to get right to the code without the ELI5 and
       | general information?
       | 
       | Do you guys offer an API? I was browsing on my small iphone so
       | maybe I missed this info.
       | 
       | Could you give an overview for someone like me how something like
       | phind works technically? You mentioned those H100s, but at a very
       | high level without revealing any "secret sauce" how does this GPT
       | work from my input to getting a response?
       | 
       | Good luck!
        
       | drcode wrote:
       | I am a heavy user of GPT4, and Phind was surprisingly able to
       | match GPT4 on several initial programming tasks I gave it. Given
       | the large context window of Phind, it will likely be able to
       | outperform GPT4 for some tasks.
       | 
       | That is quite an accomplishment, I am impressed
        
         | iandanforth wrote:
         | FWIW The default context window of GPT-4 via ChatGPT is about
         | to change to 32k.
        
           | drcode wrote:
           | that would put them significantly ahead again, for my use
           | cases
        
             | rushingcreek wrote:
             | We will eventually increase the Phind Model to 100K tokens
             | -- the RoPE embeddings in Code Llama were designed for
             | this.
        
       | PUSH_AX wrote:
       | I went over to my GPT-4 history and pasted some problems and
       | refactor requests in verbatim, the GPT-4 outputs were of much
       | higher quality.
        
       | generativeai wrote:
       | Great work boys
        
       | quickthrower2 wrote:
       | Could you open source these great models? OK yes you need a
       | competitive advantage. So maybe open source them when you are say
       | 2 models ahead in production?
       | 
       | In any case I am happy there is some competition and that it has
       | come from a more pragmatic scrappy space than one of the multiple
       | billion dollar funded places.
        
         | sounds wrote:
         | Can we have a larger discussion about the tradeoffs that come
         | with open sourcing a model?
         | 
         | When fb released Llama they obviously gained a huge amount of
         | developer goodwill but it also required them to invest a
         | serious amount of their own developer time to engage with the
         | community.
         | 
         | I'm asking the community what it can offer the company? Or is
         | this just self-abnegation by the company that releases the
         | model?
        
         | poser-boy wrote:
         | I don't know what model runs on Phind's site right now, but in
         | August Phind published a fine tune of CodeLlama 34B
         | 
         | https://huggingface.co/Phind/Phind-CodeLlama-34B-v2
        
       | notadev wrote:
       | I use Phind daily, including the VSCode extension, and I love it.
       | Much better than anything ChatGPT is able to come up and the code
       | it generates requires little-to-no modification to work properly.
       | Very big fan!
        
       | scarface_74 wrote:
       | I was just discussing using ChatGPT to make working with
       | deploying serverless code easier.
       | 
       | I gave this as an example
       | 
       | "create a CDK typescript app that deploys a lambda + API Gateway
       | where the lambda works with Get request and a dynamodb table. The
       | lambda should have permission to read and write to the Table"
       | 
       | It wrote the code perfectly. I wanted to see if it was trained on
       | the AWS APIs.
        
       | gardenhedge wrote:
       | I had a bug in my flutter app. The Phind model nailed it straight
       | away. GPT-4 gave a working but awful solution.
        
       | grahamgooch wrote:
       | Licensing and privacy details pls?
        
       | deegles wrote:
       | What's the best way to use an LLM with a large codebase that
       | isn't RAG? Ideally we could have the full source in the context
       | or already trained into the model... I was thinking I could set
       | something to fine tune a model overnight and every morning I'd
       | have a fresh one ready. Any ideas?
        
       | alexalx666 wrote:
       | it would be great to have more clarity on the Plans page re why I
       | need GPT4 in the context of Phing, Im already paying for GPT
       | Plus, Copilot, and Kagi Search. Would be great to have a ref: Is
       | input length of 8000 good for a web app, iOS view, unix util, go
       | server? It seems like the value add is Phing model but you
       | advertise GPT4
        
       | schmorptron wrote:
       | I've been a pretty heavy user of phind and have been very
       | satisfied! Haven't been using it to write code for me but to ask
       | about features and docs and it's been pretty incredible.
        
       | aliljet wrote:
       | How recently was this LLM seeded with data? In the context of
       | golang, this is easily a generation ahead of GPT-4.
        
       | godelski wrote:
       | What data did you use to train and how do you evaluate your model
       | for overfitting? I ask due to the issues with the HumanEval
       | dataset.
       | 
       | -------------
       | 
       | For those that are unfamiliar with the issues, allow me to
       | elaborate. You can find the dataset in the parent's link or
       | here[0] and you can find the paper here[1].
       | 
       | I'll quote from the paper. First is page 2 right above the github
       | link and second is page 4 section 2.2 (note, this paper has 58
       | authors... 58)
       | 
       | > To accurately benchmark our model, we create a dataset of 164
       | original programming problems with unit tests. These problems
       | assess language comprehension, algorithms, and simple
       | mathematics, with some comparable to simple software interview
       | questions.
       | 
       | > It is important for these tasks to be hand-written, since our
       | models are trained on a large fraction of GitHub, which already
       | contains solutions to problems from a variety of sources. For
       | example, there are more than ten public repositories containing
       | solutions to Codeforces problems, which make up part of the
       | recently proposed APPS dataset
       | 
       | So we take from this that the problems are simple and leet code
       | style and that they have verified that the data is not in the
       | training set by the simple nature of simply writing the code from
       | scratch. If you aren't laughing now, you should be. So let's look
       | and see if there are in fact samples of code that are exact or
       | near to those in the test set that exist in public githubs prior
       | to May 2020, their cutoff date.
       | 
       | Now let's look at some of the test questions and see if we can
       | find them on github. Github search is total garbage so I'm going
       | to pull results from the last time I looked (search my comment
       | history "godelski human eval") I apologize in advance for
       | formatting.
       | 
       | HumanEval/4:
       | 
       | Prompt: from typing import List def
       | mean_absolute_deviation(numbers: List[float]) -> float: """ For a
       | given list of input numbers, calculate Mean Absolute Deviation
       | around the mean of this dataset. Mean Absolute Deviation is the
       | average absolute difference between each element and a
       | centerpoint (mean in this case): MAD = average | x - x_mean | >>>
       | mean_absolute_deviation([1.0, 2.0, 3.0, 4.0]) 1.0 """
       | 
       | canonical_solution: mean = sum(numbers) / len(numbers) return
       | sum(abs(x - mean) for x in numbers) / len(numbers)
       | 
       | Found on Github[2], commit date Oct 5, 2019: if reduction ==
       | "median": return np.median(scores) mean = sum(scores) /
       | len(scores) if reduction == "mean": return mean return sum(abs(x
       | - mean) for x in scores) / len(scores)
       | 
       | A solution that is functionally equivalent. Swap numbers and
       | scores and remove the if statement. This constitutes a near
       | collision and ML models will preform very well on near
       | collisions. If you look at the testing method for the evaluation
       | you will also see that this code will pass the test. Thus our LLM
       | can very easily simply copy paste this code and pass no problem.
       | I'm not saying that's what happened, but that we cannot rule this
       | out. What actually happened is an open question and we're far
       | from ready as a community to call LLMs fuzzy copy machines.
       | 
       | I also have this search query marked which still seems to be
       | working[3] but you'll have to manually check the date.
       | 
       | You can repeat this process for many examples in the HumanEval
       | dataset. Or simply look the human eval dataset questions and
       | answers and ask yourself "Have I written those exact lines of
       | code?" The answer is probably.
       | 
       | But note here, that overfitting is perfectly okay in certain
       | circumstances. But HumanEval simply measures how good a LLM is at
       | solving short leetcode style questions. It does not measure a
       | LLMs ability to write code and certainly not write non-leetcode.
       | It may very well do so, but this benchmark does not measure such
       | things. This still can provide utility to people and these LLMs
       | still learn a lot more than what HumanEval tests. My issue is
       | with the metric and claims as to what the results indicate rather
       | than the product itself. There is also the danger of chasing
       | benchmarks like these as you will not be able to disentangle
       | overfitting from desired training outcomes. I am not critiquing
       | OP's network nor the work they did to create this. I'll
       | explicitly state here, well done OP. This took a lot of hard work
       | and you should feel very proud. I hope this question and context
       | does not come off as pejorative nor overly cynical. I think your
       | work is without a doubt, something to be proud of and useful to
       | our community.
       | 
       | This is a warning to all HN readers to help avoid snakeoil (I
       | expect every ML person to already know this), scrutinize your
       | metrics and know exactly what they measure. I mean precisely,
       | there are no metrics that measure abstract things like "image
       | quality", "performance in language", "code generation
       | performance" and so on. Generative models are exceptionally
       | difficult to determine what model is better and we are
       | unfortunately at a point where many of our metrics (remember:
       | metrics are proxies or more abstract goals. Metrics are models.
       | All models are wrong, just some are more wrong than others) and
       | you must do far more investigation to come to an even fuzzy
       | answer to this question. Nuance is necessary.
       | 
       | [0] https://huggingface.co/datasets/openai_humaneval
       | 
       | [1]https://arxiv.org/abs/2107.03374
       | 
       | [2] https://github.com/danielwatson6/hate-speech-
       | project/blob/8e...
       | 
       | [3]
       | https://github.com/search?q=abs%28x+-+mean%29+for+language%3...
        
       | standardly wrote:
       | I like that it provides sources, but I have to check them EVERY
       | time because too often it hallucinates bogus solutions or
       | protocols for me. I'm asking network questions, though, not
       | asking for code snippets. I've had it hallucinate powershell
       | modules as well. If you're willing to check it's work, then its
       | useful maybe.
        
         | rushingcreek wrote:
         | Thanks for the feedback. Do you have any cached links you can
         | share? It'd be massively helpful.
        
       | DrNosferatu wrote:
       | I tried a more niche language, Scilab, and it did considerably
       | *worse compared to* GPT4.
       | 
       | - What are your experiences?
        
       | sysread wrote:
       | Looks nice, but it's quite pricy compared to OpenAI's API pricing
       | or ChatGPT.
        
       | ojosilva wrote:
       | Awesome model from a quick run-through comparison, it's
       | comparable in results to GPT-4 with web search and references as
       | a plus, but runs faster. Two small nitpicks:
       | 
       | - Dark mode is hard to read, the answer text font has too much
       | weight and brightness which makes it hard to read long paragraphs
       | of non-code text. Light mode is obviously too bright overall, but
       | it's already nighttime where I'm at so maybe tomorrow at noon
       | I'll have another opinion. I'd preferred gray (dark, ie OpenAI)
       | and sepia (light, ie HN) as backgrounds when long lines of text
       | are involved.
       | 
       | - Pricing page and ties to GPT-4: what does "500+ best model uses
       | per day (GPT-4)" mean? What's the "GPT-4" part for? I saw I can
       | pick GPT4 as a model on the landing page, but I just don't get
       | the best model/GPT-4 thing. Is Phind announcing it's a competitor
       | but also proxies GPT-4? Sorry, I'm not up-to-date on GPT-4
       | "resellers" and the story behind Phind, it's just weird when it
       | announces it "beats GPT-4" then the pricing is about GPT-4 usage.
        
         | rushingcreek wrote:
         | Thanks for the feedback. We also support GPT-4 as an answering
         | model so users can pick and choose what's best for their use
         | case, but we recommend the Phind Model for the majority of
         | users.
        
       | poser-boy wrote:
       | A few of Phind's models are open/available
       | 
       | https://huggingface.co/Phind/Phind-CodeLlama-34B-v2
        
       ___________________________________________________________________
       (page generated 2023-10-31 23:00 UTC)