[HN Gopher] Copilot regurgitating Quake code, including sweary c... ___________________________________________________________________ Copilot regurgitating Quake code, including sweary comments Author : bencollier49 Score : 1063 points Date : 2021-07-02 11:52 UTC (11 hours ago) (HTM) web link (twitter.com) (TXT) w3m dump (twitter.com) | kklisura wrote: | Phew! Are jobs are safe! | unknownOrigin wrote: | _snickers_ | CookieMon wrote: | though our companies will one day be competing with product | manufacturers in China who get to use it to its fullest | crescentfresh wrote: | Direct url to the "gif" in that twitter post: | https://video.twimg.com/tweet_video/E5R5lsfXoAQDRkE.mp4 | | I could not figure out how to show it larger on the twitter UI. I | don't have a twitter account so that may be the problem. | Thomashuet wrote: | It seems like a very sensible answer from copilot since the | prompt includes "Q_" which makes it obvious that the programmer | is specifically looking for the Quake version of this function. | | To me it doesn't show that copilot will regurgitate existing code | when I don't want it to, just that if I ask it to copy some | famous existing code for me it will oblige. | cjaybo wrote: | Apparently you haven't seen many of the demos that people are | showing off? Because saying that this only occurs when the | author is explicitly asking for copied code is blatantly false. | Thomashuet wrote: | No I haven't. If you think the other demos are more | interesting please link to them. I'm just saying that this | demo is biased and that we can't draw any conclusion from it. | Actually the author has just confessed optimizing it for | entertainment in a sister comment. That doesn't mean that the | claim is false but it doesn't show that it is true either. | the_mitsuhiko wrote: | I think you misunderstood my comment. The same code gets | generated if you call the function `float fast_rsqrt` or | `float fast_isqrt` for instance. I intentionally wanted it | to be looking like `Q_rsqrt` so that people pick up on it | quicker. | heavyset_go wrote: | Thanks for making the video in the OP. | | Do you have more examples like this that I can share with | those who don't use Twitter, like a repo or blog post? | OskarS wrote: | You're not wrong, but the very idea that it will regurgitate | copyrighted code _at all_ (and especially at this length, word | for word), means that it will be totally unacceptable for many | places. In fact, it is arguably not acceptable to use anywhere | if you care deeply about copyright. | advisedwang wrote: | The claim for AI systems like this is that it has actually | learned something and is generating code from scratch. | Oftentimes the authors will claim regurgitation is simply not | possible, and this example shows that's a lie. | | Many arguments on the benefits, legality and power of AI | systems rely on this claim. | | To turn around now and say it's OK to regurgitate in the right | setting is to move the goalposts. | caconym_ wrote: | > Oftentimes the authors will claim regurgitation is simply | not possible | | Do the Copilot authors claim this? | | I get that you're suggesting that Copilot may benefit from | absolute claims made by the authors of other, similar systems | (or their proponents), but I also don't think it's reasonable | to exclude nuance and the specifics of Copilot from ongoing | discussions on that basis. The Copilot authors have publicly | acknowledged the regurgitation problem, and by their account | are working on solutions to it (e.g. attribution at | suggestion-generation time) that don't involve sweeping it | under the rug. | rckstr wrote: | They did! In the faq which I can't find anymore they said: | | >GitHub Copilot is a code synthesizer, not a search engine: | the vast majority of the code that it suggests is uniquely | generated and has never been seen before. We found that | about 0.1% of the time, the suggestion may contain some | snippets that are verbatim from the training set. | caconym_ wrote: | This actually seems like an explicit acknowledgement that | regurgitation _is_ possible, and not remotely a claim | that it is "simply not possible". | | It stands to reason that cases where people are | intentionally trying to produce regurgitation will | strongly overlap with the minority of cases where it | actually happens. So I think we are probably suffering | from some selection bias in discussions on HN and similar | forums--that might be unavoidable, and it certainly | stimulates some interesting discussion, but we should try | to avoid misrepresenting the product as a whole and/or | what its creators have said about it. | sombremesa wrote: | I think only Github's lawyers would interpret what GP | posted the way you did. Looks like weasel wording to make | such an interpretation possible, while making customers | believe that code is more or less synthesized in | realtime. "Snippets" makes one think one or two lines of | code, not entire functions and classes. | caconym_ wrote: | I think that until somebody shows that Copilot is willing | to copy _distinctive_ code fragments verbatim, | unprompted, with a high occurrence rate, I 'm not going | to start accusing Github of building an engine to | cynically exploit the IP rights of open source copyright | holders for profit. I've seen no evidence of that, and in | absence of evidence I prefer to remain neutral and open- | minded. | | How would that work, anyway? Rare, distinctive code forms | seem much more difficult for an ML thing to suggest with | a high-ish confidence level, since there won't be much | training data. The Quake thing makes sense because it's | one of the most famous sections of code in the world, and | probably exists in thousands of places in the public | Github corpus. | | I'm emphasizing _distinctive_ because a lot of | boilerplate takes up a lot of room, but still doesn 't | make a reasonable argument for copyright infringement | when yours looks like somebody else's. | sombremesa wrote: | It looks like you're responding to the wrong comment. I | don't recall alleging that Github is "building an engine | to cynically exploit the IP rights of open source | copyright holders for profit". | caconym_ wrote: | > I think only Github's lawyers would interpret what GP | posted the way you did. Looks like weasel wording to make | such an interpretation possible, | | So what are you suggesting here, except that Github is | attempting a legal sleight-of-hand to hide real | infringement? | | > while making customers believe that code is more or | less synthesized in realtime. | | What are you suggesting here except that Github is | (essentially) lying to customers, making them believe | something that is substantially untrue? | | When I say "building an engine to cynically exploit the | IP rights of open source copyright holders for profit", I | am talking about a scenario in which they are sweeping | legitimate IP concerns under the rug with bad faith legal | weaselry and misrepresentation of how the product | functions, etc., to chase profit. I do not see how that | is substantially different from the implications of your | comment, especially in the context of this subthread. | | Could you enlighten me as to how your intended meaning | substantially differs from my interpretation? If you | don't mean to accuse Github of malfeasance, we probably | don't have much to discuss. | ohazi wrote: | Nat Friedman explicitly stated that it shouldn't | regurgitate [0]: | | > It shouldn't do that, and we are taking steps to avoid | reciting training data in the output | | He's being woefully naive. To put it bluntly, we don't know | how to build a neural network that isn't capable of | spitting out training data. The techniques he pointed to in | other threads are academic experiments, and nobody seems to | have a credible explanation for why we should believe that | they work. | | [0] https://news.ycombinator.com/item?id=27677177 | caconym_ wrote: | "Shouldn't" isn't the same as "doesn't". | | I'm not anything close to an ML expert, and I have no | opinion on whether what they're aiming for is possible, | but this document^[1] (linked in your linked comment) | states explicitly that they are aware of the recitation | issue and are taking steps to mitigate it. So, in the | context of the comment I replied to, I think Github is | very far from claiming that recitation is "simply not | possible". | | ^[1] https://docs.github.com/en/github/copilot/research- | recitatio... | ohazi wrote: | That kind of bullshit phrasing can only get you so far. | | It's like if some corporate PR department told you "we're | aware of the halting problem, and are taking steps to | mitigate it." You would rightly laugh them out of the | room. | | It's not going to work, and the people making these | statements either don't understand how much they don't | understand, or are deluding themselves, or are actively | lying to us. | | An honest answer would be something like "We are aware | that this is a problem, and solving it is an active area | of research for us, and for the machine learning | community at large. While we believe that we will | eventually be able to mitigate the problem to an | acceptable degree, it is not yet known whether this | category of problem can be fully solved." | caconym_ wrote: | You're using some pretty strong language here, but do you | have any more substantive criticisms of the analysis they | present at | https://docs.github.com/en/github/copilot/research- | recitatio... ? They seem to think the incidence of | meaningful (i.e. substantively infringing) recitation is | very low, and that their solution in those cases will be | attribution rather than elimination. | | Again, I'm not an ML expert, but that sounds a lot more | reasonable to me than announcing one's intention to solve | the halting problem. | ohazi wrote: | They had some people use the thing for a while, and | concluded "Hey look, it doesn't seem to quote verbatim | very often. Yay!" There is nothing in there that | describes any sort of mitigation. The three sentences | about an attribution search at the very end are | aspirational at best, and are presented as "obvious" even | though it's not at all clear that such a fuzzy search can | be implemented reliably. | | I use the halting problem as an analogy because their | naive attempts to address this problem feel a lot like | naive attempts to get around the halting problem ("just | do a quick search for anything that looks like a loop," | "just have a big list of valid programs," etc.). I can | perform a similar analysis of programs that I run in my | terminal and come to a similar "Hey look, most of them | halt! Yay!" conclusion. I can spin a story about how most | of the ones that don't halt are doing so intentionally | because they're daemons. | | But this approach is inherently flawed. I can use a fuzz | tester to come up with an infinite number of inputs that | cause something as simple as 'ls' to run forever. | | Similarly, I can come up with an infinite number of | adversarial inputs that attempt to make Copilot spit out | training data. Some of them will work. Some of them will | produce something that's close enough to training data to | be a concern, but that their "attribution search" will | fail to catch. That's the "open research question" that | they need to solve. | | We _don 't have_ a general solution to this problem yet, | and we may never have one. They're trying to pass off a | hand-wavey "we can implement some rules and it won't be a | problem most of the time" solution as adequate. I don't | see any reason to believe that it will be adequate. Every | attempt I've seen at using logic to try and coax a | machine learning model into not behaving pathologically | around edge cases has fallen flat on its face. | caconym_ wrote: | > The analysis you're citing is just that -- a | statistical analysis. They had some people use the thing | for a while, and concluded "Hey look, it doesn't seem to | quote verbatim very often. Yay!" There is nothing in | there that describes any sort of mitigation. | | > The three sentences about an attribution search at the | very end are aspirational at best, and are presented as | "obvious" even though it's not at all clear that such a | fuzzy search can be implemented reliably. | | I agree with all of this, though I do think that the | attribution strategy they describe sounds a lot easier | than solving the halting problem or entirely eliminating | recitation in their model. Obviously, the proof will be | in the pudding. | | Maybe you and others are reacting to them framing this as | "research", as if they're trying to prove some | fundamental property of their model rather than simply | harden it against legally questionable behavior in a more | practical sense. I think a statistical analysis is fine | for the latter, assuming the sample is large enough. | csande17 wrote: | The biggest issue with that analysis is that their model | is clearly very able to copy code and change the variable | names, copying code and changing variable names is very | clearly still "copying", and the analysis doesn't seem to | include that in its definition of "recitation event". | caconym_ wrote: | I'd fully expect it to copy code and change variable | names in a lot of cases--if it wants to achieve the goal | of filling in boilerplate, how could it do anything else? | That's pretty much the definition of boilerplate: it's | largely the same every time you write it. | | What's less clear to me is that Copilot regularly does | that sort of thing with code distinctive enough that it | could reasonably be said to constitute copyright | infringement. If somebody's actually shown that it does, | I'd love to see that analysis. | the_mitsuhiko wrote: | I was able to trigger it without the Q prompt before. It just | made the a nicer looking gif that way. | | I got it to produce more GPL code too, that one is just not | entertaining. | throwaway_egbs wrote: | Welp, guess I'll be taking all my code off of GitHub now, lest it | be copied verbatim while ignoring my licenses. | | (I'm no John Carmack, but still.) | throwaway_egbs wrote: | This reply from @AzureDevOps is bizarre: "We understand. However, | the way to report this issues related to Windows 11 is through | our Windows Insider even from another device. Thanks in advance." | | I think I'm gonna give "AI" a few more years. | | https://twitter.com/AzureDevOps/status/1411018079849619458 | bob1029 wrote: | This is pretty clearly just a search engine with more parameters. | | I thought there was something more going on with copilot, but the | fact that it is regurgitating arbitrary code comments tells me | that there is zero semantic analysis going on with the actual | code being pulled in. | josefx wrote: | They openly claim it is an AI. What about the state of AI | currently in use made you think that there was any intelligence | behind it? | thegeomaster wrote: | It is decidedly not "just a search engine with more | parameters." Language models are just prone to repeating | training examples verbatim when they have a strong signal with | the prompt. Arguably, in this case, it is the most correct | continuation. | saynay wrote: | It's more that the model is so large it is capable of | memorizing a lot. This can be seen in other language models | like GPT-3 as well. | | Comments, I suspect, will be more likely to be memorized since | the model would be trained to make syntactically correct | outputs, and a comment will always be syntactically correct. | That would mean there is nothing to 'punish' bad comments. | username90 wrote: | The model in this case is just a lossy compression of github, | and you search that. | sydthrowaway wrote: | What causes this in a net? I'm guessing the RNN gets in a | catastrophic state.. | salawat wrote: | Neural nets aren't magic. You actually need quite a bit of | complexity and modeling of interrelated problem spaces to get | anything more than a childlike naivete or trauma savant-like | mastery of one particular area with crippling deficiencies | elsewhere. | otabdeveloper4 wrote: | > catastrophic state | | No, overfitting is the normal state for neural nets. | captainmuon wrote: | I would say overfitting - the net doesn't "understand" the code | in any meaningful sense. It just finds fitting examples and | jumbles them a bit. | | Understanding would mean to have an internal representation | related to the intention of the user, the expected behavior, | and say the AST of the code. My pessimistic interpretation of | this and many other recent AI applications is that it is a | "better markov chain". | LeanderK wrote: | a markov chain can have an internal representation related to | the intention of the user. I guess this example just got | copied a lot and is therefore included multiple times in the | training data, forcing the network to memorise it. Neural | networks always memorise things that appear too frequent. | Memorized Artifacts in an otherwise working neural network is | usually seen as a "bug" (since the training allowed the | network to cheat), not as a proof that the network didn't | generalise. | michaelt wrote: | This is the network working as designed. | | I mean, if you wrote an autocomplete system for written english | and asked it to complete the sentence "O Romeo, Romeo" what | would you expect to happen? | | You'd expect it to complete to "O Romeo, Romeo, wherefore art | thou Romeo?" - a very famous quote. | | How else could you produce the single right output for that | unique input, other than memorising and regurgitating? | NumberCruncher wrote: | > You'd expect it to complete to "O Romeo, Romeo, wherefore | art thou Romeo?" - a very famous quote. | | What about completing it to "O Romeo, Romeo, brave Mercutio | is dead", based on the context, as advertised? | villgax wrote: | That salesforce engineer is the average developer when prompting | to write an About me section lol | ronenlh wrote: | There are a lot of good points made against copilot. But I'm | optimistic in that it will improve with time. At worst it's an | efficient code copy-pasting tool, but at best it could be the | next level of abstraction. | littlestymaar wrote: | Copilot was only trained on public code right? Because that would | be an absolute PR disaster if some proprietary code was leaked | that way. | ezekg wrote: | I'm curious if you can find certain keywords that may leak | private code... | imnitishng wrote: | Waiting for the access to test this out | nadsat2021 wrote: | I don't understand why this is an issue. And now I have a tool to | do my coding, as well as my social media contributions: | "'Sweary comments' is not grammatically correct English." -> | GPT-2 -> "Sweary comments" is not grammatically | correct English. The "sweary" meaning is used as slang by those | in charge, especially the ones who get a free ride through the | media, for the sake of a quick buck, or the sake of getting a | better job with a promotion. The word is commonly used in a | social context and comes from an actual "sweary" comment. The | statement or phrase, "You're too shy!" is derived from this | phrase: "I wish he were as outgoing as you." | | Am I right? | PaulHoule wrote: | This kind of thing devalues the work that programmers do. | | The head example they show is using a sentiment analysis API | which is about the most useless use of technology there is. | saiojd wrote: | Based on all the negative comments so far, and based on this | website's aptitude at predicting the viability of a product, it | really seems like Copilot is bound to be a success. | DantesKite wrote: | Right? | | Even given all its initial problems, I don't see a world where | people completely avoid using it. | sydthrowaway wrote: | We're gonna Dropbox this thing all the way to the top! | stusmall wrote: | No wireless. Less space than a nomad. Lame. | saiojd wrote: | Yeah. I get why people's initial reaction is to dislike it | tbh. Honestly I doubt the utility will be huge for experts, | mostly likely it will just alleviate having to remember | certain how a certain language implements a specific concept. | sktrdie wrote: | I'm going to go against the flow here and say that worrying about | this is similar to worrying about the license we give to snippets | of code we copy-paste from other licensed code. | | The reality is that we never attribute the original source | because we copy-paste it, change it up a bit, and make it our | own. Literally everybody does this. | | I still care about licensing and proper attribution but the | reality is that a snippet of code is not something so easy to | attribute. Should we attribute all kinds of ideas, even the very | small ones? How quickly is an idea copied, altered & reused? Can | we attributes all the thoughts humans have? | nadsat2021 wrote: | "sweary comments" | | When I hear phrases like that, I worry more about human | intelligence. "I'm a little tea pot, short and stout," said | social media. | | I watched Kubrik's "A Clockwork Orange" again this week, after a | certain amount of fearful anticipation. | | When Alex said "eggy weggie", it clicked. It's like Burgess time- | traveled to 2021 to document our modern infantilization and | antisocialization. He forgot to include the Internet, loss of | humor, and emerging AI, but I guess he was overwhelmed by the | enormity of the "baddiwad". | | Later on, droogs. | Osiris wrote: | I assumed it was trained on source code that was explicitly | licensed with a permissive license. Are they training it using | private unlicensed repos also? | GhostVII wrote: | Sounds like we need another tool called "Auditor" that scans your | code to see if it violates copyright laws. | omgwtfbbq wrote: | The uproar over Copilot is kind of hilarious. Maybe it's SWE's | realizing that they might not be as irreplaceable as they seem | but its an awful lot of salty comments. If anything I think | Copilot is a really cool PoC and shows just how close we are | getting to automating large portions of the code writing process | which we should all welcome as more cycles can be spent on | architecture and system design. | ethbr0 wrote: | The irony is that we're whinging about a tool that generates code | that will be difficult to understand _in the future_... | | ... and the example is mathematically- and floating-point-spec | obtuse enough that it was incomprehensible at the time _it was | written_. (As evidenced by id comments) | maliker wrote: | Copilot transitions programmers from writing code to reading | auto-generated code. And the feeling is that reading code is 10x | harder than reading it? Seems like a rich source of problems. | | (However, I'm still definitely going to try this out once I get | off the waitlist.) | rgbrenner wrote: | So this makes it official... this post[0] and the comments on the | announcement[1] concerned about licensing issues were absolutely | correct... and this product has the possibility of getting you | sued if you use it. | | Unfortunately for GitHub, there's no turning back the clocks. | Even if they fix this, everyone that uses it has been put on | notice that it copies code verbatim and enables copyright | infringement. | | Worse, there's no way to know if the segment it's writing for you | is copyrighted... and no way for you to comply with license | requirements. | | Nice proof of concept... but who's going to touch this product | now? It's a legal ticking time bomb. | | 0. https://news.ycombinator.com/item?id=27687450 | | 1. https://news.ycombinator.com/item?id=27676266 | sktrdie wrote: | If they get rid of licensed stuff it should be ok no? I really | want to use this and seems inevitable that we'll need it just | as google translate needs all of the books + sites + comments | it can get a hold of. | ianhorn wrote: | Unlicensed code just means "all rights reserved." You'd need | to limit it to permissively licensed code and make sure you | comply with their requirements. | runeb wrote: | How would they do that? | oauea wrote: | Read the LICENSE file in each repo. | rovr138 wrote: | What guarantees it's intact? | [deleted] | mmastrac wrote: | Well... the whole training set is licensed, so you can't | really get rid of it. I think that the technology they are | using for this is just not ready. | fragmede wrote: | Just retrain the model using properly licensed code? | ("just" is doing a ton of heavy lifting, but let's be real, | that's not impossibly hard) | [deleted] | eCa wrote: | Which licenses would it be ok that the training material is | licensed under, though? If it produces verbatim enough copies | of eg. MIT licensed material, then attribution is required. | Similar with many other open source-friendly licenses. | | On the other hand, if only permissive licenses that also | don't require attribution is used, well, then for a start, | the available corpus is much smaller. | eganist wrote: | Adding to this: | | I run product security for a large enterprise, and I've already | gotten the ball rolling on prohibiting copilot for all the | reasons above. | | It's too big a risk. I'd be shocked if GitHub could remedy the | negative impressions minted in the last day or so. Even with | other compensating controls around open source management, this | flies right under the radar with a c130's worth of adverse | consequences. | fragmede wrote: | Do you also block stack overflow and give guidance to never | copy code from that website or elsewhere on the Internet? I'm | legitimately curious - my org internally officially denounces | the copying of stack overflow snippets. Thankfully for my | role it's moot as I mostly work with an internal non-public | language, for better or worse, and I have no idea how well | that's followed elsewhere in the wider company. | samtheprogram wrote: | Anything posted to Stack Overflow has a specific (Creative | Commons IIRC) license associated with it. The same is not | true of GitHub Copilot, and in fact their FAQ doesn't | specify a license at all, probably because they are | technically unable to since it is trained on a wide variety | of code from differing licenses (and code not written by a | human is currently a grey area for copyright). The FAQ | simply says to use it at your own risk. | summerlight wrote: | Google (and most of other big techs I guess?) also | explicitly prohibit employees from use of stack overflow | code snippets. | Noumenon72 wrote: | I tried Googling this and couldn't find it. I also don't | want to believe it because it seems like the world | suddenly turned into an apocalyptic hellscape with no | place for developers like me. Do you have a source? | gunapologist99 wrote: | Apples and oranges: Stack overflow snippets are explicitly | granted under a permissive license, as long as you | attribute. | | https://stackoverflow.com/help/licensing | | It appears that the code that copilot is using is created | under a huge variety of licenses, making it risky. | | On the other hand, a small snippet in a function that is | derived from many existing pieces of other code may fall | under fair use, even if it is not under an open source | license of some sort. | rorykoehler wrote: | It just seems bizarre that this wasn't flagged internally | at Microsoft. They have tons of compliance staff. | mustacheemperor wrote: | Maybe we'll even get a sneak peak at Windows 11's source | code. Time to start writing a Win32 API wrapper and see | what the robot comes up with! | snicker7 wrote: | That's because Microaoft doesn't dare use this for | production code (presumably). | | They are 100% okay with letting their competitors get | into legal hot water. | rorykoehler wrote: | It's surely a bit of a liability grey area? | ngcazz wrote: | Could bet they baked in the legal fees and are taking a | calculated risk | comex wrote: | Except that CC-BY-SA is not a permissive license; the SA | part is a form of copyleft. It's just that nobody | enforces it. From the text [1]: | | - "[I]f You Share Adapted Material You produce [..] The | Adapter's License You apply must be a Creative Commons | license with the same License Elements, this version or | later, or a BY-SA Compatible License." | | - "Adapted Material means material [..] that is _derived | from_ or based upon the Licensed Material " (emphasis | added) | | - "Adapter's License means the license You apply to Your | Copyright and Similar Rights in Your contributions to | Adapted Material in accordance with the terms and | conditions of this Public License.' | | - "You may not offer or impose any additional or | different terms or conditions on, or apply any Effective | Technological Measures to, Adapted Material that restrict | exercise of the rights granted under the Adapter's | License You apply." | | A program that includes a code snippet is unquestionably | a derived work in most cases. That means that if you | include a Stack Overflow code snippet in your program, | and fair use does not apply, then you have to license the | _entire program_ under the CC-BY-SA. Alternately, you can | license it under the GPLv3, because the license has a | specific exemption allowing you to relicense under the | GPLv3. | | For open source software under permissive licenses, it | may actually be okay to consider the entire program as | licensed under the CC-BY-SA, since permissive licenses | are typically interpreted as allowing derived works to be | licensed under different licenses; that's how GPL | compatibility works. But you'd have to be careful you | don't distribute the software in a way that applies any | Effective Technological Measures, aka DRM. Such as via | app stores, which often include DRM with no way for the | app author to turn it off. (It may actually be better to | relicense to the GPL, which 'only' prohibits adding | additional terms and conditions, not the mere use of DRM. | But people have claimed that the GPL also forbids app | store distribution because the app store's terms and | conditions count as additional restrictions.) | | For proprietary software where you do typically want to | impose "different terms or conditions", this is a dead | end. | | Note that copying extremely short snippets, or snippets | which are essentially the only way to accomplish a task, | may be considered fair use. But be careful; in Oracle v. | Google, Google's accidental copying of 9 lines of utterly | trivial code [2] was found to be neither fair use nor "de | minimis", and thus infringing. | | Going back to Stack Overflow, these kinds of surprising | results are why Creative Commons itself does not | recommend using its licenses for code. But Stack Overflow | does so anyway. Good thing nobody ever enforces the | license! | | See also: | https://opensource.stackexchange.com/questions/6777/can- | i-us... | | [1] https://creativecommons.org/licenses/by- | sa/4.0/legalcode | | [2] https://majadhondt.wordpress.com/2012/05/16/googles-9 | -lines/ | [deleted] | wrs wrote: | Yes. In a past life, after researching the situation, we | had to find and remove all the code copied from Stack | Overflow into our codebase. I can't fathom why SO won't | fix the license. | | What makes it even worse is if you try to do the right | thing by crediting SO (the BY part) you're putting a red | flag in the code that you should have known you have to | share your code (the SA part). | aasasd wrote: | In addition to other licensing gotchas, a ton of SO | snippets are copied wholesale from elsewhere--docs or | blog posts. So it's pretty likely that the poster can't | license them in the first place because they never | checked the source's license requirements. | mediaman wrote: | Who really copies stack overflow snippets verbatim? It's | usually just easier to refer to it for help figuring out | the right structure and then adapt it for your own needs. | Usually it needs customization for your own application | anyway (variables, class instances, etc). | canadev wrote: | Yeah! I've uh, ... never copied a bit of code into my | repo verbatim, right? | | yeah right. I wish. | | (Not saying every dev does this) | TillE wrote: | I've copied plenty of Microsoft sample code verbatim, | because the Win32 API sucks and their samples usually get | the error handling right. | | But, I can't think of a single scenario where I've copied | something from Stack Overflow. I'm searching for the idea | of how to solve a problem, and typically the relevant | code given is either too short to bother copying, or it's | long and absolutely not consistent with how I want to | write it. | Noumenon72 wrote: | "Too short to bother copying"? I copy single words of | text to avoid typing and typos. I would never type out | even a single line of code when I could paste and edit. | blooalien wrote: | I don't think I've _ever_ copied code directly from any | of the Stack* sites. I generally read all the answers | (and comments) and then use what I learn to write my own | (hopefully better) code specific to my needs. | corobo wrote: | Yeah my experience has always been "ohhh that solution | makes sense" then I go write it myself | | If nothing else this whole copilot thing is helping ease | some chronic imposter syndrome | bartread wrote: | Ha! Well, I think a lot of people copy code from | StackOverflow verbatim once at least - including me. | | Of course it turned out the code I'd blindly inserted | into my project contained a number of bugs. In one or two | cases, quite serious ones. This, even though it was the | accepted answer. | | It was probably more effort to fix up the code I'd copy | pasta'd than write it from scratch. Since then I've never | copied and pasted from StackOverflow verbatim. | baud147258 wrote: | I think I did a few times, usually for languages that I | wasn't going to spend to much time with (so no benefits | in figuring how to do it from the answers) and for | specific tasks. | jpswade wrote: | Not only this but a huge amount of publicly available code is | truly terrible and should never really be used other than a | point of reference, guidance. | Kiro wrote: | No-one cares about this. People have no clue about licenses and | just copy-paste whatever. If someone gets access to their code | and see all the violations they're screwed anyway. | jerf wrote: | Ask your legal department about that. Sure, engineers don't | care about licensing at all, but we are not the only players | here. | [deleted] | [deleted] | __MatrixMan__ wrote: | Is it still a legal concern if I'm just coding because I want | to solve a problem and I'm not trying to use it to do business? | maclockard wrote: | If you publish the code anywhere, potentially. You could be | (unknowingly) violating the original license if the code was | copied verbatim from another source. | | How much of a concern this is depends heavily on what the | original source was. | kevin_thibedeau wrote: | Distributing binaries to third parties is enough to trigger | a license violation. For internal corporate tools, it would | be less of an issue as "distribution" hasn't happened. | lolinder wrote: | And the problem with copilot is that you have no way of | knowing. If it changes even a little bit of the code, it's | basically ungoogleable but still potentially in violation. | saurik wrote: | Yes: not all code on GitHub is licensed in a way that lets | you use it _at all_. People focus on GPL as if that were the | tough case; but, in addition to code (like mine) under AGPL | (which you need to not use in a product that exposes similar | functionality to end users) there is code that is merely | published under "shared source" licenses (so you can look, | but not touch) and even literally code that is stolen and | leaked from the internals of companies--including | Microsoft!... this code often gets taken down later, but it | isn't always noticed and either way: it is now part of | Copilot :/--that, if you use this mechanism, could end up in | your codebase. | eximius wrote: | Seems like the liability should also be on _Copilot itself_ , | as a derivative work. | fourseventy wrote: | Ahh yes the infamous "evil floating point bit level hacking" code | tyingq wrote: | They have 4 hand picked examples on their homepage: | https://copilot.github.com/ | | One has the issue with form encoding: | https://news.ycombinator.com/item?id=27697884 | | The python example is using floats for currency, in an expense | tracking context. | | The golang one uses a word ("value") for a field name that's been | a reserved word since SQL-1999. It will work in popular open | source SQL databases, but I believe it would bomb in some servers | if not delimited...which it is not. | | The ruby one isn't outright terrible, but shows a very | Americanized way to do street addresses that would probably | become a problem later. | | And these are the hand picked examples. This product seems like | it needs some more thought. Maybe a way to comment, flag, or | otherwise call out bad output? | xyzzy_plugh wrote: | > The golang one uses a word ("value") for a field name that's | been a reserved word since SQL-1999. It will work in popular | open source SQL databases, but I believe it would bomb in some | servers if not delimited...which it is not. | | In their defense they created the table with this column before | invoking the autocomplete, so they sort of reap what the sow | here. | | It could at least auto-quote the column names to remove the | ambiguity, but it's not a compiler, is it. | mempko wrote: | These are great examples. I wrote about how this will propagate | all sorts of bugs. | | But my argument was that it's good enough developers may get | complacent and not review the auto complete closely enough. But | maybe I'm wrong! Maybe it's not that good yet. | shadowgovt wrote: | Now that they have an AI that can be trained to replicate code, | it looks like the next step is training it to replicate good | code. That will be non-trivial, since step one is identifying | good code and they may not have much big data signal to draw | from for that. | | We know you can't use StackOverflow upvotes. However, they | should have enough signal to identify what snippets of code | have been most frequently copy-pasted from one project to | another. | | Question is whether that serves as a good proxy for good code | identification. | slver wrote: | > And these are the hand picked examples. This product seems | like it needs some more thought. | | Everyone's self-preservation instincts kicking in to attack | Copilot is kinda amusing to watch. | | Copilot is not supposed to produce excellent code. It's not | even supposed to produce final code, period. It produces | suggestions to speed you up, and it's on you to weed out stupid | shit, which is INEVITABLE. | | As a side note, Excel also uses floats for currency, so best | practice and real world have a huge gap in-between as usual. | Supermancho wrote: | > Everyone's self-preservation instincts kicking in to attack | Copilot is kinda amusing to watch | | Nobody is threatened by this, assuredly. As with IDEs giving | us autocomplete, duplication detection, etc this can only be | helpful. There is an infinite amount of code to write for the | foreseeable future, so it would be great if copilot had more | utility. | mkr-hn wrote: | Have you met programmers? Even those who care about quality | are often under a lot of pressure to produce. Things slip | through. Before, it was verbatim copies from Stack Overflow. | Now it'll be using Copilot code as-is. | slver wrote: | So, nothing new, is your point? | mkr-hn wrote: | Then why are you complaining? Unless something is new | that warrants you getting mad about people getting mad at | technology. | saiojd wrote: | Not the parent, but people really like to get riled up on | the same topics, over and over again, which quickly | monopolizes and derails all conversion. Facebook bad, UIs | suck, etc. We can now add to the list, "AI will never | reduce demand for software engineering". | volta83 wrote: | So how do you know if the code that Copilot regurgitates is | almost a 1:1 verbatim copy of some GPL'ed code or not ? | | Because if you don't realize this, you might be introducing | GPL'ed code into your propiertary code base, and that might | end up forcing you to distribute all of the other code in | that code base as GPL'ed code as well. | | Like, I get that Copilot is really cool, and that software | engineers like to use the latest and bestest, but even if the | code produced by Copilot is "functionally" correct, it might | still be a catastrophic error to use it in your code base due | to licenses. | | This issue looks solvable. Train 2 copilots, one using only | BSD-like licensed software, and one using also GPL'ed code, | and let users choose, and/or warn when the snippet has been | "heavily inspired" by GPL'ed code. | | Or maybe just train an adversarial neural network to detect | GPL'ed code, and use it to warn on snippets, or... | the_rectifier wrote: | You have the same issue with MIT because it requires | attribution | slver wrote: | It's very easy: don't use copilot code verbatim, and you | won't have GPL code verbatim. | volta83 wrote: | > It's very easy: don't use copilot | | Fixed that for you. | | Verbatim isn't the problem / solution. If you take a | GPL'ed library and rename all symbols and variables, the | output is still a GPL'ed library. | | Just seeing the output of GPL'ed code spitted by copilot | and writing different code "inspired" by it can result in | GPL'ed code. That's why "clean room"s exist. | | Copilot is going to make for a very interesting to follow | law case, because probably until somebody sues, and | courts decide, nobody will have a definitive answer of | whether it is safe to use or not. | throw_2021-07 wrote: | Stack Overflow content is licensed under CC-BY-SA. Terms | [1]: | | * Attribution -- You must give appropriate credit, | provide a link to the license, and indicate if changes | were made. You may do so in any reasonable manner, but | not in any way that suggests the licensor endorses you or | your use. | | * ShareAlike -- If you remix, transform, or build upon | the material, you must distribute your contributions | under the same license as the original. | | In over a decade of software engineering, I've seen many | reuses of Stack Overflow content, occasionally with links | to underlying answers. All Stack Overflow content use | I've seen would clearly fail the legal terms set out by | the license. | | I suspect Copilot usage will similarly fail a stringent | interpretation of underlying licenses, and will similarly | face essentially no enforcement. | | [1] https://creativecommons.org/licenses/by-sa/4.0/ | guhayun wrote: | The solution might be simpler than we think,just tell the | algorithm | didibus wrote: | Doesn't this go beyond license and into copyright? | | The license lets you modify the program, but the copyright | still enforces that you can't copy/past code from it to | your own project no? | pydry wrote: | It's true I probably wouldnt have laughed quite as loudly if | there werent a chorus of smug economists telling us that | tools like this are gonna put me out of a job. | slver wrote: | Business types hate dealing with programmers, that's a | fact. And these claims of "we'll replace programmers" | happen with certain precise regularity. | | Ruby on Rails was advertised as so simple, startup founders | who can't program were making their entire products in it | in a few days, with zero experience. As if. | j-pb wrote: | If I want random garbage in my codebase that I have to fix | anyways I might as well hire a underpaid intern/junior. | | It's easier to write correct code than to fix buggy code. For | the former you have to understand the problem, for the latter | you have to understand the problem, and a slightly off | interpretation of it. | tyingq wrote: | _" self-preservation"_ | | My suggestion was a way to comment or flag, not to kill the | product. These were particularly notable to me because | someone hand-picked these 4 to be the front page examples of | what a good product it was. | saiojd wrote: | I agree with you. This is basically similar to autocomplete | on cellphone keyboard (useful because typing is hard on | cellphone), but for programming (useful because what we type | tends to involve more memorization than prose). | tyingq wrote: | >As a side note, Excel also uses floats for currency | | It's still problematic, but the defaults and handling there | avoid some issues. So, for example: | | Excel: =1.03-.42 produces 0.61, by default, even if you | expand out the digits very far. | | Python: 1.03-.42 produces 0.6100000000000001, by default. | slver wrote: | Excel rounds doubles to 15 digits for display and | comparison. The exact precision of doubles is something | like 15.6 digits, those remaining 0.6 digits causing some | of those examples floating (heh) around. | okl wrote: | That depends | https://randomascii.wordpress.com/2012/03/08/float- | precision... | slver wrote: | A lot of these edge cases are about theoretical concerns | like "how many digits we need in decimal to represent an | exact IEEE binary float". | | In practice a double is 15.6 digits precise, which Excel | rounds to 15 to eliminate some weirdness. | | In their documentation they do cite their number type as | 15 digit precision type. Ergo that's the semantic they've | settled on. | ssss11 wrote: | "Maybe a way to comment, flag, or otherwise call out bad | output?" | | A copilot for copilot? :) | TeMPOraL wrote: | - The Go one (averaging) is non-idiomatic, and has a nasty bug | in it: https://news.ycombinator.com/item?id=27698287 | | - The JavaScript one (memoization) is a bad implementation, it | doesn't handle some argument types you'd expect it to handle: | https://news.ycombinator.com/item?id=27698125 | | You can tell a lot about what to expect, if there are so many | bugs in the very examples used to market this product. | gentleman11 wrote: | > The python example is using floats for currency. | | Dumb question, but what is the proper way to handle currency? | Custom number objects? Strings for any number of decimal | places? | spamizbad wrote: | For Python, I prefer decimal.Decimal[1]. When you serialize, | you can either convert it to a string (and then have your | deserializer know the field type and automatically encode it | back into a decimal) OR just agree all numeric values can | only be ints or decimals. You can pass | parse_float=decimal.Decimal to json.loads[2] to make this | easier. | | My most obnoxious and spicy programming take is that ints an | decimals should be built-in and floats should require | imports. I understand why though: Decimal encoding isn't | anywhere near as standardized as other numeric types like | integers or floating-point numbers. | | [1] https://docs.python.org/3/library/decimal.html [2] | https://docs.python.org/3/library/json.html | dragonwriter wrote: | > My most obnoxious and spicy programming take is that ints | an decimals should be built-in and floats should require | imports | | I don't care about making inexact numbers require imports, | but the most natural literal formats should produce exact | integers, decimals, and/or rationals. | dangerbird2 wrote: | Either a fixed-point decimal (i.e. an integer with the ones | representing 1/100, 1/1000, etc. of a dollar, or a ratio type | if you need arbitrary precision. | Quekid5 wrote: | > ratio type if you need arbitrary precision. | | This is the better default, so I'd ditch the qualifier, | personally. At the very least when it comes to the | persistent storage of monetary amounts. People often start | out _thinking_ that they won 't need arbitrary precision | until that _one little requirement_ trickles into the | backlog... | | Arbitrary precision rationals handles all the artithmetic | you could reasonably want to do with monetary amounts and | it lets you decide where to round _at display time_ (or | when generating a final invoice or whatever), so there 's | no information loss. | SmooL wrote: | Yeah, you probably want to use some sort of decimal package | for a configurable amount of precision, and then use strings | when serializing/storing the values | wodenokoto wrote: | A lot of good answers, but they mostly relate to accounting | types of problems (which granted, is what you need to do with | currency data 99% of the time) | | I'd just add that if you are building a price prediction | model, floats are probably what you need. | tyingq wrote: | The example code is the start of an expense tracking | tool... | stickfigure wrote: | Create a Money class, or use one off the shelf. It should | store the currency and the amount. There are a few popular | ways of storing amounts (integer cents, fixed decimal) but it | should not be exposed outside the Money class. | | There's plenty of good advice in this subthread for how to | represent currency inside your Money abstraction, but | whatever you do, keep it hidden. If you pass around numbers | as currency values you will be in for a world of pain as your | application grows. | pizza234 wrote: | This is a complex topic, mainly for two reasons: 1. it works | on two layers (storage and code) 2. there is a context to | take care of. | | [Modern] programming languages have decimal/rational data | types, which (within limits) are exact. Where this is not | possible, and/or it's undesirable for any reason, just use an | int and scale it manually (e.g. 1.05 dollars = int 105). | | However, point 2 is very problematic and important to | consider. How do account 3 items that cost 1/3$ each (e.g. if | in a bundle)? What if they're sold separately? This really | depends on the requirements. | | My 20 cents: if you start a project, start storing currency | in an exact form. Once a project grows, correcting the FP | error problem is a big PITA (assuming it's realistically | possible). | himinlomax wrote: | > How do account 3 items that cost 1/3$ each (e.g. if in a | bundle)? | | You never account for fractional discrete items, it makes | no sense. A bundle is one product, and a split bundle is | another. For products sold by weight or volume, it's | usually handled with a unit price, and a fractional | quantity. That way the continuous values can be rounded but | money that is accounted for needs not be. | XorNot wrote: | The problem is also stupid people and companies. | | My last job they wanted me to invoice them hours worked, | which was some number like 7.6. | | This number plays badly when you run it through GST and | other things - you get repeaters. | | So I looked up common practice here, even tried asking | finance who just said "be exact", and eventually settled on | that below 1 cent fractions I would round up to the nearest | cent in my favour for each line item. | | First invoice I hand them, they manually tally up all the | line items and hours, and complain it's over by 55 cents. | | So I change it to give rounded line items but straight | multiplied to the total - and they complain it doesn't | match. | | Finally I just print decimal exact numbers (which are | occasionally huge) and they stop complaining - because | excel is now happy the sums match when they keep second | guessing my invoices. | | All of this of course was irrelevant - I still had to put | hours into their payroll system as well (which they checked | against) and my contract specifically stated what my day | rate was to be in lieu of notice. | | So how should you do currency? Probably in whatever form | that matches how finance are using excel, which does it | wrong. | hobs wrote: | I wish this was untrue, but I have spent years hearing | the words "why dont my reports match?" - no amount of | logic, diagrams, explaining, the next quarter or instance | - "why dont my reports match?" | | BECAUSE EXCEL SUCKS MY DUDE. | zdragnar wrote: | Well, they did say to be exact, and you handed them | approximations, so... | mokus wrote: | The "exact" version they wanted was full of | approximations too. They just didn't have enough | numerical literacy to understand how to say how much | approximation they are ok with. | | I guarantee nothing in anyone's time accounting system is | measured to double-precision accuracy. Or at least, I've | never quite figured out the knack myself for stopping | work within a particular 6 picosecond window. | XorNot wrote: | Sure, but at the end of the day someone had to pay me an | integer amount of cents. They wanted a total which was a | normal dollar figure. But when you sum up 7.6 times | whatever a whole lot, you _might_ get a nice round number | or you might get an irrational repeater. | | What's notable is clearly no one had actually thought | this through at a policy level - the answer was "excel | goes brrrr" depending on how they want to add up and | subtotal things. | dralley wrote: | >[Modern] programming languages have decimal/rational data | types | | This caveat is kind of funny, in light of COBOL having | support for decimal / fixed precision data types baked | directly into the language. | | It's not a problem with "non-modern" languages, it's a | problem with C and many of its successors. That's precisely | why many "non-modern" languages have stuck around so long. | | https://medium.com/the-technical-archaeologist/is-cobol- | hold... | | Additionally, mainframes are so strongly optimized for | hardware-accelerated fixed point decimal computing that for | a lot of financial calculations it can be legitimately | difficult to match their performance with standard | commercial hardware. | caleb-allen wrote: | It is quite simple to do the same in Julia | adwn wrote: | > _It 's not a problem with "non-modern" languages, it's | a problem with C and many of its successors._ | | Not really. Any semi-decent modern language allows the | creation of custom types which support the desired | behavior and often some syntactic sugar (like operator | overloading) to make their usage more natural. Take C++, | for example, the archetypal "C successor": It's almost | trivial to define a class which stores a fixed-precision | number and overload the +, -, *, etc. operators to make | it as convenient as a built-in type, and put it in | library. In my book, this is vastly superior to making | such a type a built-in, because you can never satisfy | everyone's requirements. | pjmlp wrote: | It is also trivial to keep doing C mistakes with a C++ | compiler, hence no matter how many ISO revisions it will | still have, lack of safety due to C copy-paste | compatibility will never be fixed. | adwn wrote: | > _[...] no matter how many ISO revisions it will still | have, lack of safety due to C copy-paste compatibility | will never be fixed._ | | Okay, no idea how that's relevant to "built-in decimal | types" vs "library-defined decimal types", but if it | makes you feel better, you can do the same in Rust or | Python, two languages which are "modern" compared to | COBOL, don't inherit C's flaws, and which enable defining | custom number types/classes/whatever together with | convenient operator overloading. | pjmlp wrote: | Rust I agree, Python not really as the language doesn't | provide any way to keep invariants. | adwn wrote: | > _Python not really as the language doesn 't provide any | way to keep invariants_ | | Again, how is that relevant? If there's no way to enforce | an invariant in _custom data types_ , then there's also | no way to enforce invariants in _code using built-in data | types_. | pjmlp wrote: | It is surely relevant. | | Rust provides the mechanisms to enforce them, while in | Python, like all dynamic languages, everything is up for | grabs. | adwn wrote: | What I meant [1] was: In Python, invariants are enforced | by conventions, not by the compiler. If that's not | suitable for a given use case, then Python is _entirely_ | unsuited for that use case, regardless whether it | provides built-in decimal types or user-defined decimal | types. That 's why I said that your objection regarding | invariant enforcement is irrelevant to this discussion. | | [1] (but was to lazy to write out) | [deleted] | kyrra wrote: | To pile on, here's a copy/paste from when this was asked a | few days ago: | | Googler, opinions are my own. Over in payments, we use micros | regularly, as documented here: | https://developers.google.com/standard- | payments/reference/gl... | | GCP on there other hand has standardized on unit + nano. They | use this for money and time. So unit would 1 second or 1 | dollar, then the nano field allows more precision. You can | see an example here with the unitPrice field: | https://cloud.google.com/billing/v1/how-tos/catalog- | api#gett... | | Copy/paste the GCP doc portion that is relevant here: | | > [UNITS] is the whole units of the amount. For example if | currencyCode is "USD", then 1 unit is one US dollar. | | > [NANOS] is the number of nano (10^-9) units of the amount. | The value must be between -999,999,999 and +999,999,999 | inclusive. If units is positive, nanos must be positive or | zero. If units is zero, nanos can be positive, zero, or | negative. If units is negative, nanos must be negative or | zero. For example $-1.75 is represented as units=-1 and | nanos=-750,000,000. | ronnier wrote: | In its base unit. So cents in USD. Which can be an int64 | | Or if your language has something specific built in, use | that. | umanwizard wrote: | Not necessarily. It depends on the application. | ainar-g wrote: | > Or if your language has something specific built in, use | that. | | Unless your language is PostgreSQL's dialect of SQL, | apparently. https://wiki.postgresql.org/wiki/Don%27t_Do_Thi | s#Don.27t_use... | pilif wrote: | It has the same issue that the other suggestion of your | parent comment had: it can't deal with fractions of | cents, which is an issue you will most likely run into | before you will into floating point rounding issues. | fredros wrote: | Of course for databases you should use a decimal. | tzs wrote: | > In its base unit. So cents in USD. Which can be an int64. | | Note that if you use cents in the US so that everything is | an integer then as long as you do not have to deal with | amounts that are outside the range [-$180 trillion, $180 | trillion] you can also use double. Double can exactly | represent all integer numbers of cents in that range. | | This may be faster than int64 on some systems, especially | on systems that do not provide int64 either in hardware or | in the language runtime so you'd have to do it yourself. | marcosdumay wrote: | Each country has a law or something similar that states how | people should calculate over prices. | | The usual is to use decimal numbers with fixed precision (the | actual precision varies from one country to another), and I | don't know of any modern exception. But as late as the 90's | there were non-decimal monetary systems around the world, so | if you are saving any historic data, you may need something | more complex. | umanwizard wrote: | Depends what you're doing. In fact it's not _always_ wrong to | use floats for currency. For accounting you should probably | use a fixed-precision decimal type. | jacobsenscott wrote: | If someone asks how to handle money the best answer is | integers or fixed precision decimals. There may be a valid | case for using floats, but if someone asks they shouldn't | be using floats. | | Also I'm hard pressed to come up with a case where floats | would work. Can you give an example? | umanwizard wrote: | > Can you give an example? | | The answer is the same as _any_ time you should use | floats: where you don't care about answers being exact, | either (1) because calculation speed is more important | than exactness, or (2) because your inputs or | computations involve uncertainty anyway, so it doesn't | matter. | | This is more likely to be the case in, say, physics than | it is in finance, but it's not impossible in the latter. | For example, if you are a hedge fund and some model | computes "the true price of this financial instrument is | 214.55", you certainly want to buy if it's being sold for | 200, and certainly don't if it's being sold for 250, but | if it's being sold for 214.54, the correct interpretation | is that _you aren't sure_. | | When people say "you should never use floats for | currency", their error is in thinking that the only | applications for currency are in accounting, billing, and | so on. In those applications, one should indeed use a | decimal type, because we do care about the rounding | behavior exactly matching human customs. | tyingq wrote: | That's fair, though the example code I mentioned is the | start of an expense tracker. | umanwizard wrote: | Fair enough -- in that case, you should definitely use | either a decimal type or an integer. | jacobsenscott wrote: | Good answer. I've only ever worked on accounting style | financial apps, so I've didn't think of those types of | cases. | naniwaduni wrote: | You can't use a generic decimal type in that case either! | You need a special-purpose type that rounds exactly | matching the conventions you're following. This is | necessarily use-, culture-, and probably currency- | specific. | bidirectional wrote: | Most things in front office use floats in my experience, | e.g. derivative pricing, discounting, even compound | interest. None of these things are going to be any better | with integers or fixed-precision, but maybe harder to | write and slower. | stevesimmons wrote: | Yes, the risk management/instrument pricing part in the | "Front Office" uses floats, because the calculations | involve compound interest and discount rates. | | And the downstream parts for trade confirmation ("Middle | Office"), settlement and accounting ("Back Office") used | fixed precision. Because they are fundamentally | accounting, which involves adding things up and cross- | checking totals. | | These two parts have a very clear boundary, with strictly | defined rounding rules when the floating point | risk/trading values get turned into fixed point | accounting values. | lordgilman wrote: | Integer cents or an arbitrary precision decimal type. | shagie wrote: | Having worked on a POS system, the issue of using cents | alone is if you've got something like "11% rebate" and you | need to deal with fractional cents. | | The arbitrary precision decimal type should be the default | answer for currency until it is shown that the requirements | no and at no time in the future will _ever_ require | fractional units of the smallest denomination. | | As an aside, this may be constrained by the systems that | the data is persisted into too... the Buffett Overflow is a | real thing ( https://news.ycombinator.com/item?id=27044044 | ). | _ZeD_ wrote: | python has the `decimal` module in the stdlib | tyingq wrote: | There's no one answer, but decimal counts of the smallest | unit that needs to be measured is common. Like pennies in the | US, or maybe "number of 1/10 pennies" if there's things like | gasoline tax. | bpicolo wrote: | You can use integers instead of decimal if you're using the | smallest unit. | bencollier49 wrote: | Say what you like about COBOL, but it got this stuff right. | bidirectional wrote: | Every front office finance project I have ever worked on has | used floating point, so take the dogma with a grain of salt. | It depends entirely on the context. | jhugo wrote: | They probably just accumulate the rounding errors into an | account and write it off periodically without even | realising why it happens. | bidirectional wrote: | No, it's just that we're in the realm of predictions and | modelling, not accounting. If you're constructing a curve | to forecast 50 years of interest rates from a limited set | of instruments, you're already accepting a margin of | error orders of magnitude greater than the inaccuracies | introduced by floating point. | | The models also use transcendental functions which cannot | be accurately calculated with fixed point, rationals, | integers etc. | jhugo wrote: | Makes sense; I wasn't aware of the meaning of "front | office" as a term of art in finance. | BruiseLee wrote: | It's not like decimal or fixed point does not suffer from | rounding errors either. In fact for many calculations, | binary floating point gives more accurate answers. | | In accounting there are specific rules that require | decimal system, so one must be very careful with the | floating point if it is used. | dirkt wrote: | And the all suffer from rounding error problems? | | I mean, fixed point and a specific type for currency (which | also should include the denomination, while we are at it) | are not rocket science. Spreadsheets get that right, at | least. | bidirectional wrote: | Excel uses IEEE-754 floating point, so I don't get what | you mean with the spreadsheet comment. It has formatting | around this which rounds and adds currency symbols, but | it's floating point you're working with. | | Rounding error doesn't matter on these types of financial | applications. It's the less glamorous accounting work | that has to bother with that. | | They're not rocket science, but they're unnecessary, and | would still be off anyway. Try and calculate compound | interest with your fixed point numbers. | dragonwriter wrote: | > Dumb question, but what is the proper way to handle | currency? | | In python, for exact applications (not many kinds of | modeling, where floats are probably right), decimal.Decimal | is usually the right answer, but fractions.Fraction is | sometimes more appropriate, and if you are using NumPy or | tools dependent on it, using integers (representing decimals | multiplied by the right power of 10 to get the minimum unit | in the ones position) is probably better. | trevor-e wrote: | Someone already mentioned there's a `decimal` package in | Python that's better suited for currency. Back when I was a | Java developer we used this: https://docs.oracle.com/javase/7 | /docs/api/java/math/BigDecim... | kkirsche wrote: | The Decimal class is one way if you roll your own. py-moneyed | seems to be a well maintained library though I haven't used | it. | | Disclaimer: I only work with currency in hobby projects. | thayne wrote: | An integer of the smallest denomination. For example, cents | for the American dollar. And you probably would want to wrap | it in a custom type to simplify displaying it properly, and | maybe handle different currencies. If you language has a | fixed point type that might also be appropriate, but that's | pretty rare, and wouldn't work for currencies that aren't | decimal (like the old british pound system). | biztos wrote: | Do they still use fractional cents (or whatever) in | finance? | | https://money.howstuffworks.com/personal- | finance/financial-p... | TchoBeer wrote: | What if I'm calculating sales tax? Can't use an integer | anymore. | kyllo wrote: | Yes, you can. There are algorithms for rounding up, | rounding down, rounding to nearest, and banker's | rounding, on the results of integer division. This is a | solved problem. | eulers_secret wrote: | I haven't seen anyone mention this issue for some reason, but | in fetch_tweets.py: | fetch_tweets_from_user(user_name): ... | tweets = api.user_timeline(screen_name=user, count=200, | include_rts=False) | | 'user' isn't defined, should be user_name, right? Side note, | 'copilot' is a decent name for this (though copilots are | usually very competent, moreso than this right now). You _must_ | check the suggestions carefully. Maybe it 'll make folks better | at code review, lol. | rjknight wrote: | > This product seems like it needs some more thought. Maybe a | way to comment, flag, or otherwise call out bad output? | | Wait for your colleagues to use it, fix the bad code in the | pull request, and wait for copilot to learn from the new | training data you just provided! | more_corn wrote: | This is actually a good idea that is missing from nearly | every machine learning product. How do you back propagate | lessons from user interaction into future training of the | model? It can be done, I can't think of a place I've seen it | done though. | adriancr wrote: | It would be viewed as IP theft by most companies to upload | private code to this for use by others | dgb23 wrote: | It would have to be in the same range of what is | suggested, small patches and opt in. | | If snippets are a legal problem, then Copilot is | problematic by default, since it suggests code that may | or may not be sourced from free software. | adriancr wrote: | Even free software snippets have clauses like GPL or | attribution. | | Putting GPL code in proprietary codebase would cause a | company massive headaches... | | So I agree copilot is problematic by default, liability | to lawsuits for employers and forced open sourcing, | liability to IP lawsuits as well which will end up on | employees shoulders. | TeMPOraL wrote: | It's tricky, because once you start accepting user | feedback, you need to _moderate_ it, or else someone will | poison your model for fun and profit. | giantg2 wrote: | But what about all the bad training data provided too? | amelius wrote: | What are the statistics of Copilot based on a validation set? | How often does it get code right? | | I want to see hard statistics, not 4 hand-picked examples. | rubatuga wrote: | Yeah. And like how would you even devise a metric? Like | compile it down to assembly and see if it's similar logic? | amelius wrote: | Well, this is the question which the producers of the tool | should answer. | | You can't just release a ML tool onto the public if you | haven't validated it first. | mnky9800n wrote: | That's what I thought when I first started working in text | generation too. It's highly annoying people pitch their | successful models with hand picked examples. It's literally | the opposite of STATISTICAL learning imo. | foobiekr wrote: | Copilot appears to be "give more efficiency leverage to the | worst kind of coder." | codyb wrote: | Hmm... I mean, these all seem like mistakes I could make and | I don't think I'm the "worst kind of coder". | | The currency one I learned a while back, but it's not like I | intuited using integers by default. | | Value being a reserved keyword, I'm not sure I'd know that | and I do Postgres work as part of my myriad duties at the | startup I work at. Maybe I'd make that mistake in a | migration, maybe I have already. | | In a way, is it much different then what we do now as | engineers? I'm hard pressed to call it much of an engineering | discipline considering most teams I work on barely do design | reviews before they launch in to writing code, documentation | and meeting minutes are generally an afterthought, and the | code review process while decent isn't perfect either and | often times relies on arcane knowledge derived over months | and years of wrangling with particular <framework, project, | technology>. | | It's pretty neat, presumably it'll learn as people correct | it, and it'll get better over time. I mean it's not even | version one. | | I get the concerns, but I think they're a bit overblown, and | this'll be really useful for people who want to learn how to | code. Sure they'll run into some bugs, but, I mean, they were | going to do that anyways. | voakbasda wrote: | Is this any worse? Maybe not. Is it better? Absolutely not. | | This kind of tool will only further entrench the production | of mediocre, bug-ridden code that plagues the world. As | implemented, this will not be a solution; it is a express | lane in the race to the bottom. | pasquinelli wrote: | it _is_ a race to the bottom, and people are trying to | win. any skilled trade is being turned into an unskilled | job. it might suck, the results might suck, but it 's | more profitable, and that's what matters. | ticviking wrote: | I'm not really sure that type of tool could really be | anything else. | | How would a model become aware of all of the various edge | cases that depend on which SQL database you use or | differences in language versions over time? | sbr464 wrote: | Can it submit pull requests to itself with if/else boolean | logic/hacks? | gmadsen wrote: | a large data set covering exactly what you just mentioned? | TeMPOraL wrote: | > _I 'm not really sure that type of tool could really be | anything else._ | | It can't be, because they've chosen to use a deep learning | approach. That makes it a dead end right from the start. | | > _How would a model become aware of all of the various | edge cases that depend on which SQL database you use or | differences in language versions over time?_ | | A lot of things that we call "edge cases" are only a | problem for humans. They're not "edge cases" from the point | of view of the grammar / semantics of programming languages | and libraries. The way a hypothetical, better Copilot could | work, is by having directly encoded grammars and semantics | metadata corresponding to popular languages and tools. It | could generate code in principled and introspectable way, | by having a model of the computation it wants to express | and encoding it in a target language. | | Of course, such hypothetical Copilot is a harder task - | someone would have to come up with a structure for | explicitly representing understanding of the abstract | computation the user wants to happen, and then translate | user input into that structure. That's a lot of drudgery, | and from my vague understanding of the "classical" AI | space, there might be a bunch of unsolved problems on the | way. | | Real Copilot uses DNNs, because they let you ignore all | that - you just keep shoving code at it, until the black- | box model starts to give you mostly correct answers. The | hard work is done automagically. It makes sense for some | tasks, less for others - and I think code generation is one | of those things where black-box DNNs are a bad idea. | jhgb wrote: | > The way a hypothetical, better Copilot could work, is | by having directly encoded grammars and semantics | metadata corresponding to popular languages and tools. It | could generate code in principled and introspectable way, | by having a model of the computation it wants to express | and encoding it in a target language. | | But that sounds like too much work, let's just throw a | lot of data into an NN and see what comes out! /s | | > and introspectable | | Which most importantly means "debuggable", I assume. From | what I get there doesn't seem to be any way to ad-hoc fix | an NN's output. | heroHACK17 wrote: | This is my thought as well. I get the "make productive | engineers even more productive" angle, but productive | engineers' bottleneck isn't coding. Sure, coding up a | boilerplate Go web server is tedious, but I have done it so | many times that it takes me two seconds now. | | On the flip side, coding can be the bottleneck for the worst | kind of coder. When I first started coding, coding was hard | simply because I had very little reps and was just learning | to understand how to code common solutions, data structures, | libraries, etc. Fast forward a few years and, if I were still | struggling to understand these concepts, Copilot is a | lifeline. | hamandcheese wrote: | I'm gonna have to disagree - coding can and does take | significant amounts of time even when I know exactly what | problem I am solving. | | I admit that at many organizations there are so many other | factors and bottlenecks, but it's not uncommon that I find | myself 8+ hours deep into a coding task that I had expected | would be much shorter. | | On the other hand, usually that's due to refactoring or | otherwise not being satisfied with the quality of my | initial solution, so copilot probably wouldn't help... | captn3m0 wrote: | I find it is reducing my research time by providing a decent | starting solution space. Especially for boring stuff where | you just need to google the signature of some standard | library function. | majormajor wrote: | It takes what should be your method of last resort - | copypaste - and makes it the first thing you try. | | All the steps in between - looking at the docstring for the | function you're calling, googling for more general | information, looking at and _deciding not to use_ not- | applicable or poorly-written SO answers - get pushed aside. | So instead of you having to convince yourself "yes, it's | safe to copy-paste these lines from SO, they actually fit my | problem" you're presented with magic and I think the burden | for rejecting it is going to be higher once it's in your | editor than when you're just reading it on a SO post or | Github snippet. | | Even for a newcomer looking to learn, working on simple stuff | that it has great completions for, it seems like it will | sabotage your long-term growth, since it takes all the _why_ | and the reasoning out of it. Autocomplete for a function name | isn 't that relevant to gaining a deeper understanding. | Knowing _why_ a certain block of code is passed in in a | certain style, or needs to be written at all? Probably that | is. | majormajor wrote: | Thinking about it more: there's a very small subset of | problems that I think this is actually great for. And I do | run into this somewhat often: relatively new libraries or | frameworks that don't really care about thorough | documentation so they only show you a few happy path | snippets and nothing about how to do something more | interesting, so you have to bridge the gap between "this | one line in the doc obviously doesn't work with me, but I'd | like to figure it out without reading all their source code | from scratch..." - getting more example snippets barfed up | onto my screen from other people who've figured it out | before could be a sort of replacement for the library | writers having provided documentation in the first place. | But ... this is a somewhat insane way to work around a | problem of shitty code documentation, and is still | insufficient in a couple ways: | | * some poor bastard is going to have to be the first person | to figure out how to do something, so that copilot itself | can know | | * any non-code nuances around "oh, if you do that, your | memory usage is going to explode" or "oh, by the way, if | you do that, make sure you don't do your own threading" | will still fail to be communicated. | groby_b wrote: | I've called it "The Fully Mechanized Stack Overflow Brigade" | before, and everything that comes to light supports that | assessment. | | On the upside, think of the consultancy fees you can charge | to clean up those messes. | bierjunge wrote: | The Golang example would not even compile, because `sql` is not | imported. | IncRnd wrote: | That's for the best. We don't want products that pretend to | write code for us, while copying other's code without | attribution and that may not even work. | stevelosh wrote: | The golang one also silently drops rows.Err() on the floor. | | https://golang.org/pkg/database/sql/#Rows | jacurtis wrote: | > The ruby one isn't outright terrible, but shows a very | Americanized way to do street addresses that would probably | become a problem later. | | As someone who has been coding up address storage and | validation for the past week in my current job, that one really | made me laugh. Mostly because it tries to simplify all the | stuff I have been analyzing and mulling over for a week into a | single auto-complete. | | Spoiler: The Github Copilot's solution simply won't work. It | would barely work for Americanized addresses, but even then not | be ideal. Of course trying to internationalize it, this thing | isn't even close. | | I get what Copilot is trying to do. But at the same time I | don't get it. Because from my experience, typing code is the | fastest part of my job. I don't really have a problem typing. I | spend most of my time thinking about the problem, how to solve | it, and considering ramifications of my decisions before ever | putting code in the IDE. So Copilot comes around and it | autocompletes code for me. But I still have to read what it | suggested, making edits to it, and consider if this is solving | the problem appropriately. I'm still doing everything I used to | do, except it saved me from typing out a block of code | initially. I still have to most likely rebuild, edit, or change | the function somewhat. So it just saves me from typing that | first pass. Well that's the easy part of the job. | | I have never had a manager come to me and ask why a project is | taking so long where I could answer "it just takes so long to | type out the code, i wish I had a copilot that could type it | for me". That's why we call it software engineering and not | coding. Coding is easy. Software engineering is hard. Github | Copilot helps with coding, but doesn't help with Software | Engineering. | reaperducer wrote: | _I spend most of my time thinking about the problem, how to | solve it, and considering ramifications of my decisions | before ever putting code in the IDE. So Copilot comes around | and it autocompletes code for me. But I still have to read | what it suggested, making edits to it, and consider if this | is solving the problem appropriately._ | | So, rather than helping people program better, all its done | is replace a bunch of the offshore cut-and-paste shops with | "AI." | neutronicus wrote: | A lot of my job is thinking hard about how to do [X], | incidentally needing to remember how to do [trivial thing Y] | and looking it up. | | Like, I did it before, remember that it was trivial, I just | forget the snippet and I have to break focus to look it up - | often by scrolling through my own commit history to try and | find the time I did [trivial thing Y] four months ago. | | I do kind of wish I could automate that. Skipping the actual | typing of the snippet is sort of gravy on top of that. | nitrogen wrote: | _It would be nice if there were a way to automate the | "remembering what that one function is called and what | order the parameters are in" portion of my job._ | | IME the best thing for this is looking at the method | listing in the docs for the classes I'm using. E.g. for | Ruby, it's usually looking at the methods in Enumerable, | Enumerator, Array, or Hash. Or I'll drop a _binding.pry_ | into the function, run it, and then type _ls_ to see what | 's in scope. | greyfox wrote: | this sounds super interesting, is there a video or upload | somewhere that i can watch this being performed in real | time? | nitrogen wrote: | I very briefly show some of the interactivity of Ruby+Pry | here: https://youtu.be/Gy7l_u5G928?t=805 (the overall | code segment starts at | https://www.youtube.com/watch?v=Gy7l_u5G928&t=626s) | | I'd be happy to hear about better demonstrations, and | there's also Pry's website (https://pry.github.io/) where | they link to some screencasts. | shados wrote: | Even in the 90s that was a solved problem in Visual Basic | with autocomplete. That a lot of dev environments "lost" | the ability to do it is mind boggling. With that said, | doesn't Rubymine let you do that with autocomplete with | the prompt giving you all the info you need? (I haven't | done Ruby in a long time). | | Still, having to look up the doc or run the code to | figure out how to type it is orders of magnitude slower | than proper auto complete (be it old school Visual Studio | style, or something like Copilot). | nitrogen wrote: | _orders of magnitude slower than proper auto complete_ | | Having worked extensively with verbose but autocomplete- | able languages like Java, compact dynamic languages like | Ruby, and a variety of others including C, Scala, and | Kotlin, I've come to the conclusion that, for me, | autocomplete is a crutch and I develop deeper | understanding and greater capabilities when I go to the | docs. IDE+Java encourages sprawl, which just further | cements the need for an IDE. Vim+Ruby+FZF+ripgrep+REPL | encourages me to design code that can be navigated | without an IDE, which ultimately results in cleaner | designs. | | If there's _any_ lag whatsoever in the autocomplete, it | breaks my flow state as well. I can maintain flow better | when typing out code than when it just pops into being | after some hundreds of milliseconds delay. Plus, there 's | always the chance for serendipity when reading docs. The | docs were written by the language creators for a reason. | Every dev should be visiting them often. | shados wrote: | That's totally cool but the grandparent was talking about | remembering shit they already knew. Not everyone has a | fantastic memory, and remember the arguments are A then B | or B then A doesn't deepen your understanding of a | language. Most of the time the autocomplete and the | official doc use the exact same source anyway, formatted | the same way, with the same info. | | But if it works for you, more power to you! | lwhi wrote: | >Because from my experience, typing code is the fastest part | of my job. I don't really have a problem typing. I spend most | of my time thinking about the problem, how to solve it, and | considering ramifications of my decisions before ever putting | code in the ID | | So very true. | | [1] Understanding the problem > [2] thinking about all | possible solutions > [3] working out which solution fits best | > [4] working out which implementations are possible > [5] | working out the most suitable implementation | | ... and finally, [6] implementing via code. | ph0rque wrote: | > I spend most of my time thinking about the problem, how to | solve it... | | A few years ago, I got a small but painful cut on my | fingertip. I thought I would have a hard time on the job as a | dev. To my surprise, I realized I spend 90-95% of my time | thinking, and only 5-10% of the time typing. It turned out to | be almost a non-issue. | shados wrote: | > I don't really have a problem typing. | | Im absolutely with you and want to upvote that part of the | comment x100. Unfortunately it's often considered a fairly | spicy opinions. | | Entire frameworks (Rails) are built around the idea of typing | as little as possible. Others can't even be mentioned without | the topic of boilerplate/keystroke count causing a flame war | (Redux). | | A lot of engineers equate their value with the amount of | lines they can pump out, so there's definitely a demand for | tools like these. | | There's also some legitimate stuff. There's a lot of very | silly thing I have to google every time I do because I have a | bad memory. It saves the step of googling. In a way, it was | the same debate around autocomplete at the very beginning, | but pushed to the next level. Autocomplete turned out to be a | very good thing (even though new languages and tools keep | coming out without it). | theshadowknows wrote: | I never commit something that I can easily google (with a | high quality solution) to memory | [deleted] | amluto wrote: | As the owner of a fairly normal American address that is | either corrupted by the UPS address validation service, this | is a good time to remind everyone: accept the address that | your customer enters. If you offer a service to try to | improve your customer's address, keep in mind that it's a | value added service, it may be wrong, and you MUST test the | flow in which your customer tells your service to accept the | address as entered. And maybe even collect examples in which | the address change is accepted to make sure it does something | useful. | | Vendors have lost sales to me because they were too | incompetent to allow me to ship things to my actual address. | Oops. | | P.S. for the US, you need to offer at least two lines for the | address part. And you need to accept really weird things that | don't seem to parse at all. I know people with addresses that | have a PO Box number and a PMB number _in the same address_. | Lose one and your mail gets lost. | | P.P.S. If you offer discounted shipping using something like | SurePost, make sure you let your customers pay a bit extra to | use a real carrier. There are addresses that are USPS-only | and there are addresses that work for all carriers except | USPS (and SurePost, etc). Let your customer tell you how to | ship to them. Do not second-guess your customer. | JamesAdir wrote: | Isn't address storage and validation a solved problem? Why is | it so complicated? | mgsouth wrote: | Ex: | | 412 1/2 E E NE | | 412 1/2 A E | | 1E MAIN | | 1 E MAIN | | FOO & BAR | | 123 ADAM WEST RD | | 123 ADAM WEST | | 123 EAST WEST | ezfe wrote: | You are right that USPS maintains a database of canonical | delivery points. However, it's inevitable this database | might not be correct or up to date. | | If you don't want to validate, then yes addresses are just | a series of text fields. However, mapping them to that | delivery point is where the problems arise. | gameswithgo wrote: | post process with some language aware heuristics maybe | amelius wrote: | Co-pilot fixes the wrong problem. | | It should be a tool capable of one-shot learning. | | I.e., I'm in the middle of a refactoring operation and have to do | lots of repetitive work; the tool should help me by understanding | what I'm trying to do after I give it 1 example. | marcodiego wrote: | Now, consider quake is GPL'ed. Any proprietary software using | such code will have to bow to the license terms. | anyonecancode wrote: | I think copilot is solving the wrong problem. A future of | programming where we're higher up the abstraction tree is | absolutely something I want to see. I am taking advantage of that | right now -- I'm a decently good programmer, in the sense that I | can write useful, robust, reliable software, but I'm pretty high | up the stack, working in languages like Java or even higher up | the stack that free me from worrying about the fine details of | memory allocation or the particular architecture of the hardware | my code is running on. | | Copilot is NOT a shift up the abstraction tree. Over the last few | years, though, I've realized the the concept of typing is. Typed | programming is becoming more popular and prominent beyond just | traditional "typed" languages -- see TypeScript in JS land, | Sorbet in Ruby, type hinting in Python, etc. This is where I can | see the future of programming being realized. An expressive type | system lets you encode valid data and even valid logic so that | the "building blocks" of your program are now bigger and more | abstract and reliable. Declarative "parse don't validate"[1] is | where we're eventually headed, IMO. | | An AI that can help us to both _create_ new, useful types, and | then help us _choose_ the best type, would be super helpful. I | believe that's beyond the current abilities of AI, but can | imagine that in the future. And that would be amazing, as it | would then truly be moving us up the abstraction tree in the same | way that, for instance, garbage collection has done. | | [1] https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t- | va... | shadowgovt wrote: | A taller abstraction tree makes tradeoffs of specialization: | the deeper the abstractions, the more one has to understand | when the abstractions break or when one chooses to use them in | novel ways. | | This is something I'm interested in regarding this approach... | When it works as intended, it's basically shortening the loop | in the dev's brain from idea to code-on-screen _without_ adding | an abstraction layer that someone has to understand in the | future to interpret the code. The result is lower density, so | it might take longer to read... Except what we know about | linguistics suggests there 's a balance between density and | redundancy for interpreting information (i.e. the bottleneck | may not be consuming characters, but fitting the consumed data | into a usable mental model). | | I think the jury's out on whether something like this or the | approach of dozens of DSLs and problem-domain-shifting | abstractions will ultimately result in either more robust or | more quickly-written code. | | But on the topic of types, I'm right there with you, and I | think a copilot for a dense type forest (i.e. something that | sees you writing a {name: string; address: string} struct and | says "Do you want to use MailerInfo here?") would be pretty | snazzy. | krick wrote: | Yeah, but generating tons of stupid verbose code that nobody | will be able to read and understand is more fun. Also, your | superiors will be sure you are a valuable worker if you write | more code. | mkl95 wrote: | Copilot is one of the worst ideas that have made it to production | in recent years. I predict it will be quite successful | considering Microsoft's track record. | dempsey wrote: | I've always wondered this about the realistic photo generators. | How do we know they're generating new faces and not just | regurgitating ingested faces? | antpls wrote: | One has to admit, Copilot raises many questions regarding global | code quality, reviewing processes and copyright. It's a marketing | success. | qayxc wrote: | Honestly, I see this exact issue as the main accomplishment of | Copilot. It shows that the black-box machines are to be | considered harmful and are incompatible with the current | intellectual property and privacy frameworks. | | This issue goes way beyond just code - imagine GPT-like systems | being used in medical diagnosis and results can suddenly depend | on the date of the CT-scan or the name of patient, because the | black-box simply regurgitates training data... | [deleted] | nightowl_games wrote: | Almost feels like a developer cultural thing to hate on something | like this. If you dont like it, dont use it. If you dont want | your team using it, become senior and then set the rules. | | Kinda seems like maybe there's some level of insecurity at play | here in the criticism. Like a "I coulda came up with that but its | a bad idea" type of hater philosophy. | ChrisMarshallNY wrote: | I've always assumed that we would eventually have a low-code, or | no-code junior dev replacement, and was wondering if this was it. | GH and MS actually have _[Ed. had?]_ some cred for this kind of | thing. | | Nope. Game over. Play again? | marcosdumay wrote: | Most low-code and no-code platforms go for junior dev | empowerment, and senior dev replacement. This one also seems to | be aimed at empowering juniors, but looks like it missed the | senior replacement by miles. | [deleted] | jgilias wrote: | Copying GPLed code as your own and passing it under an MIT | license is not too far fetched of a thing for a junior dev to | do. | | Jokes aside, to have a proper junior dev replacement you need | something that is able to learn and grow to eventually become a | senior dev, an architect, or a CTO. That is the most important | value of a junior dev. Not the ability to produce subpar code. | ChrisMarshallNY wrote: | Depends on who you ask. | | I think a lot of modern software development shops, these | days, exist only to make their founder[s] as rich as | possible, as quickly as possible. | | If they are willing to commit their entire future to a | lowest-bid outsourcing shop, then I don't think they are too | concerned about playing the long game. | | Also, the software development industry, as an aggregate, has | established a pervasive culture, based around developers | staying at companies for 18-month stints. I don't think many | companies feel it's to their advantage to incubate people who | will bail out, as soon as they feel they have greener | pastures, elsewhere. | abeppu wrote: | I may be over-reading, but I think this kind of example not only | demonstrates the pragmatic legal issues, but also the fundamental | weaknesses of a solely text-oriented approach to suggesting code. | It doesn't really seem to have a representation of the problem | being solved, or the relationship between things it generates and | such a goal. This is not surprising in a tool which claims to | work at least a little for almost all languages (i.e. which isn't | built around any firm concept of the language's semantics). | | I'd be much more excited by (and less unnerved by) a tool which | brought program synthesis into our IDEs, with at least a partial | description of intended behavior, especially if searching within | larger program spaces could be improved with ML. E.g. here's an | academic tool from last year which I would love to see | productionized. https://www.youtube.com/watch?v=QF9KtSwtiQQ | computerex wrote: | I think it's pretty clear that program synthesis good enough to | replace programmers requires AGI. | | This solely text based approach is simply "easy" to do, and | that's why we see it. I think it's cool and results are | intriguing but the approach is fundamentally weak and IMO | breakthroughs are needed to truly solve the problem of program | synthesis. | whimsicalism wrote: | > fundamental weaknesses of a solely text-oriented approach to | suggesting code. | | I don't think it is clear that such "fundamental weaknesses" | exist. A text-based approach can get you incredibly far. | abeppu wrote: | I mean, the cases where it tries to assign copyright to | another person in a different year highlights that context | other than the other text in the file is semantically | extremely important, and not considered by this approach. | Merely generating text which looks appropriate to the model | given surrounding text is ... misguided? | | If you think about it, program synthesis is one of the few | problems in which the system can have a perfectly faithful | model dynamics of the problem domain. It can run any | candidate it generates. It can examine the program graph. It | can look at what parts of the environment were changed. To | leave all that on the table in favor of blurting out text | that seems to go with other text is like the toddler who | knows that "five" comes after "four", but who cannot yet | point to the pile of four candies. You gotta know the | referents, not just the symbols. No one wants a half-broken | Chinese Room. | whimsicalism wrote: | > generating text which looks appropriate to the model | given surrounding text is ... misguided? | | Agreed - it represents a failure to adequately | model/understand the task, but I don't think it is a | "fundamental weakness" of text-based 'Chinese room' | approaches. | | > You gotta know the referents, not just the symbols. No | one wants a half-broken Chinese Room. | | "Knowing the referents" is not at all clearly defined. It's | totally possible that, under the constraint of optimizing | for next-word prediction, the model could develop an | understanding of what the referents are. | | You can't underestimate the level of complex behavior | emerging from a big enough system under optimization. After | all, all the crazy stuff we do - coding, art, etc. is | produced by a system under evolutionary optimization | pressure to make more of itself. | abeppu wrote: | > "Knowing the referents" is not at all clearly defined. | It's totally possible that, under the constraint of | optimizing for next-word prediction, the model could | develop an understanding of what the referents are. | | Well, in this case, it would have been good to understand | that "V. Petkov" is a person unrelated to the project | being written, and that "2015" is a year and not the one | we're currently in. Sometimes the referent will be a | method defined in an external library, which perhaps has | a signature, and constraints about inputs, or properties | which apply to return values. | | > You can't underestimate the level of complex behavior | emerging from a big enough system under optimization. | After all, all the crazy stuff we do - coding, art, etc. | is produced by a system under evolutionary optimization | pressure to make more of itself. | | I think this can verge into a kind of magical thinking. | Yes, humans also look like neural nets, and we might even | be optimizing for something. But we learn to program (and | we do our best job programming) by having a goal for | program behavior, and we use interactive access to try to | run something, get an error, set a break point, try | again, etc. I challenge anyone to try to learn to "code" | by never being given any specific tasks, never | interacting with docs about the language, an interpreter, | a compiler, etc, but merely to try to fill in the blank | in paper code snippets. You might learn to fill in some | blanks. I highly doubt you would learn to code. | | This is totally a case where the textual representation | of programs is easier to get and train against, and that | tail is being allowed to wag the dog to frame both the | problem and the product. | | None of this is to say that high-bandwidth DNN approaches | don't have a place here -- but I think we should be | looking at language-specific models where the DNN | receives information about context (including some | partial description of behavior) and outputs of the DNN | are something like the weights in a PCFG that is used in | the program search. | guhayun wrote: | Copilot NEEDs to be trained on licensed code,so that it doesn't | produce them | _tom_ wrote: | I'm reminded of the old saying: | | The best person to have on your team is a productive, high- | quality coder. | | The worst is a productive, low-quality coder. | | Copilot looks like it would give us more of the latter. | ezoe wrote: | Similar story. | | He tried to write Quine in Ruby, ended up conjuring up a | copyright claim comment and fake licensing term. | https://twitter.com/mametter/status/1410459840309125121 | gok wrote: | Quake and GitHub are both owned by Microsoft now, perhaps we can | assume this is relicense? | Jyaif wrote: | Wow, Quake _is_ owned by Microsoft. This is mind blowing, and a | little sad. | josefx wrote: | It belongs to id software -> ZeniMax -> Xbox Game Studios -> | Microsoft. | johndough wrote: | Is it possible that Copilot just put Quake's source code into | the public domain? | | From the Copilot FAQ: Who owns the code | GitHub Copilot helps me write? GitHub Copilot is a | tool, like a compiler or a pen. The suggestions GitHub | Copilot generates, and the code you write with its help, belong | to you, and you are responsible for it. We recommend | that you carefully test, review, and vet the code, as you would | with any code you write yourself. | | Copilot can probably recite most of Quake's source code and | according to the FAQ, the output of Copilot belongs to the | user. | | I think a point where this argumentation might fail is that | Quake's source code does not belong to Github directly, but | instead both Github and Quake belong to Microsoft. However, I | am not a lawyer, so I might be wrong. | [deleted] | sdevonoes wrote: | Not a problem. Just don't use Copilot :) | hi41 wrote: | I saw the gif on Twitter. Sorry, I am not able to understand what | is going on. Is copilot a character in the Quake game? | dpassens wrote: | Copilot seems to be an AI tool to generate code for you[0]. In | the gif, it's copying code from Quake, which is GPLv2 or later. | If copying GPLed code wasn't bad enough, it then adds a MIT- | like license header. | | [0] https://copilot.github.com/ | yk wrote: | So, what would happen if I train a neural network to recreate | Disney movies? | avaldes wrote: | Isn't already? | oiu45hunegn wrote: | This reminds me of an issue that came up when I was working with | a intelligence agency, training machine translation. | | If you think about language in general, individual words aren't | very sensitive. The word for bomb in any language is public | knowledge. But when you start getting to jargony phrases, some | might be unique to an organization. And if you're training your | MT on translated documents surreptitiously intercepted from West | Nordistan's nuclear program, and make your MT model public, the | West Nordistanis might notice - "hey, this accurately translates | our non-public documents that contain rather novel phrases ... I | think someone's been listening to us!" | dredmorbius wrote: | Backstory? | | WTF is "Copilot"? | gregsadetsky wrote: | It's a new product launched by GitHub in association with | OpenAI | | https://news.ycombinator.com/item?id=27676266 | loloquwowndueo wrote: | "Your AI pair programmer" - auto-completes entire functions | while you're coding. https://copilot.github.com/ | dredmorbius wrote: | Thanks. | klohto wrote: | I'm really dumbfounded by the Copilot team decision to not | exclude GPL licensed code. | | Why was this direction chosen? Is the inclusion of GPL really | worth the risk and potential Google v. Oracle lawsuit? I'd like | to know the reasoning. | throwaway287391 wrote: | Isn't it entirely possible that they _did_ exclude GPL licensed | code, but somebody somewhere has violated copyright and copy- | pasted that snippet into non-GPL-licensed code that they | trained on? | | They could try to trace every single code snippet they train on | to its "true source" and use the license for that, but that's | not very well-defined, and is a lot harder, and it's never | going to be 100%. | another-dave wrote: | Which raises another question: ideally Copilot wouldn't be | trained on "somebody somwhere", but is that happening? | | To use the old trope -- if the majority of programmers can't | implement Fizzbuzz, but they do have a Github profile, are | they being included too? | | Hopefully there's some quality bar for the training set, i.e. | some subset of "good" code (e.g. release candidate tags from | fairly established OSS tools/frameworks in different | languages) rather than any old code on the internet. | ttt0 wrote: | Nope. They did include GPL code. | | > Once, GitHub Copilot suggested starting an empty file with | something it had even seen more than a whopping 700,000 | different times during training -- that was the GNU General | Public License. | | https://docs.github.com/en/github/copilot/research- | recitatio... | pessimizer wrote: | Looks like Copilot is smart enough to understand its own | licensing situation. It should continue to suggest this for | any empty file. | goodpoint wrote: | Apache / MIT / BSD all have restrictions e.g. attribution | clause. | | Excluding GPL does not solve the problem. | Anon1096 wrote: | Why would excluding GPL'd code be enough to not violate | licenses? I don't understand why people think MIT or other | licenses are free for alls to take code as they wish. The MIT | license includes an attribution clause. And, as the linked | video shows, Copilot is more than happy to take its code and | put your pet license and copyright notice on instead. Isn't | that equally as infringing as stealing GPL code? The idea of | mining GitHub for training data was doomed from the start | copyright-wise, as there's so much code that's misattributed, | wrongly-licensed, or unlicensed. | NavinF wrote: | Has anyone ever been sued IRL for using MIT/Apache/... code? | Or are we stuck in imaginary land where this is something to | be worried about? | | Btw the GPLv2 death penalty is rather unique and I don't | think anyone will deny that including GPL code in proprietary | code is a hell of a lot worse in every way (liability, | ethically, etc) than including permissively licenced code and | forgetting to attribute it | ghaff wrote: | At some level though, this suggests that the only way to be | safe if you're writing a program (outside of a Copilot | context) is probably simply not to look at GitHub (or maybe | Stack Overflow and other code sources) except for, perhaps, | using properly attributed entire functions. If you take a | couple lines of code and tweak it a bit are you now required | to attach copyright attribution? IANAL, but I'm guessing not. | aj3 wrote: | Copilot is a tool. If you take copilot's suggestions | uncritically and push them to Github, - that's on you. | croes wrote: | Yeah, because I always check the code of my programming | partner for license violations. | | That's more Trainee than Copilot. | aj3 wrote: | If you use it as a programming partner it will simply | autofill whatever you're writing line-by-line. You're not | forced to use code completion at a whole-function level | and it's not even the suggested use-case. | GhostVII wrote: | Sure but if you have to audit every suggestion to see if it | violates copyright laws that's not a particularly useful | tool. | aj3 wrote: | Depends. If you find useful code on Github, Stack | Overflow or anywhere else in the internet, you still need | to check whether it is suitable with your licensing or | not. | TeMPOraL wrote: | If you find useful code on Github or StackOverflow, you | can check for the license directly there, or you can try | to find where it was copied from, and look for a license | there. | | Copilot isn't copying, it's regurgitating patterns from | its training dataset. The result may be subject to a | license you don't know about, but modified enough that | you won't find the original source. The result can be a | _blend_ of multiple snippets with varying licenses. And | there 's no way to extract attribution from Copilot - DNN | models can give you an output for your input, they can't | tell you which exact parts of the training dataset were | used to generate that output. | FemmeAndroid wrote: | But Copilot won't accurately tell you if it's directly | copying code, and if so what the license is. If it | provides MIT licensed code that I then need to include, | how do I know that? Do I need to search for each set of | lines of code it provides on GitHub? | | When a person gets code from another source on the | internet, they generally know where the code has come | from. | aj3 wrote: | In a real world scenario you wouldn't be mindlessly | pressing Tab right after linebreak and accepting the | first suggestion that comes your way. While entertaining, | nobody gets paid to do that. | | What you get paid is to write your own code. When you | write your own code, generally you think first and then | type. Well, with Copilot you think first and then start | typing a few symbols before seeing automatic suggestions. | If they are right, you accept changes and if they happen | to be similar to any other code out there, you deal with | it exactly the same as if you typed those lines yourself. | user-the-name wrote: | But it is not the same as if you typed it yourself. | | If you happen to type code that is similar to copyright | code, that is generally considered legally OK. | | If you copypaste copyrighted code, that is not legally | OK. | | If you accept that same code from an autocomplete tool, | that can easily be seen as equivalent to the latter case | rather than the former. | user-the-name wrote: | Then name a usage of the tool that is legally sound. I can | not think of one. | aj3 wrote: | Code completion that can suggest the whole line instead | of a single word (e.g. often it guesses function | parameters and various math operations when you haven't | even typed function name yet). | summerlight wrote: | At least that will reduce the chance of license violation as | well as make a good legal argument for any uncovered | violations as "unintentional" incidents. | [deleted] | LeicaLatte wrote: | Curious if Microsoft is training Co-Pilot on my private | repositories. | bencollier49 wrote: | This does make me wonder if this is susceptible to the same form | of trolling as that MS AI got. Commit a load of grossly offensive | material to multiple repos, and wait for Copilot to start | parroting it. I think they're going to need some human | moderation. | lawl wrote: | Way better. It's susceptible to copyright trolling. | | Put up repos with snippets for things people might commonly | write. Preferably use javascript so you can easily "prove" it. | Write a crawler that crawls and parses JS files to search for | matching stuff in the AST. Now go full patent troll, eh, i mean | copyright troll. | handrous wrote: | 1) Write a project heavily using Copilot (hell, automate it | and write thousands of them, why not?) | | 2) AGPL all that code. | | 3) Search for large chunks of code very similar to yours, but | written after yours, licensed more liberally than AGPL. | Ideally in libraries used by major companies. | | 4) Point the offenders to your repos and offer a "convenient" | paid dual-license to make the offenders' code legal for | closed-source use, so they don't have to open source their | entire product. | | 5) Profit? | armatav wrote: | 6) Arms race with someone who trained an obfuscation | version that goes through your AGPL code and tweaks it to | not be in violation. | SSLy wrote: | I love living in cyberpunk already. | gruez wrote: | Offensive code is the least of my worries. What about | vulnerable/exploitable code? | tjpnz wrote: | Given that code is easier to write than it is to read this | one is troubling. | | I certainly wouldn't want to be using this with languages | like PHP (or even C for that matter) with all the decades of | problematic code examples out there for the AI to learn from. | macNchz wrote: | This was my first thought when reading about Copilot...it | feels almost certain that someone will try poisoning the | training data. | | Hard to say how straightforward it'd be to get it to produce | consistently vulnerable suggestions that make it into | production code, but I imagine an attacker with some | resources could fork a ton of popular projects and introduce | subtle bugs. The sentiment analysis example on the Copilot | landing page jumped out to me...it suggested a web API and | wrote the code to send your text there. Step one towards | exfiltrating secrets! | | Never mind the potential for plain old spam: won't it be fun | when growth hackers have figured out how to game the system | and Copilot is constantly suggesting using their crappy, | expensive APIs for simple things!? Given the state of Google | results these days, this feels like an inevitability. | joe_the_user wrote: | Targeted attacks to elicit output only at a give context | are generally possible with AIs. And here, writing an | implementation of a difficult and vulnerable process seems | easy. Bad implementations of various hard things become | common 'cause people cut and paste the code without looking | closely since they don't understand it anyway. | | //Implement eliptic cryptography below | | //Sanitize input for SQL call below | | Etc. | bencollier49 wrote: | Yep, trivial to implement as an attack. | guhayun wrote: | Just ask it to prioritize safety | littlestymaar wrote: | 1- re-upload all the shell script you can find, after having | inserted `rm -rf --no-preserve-root /` every other line | | 2- ... | | 3- profit | raffraffraff wrote: | Perhaps they think that any code that passed a review and got | merged = human moderated | NullPrefix wrote: | Coding with Adolf? | heavyset_go wrote: | Jojo Rabbit except Adolf is in the cloud and not in a kid's | imagination. | mhh__ wrote: | YC submission when? | tyingq wrote: | Just the MP4, since it's hard to read in the smaller size: | https://video.twimg.com/tweet_video/E5R5lsfXoAQDRkE.mp4 | hawski wrote: | Copilot may do more to move open source projects out of GitHub | than the message that Microsoft is the buyer. Now you can host | the code on GitHub to get your license violated, or DMCA-ed in a | long run, when your code will become a part of some big | proprietary project. At least it makes me think about my choice | for code hosting more then whatever happened before. | woah wrote: | It looks like the author of the linked tweet intended for it to | reproduce the Quake code, by using the exact same function name | and comment. Whatever the merits of CoPilot, in this case the | human intended to write the quake function into their file, and | put the wrong license on it. | king_magic wrote: | Yep, Copilot is insanely poorly thought out. Astonishing they'd | release something as half-baked as this. | [deleted] | toss1 wrote: | License Laundering | | Like Money Laundering for cash equivalents, or Trust-Washing for | disinformation, but for hijacking software IP. | | It might not be the intended use case, but that winds up being | the practical result. | | (on a related note, it would make me want to fun GPT-* output | through plagiarism filters, but maybe they already do that before | outputting?) | rasz wrote: | "0.1% of the time" indeed | unknownOrigin wrote: | I'm honestly kinda amazed this as upvoted here as it is. | Typically anything ML-related is upvoted to the top positions and | any dissent harshly ridiculed. Anyways... it appears those who | thought about this as if it was a glorified code search engine | were close to being right. | qayxc wrote: | I still don't think it's _just_ a glorified code search engine. | | Context-sensitive data retrieval is undoubtedly a part of it, | though and the question is how big and relevant is that part | and what are the consequences? | | To me the biggest issue is that it's impossible to tell whether | the suggestions are verbatim reproductions of training material | and thus problematic. | | It goes to show that this tool and basically every tool relying | on the same or similar technology must now be assumed to do | this and thus any code suggestion must be regarded plagiarism | until proven otherwise. As a consequence such tools are now | off-limits for commercial or open source development... | coolspot wrote: | Time to write a GPL-licensed Win32 and Win64 -compatible OS with | the help of CoPilot... | fencepost wrote: | So what I'm reading here is that "Tay for code" is maybe going to | need to be rethought and perhaps trained differently? | 0-_-0 wrote: | This is a very famous function [0] and likely appears multiple | times in the training set (Google gives 40 hits for GitHub), | which makes it more likely to be memorized by the network. | | [0]: | https://en.wikipedia.org/wiki/Fast_inverse_square_root#Overv... | 0-_-0 wrote: | It's worth keeping in mind that what a neural network like this | (just like GPT3) is doing is generating the most probable | continuation based on the training dataset. Not the _best_ | continuation (whatever that means), simply the most likely one. | If the training dataset has mostly bad code, the most likely | continuation is likely to be bad as well. I think this is still | valuable, you just have to think before accepting a suggestion | (just like you have to think before writing code from scratch | or copying something from Stack Overflow). | abecedarius wrote: | > the most probable continuation based on the training | dataset | | This is not wrong, but it's easy to misread it as implying | little more than a glorified Markov model. If it's like | https://www.gwern.net/GPT-3 then it's already significantly | cleverer, and so you should expect to sometimes get the kind | of less-blatant derivation that companies aim to avoid using | a cleanroom process or otherwise forbidding engineers from | reading particular sources. | dematz wrote: | I have no idea how this or GPT3 works or how to evaluate | them, but couldn't you argue that it's working as it should? | You tell copilot to write a fast inverse square root, it | gives you the super famous fast inverse square root. It'd be | weird and bad if this _didn 't_ happen. | | As far as licenses go, idk. Presumably it could delete | associated comments and change variable names or otherwise | obscure where it's taking code from. Maybe this part is | shady. | tirpen wrote: | Maybe I could build a robot that goes out in the city and | steal cars. | | As far as licenses go, idk. Presumably it could delete the | number plate and repaint the car or otherwise obscure where | it's taking the car from. Maybe this part is shady. | | Maybe. | 0-_-0 wrote: | > couldn't you argue that it's working as it should? | | Let's say that it's doing exactly what it was trained to | do. | bee_rider wrote: | In particular, fast approximate inverse square root is an x86 | instruction, and not a super new one. I'd be surprised if it | wasn't in every major instruction set. | | This is an interesting issue. I suspect training on datasets | from places like Github would be likely to provide lots of | "this is a neat idea I saw in a blog post about how they did | things in the 90's" codes. | LeanderK wrote: | I think the problem might be in the training data. Famous code | examples are probably copied a lot and therefore appear multiple | times in the training data, prompting the neural network to | memorise it completely. | avian wrote: | Famous code examples are also much more likely to be noticed. | For all I know, the thing might be spewing random GPL'd code | from the long tail of GitHub all the time and nobody notices | because it was written by some random guy and not John Carmack. | mkl wrote: | Carmack didn't write this: | https://en.wikipedia.org/wiki/Fast_inverse_square_root | LeanderK wrote: | Well, it's sure speculation on my part what the root cause | is, but i think OpenAI is already trying to ensure the | network generalises. It's just common behaviour for neural | network to memorise frequent samples, so I think my guess is | quite realistic. I don't think OpenAI would not notice large- | scale memorisation in their model. But as long as they don't | publish more details it's just guesswork. | | Just keep in mind that it's a statistical tool. You can't | really formally prove that it won't memorise, but I think | with enough work you can get it unlikely enough that it won't | matter. It's their first iteration. | visarga wrote: | Hash 10-grams and make a bloom filter. It will not generate | more than 10 GPLed tokens from a source. | deckard1 wrote: | Also the Pareto principle. 80% of code is shit that you _don | 't_ want to copy. The vast majority of github is awful hacks | and insecure code that should not be touched with a ten foot | pole. | cblconfederate wrote: | Is this function used verbatim in multiple projects? I know | it's famous but how often does one use an approximation of | inverse sqrt instead of the readily available cpu call in the | past 20 years | ocdtrekkie wrote: | Probably an excellent reminder that both Google and Microsoft | decided to use your private emails for a training set to create | Smart Reply behavior that can "write emails for you", and they | swore up and down there's no way that could ever leak private | information. | | We need legislation banning companies from ingesting data into AI | training sets without explicit permission. | nebuke wrote: | Makes me wonder if github are using private repos in their | training data. | ocdtrekkie wrote: | GitHub clearly stated they only used publicly available repos | in this project. However, as many people are rightfully | pointing out, those projects might still be either closed | source or copylefted, and if Copilot regurgitates chunks of | those projects, people who use it may be subject to | infringement lawsuits in the future. | grawprog wrote: | I'm not surprised to be honest. I've played around with AI | dungeon, which also uses GPT-3. It regularly reproduces content | directly from its training material, including even comments | attached to the stories they trained the ai on. | Tarucho wrote: | Is Copilot aimed at programmers or at non-technical hiring | managers? | | I mean, it goes right away with the devaluing narrative of | programming that is going around from the last couple of years. | To the "anyone can code" narrative we are adding "more so, if | they have AI assisted Copilot" | KoftaBob wrote: | Seems like this is less of an "AI that intelligently generates | code based on context given" and more of a "google search | autocomplete for code". | thinkingemote wrote: | From the GPLv2 licensed code: | | https://github.com/id-Software/Quake-III-Arena/blob/master/c... | | copilot repeats it word for word almost, including comments, and | adds an MIT like license up the top | arksingrad wrote: | I guess this confirms John Carmack to be an AI | OskarS wrote: | Apparently Carmack was not the original author, the origin I | believe is SGI somewhere in the deep dark 90s. | Haga wrote: | Was a optimization for a fluid simulation originally.. | Fordec wrote: | I get why some people were saying it made them a better | programmer. Of course it did, it's copy-pasting Carmack code. | Thomashuet wrote: | Actually the indentation of the first comment and the lack of | preprocessor show it's not copied from this code directly but | from Wikipedia (https://en.wikipedia.org/wiki/Fast_inverse_squa | re_root#Overv...) So It could be that the Quake source code is | not part of the training set but the Wikipedia version is. | SamBam wrote: | While I strongly doubt they would use Wikipedia as a training | set, has anyone done a search of GitHub code to see if other | projects have copied-and-pasted that function from Wikipedia | into their more-permissive codebases? | edgyquant wrote: | It's GPT though and the GPT models were trained on data | from Wikipedia | ajayyy wrote: | It is probably based off GPT-3 with a layer on top trained | for programming specifically, like what is done with AI | dungeon. | an_opabinia wrote: | Wait until people on the toxic orange site find out what | has happened to AI Dungeon. | SamBam wrote: | I'm out of the loop. | grawprog wrote: | https://gitgud.io/AuroraPurgatio/aurorapurgatio | | https://www.reddit.com/user/non-taken-name | zxzax wrote: | I don't get it, that seems like standard fare for an | R-rated movie? And then it seems like some complained | because they decided to start editing it down to a PG-13 | movie? | grawprog wrote: | Essentially, from my understanding, there was a data leak | they never commented on, they instituted a poorly made | content filter without saying anything. The filter | frequently has false positives and negatives, someone | discovered they trained the game using content the filter | was designed to block, meaning the ai itself would | frequently output filter triggering stuff, more people | found out their private unpublished stories were being | read by third parties after a job ad and the stories were | posted on 4Chan, people recognized stories they wrote | that had triggered the filter that were posted, and then | they started instituting no warning bans. | | I might have missed something, but that's the gist of it. | armatav wrote: | It's pre-trained, partially, on Wikipedia. GPT-2 did this | sort of thing all the time: native to the architecture to | surface examples from the fine-tuning training set by | default. | bootlooped wrote: | Almost 2000 results for one of the comment lines. I'm not | going to read through those or check the licenses, but I | think it's safe to say that block of code exists in many | GitHub code bases, and it's likely many of those have | permissive licenses. Given how famous it is (for a block of | code) it's not unexpected. | | https://github.com/search?q=%22evil+floating+point+bit+leve | l... | | A question that popped into my head is: if the machine sees | the same exact block of code hundreds of times, does that | suggest to it that it's more acceptable to regurgitate the | entire thing verbatim? Not that this incident is totally | 100% ok, but if it was doing this with code that existed in | only a single repo that would be much more concerning. | Animats wrote: | _if the machine sees the same exact block of code | hundreds of times, does that suggest to it that it 's | more acceptable to regurgitate the entire thing | verbatim?_ | | From a copyright standpoint, quite possibly. This is | called the "Scenes a faire" doctrine. If there are some | things that have to be there in a roughly standard form | to do a standard job, that applies. | | [1] | https://en.wikipedia.org/wiki/Sc%C3%A8nes_%C3%A0_faire | nojito wrote: | It's up to the end user to accept the suggestions. | user-the-name wrote: | And it is completely impossible for the user to do so. | | So, the tool is worthless if you want to use it legally. | nojito wrote: | Doubtful. | | You can be almost certain it's being widely used or will be | widely used shortly. | | The conversations around copilot are eerily similar to the | conversations around the first autocomplete tools | gnulinux wrote: | It's more like a writer using an autocomplete tool to | write the first chapter to their novel. | caconym_ wrote: | As someone who gets paid to write code (nominally) and | has also written a few novels, I don't agree with this | characterization. From what I've seen of Copilot, it's | more like having a text editor generate your next | sentence or paragraph^[1]. The idea (as I see it) is that | you might use it to generate some prose "boilerplate", | e.g. environmental descriptions, and hack up the results | until you're satisfied. | | It's content generation at a fragmentary level where each | "copied" chunk does not form a substantive whole in the | greater body of the new work. Even if you were training | it on other authors' works rather than just your own, as | long as it wasn't copying _distinctive_ sentences | wholesale, I think there 's a strong argument for it | falling under fair use--if it's even detectable. | | On the other hand, if it regurgitated somebody else's | paragraph wholesale, I don't think that would be fair | use. Somewhere in-between is where it gets fuzzy, and | really interesting; it's also where internet commenters | seem to prefer flipping over the board and storming out | convinced they're _right_ to exploring the issues with a | curious and impartial mind. I see way too much unreasoned | outrage and hyperbolic misrepresentation of the Copilot | tool in these threads, and it 's honestly kind of | embarrassing. | | As far as this analogy goes, it's worth noting that the | structure of a computer program doesn't map onto the | structure of a piece of fiction (or any work of prose) in | a straightforward way. Since so much of code _is_ | boilerplate, I would (speculatively, in the copyright law | sense) actually give more leeway to Copilot in terms of | absolute length of copied chunks than I would for a prose | autocompleter. For instance, X program may be licensed | under the GPL, but that doesn 't mean X's copyright | holder(s) can sue somebody else because their program | happened to have an identical expression of some RPC | boilerplate or whatever. It would be like me suing | another author because their work included some of the | same words that mine did. | | ^[1] At least one tool like this (using GPT-3) has been | posted on HN. At this point in time I wouldn't use it, | but I have to admit that it was sort of cool. | user-the-name wrote: | That does not seem like a response to what I just said? | | I said that it is impossible for the user to check that | the code copilot gives is OK, license-wise, and | therefore, they can not be sure that it is legally OK to | include in any project. | freshhawk wrote: | And it's up to the end user to evaluate the tool that makes | the suggestions. | flatiron wrote: | As someone who does code reviews the thought the developer | didn't code the code submitted to be merged never would cross | my mind. | croes wrote: | Good luck checking every code line for license violations | duckmysick wrote: | SaaS idea: code linter, but for licenses. | adrianN wrote: | Extend Fossology: https://www.fossology.org/ | SahAssar wrote: | That's one of blackduck's offerings: | https://www.synopsys.com/software-integrity/open-source- | soft... | | At a previous job we had a audit from them, it seemed to | not be too accurate but probably good enough for | companies to cover their asses legally. | dr_kiszonka wrote: | There will be a VSCode extension for that. | TheDong wrote: | It's impossible to automate checking for code license | violations. | | If you and I write the exact same 10 lines of code, we | both have independent and valid copyrights to it. Unlike | patents, independent derivation of the same code _is_ a | defense for copyright. | | If I write 10 lines of code, publish it as GPL (but don't | sign a CLA / am not assigning it to an employer), and | then re-use it in an MIT codebase, I can do that because | I retained copyright, and as the copyright holder I can | offer the code under multiple incompatible licenses. | | There's no way for a machine to detect independent | derivation vs copying, no way for the machine to know who | the original copyright holder was in all cases, and | whether I have permission from them to use it under | another license (i.e. if I email the copyright holder and | they say 'yeah, sure, use it under non-gpl', it suddenly | becomes legal again)... | | It's not a problem computers can solve 100% correctly. | croes wrote: | Same trust issue | atatatat wrote: | It's people for your lawyers to blame, all the way down! | | /s | croes wrote: | It's the same problem s with self driving cars, you gets | sued. The company that provides the service/car or the | the programmer/driver? I think the latter. | rmorey wrote: | This exact code is all over github, >1k hits | | https://github.com/search?q=%22i++%3D+%2A+%28+long+%2A+%29+%... | ajklsdhfniuwehf wrote: | that will make a great defense at a copyright court. | | "your honor, i would like to plead not guilty, on the basis | that i just robbed that bank because i saw that everyone was | robbing banks on the next city" | | ...on the other hand, that was the exact defense tried for | the capitol rioters. So i don't know anything anymore. | mattowen_uk wrote: | With apologies to Martin Niemoller[1]: First the | automation came for the farmers, and I did not speak out -- | Because I was not a farmer. Then the automation came | for the factory workers, and I did not speak out -- | Because I was not a factory worker. Then the | automation came for the accountants, and I did not speak out -- | Because I was not a accountant. Then the automation | came for me (a programmer) -- and there was no one left | to speak for me. | | --- | | [1] https://en.wikipedia.org/wiki/First_they_came_... | flatiron wrote: | Honestly I've automated a large chunk of my day job. The trick | is keeping it secret! | mattowen_uk wrote: | Well... wait until all the programmer salaries crash to | minimum wage because Management believe that "CoPilot does | most of the work anyway". | Rooster61 wrote: | Then wait for them to realize how brittle the code is when | nobody is considering the context into which this code is | being foisted. They'll TRIPLE our salaries! :D | timdaub wrote: | I felt the need to write an article about this whole situation | too: | | "Built on Stolen Data": | https://rugpullindex.com/blog#BuiltonStolenData | GenerocUsername wrote: | A closed beta with only a few previews out in the wild has bugs. | Unbelievable. | | I cannot believe GitHub would do this. | rozularen wrote: | Guess someone had to try it | mouzogu wrote: | I can't even get intellisense to work correctly half the time. | [deleted] | swiley wrote: | So I guess we can just copy around copyrighted source now? Great! | Now we can share all the proprietary driver and DSP code from | Qualcomm. | supernintendo wrote: | I wonder when someone will try to use the "it came from | Copilot" defense to get away with stealing copyrighted code. | louthy wrote: | This is utterly damning. I have already instructed my team that | Copilot can never be used for our projects. Compromising the | product because of unknowable license demands isn't acceptable in | the professional world of software engineering. | | But if we put the licensing to one side for a moment... | | 1/ Everything I've seen it generate so far is 'imperative hell'. | It is practically a 'boilerplate generator'. That might be useful | for pet projects, smaller code bases, or even unit-test writing. | But large swathes of application code looking like the examples | I've seen so far is hard to manage. | | 2/ The boilerplate is what bothers me the most (as someone who | believes in the declarative approach to software engineering). | The future for programming and programming languages should be an | attempt to step up to a higher level of abstraction, that has | been historically the way we step up to higher levels of | productivity. As applications get larger and code-bases grow | significantly we need abstraction, not more boilerplate. | | 3/ As someone who develops a functional framework for C# [1], I | could see Copilot essentially side-lining my ideas and my | approach to writing code in C#. Not just style, but choice of | types, etc. I wonder if the fall out of what is Copilot's 'one | true way' of generating code was ever considered? It appears to | force a style that is at odds with many who are looking for more | robust code. At worst it will homogenise code "people who wrote | that, also wrote this" - stifling innovation and iterative | improvements in the industry. | | 4/ Writing code is easy. Reading and understanding code written | by another developer is hard. Will we spend most of our time as | code-reviewers going forwards? Usually, you can ask the author | what their intentions were, or why they think their approach is | the correct one. Copilot (as far as I can tell) can't justify its | decisions. So, beyond the simple boilerplate generation, will | this destroy the art of programming? I can imagine many juniors | using this as a crutch, and potentially never understanding the | 'why'. | | I'm not against productivity tools per se; it's certainly a neat | trick, and a very impressive feat of engineering in its own | right. I am however dubious that this really adds value to | professional code-bases, and actively may decrease code quality | over time. Then there's the grey area of licensing, which I feel | has been totally brushed to one side. | | [1] https://github.com/louthy/language-ext | ethangk wrote: | This is a little off topic, but your framework looks really | interesting! How come you opted for building a functional | framework in C#, vs using F#? I couldn't see anything in the | README about what was specifically frustrating about F#? I ask | because we're looking at introducing it at my company. | louthy wrote: | I cofounded a company in 2005, the primary product is a | never-ending C# web-application project. As the code-base | grew to many millions of lines of code I started to see the | very real problems of software engineering in the OO | paradigm, and had the _functional programming enlightenment | moment_. | | We started building some services in F#, but still had a | massive amount of C# - and so I wanted the inertia of my team | to be in the direction of writing declarative code. There | wasn't really anything (outside of LINQ) that did that, so I | set about creating something. | | We don't write F# any more and find functional C# (along with | the brilliant C# tooling) to be very effective for us | (although we also now use PureScript and Haskell). | | I do have a stock wiki post on the repo for this though [1]. | You might not be surprised to hear it isn't the first time | I've been asked this :) | | [1] https://github.com/louthy/language-ext/wiki/%22Why- | don't-you... | ethangk wrote: | Ha, it's good to see I'm full of original thoughts. | | That post in the wiki sums it up perfectly, much | appreciated! | bostonsre wrote: | I'm not sure we should throw the baby out with the bath water | here due to the large blurbs it stubs in when when it doesn't | have a lot to go on in mostly empty files. It is a preview | release. They are working on proper attribution of suggested | code and explainability [1]. Having a stochastic parrot that | types faster than I do would be useful in a lot of cases. | | Yes, better layers of abstraction could make us more productive | in the future, but we're not there yet. By all means, don't | accept the larger blurbs it proposes, but there is productivity | to be gained in the smaller suggestions. If it correctly | intuits the exact rest of the line that you were thinking of, | it will save time and not make you lose understanding of the | program. | | In some areas complete understanding and complete code | ownership is required but in a lot of places, it's not. If it | produces the work of a moderately skilled developer it would be | sufficient. I don't remember all code I write as time passes. | If it produces work that I would have produced, then I don't | see how that's any different that work that was produced by my | past self. | | It may feel offensive but a lot of the comments against it | sound like rage against the machine/industrialization opponents | and the arguments sound pretty similar to those made in the | past by those that had their jobs automated away. I'm not sure | we're all as unique snowflakes as we like to think we are. | Sure, there will be some code that requires an absolute master | that is outside the capabilities of this tool. But I'd guess | there is a massive amount of code that doesn't need that | mastery. | | [1] https://docs.github.com/en/github/copilot/research- | recitatio... | vlovich123 wrote: | I think it depends on how you look at it. | | For small snippets that have likely been already written by | someone else, this probably works great. For those though, | the time savings is probably at most 5-10 min down to 1 or | less. The challenge is that that's not where my time goes | unless I'm working in an unfamiliar language. | | As someone who writes a lot of code quickly, I'm usually | bottlenecked by reviews. For more complex changes I'm | bottlenecked by understanding the problem and experimenting | with solutions (and then reviews, domain-specific tests | usually, fixing bugs etc). Writing code isn't like waiting | for code to compile since I'm not actually ending up task | switching that frequently. | | This does sound like a fantastic tool when I'm not familiar | with the language although I wonder if it actually generates | useful stuff that integrates well as the code gets larger (eg | can I say "write an async function that allocates a record | handle in the file with ownership that doesn't outlive the | open file's lifetime"). I'm sure though that this is what a | lot of people are overindexing on. For things like that I | expect normal evolution of the product will work well. For | things like "cool, understand your snippets but also weight | my own codebase higher and understand the specifics of my | codebase", I think there's a lot of groundbreaking research | that would be required. That is what I see as a true | productivity boost - I'd make this 100% required for anyone | joining the codebase. The more mentorship can be offloaded, | the lower the cost is to growing teams. OSS projects can more | easily scale similarly. | 0xbadcafebee wrote: | > programming languages should be an attempt to step up to a | higher level of abstraction | | Adding abstraction buries complexity. If all you do is keep | adding more abstractions, you end up with an overcomplicated, | inefficient mess. Which is part of why application sizes are so | bloated today. People just keep adding layers, as long as they | have room for more of them. Everything gets less efficient and | definitely not better. | | The right way to design better is to iterate on a core design | until it cannot be any simpler. All of the essential complexity | of software systems today comes from 40 year old conventions. | We need a redesign, not more layers. | | One example is version management. Most applications today | _can_ implement versioned functions and keep multiple versions | in an application, and track dependencies between external | applications. Make a simple DAG of the versions and let apps | call the versions they were designed against, or express what | versions are compatible with what, internally. This would make | applications infinitely backwards-compatible. | | The functionality exists right now in GNU Libc. You can | literally do it today. But rather than do that, we stumble | around replacing entire environments of specific versions of | applications and dependencies, because we can't seem to move | the entire industry forward to new ideas. Redesign is hard, | adding layers is easy. | louthy wrote: | > Adding abstraction buries complexity. If all you do is keep | adding more abstractions, you end up with an overcomplicated, | inefficient mess. Which is part of why application sizes are | so bloated today. People just keep adding layers, as long as | they have room for more of them. Everything gets less | efficient and definitely not better. | | Presumably you're writing code in binary then? This is a non- | argument, because there's evidence that it's worked. | Computers were first programmed with switches and punch | cards, then tape, then assembly, then low level languages | like C, then memory managed languages etc. | | Abstraction works when side-effects are controlled. | Composition is what we're after, but we must compose the | bigger bits from smaller bits that don't have surprises in. | This works well in functional programming, a good example | would be monadic composition: monads remove the boilerplate | of dealing with asynchrony, value availability, list | iteration, state management, environment management, etc. | Languages that have first-class support for these tend to | have significantly less boilerplate. | | The efficiency argument is also off too. Most software | engineering teams would trade some efficiency for more | reliable and bug free code. At some point (and I would argue | we're way past it) programs become too complex for the human | brain to comprehend, and that's where bugs come from. That's | why we're overdue an abstraction lift. | | Tools like Copilot almost tacitly agree, because they're | trying to provide a way of turning the abstract into the | real, but then all you see is the real, not the abstract. | Continuing the assault on our weak and feeble grey matter. | | I spent the early part of my career obsessing over | performance on crippled architectures (Playstation 3D engine | programmer). If I continued to write applications now like I | did then, nothing would go out the door and my company | wouldn't exist. | | Of course there are times when performance matters. But the | vast majority of code needs to be correct first, not the most | optimal it can be for the architecture. | mdellavo wrote: | Generating code has never been a problem for developers :) | | I'd be more interested in a tool that notices patterns and | boilerplate. It could offer a chance for generalization, | abstraction or use of a common pattern from the codebase. This | is of course much harder. | bohemian99 wrote: | My question is would Copilot be useful if you could choose the | codebase it would be drawing from? Almost as an internal | company tool? | freshhawk wrote: | That would actually be potentially useful, it could do a kind | of combination of autocompletion of internal libraries, | automatic templates for common patterns and internal | style/linting type tasks all in one. Certainly augmenting | those other things. | | It would be interesting how much code you would need before | it was useful (and how good does it have to be to be useful? | Does even a small error rate cost so much that it erases | other gains, because so many of the potential errors in usage | of this type of tool are very subtle?) | saurik wrote: | If you find yourself copying code someone else in your | organization wrote rather than abstracting it to a function | in a shared library or building a more declarative framework | to manage the problem, something horrible has happened. | maccard wrote: | Sometimes boilerplate is unavoidable. As an example, how do | you send a GET request with libcurl in C with an | authorization header? I can't tell you offhand, but I can | tell you the file in my codebase that does have it, because | I've duplicated the logic for two separate systems. | saurik wrote: | So you are saying you would rather every project in the | world have at least one--if not, thanks to making it | easier via Copilot, many--copies of this code rather than | one shared library that provides a high-level abstraction | for libcurl?... At least for your own code, how did you | end up with two copies of duplicated logic rather than a | shared library of functionality? | maccard wrote: | > So you are saying you would rather every project in the | world have at least one--if not, thanks to making it | easier via Copilot, many--copies of this code. | | Absolutely not, not at all. I'm suggesting that copying | and pasting happens, particularly in the context of a | single project. | | > At least for your own code, how did you end up with two | copies of duplicated logic rather than a shared library | of functionality? | | At what point is it worth introducing an abstraction | rather than copying? Using my libcurl example, you can | create an abstraction over the~ 10 lines of | initialization, but if you need to change it to a POST, | then you're just implemnenting an abstraction over | libcurl, which is just silly. | saurik wrote: | If you have 10 lines of repeated code with one line | changed to make it GET vs POST, introducing an | abstraction isn't "silly": it is simultaneously both | ergonomic and advantageous, as if you ever need to add | another line of code to that initialization--which | totally happens, due to various security extensions you | might want to make to what TLS settings you accept, or to | tune performance parameters related to connection | caching, or to add a header to every request (for any | number of reasons from debugging to authentication)--you | can do it in one place instead of umpteen number of | places. And like... further to the point: I use libcurl | as a _fallback_ on Linux, but if you want to correctly | support the user 's settings for proxy servers--which are | sometimes needed for your requests to work at all--my | code is abstracted so I can plug in _entirely different | backends_ to libcurl, such as Apple 's CFNetwork. You act | like abstraction is somehow a bad thing or a complex | thing, when it should absolutely take you less time to | wrap duplicated code into a function than to duplicate | it. | tyingq wrote: | That sounds interesting, though it still feels like it would | need work. Like a way to annotate suggestions with comments, | or flag them. Definitive licensing shown for each snippet. A | way to mark deprecated code as deprecated to the training | algorithm, etc. | louthy wrote: | It would certainly alleviate the license concerns. If it was | possible to train it to a level (that produces effective | output), then sure. | | As a thought experiment, I thought "what would happen if we | trained it on our 15 million lines of product code + my | language-ext project". It would almost certainly produce | something that looks like 'us'. | | But: | | * It would also trip over a million or so lines of generated | code | | * And the legacy OO code | | * It will 'see' some of the extreme optimisations I've had to | built into language-ext to make it performant. Something like | the internals of the CHAMP hash-map data-structure [1]. That | code is hideously ugly, but it's done for a good reason. I | wouldn't want to see optimised code parroted out upfront. | Maybe it wouldn't pick up on it, because it hasn't got a | consistent shape like the majority of the code? Who knows. | | Still, I'd be more willing to allow my team to use it if I | could train it myself. | | [1] https://github.com/louthy/language- | ext/blob/main/LanguageExt... | carlmr wrote: | > legacy OO code | | Aside from OO vs FP. A concern with that I'd have is that | it would encourage and enforce idiosyncracies in large | corporate codebases. | | If you've ever worked for a large corporation on their | legacy code, you know you don't want any of that to be | suggested to colleagues. | | This would enforce bad behaviors and make it even harder | for fresh developers to argue against it. | louthy wrote: | > This would enforce bad behaviors and make it even | harder for fresh developers to argue against it. | | I think this is a significant point. It maintains the | status quo. We change our guidance to devs every other | year or so. New language features become available, old | ones die, etc. But we're not rewriting the entire code- | base every time, we know if we hit old code, we refactor | with the new guidance; but we don't do it for the sake of | it, so there's plenty of code that I wouldn't want in a | training set (even if I wrote it myself!) | [deleted] | goodpoint wrote: | 5/ Boilerplate is easy to write but expensive to maintain in | large quantities. Proper abstraction/templating requires | careful thinking. Copilot encourages the first and discourages | the second. | | 6/ Copilot learns from the past. It can only favor popularity | and familiarity in code patterns over correctness and | innovation. | hallqv wrote: | Neural net: "It's all in the training data, stupid." | reader_mode wrote: | >2/ The boilerplate is what bothers me the most (as someone who | believes in the declarative approach to software engineering). | The future for programming and programming languages should be | an attempt to step up to a higher level of abstraction, that | has been historically the way we step up to higher levels of | productivity. As applications get larger and code-bases grow | significantly we need abstraction, not more boilerplate. | | Just the other day someone on copilot threads was arguing that | this kind of boilerplate optimizes for readability... It's like | Java Stockholm syndrome and the old myth of easy to approach = | easy to read (how long it took them to introduce var). | | I've always viewed code generators as a symptom of language | limitations (which is why they were so popular in Java land) | that lead to unmaintainable code, this seems like a fancier | version of that - with all the same drawbacks. | 3pt14159 wrote: | I'm all for abstracting. I like Rails, for example. That | said, it gets _truly_ difficult to add or change stuff at the | more abstract layers. For example, adding recursive querying | to an existing ORM is _tough_. And on the rare occasion that | there is a bug in the abstract layer, debugging that from the | normal application code is also tough. | | I understand why some corporations prefer dumb boilerplate | everywhere for some applications. If there is an outage it's | usually easy to fix quickly. Sometimes it's not, if it's a | issue in the boilerplate (say, Feb 29 rolls around and all of | the boilerplate assumed a 28 day month) that means a huge | update all across the system, but that rarely happens in | practice. | reader_mode wrote: | I would say ORM is tough with code gen or with | metaprogramming because it maps two mismatched paradigms | (OOP and relational) and tries to paper over the | differences. | | I do agree on the debugging aspect - especially in dynamic | languages - metaprogramming stack traces can be really hard | to follow. | slumdev wrote: | Tools like xsd or T4 (in the .NET ecosystem) are great time- | savers, but you would never consider directly modifying the | code they generate. You would leave the generated code | untouched (in case it ever needed to be generated again) and | subclass it to make whatever changes you intend. | | I think Copilot is so unfortunate because it's not building | abstractions and expecting you to override parts of them. | It's acting as an army of monkeys banging out Shakespeare on | a typewriter. And the code it generates is going to require | an army to maintain. | hu3 wrote: | Linq2Db is a great example of T4 code generation that | works. It creates partial classes from database schema. | Together with C# I have strongly typed database access. | | https://github.com/linq2db/linq2db | reader_mode wrote: | Even there I feel like code generators are just a band aid | around the fact that metaprogramming facilities suck. If | you would never modify the generated code why generate in | the first place. You could argue that stack traces are | easier to follow but TBH generated code is rarely pretty in | that regard as well. | | For example I think F# idea of type providers > code | generators. | ethbr0 wrote: | Code generators = out of practice, out of mind | mimixco wrote: | Awesome summary and thanks for trying it for the rest of us! | | Copilot sounded terrible in the press release. The idea that a | computer is going to pick the right code for you (from | comments, no less) is really just completely nuts. The belief | that it could be better than human-picked code is really way | off. | | You bring up a really important point. When you use a tool like | Copilot (or copypasta of any kind), you are introducing the | _additional_ burden of understanding that other person 's code | -- which is worse than trying to understand your own code or | write something correct from scratch. | | I think you've hit the nail on the head. Stuff like Copilot | makes programming worse and more difficult, not better and | easier. | res0nat0r wrote: | Isn't the entire point of this to _suggest_ code you may use, | not to just blindly accept is correct without thinking? | petercooper wrote: | While I accept most of the concerns, it's better than your | comment suggests. I see some promise for it as a tool for | reminding you of a technique or inspiring you to a different | approach than you've seen before. | | For example, I wrote a comment along the lines of "Find the | middle point of two 2D positions stored in x, y vectors" and | it came up with two totally different approaches in Ruby - | one of which I wouldn't have considered. I did some similar | things with SQL, and some people might find huge value in it | suggesting regexes, too, because so many devs forget the | syntax and a reminder may be all it takes to get out of a | jam. | | I'm getting old enough now to see where these sorts of | prompts will be a game changer, especially when dabbling in | languages I'm not very proficient in. For example, I barely | know any Python, so I just created a simple list of numbers, | wrote a "Sort the numbers into reverse order" comment, and it | immediately gave me the right syntax that I'd otherwise have | had to Google, taking much longer. | | Maybe to alleviate the concerns it could be sandboxed into a | search engine or a separate app of its own rather than | sitting constantly in my main editor - I would find that a | fair compromise which would still provide value but require | users to engage in more reflection as to what they're using | (at least to a level that they would with using SO answers, | say). | rob74 wrote: | Yeah, but... I mean, I guess we all agree that copying code | from, let's say StackOverflow without checking if it really | does what you want it to do is a bad thing? Now here we | have a tool that basically automates that (except it's | copying from GitHub, not StackOverflow), and that's | supposed to be a good thing? Even if its AI is smarter, you | would still have to check the code it suggests, and that | can actually be harder than writing it yourself... | ethbr0 wrote: | The big boost, that I think parent is alluding to, is for | rusty (not Rust!) languages in the toolbox, where you may | not have the standard library and syntax loaded into your | working memory. | | As a nudge, it's a great idea. As a substitute for | vigilance, it's a terrible idea. | | I suspect that's why they named it Copilot instead of | Autopilot, but it's unfortunately more likely to be used | as the latter, humans being humans. | [deleted] | toss1 wrote: | Right, so it might occasionally be useful as a search tool | for divergent ideas of different approaches to a problem, | and your suggestion to sandbox it in a separate area works | for that. | | But that does not seem to be it's advertised or configured | purpose, sitting in your main editor. | mimixco wrote: | This is good stuff. As a search engine, it could very well | be useful. As another poster pointed out, if some context | or explanation were provided along with the source | suggestions, its utility as a reference would really grow. | | I totally agree with you that prompted help is a big deal | and just going to get bigger. We have developed a language | for fact checking called MSL that works exactly this way in | practice -- suggesting multiple options rather than just | inserting things. | | One of the things that interests me about this thread is | the whole topic of UI vs. AI and how much help really comes | from giving the user options (and a good UI to discover | them) vs how much is "AI" or really intelligence. I think | the intelligence has to belong to the user, but a computer | can certainly sift through a bunch of code to find a search | engine result and, those results could be better than you | get now from Google &Co. | osmarks wrote: | If they're using something like GPT-3 on the backend, | which they probably are, it probably _can 't_ provide any | explanations or context (unless the output is memorized | training data, like this); the output can be somewhat | novel code not from any particular source, and while it | might be possible to find relevant information on similar | code, this would be a hard problem too. | | EDIT: they appear to be interested in making it look for | similar code, see here: | https://docs.github.com/en/github/copilot/research- | recitatio... | vmception wrote: | Hm odd takes here. | | It's really weird for software engineers to judge something | by its current state and not by its potential state. | | To me, it's clearly solvable by Copilot filtering the input | code by that repository's license. It should only be certain | open source licenses, maybe even user-choosable, or code- | creators can optionally sublicense their code to Copilot in a | very permissable way. | | Secondly, a way for the crowd to code review suggestions | would be a start. | gpm wrote: | Practically every open source license requires attribution, | if copilot has a licensing issue, training a model on only | repositories with the same license won't fix it except for | the extremely rare licenses which do not require | attribution. | vmception wrote: | why not? it can just generate an attribution file or | reminder | gpm wrote: | Because it's an opaque neural network on the backend, it | doesn't know if or from whom it copied code. | buu700 wrote: | Could they handle this by generating a collective | attribution file that covers every (permissively | licensed) repository that Copilot learned from? | | Of course this would be massive, so from a practical | consideration the attribution file that Copilot generates | in the local repository would have to just link to the | full file, but I don't think that would be an issue in | and of itself. | gpm wrote: | Maybe? Might depend on the license, I doubt the courts | would be amused. | | Almost certainly a link would not suffice, basically | every license requires that the attribution be directly | included with the modified material. Links can rot, can | be inaccessible if you don't have internet access, can | change out from underneath you, etc. | | (I am not a lawyer, btw) | buu700 wrote: | Makes sense. Maybe something like git-lfs/git-annex would | be sufficient to address the linking issue, but it seems | like the bigger concern is whether a court would accept | this as valid attribution. In a sense it reminds me of | the LavaBit stunt with the printed key. | joeyh wrote: | I think a judge could be persuaded that a list of every | known human does not constitute a valid attribution of | the actual author, even though their name is on the list. | The purpose of an attribution is to acknowledge the | creator of the work, and such a list fails at that. | buu700 wrote: | Makes sense. That's probably the best interpretation | here. Any other decision would make attribution lists | optional in general for all practical purposes. | mimixco wrote: | I've been in the business a long time and I just don't | believe in generalized AI at all. Writing code requires | general (not artificial) intelligence. All of these "code | helping" tools break down quickly because they may be | searching for and finding relevant code blocks (the | "imperative hell" referred to by another commenter), but | they don't understand the _context_ or the overall behavior | and goals of the program. | | Writing to overall goals and debugging actual behavior are | the real work of programmers. Coming up with syntax or | algorithms are 3rd and 4th on the priority list because, | lets face it, it's not that hard to find a reference for | correct syntax or the overall recipe implied by an | algorithm. Once you understand those, you can write the | correct code for your project. | | I do think Copilot has potential as a search engine and | reference tool -- if it can be presented that way. But the | idea of a computer actually coming up with the right code | in the full context of the program seems like fantasy. | gpm wrote: | If we're coming up with potential uses, I think they got | the direction wrong. | | Don't tell me what to do, tell me what not to do. "this | line doesn't look like something that belongs in a code | base", "this looks like a line of code that will be | changed before the PR is merged". Etc. | mimixco wrote: | _That_ would be fantastic! Imagine if it could catch | common errors before you make them. So many things in | loops and tests that we mess up all the time. My favorite | is to confuse iterating through an array vs an object in | JS. I 'd love to have Gazoo step in and say, "Don't you | mean, _this_ , David?" | slumdev wrote: | > It's really weird for software engineers to judge | something by its current state and not by its potential | state. | | No, we're not afraid of Copilot replacing us. The thought | is ridiculous, anyway. If it actually worked, we would be | enabled to work in higher abstractions. We'd end up in even | higher demand because the output of a single engineer would | be so great that even small businesses would be able to | afford us. | | Yes, we are afraid of Copilot making the entire industry | worse, the same way that "low-code" and "no-code" solutions | have enabled generations of novices to produce volumes of | garbage that we eventually have to clean up. | vmception wrote: | Sounds like projecting because thats not what I was | referring to | | I'm saying copilot can be better with very simple tweaks | [deleted] | onion2k wrote: | _Stuff like Copilot makes programming worse and more | difficult, not better and easier._ | | Copilot makes programming worse and more difficult if you're | aiming for a specific set of coding values and style that | Copilot doesn't generate (yet?). If Copilot generates the | sort of code that you would write, and it does for _a lot_ of | people, then it 's definitely no worse (or better) than | copying something from SO. | | The author of a declarative, functional C# framework likely | has very different ideas to what code should be than some PHP | developer just trying to do their day-to-day job. We | shouldn't abandon tools like Copilot just because they don't | work out at the more rigorous ends of the development | spectrum. | serf wrote: | >If Copilot generates the sort of code that you would | write, and it does for a lot of people, then it's | definitely no worse (or better) than copying something from | SO. | | Disagree. | | Most SO copy-paste must be integrated into your project -- | maybe it expects different inputs, maybe it expects or | works with different variables -- whatever, it must be | partially modified to work with the existing code-base that | you're working with. | | Copilot does the integration tasks for you. When one might | have had to read through the code from SO to understand it | enough to integrate it, the person using Copilot need not | even invest that much understanding. | | Because of these workflow differences, it seems to me as if | Copilot enables an even more low-quality workflow than | offered by copy-pasting from SO and patching together | multiple code-styles and paradigms while hoping for the | best; Copilot does that without even the wisdom that an SO | user might have that 'this is a bad idea.' | buu700 wrote: | I'm not firmly for or against the concept of Copilot, but | it's fascinating to me that it will introduce an entirely | new class of bugs. Rather than specific mistakes in | certain blocks of code and edge case errors in handling | certain inputs, now we're going to have | lazy/overworked/junior developers getting complacent and | committing code they haven't reviewed that isn't even | close to their intent. Like you could have a backend | method that was supposed to run a database query, but | instead it sends the content of an arbitrary variable in | a POST request to a third-party API or invokes a shell to | run `rm -rf /`. | marcosdumay wrote: | To me, the most interesting aspect is the new class of | supply chain security vulnerabilities it will create. How | people will act to exploit or protect1 against those will | be very interesting. | | 1 - I don't expect "not using a tool that generates bad | code" to be the top option. | nightpool wrote: | The arguments that the GP makes are not based on a specific | style or value of coding. Instead, they're based on the | simple truth that it is harder to understand code that | somebody else wrote. | | In some cases the benefits of doing so outweigh the costs | (such as using a stack overflow answer that's stood the | test of time for something you don't know how to do), but | with Copilot you don't even get the benefit of upvotes, | human intent, or crowdsourced peer review. | mimixco wrote: | I don't think they work out past trivial applications. Any | non trivial app requires an understanding of a much larger | part of the codebase than a tool like Copilot is looking at | at any one time. | | Copilot does not understand the code _in toto_ and is | therefore really useless for debugging (70% of all coding) | and probably useless for anything other than very simple | parts of an app. | onion2k wrote: | _Any non trivial app requires an understanding of a much | larger part of the codebase than a tool like Copilot is | looking at at any one time._ | | I don't think that's important. Copilot, at least as it's | been demo'd so far judging by the examples, is to help | you write small, standalone functions. It shouldn't need | to know about the rest of the application. Just as the | functions that you write yourself shouldn't need to know | about the rest of the application either. | | If your functions need a broad understanding of the | codebase as a whole how the heck do you write tests that | don't fail the instant anything changes? | mimixco wrote: | The reality of code is that stuff breaks when connected | to other stuff, as it eventually must be for real work to | happen. There's no getting around that. | | Since that's where the work of programming is, debugging | connected applications (not writing fresh, unencumbered | code, a rare luxury), a tool that offers no help for that | is, well, not much help. | GlennS wrote: | I'm inclined to agree with you, and actually I'm rather | mistrustful of even basic autocomplete ever since a colleague | caught me using it without even looking at the screen! | | But I wonder... | | Is this a difference of programmer culture? | | I think there are people who write successful computer programs | for successful businesses without delving into the details. | Without considering all the things that might go wrong. Without | mapping the code they're writing to concepts. | | Lots of people. | | What would they do with this? | louthy wrote: | > What would they do with this? | | Not get a job working for me ;) | | More seriously, when I think back to when I was first | learning programming - in the heady days of 1985 - I would | often copy listings out of computing magazines, make a | mistake whilst doing it, and then have no idea what was | wrong. The only way was to check character by character. I | didn't have the deeper understanding yet, and so I couldn't | contribute to solving the problem in any real way. | | If they're at that level as a programmer, to the point where | their code is being written for them and they don't really | understand it, then they're going to make some serious | mistakes eventually. | | If you want to step up as a dev, understanding is key. | Programming is hard and gets harder as you step up and bite | off bigger and more complex problems. If you're relying on | the tools to write your code, then your job is one step away | from being automated. That should be enough to light a fire | under your ambition! | biztos wrote: | I also typed stuff in from magazines in the 80's, and my | fast but imperfect typing really helped me learn | programming: I often had to stop, go back to the first | page, and actually _read_ the damned thing in order to make | it work. | code4you wrote: | Great points. Really makes me question why so many developers | were excited / worried about programming jobs being automated | away by this technology. I really doubt that many jobs are | going to be displaced by what is at best an improvement to | autocomplete/intellisense and at worst an unreliable, copyright | infringing boilerplate generator. Also agree with point #3 - I | could see Copilot steering devs away from new code patterns | toward whatever was most commonly seen in the existing | codebases it was trained on. Doesn't seem good for innovation | in that sense. | influx wrote: | I get why marketing calls machine learning "AI". I don't get why | engineers would think this is. | | Dumb. | squeaky-clean wrote: | I still consider anything with more than 3 if-statements to be | AI. We just need more sensible expectations about what AI can | do haha. | SEMW wrote: | > I don't get why engineers would think this is. | | This claim that "AI" only means artificial general / human- | equivalent intelligence completely ignores the long history of | how that term has been used, by computer science researchers, | for the last 70-odd years, to include everything from Shannon's | maze-solving algorithms, to Prolog-y systems, to simple | reinforcement learning, and so on. | | https://web.archive.org/web/20070103222615/http://www.engagi... | | It's true that there has been linguistic drift in the direction | of the definition getting narrower (to the point where it's a | joke that some people use 'AI' to mean whatever computers can't | do _yet_). And you can have reasons to prefer your own very- | narrow definition. But claiming that your own definition is the | only valid one to the point that anyone using a wider | definition (one that has a long etymological history, and which | remains in widespread use) are "dumb" is... not how language | works. | influx wrote: | It hasn't been AI the entire time. It's borderline fraud, | tbh. | konfusinomicon wrote: | it's the marketing magic bullet. each person shot is entranced | by its promises, and given unlimited ammo to spread its lies. | few possess armor capable of stopping them | yepthatsreality wrote: | Co-pilot is just lowest common denominator solutions with flashy | tabbing. | axiosgunnar wrote: | I hate to be the one that says this but I think it's true: | | "So you are an SWE and you take a break from work to go to | Hackernews to complain that Github's Copilot, which is an AI- | based solution meant to help SWEs, is utter shit and completely | unusuable. | | And then you go back to writing AI-based solutions for some other | profession. Which is totally not shit or anything." | | Can anybody put this more elegantly? | MajorBee wrote: | A variation of the Gell-Mann Amnesia effect? | | "Briefly stated, the Gell-Mann Amnesia effect is as follows. | You open the newspaper to an article on some subject you know | well. In Murray's case, physics. In mine, show business. You | read the article and see the journalist has absolutely no | understanding of either the facts or the issues. Often, the | article is so wrong it actually presents the story backward-- | reversing cause and effect. I call these the "wet streets cause | rain" stories. Paper's full of them. In any case, you read with | exasperation or amusement the multiple errors in a story, and | then turn the page to national or international affairs, and | read as if the rest of the newspaper was somehow more accurate | about Palestine than the baloney you just read. You turn the | page, and forget what you know." | | https://www.goodreads.com/quotes/65213-briefly-stated-the-ge... | edgyquant wrote: | I would do this with Reddit posts. I'd see the top comment | under something I was familiar with and see it was full of | holes or just incorrect but then I'd go to a post about | something I didn't know all that well and take the top | comment at face value. | joe_the_user wrote: | SWEs create AI based solutions to X 'cause people pay them. | Entrepreneurs and investors are the one who actually think | they're the answer to everything. | | Also, Copilot might (or might not) be useless or even interfere | with real work. But it's probably low on the scale of awful | things SWEs have helped create. The AI parole app is a thing | that should haunt the nightmare of whoever created it, for | example. But lots of AI apps may be useless but are probably | also harmless so doing that might not be worst thing. | SamBam wrote: | "'I never thought leopards would eat MY face,' sobs woman who | voted for the Leopards Eating People's Faces Party." | Hamuko wrote: | > _And then you go back to writing AI-based solutions for some | other profession._ | | I don't know what you're talking about, I'm a webshit | developer. | [deleted] | skinkestek wrote: | You mean like the insanely annoying AIs that replaced Google | search? The idiotic one that files Javascript books under "Law" | in Amazon or the insulting one who runs Ad Sense and thinks my | wife isn't good enough and I am stupid enough to leave her for | some mail order bride? | donkeybeer wrote: | Javascript books under "Law" is hilarious | drdaeman wrote: | I'm in for JavaScript Penal Code. Make that unwarranted | type-coercing operator use punishable by law. | axiosgunnar wrote: | Instead of prison you go to callback hell | shadilay wrote: | Maybe the Google AI is a polygamist and thinks you ought to | have a 2nd wife? | marcosdumay wrote: | There are probably good ways to apply AI to software | development (has anybody tried to build a linter already?). It | is this product that is very bad. | | The same certainly apply to other tasks. | thinkingemote wrote: | The most common example of this would probably be complaining | about advertising whilst working for a business that depends on | advertising to survive. | | Ultimately it's a kind of Kafkaesque trap that modern living | has us all in to a larger or lesser extent. | srcreigh wrote: | That's a bit different. Advertising is like a race to the | bottom, where everybody to survive takes part. You can do | that meanwhile wish that it could somehow not be that way. | Same with environmental issues. | | The GP comment by contrast is about hypocrisy. I personally | found it funny that I didn't ever read about (or consider) | copyright violations of deep learning until they tried to do | it with code :-) | | Of course programmers would find the problem with AI as soon | as it exploited _them_. | [deleted] | TeMPOraL wrote: | Dunno. I go to HN because it's the one place where I can whine | about AI being total bullshit, for the exact reasons as we're | now complaining about wrt. Copilot. | rpmisms wrote: | I like tabnine. It's an autocomplete tool and doesn't pretend to | be anything more. | celeritascelery wrote: | From the Copilot FAQ: | | > The technical preview includes filters to block offensive words | | And somehow their filters missed f*k? That doesn't give a lot of | confidence in their ability filter more nuanced text. Or maybe it | only filters truly terrible offensive words like "master". | spoonjim wrote: | Blocks offensive words, but doesn't block carefully crafted | malware. | minimaxir wrote: | In my testing of Copilot, the content filters only work on | _input_ , not output. | | Attempting to generate text from code containing "genocide" | just has Copilot refuse to run. But you can still coerce | Copilot to return offensive output given certain innocuous | prompts. | aasasd wrote: | Maybe Github just doesn't have many repos to control death | factories and execution squads? | Jackson__ wrote: | Interesting how this continues to be an issue for GPT3 based | projects. | | A similar thing is happening in AI Dungeon, where certain | words and phrases are banned to the point of suspending a | users account if used a certain amount of times, yet they | will happily output them when it is generated by GPT3 itself, | and then punish the user if they fail to remove the offending | pieces of text before continuing. | Closi wrote: | Ahh, so it's the most pointless interpretation of the phrase | "filters to block offensive words", where it is stopping the | user from causing offense to the AI rather than the other way | around. | verdverm wrote: | They probably don't want to repeat Microsoft's incident | with Tay, though they seem to have created their own | incident which dooms the product if it wasn't already | derefr wrote: | I believe the concept is to stop users from prompting the | AI to generate offensive stuff specifically, and then | publishing the so-generated stream of offensive stuff as | negative PR for GitHub, in the same way the generated | stream of offensive stuff coming from Microsoft's AI was a | big PR disaster. | bambax wrote: | Maybe, but even if so, filtering the output would also | prevent this. | stingraycharles wrote: | I suppose you're referring to the AI Twitter bot that | initially was very lovely and within a day 4chan had | turned into a nazi. That was both very naive and | hilarious. | | https://spectrum.ieee.org/tech-talk/artificial- | intelligence/... | | The big difference in this case, however, is that this AI | was constantly learning based on user input, however, | which I do not think is the case for Copilot. | raffraffraff wrote: | Easily offended AI is exactly what the world needs | GenerocUsername wrote: | We have too many easily offended NPC's as is. | krick wrote: | Lol, how does _that_ make any sense? I mean, all these word | blacklists are always pretty stupid, but at least you can | usually see the motivation behind them. But in this case I 'm | not even sure what they tried to achieve, this is absolutely | pointless. | throwaway2037 wrote: | Just to be clear for other readers: Are you being sarcastic | about the last sentence that mentions the term 'master'? I hope | not. | | As I understand, this movement (for lack of a better term!) | started in the United States, which has a long and complicated | history of slavery. In the last few years in my various jobs | (all outside the United States), there has been a concerted | effort to remove and instances of "master" and "slave" and | replace with terms like "primary" and "secondary". | | For co-workers not familiar with the history of slavery in the | United States, there is always a pause, and then some confusion | about the changes. After explaining the historical context, 99% | of people reply: "Oh, I understand. Thank you to explain." | andrewzah wrote: | The word master has many usages. One specific context | (master/slave) is inappropriate, but that doesn't mean every | other context is unusable now. | | Github changing master->main was the epitome of virtue | signaling. This literally does not affect black people at | all, nor does it do -anything- to help with racial inequality | in the US. It's actually quite patronizing and tone-deaf to | think that instead of all the things -Microsoft- could be | doing to help racial inequality, they're putting in as little | effort as possible. | | Congrats on granting power over words to unreasonable people | who ignore things like context in language and common sense. | mdoms wrote: | I don't work in USA and I don't intend to. Your history of | slavery is none of my concern, especially when I'm just | trying to do my work. | | The word 'master' is useful for me, and I don't believe for a | nanosecond that anyone, American or not, is ACTUALLY offended | by it. I believe that some people (mostly affluent white | Americans) are searching for things that they think they | SHOULD be offended by. | slackfan wrote: | And in my historical context power fists that your ideology | used were used by a regime that murdered millions. In the | past 100 years. | blindmute wrote: | > After explaining the historical context, 99% of people | reply: "Oh, I understand. Thank you to explain." | | A similar percentage then think to themselves, privately, | "well that's pretty stupid." | Isinlor wrote: | Why is there no push-back against using the word Slave that | originates from word "Slav" due to enslavement of Slavic | people? | | By analogy, you are basically using the word African to mean | "a person in possession of someone else". | | https://www.etymonline.com/word/slave | | @edit The fact that people down vote this highlights that the | whole issue is just virtue signaling. | RicardoLuis0 wrote: | while the word 'master' can indeed be used in the sense of | "master and slave", its use in git is more akin to the use of | 'master' in "master record", and doesn't refer to 'ownership' | in any way | [deleted] | sseagull wrote: | Everyone has a line of how much they are willing to change | their language, though. There will always come a point where | someone will think some change is "silly", even though the | old term may have upset some people. And almost every term | has some sort of baggage associated with it. | | There was a post going around somewhere of a college's | earnest attempt and change some language (like avoiding "give | it a shot" because of the association of "shot" with guns). | Would renaming all the various things we call "triggers" be | ok, so we don't upset victims of gun violence? | | So the master->main change was the line for some people, not | others. | andrewzah wrote: | As a matter of principle I don't think we should be moving | towards ignoring any and all contexts of words. Granting | this power of word banning to random arbiters is quite | crazy. In this case, master was moreso changed because it | -could- be deemed offensive, not that it -actually- is | offensive by itself. Not one person that I've spoken to | about it has actually cared. | | Words having multiple usages is not really a novel concept. | If we ban words based on them potentially being offensive, | we'll end up with no words at all as people move onto using | different words, and so forth. | | It is not silly to have pushback when someone wants to | grant themselves power over language usage. Dropping usage | of a word should have a strong, tenable argument and larger | community support than 0.00000001% of people caring. | LAC-Tech wrote: | > Everyone has a line of how much they are willing to | change their language, though. | | But that line is constantly moving though. People are | forced to adapt, or they are ostracised socially and | economically. | | If prestigious organisations, people and institutions | decide "master/slave" is an immoral thing to say, I have no | choice. Eventually I'll need to fall in line or my | livelihood will be at risk. | username90 wrote: | > For co-workers not familiar with the history of slavery in | the United States, there is always a pause, and then some | confusion about the changes. After explaining the historical | context, 99% of people reply: "Oh, I understand. Thank you to | explain." | | Most people answer like this when they realize you are an | unreasonable person who refuse to listen. Happens all the | time, like "Oh, I understand (you are one of those). Thank | you for explaining!", and remember that they need to stop | using this word when working with you. | bestcoder69 wrote: | By rolling your eyes you accept my terms and conditions. | LAC-Tech wrote: | Imagine 'explaining' the historical context to someone from | say, Brazil. | pydry wrote: | Changing master to main was something Github did when they | were taking heat for their contract with ICE. It was a nice | bit of misdirection that cost them nothing, achieved nothing | and garnered praise in some quarters. | | ICE, of course, runs an _actual_ concentration camp which has | a slightly more troublesome history than the word master. | | Language policing is to racism what recycling is to global | warming - an attempt to shift the focus away from elite | responsibility for systemic issues to "personal | responsibility" and forestall meaningful reform by placing | emphasis on largely non-threatening symbolic gestures. | samatman wrote: | y'know it really seems like both purpose and outcome need | to be closely examined here, if we're going to be | emphasizing _actual_ next to concentration camps. | | what's the paradigm of a concentration camp? if we go | straight for Auschwitz we'll get nowhere, how about the | Boer concentration camps? Origin of the term after all. | | What was the purpose? To _concentrate_ the Boer population | during a total war against them, so they couldn 't supply | and hide the belligerents. | | What was the outcome? Tens of thousands of preventable | deaths, mostly from disease. Success in the war, from the | British perspective. | | So, let me turn my spectacles to your example of, may I | quote? | | > an _actual_ concentration camp | | Which appears to be a migrant detention center. To put it | succinctly, migrants who enter the country without filling | out paperwork, and get caught, end up in one of these | places for months-to-years while USG figures out what to do | with them. | | So a Boer concentration camp is filled by the British | riding into a farmstead or town, kidnapping the women and | children, and driving them out to a field and sticking them | in a tent. A migrant detention center is filled with | someone enters the United States without following the | rules which govern that sort of behavior, and then, gets | caught. | | Where is the war? | | Where is the excess death? | | Ah well. I'm out of time and patience to express my | contempt for your abuse of language and disrespect for the | real horrors which you cheapen with this kind of facile | speech. | | Enjoy the 4th of July. | bloomark wrote: | Your vacuous argument about what is an _actual | concentration camp_ is out of place. This wasn't a | discussion about concentration camps, it was about | github's attempted misdirection, and their facetious show | of supporting inclusion, by eliminating the term | "master". | | https://news.ycombinator.com/item?id=26487854 | pydry wrote: | Is this an indirect way of saying that you support ICE? | | Coz if so Id really rather hear it straight rather than | indirectly via an attempt to police my language. | SahAssar wrote: | I get what you mean, but in a discussion about semantics it | might be unhelpful to dilute the term "concentration camp", | especially if prefixed with "actual" in italics. That is | unless you actually mean that ICE camps serve the same | purpose and are equivalent to nazi concentration camps. | pydry wrote: | The Nazis ran what would more accurately be termed | extermination camps. | | Though what they did certainly bore a strong resemblance | to the Boer war concentration camps/manzanar,etc. whose | purpose was to "concentrate" people into one place rather | than industrially slaughter them. | pmkiwi wrote: | To be correct, both existed. | | A camp like Ravensbruck was a concentration camp (for | women) while Auschwitz-Birkenau was both a concentration | and extermination camp. | | https://upload.wikimedia.org/wikipedia/commons/b/be/WW2_H | olo... | SahAssar wrote: | I don't know if I've ever heard anyone use the term | "concentration camp" without qualifiers to refer to | anything else than the nazi concentration camps (or | something equivalent). | | Maybe it's just me, but I think it would have been more | clear if you said internment camp if your intent was to | refer to the broader context and not invoke a comparison | to nazis. | pydry wrote: | Wikipedia redirects concentration camp to: | | https://en.wikipedia.org/wiki/Internment | | Where it also makes the point that the nazi camps were | primarily extermination camps. | | Maybe take it up with them and get back to me if you feel | truly passionate about this issue. | | >Maybe it's just me, but I think it would have been more | clear | | Gosh, it's awfully ironic that this sentence would happen | in a thread about how language policing is used as a | distraction from _important_ issues. | | Is it more important to you how people _use_ the term | concentration camp or the fact that ICE lock up children | in internment /concentration/[ insert favorite word here | ] camps? | SahAssar wrote: | > So, is it more important to you how people use the term | concentration camp or the fact that ICE lock up children | in internment/concentration/[ insert favorite word here ] | camps? | | Well, that escalated quickly. | | I don't think I ever said anything for or against what | ICE is doing, in fact I tried not to because the only | thing I wanted to say was that when using the words | "literally concentration camps" people might read that as | "camps designed to kill people" since that is the way | I've been taught it (in history classes) and heard it (in | general use). | | I don't even live in the US so I have no say in this in a | democratic sense. If I did I'd be against the way | migrants are treated and want more humane treatment, but | I don't think that should be relevant to what I said. | pydry wrote: | Your primary worry was that somebody _might_ read that | sentence and believe that the US is gassing immigrants? | | Seems unlikely. | SahAssar wrote: | You seem to think I have some political motive, I don't. | I just saw a comment that from my perspective and | historical education seemed to equate two things that I | regard as different and said that it might be helpful to | not conflate those. It seems like you did not intend to | conflate them and it is a difference in what you and I | read into the term "actual concentration camp". | | From my perspective this conversation is as if someone | said "working for XCompany is actual slavery" and I said | "Perhaps don't use 'actual slavery' as a term for | something that isn't that?" | junon wrote: | Historians themselves call what ICE is doing a | concentration camp. So your experience is very much | localized. | hdhjebebeb wrote: | It seems like a distinction without a difference, this | article for example uses them interchangably: | https://www.commondreams.org/views/2019/06/21/brief- | history-... | dragonwriter wrote: | Nazi "concentration camps" were not actual concentration | camps (a thing which long predates the Nazi camps), they | were extermination camps for which "concentration camp" | was a minimizing euphemism. | | US WWII "internment" and "relocation" centers were actual | concentration camps ("relocation center" was itself a | euphemism, but "internment" referred to a formal legal | distinction impacting treaty obligations.) | SahAssar wrote: | Sure, but I don't know if I've ever heard anyone use the | term "concentration camp" without qualifiers to refer to | anything else than the nazi concentration camps (or | something equivalent). | | If someone says that something is "_literally_ a | concentration camp" I think that most people will think | of ovens and genocide. | | Perhaps it's a regional thing, but that is how I | interpreted it. | [deleted] | sombremesa wrote: | It's not so much a regional as a political thing. Want it | to sound worse? Use concentration camp. Want it to sound | better? Use internment camp (or in some cases, re- | education facility). | michael1999 wrote: | Or "Reserve". | dragonwriter wrote: | Relevant to that, the US WWII internment camps | were...placed on land taken from (with disputedly- | adequate compensation for the use) reservation land. | [deleted] | bambax wrote: | Why is this downvoted... It's simply the truth. | kanzenryu2 wrote: | There were only a handful of mass extermination camps. | There were tens of thousands of concentration camps. | https://encyclopedia.ushmm.org/content/en/article/nazi- | camps.... | okamiueru wrote: | Pretty sure they were being sarcastic. I also don't find your | arguments persuasive in the slightest, and I find myself | being skeptical of these recent moral outcries. I'm skeptical | of its sincerity, and I don't buy it. "Master" has an | etmylogical background far more diverse than the dichotomy to | "slave". I can wholeheartedly say that I've not once thought | to make that association. It's been a title for centuries. | Master blacksmith, etc. (See | https://en.wikipedia.org/wiki/Master for a list) | | Another example of what seems like a fake moral outcry is | "blackface". And, I mean what it is being referred to now, | and not the actual meaning. The racist ridicule by | stereotyping ethnicity. That was "Blackface". Yet, for some | reason, context doesn't matter anymore, and we end up with | removing episodes of Community because someone painted their | face in a cosplay of an dark elf, in exact commentary of | this. | | There is a significan systemic racism in the US that affects | almost everything. In order to deal with those things, the | very first thing would be to properly be able to identify | racism. Context matters. Renaming "Master" branches is not | progress. Ostracising a kid for dressing up as Michael | Jackson isn't it. | | Whenever I see outrage over such things I cynically think | that the person is probably white, and probably doing it for | attention. One thing is for sure, it only serves to detract | from the real issues. | rorykoehler wrote: | Check out the recent Marc Rebillet stream with Flying Lotus | and Reggie Watts. They absolutely destroy the bs around the | use of the word master. I think both FL and RW will be | quite representative of how African Americans (and the rest | of the world) feel about this. | okamiueru wrote: | Do you have a timestamp? As enjoyable as as it is to | listen to each of them, the stream was mostly music and | almost two hours long. | rorykoehler wrote: | The next couple of minutes from here | https://youtu.be/0J8G9qNT7gQ?t=3984 | greyfox wrote: | Very interesting that this was posted as I literally JUST watched | an even MORE interesting youtube upload about this very bit of | code just last weekend. | | Here's the very fun video if anyone wants to take a look: | | https://www.youtube.com/watch?v=p8u_k2LIZyo | cblconfederate wrote: | Clearly, swearing is the only right way to write that function | stefan_ wrote: | Even includes the commented out code. Clearly Copilot has gained | a deep understanding of code and is not simply the slowest way to | make a terrible, opaque search engine ever! | mrfredward wrote: | From the tweet it looks like an awesome search feature. Just | type what you wanted to search for right inline and then it can | drop the result in without you ever changing a window or moving | a hand to the mouse. | | Problem is you don't know whose code you're stealing, which | leads to all sorts of legal, security, and correctness issues. | aj3 wrote: | Does GitHub Copilot write perfect code? | | No. GitHub Copilot tries to understand your intent and to | generate the best code it can, but the code it suggests may not | always work, or even make sense. While we are working hard to | make GitHub Copilot better, code suggested by GitHub Copilot | should be carefully tested, reviewed, and vetted, like any | other code. As the developer, you are always in charge. | | https://copilot.github.com/ | | EDIT: the text above is a direct quote from the Copilot website | danparsonson wrote: | > ...may not always work, or even make sense... | | Naively, as someone who just heard of this - that sounds | worse than useless. If you can't trust its output and have to | verify every line it produces _and_ that the combination of | those lines does what you wanted, surely it 's quicker just | to write the code yourself? | aj3 wrote: | Then write the code yourself. It's not like you're forced | to use this demo. | danparsonson wrote: | Well, you're right. I was somehow expecting there might | be a silver lining I'd missed but perhaps not. | cjaybo wrote: | Not exactly a confidence-inspiring reply from someone who | just identified themselves as representing the project | here! | aj3 wrote: | I don't work for Github (nor MS) and do not represent | Copilot. | vultour wrote: | Just today I needed to quickly load a file into a string in | golang. I haven't done that in a while, so I had to go look | up what package and function to use for that. I'd love a | tool that would immediately suggest a line saying | `ioutil.ReadFile()` after defining the function. I would | never accept a full-function suggestion from Copilot, | similarily to how I never copy and paste code verbatim from | StackOverflow. Using it as hints for what you might want to | use next seems like a nice productivity boost. | edgyquant wrote: | It's quite literally stealing code from repos under a GPL | license and suggesting them to people regardless of license | (if any) they're using. I do not see how this is legal. | aj3 wrote: | I disagree with this attitude. Many demos such as this one | with Quake code are intentionally looking for (funny) | outliers by bending the rules. But this is not how anyone | would use the system in a real scenario (no one should | select license by typing "// Copyright\t" and selecting | whatever gets auto-completed), so it doesn't really | demonstrate any new limits besides what you could | reasonably expect anyway (and what's mentioned on the | Copilot's landing page). | | Basically, in order to fall victim for this "code theft" | (or any other "footguns" from Twitter threads) you'd need | to be actively working against all the best practices and | common sense. If you actually use it as a productivity tool | (the way it is marketed) you'll remain in full control of | your code. | comodore_ wrote: | funny, the youtube algo blessed me with an in dept video (~1y | old) about this quake function yesterday. | gumby wrote: | stack overflow at its automated finest. | | Or should we call it the Tesla of software? | FlyingSnake wrote: | The rate at which these bots implode imply something about the | whole AI/ML zeitgeist. | meling wrote: | What I would love even more than copilot helping me write code is | a copilot to write my tests for the code I write. | danuker wrote: | They could train on solely MIT-licensed code, and dump ALL the | copyright notices of code used for training into a file. Problem | solved. | Uehreka wrote: | Plenty of people probably copy-paste GPL code with the comments | and stick MIT on it. This kind of thing violates the GPL, but | I'm pretty sure (IANAL) that such code is "fruit of the poison | tree", and if you then copy it, you too can be held | responsible. Sure, you might not get caught, but it's a rough | situation if you do. | rebolek wrote: | Have you read MIT license? It explicitly says: The above | copyright notice and this permission notice shall be included | in all copies or substantial portions of the Software. | dgellow wrote: | Another fascinating one, an "About me" page generated by copilot | links to a real person's Github and twittter accounts! | | https://twitter.com/kylpeacock/status/1410749018183933952 | bencollier49 wrote: | That's bonkers. And the beauty of it is that now someone could | realistically do a GDPR Erasure request on the Neural Net. I do | hope that they're able to reverse data out. | qayxc wrote: | Since the information is encoded in model weights, I doubt | that erasure is even possible. Only post-retrieval filtering | would be an option. | | It only goes to show that intransparent black-box models have | no place in the industry. The networks leak information left | and right, because it's way too easy to just crawl the web | and throw terabytes of unfiltered data at the training | process. | ohazi wrote: | I think the fact that there's no way to delete the data in | question without throwing away the entire model is a | feature... | | The strategic goal of a GPDR erasure request would be to | force GitHub to nuke this thing from orbit. | bencollier49 wrote: | > Only post-retrieval filtering would be an option. | | And illegal, if the original information remains. | | I assume that there must be a process for altering the | training data set and rerunning the entire thing. | gmueckl wrote: | The problem is that the information is in an opaque | encoding that nobody can reverse engineer today. So it's | impossible to prove that a certain subset of data has | been removed from the model. | | Say, you have a model that repeats certain PII when | prompted in a way that I figure out. I show you the | prompt, you retrain the model to give a different, non- | offensive answer. But now I go and alter the prompt and | the same PII reappears. What now? | computerex wrote: | Yes, but the compute costs required for training are | probably in the range of hundreds of thousands of usd to | potentially millions of usd. Not to mention potentially | months of training time. ___________________________________________________________________ (page generated 2021-07-02 23:00 UTC)