The GitHub Copilot Case

This paper from Gabriele Montanari, analyses a recent class-action against GitHub Copilot, that has given new fuel to the discourse around AIs and copyright. In this case, Copilot is an AI that has been trained on a large number of publicly available source codes, including GitHub’s public repositories.

Download

Click here for a summary or to preview text

The GitHub Copilot Case

Going from Software Protection to Artificial Intelligence Authorship

Summary:

A recent class-action against GitHub Copilot has given new fuel to the discourse around AIs and copyright. In this case, Copilot is an AI that has been trained on a large number of publicly available source codes, including GitHub’s public repositories. Since these codes are protected by open-source licenses, legal problems arise both from the AI training with licensed material in the input phase and from how Copilot outputs verbatim snippets of code deprived of the related Copyright Management Information.

Regarding input infringement, in the US, text and data mining is a matter of fair use. On this regard, Authors Guild v. Google, Inc. and Google LLC v. Oracle America, Inc. give a blueprint on how fair use jurisprudence relates to TDM and protected software. More in detail, software suffers from the “original sin” of being considered a functional product. Arguably, US jurisprudence undervalues the fact that what is copyrighted is not the function, but the specific expression of it. In fact, in Oracle, the majority decision did not reward how the declaring code of the Java API was organized in an intuitive and understandable way that made it so appreciated by developers. These judgements revolve around the orientation that copyright’s ultimate goal is to expand public knowledge and understanding, and authors are not the ultimate beneficiaries of it. Applying analogous reasonings to the Copilot case, it feels safe to assume that an US court would consider TDM in this instance fair use. Even so, other instances of TDM might receive different evaluations.

On the other hand, in the EU, TDM has been the object of a specific provision, with Articles 3 and 4 of Directive 2019/790/EU. Article 4 provides an exception to the right to reproduction, of which everyone can be the beneficiary, but restricted to lawfully accessible works. At the same time, rightsholders can expressly reserve the use of their works, exercising their right to opt-out from mining activities. Accordingly, publishing code on a public GitHub repository, and consequently licensing the software to GitHub, can be also interpreted as allowing mining on the published code. But the matter is not so simple, since it often happens that programmers add to their repositories even code written by third parties. Additionally, the very adequacy of these TDM exceptions is still debated. It is feared that a less aggressive implementation of mining would translate in a loss of market opportunities for European countries. Furthermore, there is no definitive answer on whether AI training is included in the scope of these exceptions, even if AI companies will probably assume that it is. This discussion extends to the actual scope of protection of the right to reproduction, since an antithesis to the technical and literal reading of the right has found its way in the Pelham case.

On the output side, what is relevant for the US is the possible violation of sections 1201–1205 of title 17 U.S.C. as amended by the DMCA. In fact, it can be affirmed that GitHub/Microsoft is distributing a product that circumvents the license system that governs the open-source ecosystem. Again, this becomes a legal problem of whether the occasional reproduction of licensed content, deprived of its CMI, counts as fair use. According to HathiTrust, the creation of complete digital copies of protected works can be transformative when they serve “a new and different function from the original work”. So, it can be argued that verbatim reproduction of software is not an “extraction of information”; it does not offer a different function and it should not be considered transformative. Moreover, the lack of the CMI is not functional for the services that the AI offers. This absence is a choice of the AI’s producers, and not an essential part of machine learning. This shortcoming affects also the fourth factor of the fair use test, since defining what Copilot does as admissible would mean validating all similar tools, capable of bypassing licenses on software, that will multiply exponentially in the next years. Still, in order to find a DMCA violation, the substance of the reproduction must be considered. Even a small amount of copying can be considered outside the scope of fair use when the copied fragment represents the “heart” of the original’s authorial expression. Another interesting question on the output side is about authorship on AI output. Since the creative power of the human mind is considered indispensable for authorship, it must be concretely evaluated what is the level of human input and supervision on the code that is produced by Copilot. However, the machine cannot be fully equated to “a tool like a pen”. When it reproduces licensed code without giving its users any kind of warning, the fault can be only of the machine’s producers.

The European stance on authorship is similar to the American one, therefore programmers must use their own personal capabilities to define the final form of what is outputted by Copilot. While infringement of the right to reproduction on Copilot’s side must be also assessed in concrete. The practical or utilitarian function of a work should does not impede its copyright protection, except when the expression is dictated only by a technical function. This EU orientation brings to the conclusion that even reduced fragments of code can be eligible for copyright protection. Nevertheless, it must be gauged case-by-case how much the programmer added his personal touch to the code arrangement.

In conclusion, the Copilot case is one of the testing grounds for the policy decisions that are being made in these years. With the warning that, in this instance, the open-source ecosystem could be severely damaged. First off, there are still uncertainties on when code is expressive and when instead it is “inherently bound together with uncopyrightable ideas”. And, secondly, there are conflicting views on whether protecting authorship is more important than catching up with other countries. Especially with countries like China making bold decisions like the Dreamwriter case, the US and EU must determine where they stand.