Articles & Blogs
POV on AI-Generated Code
“Basics don’t change regardless of who or what wrote the code” – Aaditya Uthappa, Co-Founder & COO ||
Generative AI (GenAI) has redefined the way businesses work today. It fuels innovation, automates tasks, and simplifies the work itself. With over 55% of companies using GenAI, its adoption is rapidly increasing. However, this progress comes with potential risks. Data security breaches, privacy violations, and the generation of inaccurate or biased outputs remain key concerns. Recent studies in 2023 regarding the security of code that GenAI helps generate surveyed developers and found that over 56% encountered security vulnerabilities in code suggestions from AI tools frequently or sometimes. This highlights a significant risk, considering the widespread adoption of GenAI for code generation.
Gen AI Coding Assistants: Efficiency & Risk
In today’s fast-paced development environment, AI coding is a valuable tool to stretch your development budget further. They offer undeniable advantages: speed, efficiency, and convenience. However, these benefits come with inherent risks, particularly regarding data leakage and the potential for incorporating malicious code. Below are the merits and risks:
Merits
-
Faster Development Cycles:
AI assistants automate repetitive tasks and generate code snippets, significantly accelerating the development process. This frees up developers to focus on more complex problems and core functionalities.
-
Convenience for Developers:
Having AI-powered suggestions and code generation capabilities readily available can streamline the workflow for developers, making development a more convenient and efficient experience.
Risks
-
Data Leakage:
A critical concern is the potential for sensitive information to be inadvertently leaked during the AI code generation process. This could include passwords, API keys, or proprietary code snippets.
-
Blind Trust:
Blindly trusting the output generated by AI assistants is risky. Code reviews and developer scepticism are critical to ensure the generated code functions as intended and don't contain security flaws.
-
Open-source Code used as Training Data and Not Under a Permissive License:
Blindly trusting the output generated by AI assistants is risky. Code If the AI tool utilizes open-source code during training that doesn't have a permissive license (e.g., GPL), incorporating that code into your project could lead to licensing violations. Understanding the licensing terms of the training data is crucial.reviews and developer scepticism are critical to ensure the generated code functions as intended and don't contain security flaws.
-
Unintended Incorporation of Malicious Code:
The vastness of training data increases the risk of malicious code inadvertently slipping into the AI model. This could lead to vulnerabilities in the generated code. Choosing AI tools with robust security measures in place for training data selection helps mitigate this risk.
Are Developers Writing Less Secure Codes with GenAI Tools?
A recent study by Schneier Security suggests that developers who had access to an AI assistance or, Gen AI tool wrote significantly less secure code than those without access.
Schneier Security – “Participants with access to an AI assistant based on OpenAI’s codex-davinci-002 model wrote significantly less secure code than those without access. Additionally, participants with access to an AI assistant were more likely to believe they wrote secure code than those without access to the AI assistant. Furthermore, we find that participants who trusted the AI less and engaged more with the language and format of their prompts (e.g. re-phrasing, adjusting temperature) provided code with fewer security vulnerabilities.”
How Do We Mitigate AI-Generated Code Risks?
Auto-remediated AI assistants like GitHub Autopilot etc. pose a greater challenge, making it difficult to identify and address potential vulnerabilities. Here’s how you can mitigate AI-generated coding risks:
1. Reapplying Basic Security Hygiene
-
Secure Coding Principles:
This applies to both human developers and AI assistants. The code should be written following well-established secure coding guidelines to minimize vulnerabilities like buffer overflows, SQL injection attacks, and cross-site scripting (XSS) attacks. These guidelines typically involve techniques like proper input validation, data sanitization, and secure memory management.
-
Dependency Management:
AI assistants might suggest incorporating third-party libraries (pre-written code modules) to achieve specific functionalities.
-
Security Posture:
Don't blindly trust external libraries. Evaluate their security posture. Look for libraries with a good track record of security updates and a responsible development team.
-
Vulnerability Scanning:
Subject the suggested libraries to vulnerability scanning tools (SAST - Static Application Security Testing) to identify any known weaknesses before integrating them into your project.
-
Secrets Scanning:
AI-generated code might inadvertently introduce sensitive information like passwords or API keys. Use secrets scanning tools to detect and remove any such hardcoded secrets from the codebase.
2. AI Coding Usage Guidelines
-
Approved Tool List:
Not all AI coding assistants are created equal. Create a curated list of approved tools that have undergone a rigorous evaluation process. Security features, the quality and origin of training data, and vendor reputation are all crucial factors to consider when selecting these tools.
-
Code Ownership:
Clearly define code ownership for AI-generated code. This helps with accountability and ensures proper integration and maintenance within the project. While the AI assists in generation, human developers ultimately own the final product.
-
Strengthened Code Reviews:
Existing code review practices need to be adapted for AI-generated code. Reviewers should be trained to identify potential issues like bias from the training data, security vulnerabilities, and adherence to coding standards. Consider double-reviewing critical code sections for added security.
-
AI Skepticism Training:
Educate developers to avoid "blind trust" in AI tools. Training programs should instill a healthy skepticism and encourage developers to critically evaluate the outputs generated by AI assistants.
-
IP and Open-Source Awareness:
Equip developers with the knowledge to avoid intellectual property (IP) infringement and open-source license violations. Training can cover topics like proper attribution for borrowed code snippets and how to ensure the AI tool's training data doesn't contain copyrighted material.
-
Data Prompt Exclusion:
Train developers to identify and exclude sensitive data from prompts used for AI code generation. This helps prevent the inadvertent inclusion of confidential information like passwords or API keys in the generated code.
3. Governance Policy Controls
-
AI Tool Selection Process:
Establish a formal process for evaluating and approving AI coding tools. This process should involve security professionals, developers, and project managers working together to assess the tools based on predefined criteria like security features, training data quality, and vendor support.
-
Access Controls:
Limit access to AI code generation tools. Only authorized and trained developers should be able to utilize them. This helps prevent misuse and ensures that those using the tools have the necessary skills to create secure and high-quality code.
-
Version Control and Audit Logs:
Maintain a meticulous record of AI-generated code. Implement a robust version control system to track different versions of the code, who generated them, and the specific purpose for which they were created. Detailed audit logs can be invaluable for troubleshooting issues and ensuring code quality over time.
Indemnity Support in AI-Generated Code
The potential of AI-powered coding assistants is undeniable, but concerns regarding intellectual property (IP) infringement remain a critical consideration. Some vendors, including AWS, IBM, Microsoft, and GitHub, offer indemnity support as a safety net. However, to navigate this landscape, it’s essential to understand both indemnity’s limitations and the importance of proactive strategies.
-
Limited Coverage:
Indemnity clauses within vendor terms of service are not comprehensive guarantees. They typically provide coverage for specific infringement scenarios explicitly outlined in the legalese. This means not all potential IP claims might be eligible for protection under the indemnity clause. For instance, the indemnity might only apply if the infringement arises directly from the vendor's AI model itself, excluding any third-party code you integrate with the platform. A meticulous review of the vendor's terms of service is crucial to grasp the exact scope of coverage offered.
-
Higher-Tier Exclusivity:
Indemnity support might be a benefit reserved for paid subscriptions, often restricted to higher tiers. Free plans or basic versions might not include this safety net. Therefore, a cost-benefit analysis is vital before relying solely on indemnity as your primary defence.
-
Vendor-Developed Model Focus:
The indemnity clause might only apply to AI models developed by the vendors themselves. If you choose to utilize third-party models that integrate with the platform, those models might not be covered by the vendor's indemnity clause. This underscores the importance of understanding the origin and training data of any third-party models you incorporate. Furthermore, over-reliance on vendor indemnity creates a false sense of security. You shouldn't be absolved of the responsibility to ensure your code adheres to IP rights. Maintaining a proactive approach through governance controls empowers you to take ownership of your IP compliance.
Self-Hosted AI Models: Balancing Control and Complexity
Self-hosted AI models offer an attractive proposition for organizations concerned about intellectual property (IP) ownership and data privacy. While cloud-based solutions provide convenience, self-hosting allows you to retain complete control over your model and the data it’s trained on. Here’s a detailed breakdown of the merits and demerits to aid your decision:
Merits of Self-Hosting:
-
Model Ownership:
Self-hosting grants you complete ownership of the AI model. This means you control its development, deployment, and how you choose to leverage it. This is particularly valuable for proprietary tasks or applications where keeping the model confidential is crucial.
-
Data Privacy:
By keeping the training data within your environment, you ensure it doesn't leave your organization's control. This can be critical for sensitive data or industries with strict data privacy regulations.
-
Fine-tuning on Private Code:
Self-hosting allows you to fine-tune the AI model on your private code repositories or data in less popular languages that might not be well-supported by cloud-based solutions. This level of customization can significantly improve the model's performance on your specific tasks.
Demerits of Self-Hosting:
-
Continuous Upgrades:
The responsibility for maintaining and upgrading the AI model falls entirely on your organization. This requires a dedicated team with expertise in AI development and infrastructure management. Cloud-based solutions often handle upgrades automatically.
-
Infrastructure Management:
Self-hosting necessitates managing the underlying hardware and software infrastructure required to run the AI model. This includes aspects like computing power, storage, and network resources.
-
Pace of Innovation:
Staying abreast of the latest advancements in AI can be challenging with a self-hosted solution. Cloud platforms constantly innovate and update their AI services, while self-hosting requires manual effort to keep up with the evolving landscape.
-
Benchmarking Challenges:
Comparing your self-hosted model's performance against industry benchmarks can be difficult. Cloud platforms often provide pre-trained models with established performance metrics, making it easier to gauge your model's effectiveness.
Recommendations for Secure and Responsible AI-Assisted Development
1. Maintain Traditional Security Measures
-
Integrate SAST, DAST, and SCA:
Don't abandon your existing security toolset. Security Static Application Testing (SAST), Dynamic Application Security Testing (DAST), and Software Composition Analysis (SCA) tools remain crucial for measuring code quality and identifying vulnerabilities. Integrate these tools into your development pipeline even when using AI assistants.
-
Long-Term Security Monitoring:
Maintain a long-term perspective on security. Regularly scan your codebase with traditional tools to identify potential security issues that might emerge over time as your project evolves, even after AI-assisted code generation has been implemented.
2. Prioritize Secure AI Tools
-
Enterprise-Grade Security:
Not all AI coding assistants are created equal. Prioritize tools that offer robust security features like secure enclaves for training data, access controls, and regular penetration testing. Look for vendors that demonstrate a commitment to enterprise-grade security.
-
Governance Controls:
Choose tools that provide granular governance controls. This might include features like pre-approval processes for AI models, audit logs to track code generation activity, and the ability to restrict access to AI tools based on user roles and permissions.
-
Avoid Generic LLMs:
While large language models (LLMs) like ChatGPT can be fascinating, their focus might not be on secure code generation. For mission-critical projects, prioritize AI coding assistants specifically designed for enterprise development environments, offering the aforementioned security features and governance controls.
Browser-based DLP for Leakage Protection
-
Curbing Leaks to Generic LLMs:
If you choose to use browser-based interfaces for AI code generation, consider implementing data loss prevention (DLP) solutions. These tools can help prevent sensitive information like code snippets or API keys from being inadvertently leaked to generic LLMs that might not have the same security protocols as enterprise-grade AI assistants.
Indemnity and IP Protection
-
Favor Explicit Indemnity:
When evaluating AI coding assistants, favor services that offer explicit indemnification protection against IP infringement claims. While indemnity has limitations (as discussed previously), it can provide a valuable safety net.
-
Dual-Edged Sword of IP:
AI-assisted development presents a double challenge:
-
Loss of Control:
Ensure the AI tool doesn't inadvertently reveal your proprietary information or code in the generated outputs.
-
Infringing Others' IP:
Be mindful of the training data used by the AI tool and the potential for unintentional infringement of other people's intellectual property.
5. Developer Training and Code Review
-
Hallucination Awareness:
Train developers to be aware of the potential for AI-generated code to contain "hallucinations" - seemingly correct code that might be functionally incorrect or contain hidden vulnerabilities. Rigorous code reviews remain essential for catching these issues.
-
Critical Thinking and Scepticism:
Developers should be encouraged to think critically about AI-generated code. Blindly trusting the output is a recipe for disaster. Effective code review practices and a healthy dose of scepticism are essential.
Finding the Balance
AI coding assistants can be powerful tools, but they require a cautious approach. By being aware of the risks and implementing appropriate mitigation strategies you can harness the benefits of AI-powered development while keeping your business secure.