This is a CDK Python project to host DeepSeek-R1-Distill-Llama-8B on Amazon SageMaker Real-time Inference Endpoint. In this example, we'll demonstrate how to adapt the SGLang framework to run on SageMaker AI endpoints. SGLang is a serving framework for large language models that provides state-of-the-art performance, including a fast backend runtime for efficient serving with RadixAttention, extensive model support, and an active open-source community. For more information refer to https://docs.sglang.ai/index.html and https://github.com/sgl-project/sglang.
This is a CDK Python project to deploy DeepSeek-R1-Distill-Llama-8B a SageMaker real-time endpoint with the scale down to zero feature. This project demonstrates how you can scale in your SageMaker endpoint to zero instances during idle periods, eliminating the previous requirement of maintaining at least one running instance.
This is a CDK Python project to host deepseek-ai/Janus-Pro-7B on Amazon SageMaker Real-time Inference Endpoint. Janus-Pro-7B is a unified understanding and generation MLLM, which decouples visual encoding for multimodal understanding and generation.
Hosting LG AI EXAONE-Deep models on Amazon SageMaker
EXAONE-Deep 7.8B using SGLang
This is a CDK Python project to host LG AI EXAONE Deep 7.8B on Amazon SageMaker Real-time Inference Endpoint. EXAONE Deep, which exhibits superior capabilities in various reasoning tasks including math and coding benchmarks, ranging from 2.4B to 32B parameters developed and released by LG AI Research. In this example, we'll demonstrate how to adapt the SGLang framework to run on SageMaker AI endpoints. SGLang is a serving framework for large language models that provides state-of-the-art performance, including a fast backend runtime for efficient serving with RadixAttention, extensive model support, and an active open-source community. For more information refer to https://docs.sglang.ai/index.html and https://github.com/sgl-project/sglang.
This project is an AWS CDK Python project for deploying the Langfuse application using Amazon Elastic Container Registry (ECR) and Amazon Elastic Container Service (ECS). Langfuse is an open-source LLM engineering platform that helps teams collaboratively debug, analyze, and iterate on their LLM applications.
This project is an Question Answering application with Large Language Models (LLMs) and Knowledge Bases for Amazon Bedrock. An application using the RAG(Retrieval Augmented Generation) approach retrieves information most relevant to the user’s request from the enterprise knowledge base or content, bundles it as context along with the user’s request as a prompt, and then sends it to the LLM to get a GenAI response. In this project, Amazon OpenSearch Serverless is used for a Knowledge Base for Amazon Bedrock.
With Amazon Aurora Postgresql used for a Knowledge Base for Amazon Bedrock
This project is a Question Answering application with Large Language Models (LLMs) and Amazon Aurora Postgresql using pgvector. An application using the RAG(Retrieval Augmented Generation) approach retrieves information most relevant to the user’s request from the enterprise knowledge base or content, bundles it as context along with the user’s request as a prompt, and then sends it to the LLM to get a GenAI response. In this project, Amazon Aurora Postgresql with pgvector is used for a Knowledge Base for Amazon Bedrock.
This project is a Question Answering application with Large Language Models (LLMs) and Amazon Kendra. An application using the RAG(Retrieval Augmented Generation) approach retrieves information most relevant to the user’s request from the enterprise knowledge base or content, bundles it as context along with the user’s request as a prompt, and then sends it to the LLM to get a GenAI response.
This project is a Question Answering application with Large Language Models (LLMs) and Amazon Kendra. An application using the RAG(Retrieval Augmented Generation) approach retrieves information most relevant to the user’s request from the enterprise knowledge base or content, bundles it as context along with the user’s request as a prompt, and then sends it to the LLM to get a GenAI response.
This project is an Question Answering application with Large Language Models (LLMs) and Amazon OpenSearch Service. An application using the RAG(Retrieval Augmented Generation) approach retrieves information most relevant to the user’s request from the enterprise knowledge base or content, bundles it as context along with the user’s request as a prompt, and then sends it to the LLM to get a GenAI response.
With Amazon Bedrock and Amazon Aurora Postgresql using pgvector
This project is a Question Answering application with Large Language Models (LLMs) and Amazon Aurora Postgresql using pgvector. An application using the RAG(Retrieval Augmented Generation) approach retrieves information most relevant to the user’s request from the enterprise knowledge base or content, bundles it as context along with the user’s request as a prompt, and then sends it to the LLM to get a GenAI response. In this project, Amazon Aurora Postgresql with pgvector is used for knowledge base.
With LLMs and Amazon Aurora Postgresql using pgvector
This project is a Question Answering application with Large Language Models (LLMs) and Amazon Aurora Postgresql using pgvector. An application using the RAG(Retrieval Augmented Generation) approach retrieves information most relevant to the user’s request from the enterprise knowledge base or content, bundles it as context along with the user’s request as a prompt, and then sends it to the LLM to get a GenAI response. In this project, Amazon Aurora Postgresql with pgvector is used for knowledge base.
Question Answering Generative AI application with Large Language Models (LLMs) deployed on Amazon SageMaker, and Amazon MemoryDB for Redis as a Vector Database.
Question Answering Generative AI application with Large Language Models (LLMs) deployed on Amazon SageMaker, and Amazon DocumentDB (with MongoDB Compatibility) as a Vector Database.