ApX 标志

趋近智

Gemma 3 12B

参数

12B

上下文长度

128K

模态

Multimodal

架构

Dense

许可证

Gemma Terms of Use

发布日期

12 Mar 2025

知识截止

Aug 2024

技术规格

注意力结构

Grouped-Query Attention

隐藏维度大小

3072

层数

42

注意力头

48

键值头

12

激活函数

-

归一化

RMS Normalization

位置嵌入

ROPE

系统要求

不同量化方法和上下文大小的显存要求

Gemma 3 12B

Gemma 3 12B is a 12-billion-parameter multimodal model developed by Google, designed to process both text and image inputs while generating textual outputs. This model is part of the Gemma family, which is built upon the foundational research and technology employed in the Gemini series of models. The architectural design features a decoder-only transformer with Grouped-Query Attention (GQA), incorporating a distinctive pattern of five local sliding window self-attention layers interleaved with one global self-attention layer. This configuration is engineered to optimize KV-cache memory utilization, thereby enhancing efficiency, particularly for longer sequences. Position embeddings are handled via Rotary Position Embeddings (RoPE), adapted with an increased base frequency for extended context windows.

Optimized for deployment across a range of hardware configurations, Gemma 3 12B can operate efficiently on single-GPU systems, workstations, laptops, and even mobile devices. Its multimodal capability is achieved through the integration of a tailored SigLIP vision encoder, which converts images into a sequence of soft tokens for processing. The model supports an expansive context length of 128,000 tokens, enabling it to process substantial amounts of information, including extensive documents and multiple images, within a single prompt. Furthermore, it offers broad multilingual support, encompassing over 140 languages.

Typical use cases for Gemma 3 12B include advanced natural language understanding and generation tasks such as question answering, comprehensive summarization, and intricate reasoning. Its multimodal capabilities extend to image interpretation, object identification within visual data, and the extraction of textual information from images, making it suitable for a diverse set of vision-language applications. The model also supports function calling, facilitating the development of natural language interfaces for programmatic interactions.

关于 Gemma 3

Gemma 3 is a family of open, lightweight models from Google. It introduces multimodal image and text processing, supports over 140 languages, and features extended context windows up to 128K tokens. Models are available in multiple parameter sizes for diverse applications.


其他 Gemma 3 模型

评估基准

排名适用于本地LLM。

排名

#43

基准分数排名

Agentic Coding

LiveBench Agentic

0.02

19

0.48

19

Professional Knowledge

MMLU Pro

0.61

19

0.42

22

Graduate-Level QA

GPQA

0.41

24

0.29

25

0.47

25

General Knowledge

MMLU

0.41

30

排名

排名

#43

编程排名

#31

GPU 要求

完整计算器

选择模型权重的量化方法

上下文大小:1024 个令牌

1k
63k
125k

所需显存:

推荐 GPU