Từ điển khả năng AI

Cùng với generative AI đi vào nhiều scenario, vấn đề ngày càng thực tế: AI có những capability nào dùng được? Với requirement cụ thể, chọn capability/model/product nào?

Cách dễ nhất: "ôm phật chân vội" — gặp need thì search API cloud + commercial solution. Thấy image need nghĩ image generation, text need nghĩ LLM, voice nghĩ ASR/TTS. Nhưng gộp đống product ≠ systematically plan + select + combine AI capability ở enterprise scale. Search lúc cần + judge bằng kinh nghiệm gây fragmentation, design tuỳ tiện, khó reuse.

Cẩm nang này theo "AI Capability Landscape" core. Mục tiêu: không nhét tên, mà giúp bạn nhanh hiểu 3 thứ: "việc này dùng AI capability nào? Loại model/product nào? Keyword nào để tìm API/project/service thử?"

Qua hệ thống từ modal (text/image/audio/video/3D/multimodal) → architecture (model/retrieval/Agent/platform engineering), mỗi need + scenario điển hình tìm được AI capability tương ứng, model/product đại diện, use case thực, giúp team xây hệ AI với cost trial-error thấp, decision efficiency cao, reuse mạnh.

Nội dung nhiều, có thể tra lúc cần. Recommend chia sẻ cẩm nang này cho AI để gợi ý chọn model + API cho scenario cụ thể.

Nếu chỉ muốn biết category, chỉ đọc intro mỗi chapter (vd 1.1, 1.2), không cần đi sâu 1.1.1, 1.1.2.

Khuyến nghị: tra phần cần thiết hoặc browse L1 outline; thấy hấp dẫn mới đọc full.

Bạn sẽ học

AI capability landscape: từ text, image, audio, video, 3D đến multimodal, Agent, RAG, safety, platform engineering
Model + product mỗi capability: Embedding, OCR, ASR, TTS, VLM, RAG
Capability → scenario mapping: content product, search Q&A, smart CS, auto ops

Model parameter

Trước khi vào, làm rõ concept hay nói: gì là big model, gì là small model?

Học thuật: big model = vài tỷ → trăm tỷ → trillion param general model. Small model = task/scenario specific, vài chục triệu → trăm triệu param.

Giá: API call cực rẻ (vài hào/lần, hoặc vài cent/1K token) + không nhấn mạnh general LLM → thường là small model điển hình (OCR, ASR, image classification, content moderation), hoặc lightweight LLM compress/distill cho high concurrency low cost. Giá cao (vài hào → 1k/call) → thường là big model.

Nếu product mention LLM, general big model, multimodal big model, hoặc end-to-end task phức tạp → thường là big model. Nếu nhấn mạnh 1 vertical capability (card OCR, invoice, license plate, ad CTR, voice transcribe, content safety) → underlying thường 1 hoặc nhiều small model.

Convention trong bài:

Big model: general, conversational, programmable, giá hơi cao (kèm multimodal: GPT-4o, Gemini 2.5 Pro, Claude 3.7 Sonnet). Cover đa số task general text + code + multimodal.
Small model: task specific fine-tune/custom. Rẻ hơn, perf ổn định, phạm vi hẹp.

Industry change quan trọng: nhiều capability trong cẩm nang trước 2021 do "small model" đảm nhận. Hôm nay, đa số general scenario có thể call big model trực tiếp giải.

Từ góc precision + cost cực đại, small model vẫn không thay thế được. Nhưng người mới có thể bắt đầu bằng call API big model, rồi tiến vào high-level. Chỉ cần trade-off cost vs precision vs latency, quyết chỗ nào general LLM, chỗ nào small model specific.

Common general big model:
OpenAI: GPT-4, GPT-4.1, GPT-4o, GPT-5
Google: Gemini 2.5 Pro, Gemini 2.5 Flash
Anthropic: Claude 3.7 Sonnet, Opus 4
Open: Llama 4, Qwen 3, DeepSeek V3.5, Mistral Large 2

1. Text task (Text / NLP / LLM)

Text task = foundational. Dù content moderation, search-recommend, KB Q&A, writing assistant, code Copilot — đều xoay quanh: máy hiểu chữ thật.

1.1 Language modeling + representation cơ bản

Goal: cho máy "quen" ngôn ngữ statistically, tìm vector representation ổn định cho word/sentence/document.

Scenario:

Search: general search, e-commerce site search, KB retrieval
Recommend: content recommend, product recommend, user interest modeling
Q&A: FAQ, KB Q&A
Text analysis: sentiment, dedup, clustering
Downstream task base: classification, IE, generation

Principle:

Language modeling: autoregressive (GPT, Llama, Qwen) + Masked LM (BERT, RoBERTa, ERNIE)
Word/sentence/document representation: Word2Vec/GloVe (static), BERT embedding/Sentence-BERT (contextual), document-level vector

Model: BERT/RoBERTa/ERNIE, GPT/Llama/Qwen LLM, Embedding (OpenAI text-embedding-3, bge, E5, SimCSE, BGE-M3 cho VN)

1.2 Text classification + matching

Trên vector representation, build classifier + similarity.

Scenario: sentiment classification, intent recognition, spam detection, FAQ matching, dedup, paraphrase detect, semantic search.

Principle:

Classification: single-label, multi-label, hierarchical
Matching: pairwise (Sentence-BERT, cross-encoder), retrieval (bi-encoder)
Natural Language Inference (NLI): entailment, contradiction, neutral

Model: fine-tuned BERT, sentence-transformers (all-MiniLM, paraphrase-mpnet), commercial API (OpenAI moderation, Azure Content Moderator).

1.3 Sequence labeling + Information Extraction (IE)

Token-level decision.

Scenario: Named Entity Recognition (người, địa điểm, công ty), relation extraction (Hoàng work at AIECOS), event extraction (M&A, leadership change), invoice extraction (date, amount, vendor).

Principle:

NER: BIO tagging, CRF, span-based
Relation extraction: pipeline (NER → RE), joint
LLM extraction: zero/few-shot, structured output (JSON schema)

Model: spaCy, Stanford CoreNLP, fine-tuned BERT (BERT-CRF), LLM (GPT-4o với JSON mode, Claude tool use), specialized (UIE, GLINER).

1.4 Text generation + editing

Scenario: writing assistant, content creation, summarization, translation, code generation, rewriting, expansion, polishing.

Principle:

Generation: autoregressive LLM
Summarization: extractive, abstractive
Translation: NMT (Marian, mBART), LLM zero-shot
Editing: instruction-tuned LLM

Model: GPT-4o, Claude 3.7, Gemini 2.5, Qwen, Llama, Mistral. Translation: DeepL, Google Translate, M2M-100. Code: GitHub Copilot, Cursor, Codestral.

2.1 Low-Level Vision

Pixel-level processing: denoising, super-resolution, deblur, color correction, HDR.

Scenario: photo enhance, video upscale, scan cleanup, low-light enhance.

Model: ESRGAN, Real-ESRGAN, SwinIR, DnCNN, NAFNet.

2.2 Image classification + recognition

Scenario: product category, scene tag, brand recognition, NSFW detection, medical image diagnosis.

Model: ResNet, EfficientNet, Vision Transformer (ViT), CLIP zero-shot, Imagenet pre-trained.

2.3 Object detection

Scenario: pedestrian detect, vehicle detect, defect detect, security camera, autonomous driving.

Model: YOLO series (YOLOv8, v11, v12), DETR, Faster R-CNN, RT-DETR.

2.4 Image segmentation

Pixel-level mask.

Scenario: cutout, virtual background, medical organ segmentation, satellite analysis.

Model: Segment Anything (SAM, SAM 2), Mask R-CNN, U-Net, MaskFormer.

2.5 Keypoint + action recognition

Skeleton + motion.

Scenario: motion capture, fitness app, dance grading, surveillance behavior.

Model: OpenPose, MediaPipe, MMPose, ViTPose.

2.6 Open-vocabulary / open-world / open-domain detection

Without fixed label set.

Scenario: "find me the green cup" search, prompt-based object search.

Model: GroundingDINO, OWL-ViT, OWLv2, CLIP.

2.7 Vision-Language tasks

VLM cross-modal.

Scenario: image captioning, VQA (visual Q&A), image search by text.

Model: CLIP, BLIP-2, LLaVA, Qwen2.5-VL, GPT-4V, Claude vision.

2.8 OCR

Text from image.

Scenario: invoice extraction, ID verification (eKYC), document digitalization, receipt parsing.

Model: PaddleOCR, EasyOCR, Tesseract, TrOCR, MinerU. Commercial: Google Vision OCR, Azure Document Intelligence, FPT.AI OCR.

2.9 Image generation + editing

Scenario: art creation, marketing image, product image variation, inpainting, outpainting.

Model: Flux 1.1 Pro, Stable Diffusion 3.5, SDXL, Midjourney v7, DALL-E 3, Google Imagen 3. Edit: ControlNet, InstantID, IP-Adapter, Adobe Firefly.

2.10 Image Quality Assessment (IQA)

Quality score for image.

Scenario: photo selection, model output filter, photography contest.

Model: BRISQUE, NIQE, MUSIQ, MANIQA, Q-Align.

3.1 3D perception + reconstruction

Scenario: 3D scanning, AR/VR, photogrammetry, autonomous driving depth.

Model: NeRF, Gaussian Splatting, COLMAP, MVSNet, Depth Anything.

3.2 3D scene understanding + SLAM

Scenario: robot navigation, AR mapping, mobile robot.

Model: SuperPoint, SuperGlue, ORB-SLAM, NICE-SLAM.

3.3 3D generation + editing

Scenario: game asset, metaverse, virtual try-on, product visualization.

Model: Shap-E, GET3D, Meshy, Tripo, DreamGaussian, Trellis.

4. Audio (Audio / Speech)

4.1 Waveform-level audio processing

Scenario: noise reduction, echo cancellation, voice enhance.

Model: RNNoise, DNS Challenge models, FullSubNet, NSNet.

4.2 Speech Recognition (ASR) + Speaker

Scenario: voice memo, podcast subtitle, call center transcription, meeting note.

Model: Whisper (large-v3 best multilingual), Faster-Whisper, Conformer, Paraformer. VN: VietAI Whisper-VN, FPT.AI ASR. Speaker ID: ECAPA-TDNN, x-vector, Resemblyzer.

4.3 Audio / Music understanding

Scenario: music tagging, content ID (Shazam-like), genre classification, scene classification.

Model: PANNs, YAMNet, MERT, MusicGen analysis.

4.4 Speech + audio generation (TTS / VC / Music)

Scenario: audiobook, voice assistant, voice cloning, music creation.

Model: TTS: ElevenLabs, OpenAI TTS, F5-TTS, Coqui XTTS, FishSpeech. Voice clone: OpenVoice, GPT-SoVITS. Music: Suno, Udio, MusicGen, Stable Audio.

5. Video

5.1 Traditional video processing

Scenario: video encoding, compression, watermark, transcoding.

Model: FFmpeg, x264, AV1, NVENC.

5.2 Video understanding + structure analysis

Scenario: scene segmentation, action recognition, video tagging, highlight clip.

Model: TimeSformer, VideoMAE, InternVideo, VJEPA.

5.3 Video + Language multimodal

Scenario: video Q&A, video captioning, video search by text.

Model: Gemini 2.5 (1h video), Qwen2.5-VL, InternVL, Video-LLaMA.

5.4 Video generation + editing

Scenario: short video creation, ad gen, animation, scene generation.

Model: Sora, Veo 3, Runway Gen-4, Pika 2.0, Kling, Hunyuan Video, MiniMax Video.

5.5 Digital human / Avatar

Scenario: live host, virtual influencer, customer service, e-learning instructor.

Model: HeyGen, Synthesia, D-ID, Hedra, EMO (Alibaba), VASA-1 (Microsoft).

6. Time series + sequential decision

6.1 Classical statistical TS modeling

Scenario: sales forecast, inventory, energy demand, traffic flow.

Model: ARIMA, Prophet, ETS, statsmodels.

6.2 Deep learning TS forecasting

Model: N-BEATS, Temporal Fusion Transformer (TFT), Informer, PatchTST, TimeMixer, TimesFM (Google foundation model).

6.3 Anomaly + change point detection

Scenario: fraud detection, equipment failure, system monitoring.

Model: Isolation Forest, LSTM autoencoder, USAD, RobustAD, Anomaly Transformer.

6.4 Spatio-temporal modeling

Scenario: traffic prediction, weather, crowd flow.

Model: ConvLSTM, ST-GCN, GraphWaveNet, ClimaX (climate).

7. Agent + Tool use

7.1 Tool calling / Function calling

Scenario: AI book ticket, query weather, control device, query DB.

Model: GPT-4o function calling, Claude tool use, Gemini function calling, Qwen function call. Framework: LangChain, LlamaIndex, OpenAI Assistants.

7.2 Workflow orchestration + multi-Agent

Scenario: end-to-end task automation, multi-role collab simulation, complex business process.

Framework: LangGraph, CrewAI, AutoGen, Microsoft Magentic-One, Anthropic Claude Code, Cursor Agent. Protocol: MCP, A2A.

8. Retrieval + Knowledge

8.1 RAG (Retrieval-Augmented Generation)

Scenario: enterprise KB Q&A, document chat, customer support, code search.

Component: embedding model + vector DB + retrieval strategy + LLM. Framework: LlamaIndex, LangChain, Haystack, Dify, FastGPT, RAGFlow. Vector DB: Pinecone, Milvus, Qdrant, Weaviate, Chroma, pgvector.

8.2 Structured data + Knowledge Graph

Scenario: enterprise knowledge management, search engine entity, recommendation, fraud detection.

Tool: Neo4j, Stardog, Apache TinkerPop, OpenSPG. LLM-based KG: GraphRAG (Microsoft), LightRAG.

9. Safety / Alignment / Evaluation

9.1 Capability evaluation + benchmark

Scenario: model selection, A/B test prompt, regression test.

Benchmark: MMLU (general), HumanEval (code), MT-Bench (chat), GAIA (agent), AgentBench, BFCL (function calling). Eval tool: LangSmith, Helicone, Langfuse, Phoenix (Arize), Braintrust, Promptfoo.

9.2 Value alignment + training

Method: RLHF (Reinforcement Learning from Human Feedback), DPO (Direct Preference Optimization), Constitutional AI (Anthropic), RLAIF.

9.3 Content safety + compliance

Scenario: jailbreak prevent, harmful content filter, prompt injection defense, PII detection.

Tool: OpenAI Moderation, Azure Content Safety, AWS Bedrock Guardrails, Llama Guard 3, NVIDIA NeMo Guardrails, Lakera Guard.

10. AI for Science (AI4Science)

10.1 Molecular + drug design

Model: AlphaFold 3, DiffDock, MolFormer, Boltz-1, ChemCrow.

10.2 Protein + structural biology

Model: AlphaFold 3, ESMFold, RoseTTAFold, ProteinMPNN, RFdiffusion.

10.3 Physics simulation + surrogate modeling

Model: PINN (Physics-Informed NN), FNO (Fourier Neural Operator), GraphCast (Google weather), Aurora (Microsoft).

10.4 Materials discovery + crystal design

Model: GNoME (Google), MatterGen, M3GNet, ALIGNN.

10.5 Mathematics + symbolic reasoning

Model: AlphaProof, AlphaGeometry, Minerva, DeepSeek-Math, LLEMMA. Tool: Lean, Coq theorem provers.

10.6 Scientific workflow + lab automation

Tool: Coscientist (CMU), ChemCrow, Aviary, MOSES.

11. Platform + engineering (MLOps / Infra)

11.1 Model training + fine-tuning

Framework: PyTorch, JAX, DeepSpeed, FSDP, Hugging Face Transformers + PEFT + TRL. Service: Together AI, Fireworks AI, Modal, RunPod, Lambda Labs. Apple Silicon: MLX. No-code: Llama Factory, Unsloth, Axolotl, OpenAI fine-tuning UI.

11.2 Model deployment + inference optimization

Engine: vLLM, TGI (Text Generation Inference), TensorRT-LLM, SGLang, llama.cpp, MLC-LLM, Ollama (local). Quantization: GGUF, AWQ, GPTQ, BitsAndBytes. Serving: BentoML, Modal, Replicate, AnyScale.

11.3 Data + model ops

Data: LabelStudio, Argilla, Snorkel, Cleanlab. Experiment tracking: MLflow, Weights & Biases, Neptune. Monitoring: Arize, WhyLabs, Evidently, Helicone, LangSmith.

Tổng kết

Cẩm nang này là map navigation cho AI capability landscape. Recommend:

Newbie: scan L1 outline để có big picture
Có need cụ thể: tra section liên quan, copy model/tool name → search docs
Architect: dùng làm checklist khi design AI system, không bỏ sót capability
Stay updated: AI changes nhanh, model mới ra hàng tuần. Theo dõi:

2026 VN dev

API VN-friendly pricing: Fireworks, Together, DeepInfra, Replicate
Open source: Qwen 3, Llama 4, DeepSeek-V3 — chạy được local nếu có 1-2 GPU
VN-specific: PhoGPT, Vistral, VinaLLaMA cho tasks tiếng Việt
Skill priority 2026:
1. Function calling + MCP — kỹ năng nền cho mọi AI app
2. RAG patterns — KB Q&A là use case enterprise lớn nhất
3. Agent orchestration — LangGraph, Claude Code, Cursor
4. Fine-tuning với LoRA — cho domain specific
5. Cost optimization + caching

Cẩm nang này tự maintain liên tục. Model nào mới release > 6 tháng có thể đã out-of-date. Phải verify với docs official.

Từ điển khả năng AI ​

Bạn sẽ học ​

Model parameter ​

1. Text task (Text / NLP / LLM) ​

1.1 Language modeling + representation cơ bản ​

1.2 Text classification + matching ​

1.3 Sequence labeling + Information Extraction (IE) ​

1.4 Text generation + editing ​

2. Image modal (Image / Vision) ​

2.1 Low-Level Vision ​

2.2 Image classification + recognition ​

2.3 Object detection ​

2.4 Image segmentation ​

2.5 Keypoint + action recognition ​

2.6 Open-vocabulary / open-world / open-domain detection ​

2.7 Vision-Language tasks ​

2.8 OCR ​

2.9 Image generation + editing ​

2.10 Image Quality Assessment (IQA) ​

3. 3D / Spatial modal (3D / Spatial / XR) ​

3.1 3D perception + reconstruction ​

3.2 3D scene understanding + SLAM ​

3.3 3D generation + editing ​

4. Audio (Audio / Speech) ​

4.1 Waveform-level audio processing ​

4.2 Speech Recognition (ASR) + Speaker ​

4.3 Audio / Music understanding ​

4.4 Speech + audio generation (TTS / VC / Music) ​

5. Video ​

5.1 Traditional video processing ​

5.2 Video understanding + structure analysis ​

5.3 Video + Language multimodal ​

5.4 Video generation + editing ​

5.5 Digital human / Avatar ​

6. Time series + sequential decision ​

6.1 Classical statistical TS modeling ​

6.2 Deep learning TS forecasting ​

6.3 Anomaly + change point detection ​

6.4 Spatio-temporal modeling ​

7. Agent + Tool use ​

7.1 Tool calling / Function calling ​

7.2 Workflow orchestration + multi-Agent ​

8. Retrieval + Knowledge ​

8.1 RAG (Retrieval-Augmented Generation) ​

8.2 Structured data + Knowledge Graph ​

9. Safety / Alignment / Evaluation ​

9.1 Capability evaluation + benchmark ​

9.2 Value alignment + training ​

9.3 Content safety + compliance ​

10. AI for Science (AI4Science) ​

10.1 Molecular + drug design ​

10.2 Protein + structural biology ​

10.3 Physics simulation + surrogate modeling ​

10.4 Materials discovery + crystal design ​

10.5 Mathematics + symbolic reasoning ​

10.6 Scientific workflow + lab automation ​

11. Platform + engineering (MLOps / Infra) ​

11.1 Model training + fine-tuning ​

11.2 Model deployment + inference optimization ​

11.3 Data + model ops ​

Tổng kết ​

Từ điển khả năng AI

Bạn sẽ học

Model parameter

1. Text task (Text / NLP / LLM)

1.1 Language modeling + representation cơ bản

1.2 Text classification + matching

1.3 Sequence labeling + Information Extraction (IE)

1.4 Text generation + editing

2. Image modal (Image / Vision)

2.1 Low-Level Vision

2.2 Image classification + recognition

2.3 Object detection

2.4 Image segmentation

2.5 Keypoint + action recognition

2.6 Open-vocabulary / open-world / open-domain detection

2.7 Vision-Language tasks

2.8 OCR

2.9 Image generation + editing

2.10 Image Quality Assessment (IQA)

3. 3D / Spatial modal (3D / Spatial / XR)

3.1 3D perception + reconstruction

3.2 3D scene understanding + SLAM

3.3 3D generation + editing

4. Audio (Audio / Speech)

4.1 Waveform-level audio processing

4.2 Speech Recognition (ASR) + Speaker

4.3 Audio / Music understanding

4.4 Speech + audio generation (TTS / VC / Music)

5. Video

5.1 Traditional video processing

5.2 Video understanding + structure analysis

5.3 Video + Language multimodal

5.4 Video generation + editing

5.5 Digital human / Avatar

6. Time series + sequential decision

6.1 Classical statistical TS modeling

6.2 Deep learning TS forecasting

6.3 Anomaly + change point detection

6.4 Spatio-temporal modeling

7. Agent + Tool use

7.1 Tool calling / Function calling

7.2 Workflow orchestration + multi-Agent

8. Retrieval + Knowledge

8.1 RAG (Retrieval-Augmented Generation)

8.2 Structured data + Knowledge Graph

9. Safety / Alignment / Evaluation

9.1 Capability evaluation + benchmark

9.2 Value alignment + training

9.3 Content safety + compliance

10. AI for Science (AI4Science)

10.1 Molecular + drug design

10.2 Protein + structural biology

10.3 Physics simulation + surrogate modeling

10.4 Materials discovery + crystal design

10.5 Mathematics + symbolic reasoning

10.6 Scientific workflow + lab automation

11. Platform + engineering (MLOps / Infra)

11.1 Model training + fine-tuning

11.2 Model deployment + inference optimization

11.3 Data + model ops

Tổng kết