Evaluation

Langsmith article really good - https://docs.smith.langchain.com/concepts/evaluation

Karpathy tweet

Karpathy notes on Apple via tweet 11/06/24 Actually, really liked the Apple Intelligence announcement. It must be a very exciting time at Apple as they layer AI on top of the entire OS. A few of the major themes.

-Step 1 Multimodal I/O. Enable text/audio/image/video capability, both read and write. These are the native human APIs, so to speak. -Step 2 Agentic. Allow all parts of the OS and apps to inter-operate via "function calling"; kernel process LLM that can schedule and coordinate work across them given user queries. -Step 3 Frictionless. Fully integrate these features in a highly frictionless, fast, "always on", and contextual way. No going around copy pasting information, prompt engineering, or etc. Adapt the UI accordingly. -Step 4 Initiative. Don't perform a task given a prompt, anticipate the prompt, suggest, initiate. -Step 5 Delegation hierarchy. Move as much intelligence as you can on device (Apple Silicon very helpful and well-suited), but allow optional dispatch of work to cloud. -Step 6 Modularity. Allow the OS to access and support an entire and growing ecosystem of LLMs (e.g. ChatGPT announcement). -Step 7 Privacy. <3

-We're quickly heading into a world where you can open up your phone and just say stuff. It talks back and it knows you. And it just works. Super exciting and as a user, quite looking forward to it.

Prompting

Articles

Strategies

Prompting Strategies

Books

Prompt Book Coming Out

Evaluation

Tools

Evaluation Links
Eugen Yan on Evaluation
List of Evaluation/Observability Tooling
TinyLLM seems particularly interesting
Logfire from Pydantic
Logfire Blog Post

Papers

Engineering and Development

Iterative Loop

Libraries

Education

Papers

Google Education Paper

Blogs

Safety and Standards

Safety Guidelines

Standards

Australian Safety Standards

AI Product Strategy and Portfolio

Strategy Documents

AI Product Strategy PDF

Tools and Resources

Reflection Tools

Reflection Tool

Instructor Use Cases

Knowledge Graphs

GraphRAG Design Patterns, Challenges, Recommendations

GPT Researcher

Companies to Watch

Study Fetch

Routing and Agentic RAG

Leah Bonser on AI Planner Agent
Leah also wrote a second post.
LLama Index Router Stuff
Signed up to the Course in DeepLearning.ai

Miscellaneous

AI agent infrastructure

https://www.madrona.com/the-rise-of-ai-agent-infrastructure/

Shadow AI

This keeps coming up as a thing.

Few-Shot Learning

Few-Shot Tool Use Doesn't Really Work Yet

Perplexity Pages

Perplexity Pages FAQ

O'Reilly Article on prompting

https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-i/

https://t.co/UT3pqK9uK67

https://applied-llms.org/

Evaluation links

https://hamel.dev/blog/posts/evals/
https://eugeneyan.com/writing/evals/

Think I need a page just on evaluation

Prompting

https://eugeneyan.com/writing/prompting/

Goals

https://www.loom.com/share/d0ff7c0b17b34aa7b46f9537c5b25785?start_download=true

I really like this idea of taking two companies and making the "Goals" [As I see it] relevant from one vendor to a potential client to help in sales calls

Engineering

Hrishi Olickel's iterative loop of Chat, Play, Loop, and Nest⁶ ⁷ guides alot of my thinking on how to build agentic workflows. In general, the build principles I follow are:

Pipelines > Prompts
Inputs > Outputs
APIs > Abstractions
Multimodal > Text

The table below collects more specific technique tips and tricks - summarised from Hrishi's talks - across the build cycle.

Stage	Techniques
Planning & Design	- Iterative Loop: • Chat: Explore options • Play: Edit in OpenAI playground (60% time) • Loop: Add data and test cases • Nest: Simplify with subtasks - Time Allocation: • Playing: 60% • Prompt Tuning: 20% • Input Massaging: 10% • Coding: 10% • Tooling: 1%
Development	- Dos: • Use all modalities • Code for input and output structure • Leverage pretraining information • Reduce search space upfront - Don'ts: • Add abstractions between yourself and the LLM • Stick to one model • Have too high an I/O ratio
Testing & Debugging	- Debugging: • Start at the prompt level or try a different model • Transform input and add more structure to output • Classify tasks and errors to identify where the failure is - Optimizing AI: • Break prompts to reduce complexity • Use separate models for structure and long-form writing • Implement state management and self-healing
Deployment & Scaling	- Future Planning: • Assume everything will be 10x to 50x cheaper and faster in the future • Build for at least 6 months ahead - AI Resources: • Be mindful of the exponential scaling of attention algorithms

Andrew Ng

https://www.deeplearning.ai/the-batch/issue-249/

When building complex workflows, I see developers getting good results with this process:

Write quick, simple prompts and see how it does. Based on where the output falls short, flesh out the prompt iteratively. This often leads to a longer, more detailed, prompt, perhaps even a mega-prompt. If that’s still insufficient, consider few-shot or many-shot learning (if applicable) or, less frequently, fine-tuning. If that still doesn’t yield the results you need, break down the task into subtasks and apply an agentic workflow.

prompting strategies : https://arxiv.org/abs/2311.16452?utm_campaign=The%20Batch&utm_source=hs_email&utm_medium=email&_hsenc=p2ANqtz-_PU4gmbfJN9_gBrzLMkZheDB1ROQnQWYv9cSxeMK53CO9ix0aYRLcabOd6v3xmmbHcM7HE

Hrishi Budget

https://olickel.com/object-oriented-large-language-models

Cameron Wolfe

Many useful ideas from his writings around prompt engineering and RAG https://cameronrwolfe.substack.com/

Reflection

The reflection Tool at its simplest is about thinking about what was done and trying to improve it. Evaluation is a subset of reflection. or maybe the other way around. Here's a great list of evaluation/observability tooling available

https://ianww.com/llm-tools
tinyllm seems particularly interesting to me

I'd also want to consider and think about logfire from Pydantic :

https://pydantic.dev/logfire
https://python.useinstructor.com/blog/2024/05/01/instructor-logfire/

An evaluation paper

https://arxiv.org/abs/2404.12272

Instructor use cases I want to explore

https://x.com/ddebowczyk/status/1792105564966662523
https://x.com/jxnlco/status/1788558053094117691 [Some rag thing I need to review]
In the docs there is a create iterable method instead of Iterable[T] - https://github.com/jxnl/instructor?tab=readme-ov-file#streaming-iterables-create_iterable

Other librarires I 'm thinking through

CSIRO links on Safety

https://research.csiro.au/ss/team/se4ai/responsible-ai-engineering/
https://research.csiro.au/ss/team/se4ai/

Routing [Probably more urgent that I read the Leah Post]

https://www.linkedin.com/pulse/how-create-good-ai-planner-agent-leah-bonser-mvrsf/?trackingId=nMqml7p%2BTVK2nhzyPLpsxQ%3D%3D
Note Leah also wrote a second post.

Jeremy Howard Stuff I need to read

https://www.youtube.com/watch?v=jkrNMKz9pWU
https://github.com/fastai/lm-hackers/blob/main/lm-hackers.ipynb
https://x.com/HamelHusain/status/1793319488731107718

LLama index router stuff

https://medium.com/aimonks/agentic-rag-with-llama-index-router-query-engine-01-381e83a418af
Have signed up to the course in deeplearning.ai

AI Product Strategy for portfolio

AI Product Strategy

Ton of awesome AI speakers

https://x.com/HamelHusain/status/1792223793903276343

Prompt book coming out

https://www.oreilly.com/library/view/prompt-engineering-for/9781098156145/

Agent collusion

https://arxiv.org/abs/2404.00806

Hamel Husain

Consulting proposal from a meeting : https://gist.github.com/hamelsmu/ac72d18ee9d4cbd6a235a8e37a75f303

Australian safety standards

https://www.cyber.gov.au/resources-business-and-government/governance-and-user-education/artificial-intelligence/deploying-ai-systems-securely

Education

https://towardsdatascience.com/ai-knocking-on-the-classrooms-door-87db39d00b94

GPT - Researcher

1 : https://x.com/hu_yifei/status/1793751353308901394
2 : https://github.com/assafelovic/gpt-researcher

Education

Google Education Paper Worth reading. Long and dense : https://storage.googleapis.com/deepmind-media/LearnLM/LearnLM_paper.pdf
Donald Clark Blog - https://donaldclarkplanb.blogspot.com/
https://www.ai-supremacy.com/p/ai-in-education-googles-learnlm-product

Shadow AI

this keeps coming up as a thing

Companies to watch

https://www.studyfetch.com/

Few shot tool use doesn't really work yet

https://research.google/blog/few-shot-tool-use-doesnt-really-work-yet/

Perplexity pages

https://www.perplexity.ai/hub/faq/what-is-perplexity-pages

Knowledge graphs

https://gradientflow.com/graphrag-design-patterns-challenges-recommendations/

Evaluation

https://x.com/eugeneyan/status/1796300181186732522

McKinsey state of AI

https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-in-2023-generative-ais-breakout-year

Aha moments for voice

Sure, here are some specific and niche examples of potential aha moments:

1. Product Manager's Design Sprint

A product manager is leading a design sprint. While walking, they interact with the LLM to brainstorm ideas, outline user personas, and sketch out user journeys. The LLM helps by offering creative suggestions, asking probing questions, and even generating wireframes. This turns what would be idle walking time into productive, ideation-rich sessions.

2. Medical Researcher Reviewing Clinical Trials

A medical researcher on their way to the lab uses their walk to review the latest clinical trial data. The LLM summarizes complex medical journals, highlights key findings, and answers specific queries about methodologies or results. This ensures that the researcher is well-prepared and updated, maximizing their time efficiently.

3. Job Seeker Practicing Industry-Specific Interviews

An IT professional seeking a new job uses their morning walks to practice for interviews. The LLM conducts mock interviews, focusing on technical questions related to the latest programming languages and frameworks. It also provides feedback on responses and suggests improvements, helping the job seeker feel more confident and prepared.

4. Startup Founder Refining a Pitch

A startup founder uses their evening walks to refine their pitch. The LLM assists by simulating investor questions, helping to craft compelling narratives, and suggesting data points to include. This iterative process helps the founder fine-tune their pitch deck and elevator speech, making them better prepared for investor meetings.

5. Musician Composing and Arranging Music

A musician on a stroll uses voice interactions with an LLM to compose and arrange music. They hum melodies, and the LLM helps to notate the music, suggest chord progressions, and even generate accompanying parts for other instruments. This transforms a simple walk into a creative session, leading to new compositions and arrangements.

6. Author Developing a Novel Plot

An author uses their walks to develop novel plots. The LLM assists by discussing character development, plot twists, and thematic elements. The author can voice their ideas, and the LLM offers constructive feedback, alternative scenarios, and even dialogue snippets. This keeps the author in a creative flow, turning walks into productive brainstorming sessions.

These examples illustrate how specific, detailed interactions with an LLM can turn otherwise idle walking time into highly productive and insightful sessions tailored to various professional and creative needs.

Idea - Voice enabled

AI Powered design sprints with voice
walkandtalk.ai

Pattern for building LLM applications hrishi

https://simonwillison.net/2024/Apr/9/a-solid-pattern-to-build-llm-applications/
https://youtu.be/8w0hUcQSDy8?si=VvwvHVuUv94aIzuu

Agent design patterns

https://arxiv.org/abs/2405.10467
agent reference architecture - https://arxiv.org/abs/2311.13148
https://www.youtube.com/watch?v=MXPYbjjyHXc

recommendation systems

https://www.buildingrecsys.com/

Advancing to agents

https://www.youtube.com/watch?v=MXPYbjjyHXc

Prompt report taxonomy

https://arxiv.org/abs/2406.06608

Really interesting infrastructure market map

https://www.bvp.com/atlas/roadmap-ai-infrastructure

AI regulation by Jeremy Howard

https://www.answer.ai/posts/2024-06-11-os-ai.html

Langgraph

Explanation - https://www.youtube.com/watch?v=hvAPnpSfSGo

Really good podcast on agents by Harrison Chase

https://www.youtube.com/watch?v=6XZLoW0-mPY&t

Spiral - BUsiness idea

https://www.youtube.com/watch?v=iZw5GHuR9IY
https://spiral.computer/

Notebooks

https://youtu.be/-kdl04xqasY

Live podcast on o'reilly LLMs

https://www.youtube.com/watch?v=c0gcsprsFig&t=2839s

Business Idea : Auto convert market map into a nice UI

https://app.dealroom.co/lists/46345

Claudette :

https://www.youtube.com/watch?v=p_8Zk6HUCV8
https://www.answer.ai/posts/2024-06-23-claudette-src.html

Lance DB with instructor

-https://x.com/Prashant_Dixit0/status/1804789578437722416

Australia COmpliance

https://www.itnews.com.au/news/data-and-digital-ministers-agree-to-national-ai-framework-609028
https://www.finance.gov.au/government/public-data/data-and-digital-ministers-meeting/national-framework-assurance-artificial-intelligence-government

Evaluation

Karpathy tweet

Prompting

Articles

Strategies

Books

Evaluation

Tools

Papers

Engineering and Development

Iterative Loop

Libraries

Education

Papers

Blogs

Safety and Standards

Safety Guidelines

Standards

AI Product Strategy and Portfolio

Strategy Documents

Tools and Resources

Reflection Tools

Instructor Use Cases

Knowledge Graphs

GPT Researcher

Companies to Watch

Routing and Agentic RAG

Miscellaneous

AI agent infrastructure

Goals

Agent Collusion

Consulting Proposal

McKinsey State of AI

AI Speakers

Shadow AI

Few-Shot Learning

Perplexity Pages

O'Reilly Article on prompting

Evaluation links

Prompting

Goals

Engineering

Andrew Ng

Hrishi Budget

Cameron Wolfe

Reflection

Instructor use cases I want to explore

Other librarires I 'm thinking through

CSIRO links on Safety

Routing [Probably more urgent that I read the Leah Post]

Jeremy Howard Stuff I need to read

LLama index router stuff

AI Product Strategy for portfolio

Ton of awesome AI speakers

Prompt book coming out

Agent collusion

Hamel Husain

Australian safety standards

Education

GPT - Researcher

Education

Shadow AI

Companies to watch

Few shot tool use doesn't really work yet

Perplexity pages

Knowledge graphs

Evaluation

McKinsey state of AI

Aha moments for voice

1. Product Manager's Design Sprint

2. Medical Researcher Reviewing Clinical Trials

3. Job Seeker Practicing Industry-Specific Interviews

4. Startup Founder Refining a Pitch

5. Musician Composing and Arranging Music

6. Author Developing a Novel Plot

Idea - Voice enabled

Pattern for building LLM applications hrishi

Agent design patterns

recommendation systems

Advancing to agents