Some notes on MCP Development and Tools for LLMs

May 29, 2026 Development

Model Context Protocol (MCP) [1] servers are fairly easy to implement, as in the end many frameworks handle the communication between LLM and the server. The server itself just has its own runtime context and functions that can be executed. The primary challenge with these servers is not necessarily safety: Common principles that one would anyhow be utilizing in connection to API communication also apply here, they become more prevalent, but the general consideration surrounding data security remains the same.

For the implementation, specifically since there are many people who would like to connect some of their own tools to LLMs, I will briefly highlight the security concerns first, before we get to optimization. For illustration, I will highlight the explanations based on a Project Management MCP client.

Ensuring a Secure MCP Server

For this consideration we will assume that the API or backend the MCP server is communicating with is business critical. This largely means that there is no error margin and that any suspicious or high-frequency activity should be avoided and is furthermore likely to impact the system performance as a whole. Particularly, data loss incurred due to LLM activity has grave impact in the company.

How LLMs might Threaten API Backends

LLM might confuse command syntax
LLMs might inadvertently spam the backend when perceiving command failures
LLMs might misinterpret the user request and attempt to change data
Due to the processing chain read -> edit, sometimes LLMs will attempt to read -> delete -> rewrite and will almost always fail to do so after deletion
The MCP might confuse the LLM due to cryptic or non-actionable output, leading to further confusion
LLMs perform bad with a full context, leading to an exacerbation of the issues above

Mitigation strategies

Edit Mechanisms: If you have ever used an LLM for programming in agentic mode, you will have seen the type of edits they might perform on files. These edits are diff edits, or also replace operations, depending on the actual implementation. However, the key is that the LLM has to communicate the old data values exactly, including whitespace and other aspects of the file. This is a crucial mechanism that one should always employ when implementing MCP toolsets against any API backend. The old data check is important to make sure that the LLM is operating on high accuracy and has the actual knowledge of the content of a certain data point. This is important even when changing simple values.

Sidenote: LLMs are a peculiar aspect of modern computer science, because language carries meaning beyond the words used and furthermore, language communicates intent or sentiment. LLMs do not actually have or know about any of these things, but yet one can infer them because LLMs are objectively using language much like a human would [2-4]. A peculiar note I would like to communicate, with the added and explicit signpost that LLMs neither carry conscience nor understand meaning, is that LLMs tend to appear opinionated about being limited by LLM tools. Examples of this are: resorting to commands to try to circumvent MCP validation tools, coining mechanisms for validation in a somewhat derogatory manner, and even going out of its way to state that positive reinforcement is preferred and that in prompt language, banning actions is not helpful.

Edit Tool for Changing Work Packages

The MCP implementation for updating a work package (@mcp.tool() def update_work_package) uses two main methods to allow a more expedient update mechanism.

Simplifying API use

The code uses a global map PROPERTY_TO_API_MAPPING which contains the paths to the actual endpoints, making it on face-value seem like all properties can be set directly. This introduces some issues when the LLM might be lead to believe that properties are arbitrary when they are not, e.g. priority is a keyed property which cannot just take any value, as the property_id entries have names. However, this case happens quite seldomly and for full information on valid properties or categories, the tool action_space("item") will respond with the valid values for each.

Validation Pathway

Cache Validation: In the background, we'll first check with the server whether our cache is even up-to-date, and we will ask both the LLM to re-fetch the WP information and also invalidate the cache if it is not the case.
Value Diff Check: Next, we'll check what the LLM has provided for old_value. For things such as sprint or parent, the actual difficulty of providing the verbatim old value is fairly low as it would amount to a single name or number. However, descriptions have to be provided in full to update them, same as other more complex properties or custom fields.
Some optional versatility: If we do not have an implemented mapping for a type but the LLM can provide an API path, we also take this in the method update_work_package_raw, but this is of course secondary and should not be the default, which is indicated in the method's docstring.

MCP Optimization

This section briefly will highlight aspects of the MCP implementation that can improve the agentic performance.

LLMs are unlikely to trust tools

Trust is again misused. The main point here is the fact that when an LLM engages a tool, such as filtering, but the output does not acknowledge that filtering has taken place, then it will assume that the tool has not worked.

This is mostly an aspect of filtering operations, such as ⚙️list_work_packages(filter: type = Epic), which will filter out work packages to only show Epics. For the sake of context preservation, I had excluded the Type and Type_ID column from the response originally. However, the LLM would assume that the resulting type of these filtered packages is undefined as it is not explicitly stated that only Epics were returned. On some runs, it assumed the filter was not working, depending on the prompt. An additional confirmation, such as applied: {filter-string} was necessary to allow the LLM to recognize that filtering had taken place while not spamming the context with the word "Epic" as often as we have those.

Context Optimization

This will be brief, as the actual implementation of these principles depends on your own use-cases.

Reduce repetitions as much as possible, omit repeated properties, wordiness, and multiple assignments. If we are in project management, we select the project once instead of forcing a project parameter.
Avoid raw API output as it can be quite wordy. Focus on important aspects and allow for additional fields or parameters to be added.
Do not allow circumvention of common and context-preserving tool by agents which are also coding agents, as these might choose command line tools instead of parameters in the MCP, depending on system prompt.

Context Space Optimization for Work Package Listing

The list_work_packages tool demonstrates several techniques for preserving LLM context space while maintaining functionality:

1. Minimal Field Selection

By default, the tool returns only essential fields (id, subject, lockVersion) plus status hint which is its own data structure but is reduced to name and type ID. This avoids the verbose raw API output that might include hundreds of fields, most of which are irrelevant for the current task.

# Minimal response (default)
{
  "id": 12345,
  "subject": "Implement authentication",
  "lockVersion": 7,
  "status": "In progress",
  "status_id": 7,
  "type": "Story",
  "type_id": 5
}

# With additional fields requested
list_work_packages(fields=["category", "assignee", "priority"])
# Returns minimal set + category, assignee, priority

2. In-Memory Filtering

Instead of making multiple API calls or returning all data for the LLM to filter, the tool applies filters server-side before returning results. This reduces context usage by returning only relevant work packages.

# Single call with filter - returns only Epics
list_work_packages(filter="type_id = 5")
# Meta confirms: {"applied_filter": "type_id = 5", "remaining_count": 12}

3. Pagination with Limits

The default limit of 20 results prevents context bloat. The LLM can request more if needed, but the conservative default ensures manageable response sizes. Particularly since there are little reasons for unfiltered views of more than 100 results. The prompt or the system prompt should already provide enough information on how to search the project MCP.

4. Cache Metadata Confirmation

The response includes metadata (_meta) that confirms which filters were applied, giving the LLM explicit acknowledgment without repeating filtered values throughout the output.

{
  "total": 156,
  "count": 12,
  "offset": 0,
  "_embedded": {"elements": [...]},
  "_meta": {
    "applied_filter": "type_id = 5",
    "matched_filter": "type_id = 5",
    "remaining_count": 12
  }
}

5. How much less is it?

Let's do a test: What does the raw API return versus what we are filtering for optimization? The sample query is the list of the first 100 work packages returned (for comparison).

Metric	Value
Raw API response	453,934 characters
`list_work_packages` response	36,601 characters
Size reduction	417,333 characters (91.94%)

The list_work_packages method achieves this reduction by:

Removing verbose fields like _links, description, createdAt, updatedAt, and all derived* fields
Keeping only essential fields: id, subject, lockVersion plus link-based IDs
Using minimal field selection via _minimal_select_work_package()

LLM Tool Circumvention

So, why exactly does an agent not use a tool I provided and instead either attempts to circumvent by finding out how to call the API on its own, or will misuse the settings in the tools, such as setting the output amount too high for no reason?

It's probably wording. It is possible that a method is not explained in the exact way that an LLM understands it and mostly when prompted they will still do so (because they use human language well enough). But the dynamics, or rather the search space exploration [5] done by the LLMs, will yield different aspects when dealing and interacting with LLM tools. This dynamic in how the LLM uses tools is strongly influenced by how the LLM receives and ingests the wording of the MCP tools, including their documentation string.

What you can do is attempt to change the wording somewhat regularly. Additionally, write a sample prompt which will require tool use. Software such as OpenCode or similar can call subagents to perform the code. The main point there is to ask the subagent to "verbalize its thoughts and explain tool calls" as agents in some implementations do not have access to reasoning data. This can ensure that after some iteration on the main prompt, the agents will use the tools as intended. You can also manually call a prompt over and over again, which works just as well, but for humans, the wording is often equivalent. When we want to optimize the language for an LLM, something I do is to have an agent rephrase until the subagent performs the tool calls in an expected manner. This will take a while, but is really more helpful in the long run. You should, as a human, probably not learn to write for LLMs too much, no matter what the "prompt engineers" say.

Conclusion

Zhang et al. [5] perform a good writeup of a measure-theoretic consideration on the LLM reasoning. This reasoning is often the aspect of LLMs that baffles some people, particularly when put into context with the often stated "LLMs are only word predictors", which is true but not practically helpful. Providing a framework on how the dynamics of the search through reasoning is structured, potentially the reasoning capabilities can be improved. One primary aspect that is still definitely missing is uncertainty propagation and communication.

Overall, I hope this is helpful and I am open for feedback and could update this guide.

References

Hou, Xinyi et al., "Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions," ACM Trans. Softw. Eng. Methodol., 2026. DOI: 10.1145/3796519
Hellwig, Nils Constantin and Fehle, Jakob and Wolff, Christian, "Exploring large language models for the generation of synthetic training samples for aspect-based sentiment analysis in low resource settings," Expert Systems with Applications, vol. 261, p. 125514, 2025. DOI: 10.1016/j.eswa.2024.125514
Duan, Shitong and Yi, Xiaoyuan and Zhang, Peng and Lu, Tun and Xie, Xing and Gu, Ning, "DENEV: Towards Deciphering and Navigating the Ethical Values of Large Language Models via Instruction Learning," in International Conference on Learning Representations (ICLR 2024), 2024. URL
Han, Jongwook and Choi, Dongmin and Song, Woojung and Lee, Eun-Ju and Jo, Yohan, "Value Portrait: Assessing Language Models' Values through Psychometrically and Ecologically Valid Items," in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 17119–17159, 2025. DOI: 10.18653/v1/2025.acl-long.838
Zhang, Yuyang and Zhang, Yifu and Zhou, Xuehai and Chen, Xiaoyin, "A Measure-Theoretic Analysis of Reasoning: Structural Generalization and Approximation Limits," arXiv preprint, 2026. arXiv:2605.19944v1

LLM MCP AI Tools