Using AI Builder to Auto-Tag and Categorise Knowledge Documents on Upload
-
Admin Content
-
Jun 25, 2026
-
3
Knowledge management lives and dies by metadata. A SharePoint library packed with thousands of policies, contracts, research papers and meeting notes is only useful if people can actually find what they need — and findability depends almost entirely on how well documents are tagged, categorised and described. The trouble is that asking users to fill in metadata fields when they upload a file is a losing battle. Tags are skipped, categories are guessed, and within a few months the library is a digital landfill.
This is exactly the kind of repetitive, judgement-light task that AI Builder was made for. By training a custom classification model and wiring it into Power Automate, you can have every incoming document read, understood and tagged automatically the moment it lands in SharePoint. Users keep uploading the way they always have, and the metadata simply appears.
Here is how to design, train and deploy that solution end to end.
Why Auto-Tagging Matters More Than People Think
Manual tagging fails for predictable reasons. Users do not know the taxonomy, they are in a hurry, and they often cannot tell the difference between a "Policy" and a "Procedure" without reading the document themselves. Even when training is provided, consistency erodes over time as new staff join and old taxonomies drift.
Automated classification solves three problems at once. It enforces a consistent taxonomy because the model applies the same logic every time. It removes friction from the upload process, which encourages people to actually use the official library rather than emailing files around. And it creates a foundation for downstream automation — retention policies, access reviews, Copilot grounding, and search refiners all depend on reliable metadata.
Choosing the Right AI Builder Model
AI Builder offers several model types, and picking the right one is the difference between a project that ships in a fortnight and one that drags on for months.
For tagging and categorisation, the Category Classification model is usually the right starting point. It reads text and assigns one or more categories from a list you define. If your taxonomy is flat — for example, classifying documents as HR, Finance, Legal, IT or Operations — this model handles it cleanly. It also supports multi-label classification, so a document can be tagged with more than one category if it genuinely spans topics.
If your needs go beyond category labels and into extracting specific values — contract dates, supplier names, invoice numbers — you will want to pair Category Classification with a Document Processing model, which pulls structured fields out of forms and semi-structured documents.
For pure free-text content like reports and policy documents, Text Recognition (OCR) may also be needed upstream to convert scanned PDFs into text the classifier can actually read.
Preparing Your Training Data
The quality of your model is determined almost entirely by the quality of your training data. AI Builder's Category Classification needs a minimum of ten examples per category, but in practice you want at least fifty, and ideally a couple of hundred per category for production-grade accuracy.
Start by exporting a representative sample of historical documents from your existing library. If those documents are already tagged — even inconsistently — you have a head start. Pull the text content into a spreadsheet or a SharePoint list with two columns: the document text and its correct category.
Watch out for class imbalance. If you have two thousand HR documents and only forty Legal documents, the model will become biased toward HR. Either gather more Legal examples or downsample the HR set so the categories are roughly balanced.
Clean the text before training. Strip out boilerplate headers and footers, remove confidential information that should not influence classification, and make sure encoding issues are not leaving you with garbled characters. The model learns patterns in the words, so junk in means junk out.
Training the Model in AI Builder
Inside Power Apps, navigate to AI Builder and create a new Category Classification model. You will be asked where your training data lives — typically a SharePoint list or a Dataverse table containing your prepared examples.
Point the model at the column containing the document text and the column containing the category labels. AI Builder will then offer a choice between single-label and multi-label classification. Choose multi-label if your documents legitimately span multiple categories; otherwise stick with single-label for cleaner results.
Training takes anywhere from a few minutes to a couple of hours depending on data volume. Once it finishes, AI Builder presents a performance score and a breakdown of how the model performed on each category. Pay close attention to categories with low scores — these usually indicate either insufficient training data or overlapping definitions in your taxonomy. It is common to discover during training that your taxonomy itself is the problem, with two categories that even humans cannot reliably distinguish.
Iterate. Adjust the data, retrain, and review the metrics until you are comfortable. Then publish the model so it can be called from Power Automate.
Building the Upload Flow
With the model trained and published, the next step is a Power Automate flow triggered by file uploads to your SharePoint library.
The trigger is When a file is created (properties only) pointing at the target library. From there, the flow needs to get the file content, extract its text, run it through the classifier, and write the result back as metadata.
For text extraction, Word documents and plain text files can be read directly. PDFs usually need to go through an OCR step or a text extraction action — there are several connectors and custom actions available, including the AI Builder text recognition model for image-based PDFs.
Once you have the document text, call the Predict action for your Category Classification model and pass the text in. The model returns one or more categories along with confidence scores.
Now comes a design decision that matters more than people realise: what do you do with low-confidence predictions? A model that confidently tags a document as "Finance" with 95 percent certainty is fine to trust. A model that hesitates between "Policy" and "Procedure" at 52 percent each should not silently pick one. A robust flow checks the confidence score and branches accordingly — high confidence writes the tag automatically, low confidence either leaves the field blank, applies a "Needs Review" flag, or sends a Teams message to a steward asking them to confirm.
Finally, use the Update file properties action to write the predicted categories into the SharePoint metadata columns. If your library uses managed metadata or a term store, you will need to map the model's category labels to the correct term GUIDs, which is straightforward but easy to forget.
Handling Edge Cases
Real document libraries are messier than training sets. A few situations come up repeatedly and deserve explicit handling in your flow.
Encrypted or password-protected files cannot be read, so the flow should catch these and route them to a queue for manual handling rather than silently failing. Very large documents may exceed the input size limits of the classification model — chunk them and classify the first few thousand characters, which is usually enough to determine category. Documents in languages your model was not trained on will produce nonsense predictions, so detect the language first and either route non-English files to a different model or skip auto-tagging entirely.
You should also think about what happens when someone uploads a file that genuinely does not fit any of your categories. Rather than forcing a wrong tag, train your model with an "Other" or "Uncategorised" class so it has somewhere honest to put the outliers.
Monitoring and Improving Over Time
A classification model is not a fire-and-forget asset. The documents people upload will drift over time — new product lines, new regulations, new acronyms — and the model's accuracy will quietly decay if you do not retrain it.
Build a feedback loop from day one. When a knowledge steward corrects a tag manually, capture that correction. Periodically — quarterly is reasonable for most organisations — pull the corrections together with fresh examples and retrain the model. Track the model's confidence scores in a Power BI dashboard so you can see accuracy trends and catch degradation before users start complaining.
It is also worth reviewing the taxonomy itself once a year. Categories that started clean often accumulate ambiguity as the business evolves, and sometimes the right answer is to split a category or merge two that have become indistinguishable.
Bringing It Together
The combination of AI Builder, Power Automate and SharePoint turns metadata from a nagging compliance burden into something that just happens. Users upload files the way they always have, the model reads and classifies them in seconds, and the library becomes genuinely searchable — not because anyone changed their behaviour, but because the friction was removed.
The real payoff arrives later, when retention policies start working correctly because documents are tagged, when search results sharpen because metadata is consistent, and when Copilot can ground its answers in a properly organised corpus rather than a chaotic dumping ground. Auto-tagging is one of those investments that looks modest on the project plan and ends up underpinning half the knowledge work your organisation does.
Start small. Pick one library, one taxonomy, and one well-defined set of document types. Get the model working there, prove the value, and expand from a foundation you trust.
source: Using AI Builder to Auto-Tag and Categorise Knowledge Documents on Upload