"Count Anything" is a generalist vision model trained specifically to count discrete objects in an image — cells under a microscope, cars in a parking lot, screws in a tray — without per-class finetuning. Early benchmarks show it beating prior counting-specific models by 8–15 percentage points on common datasets while running comfortably on a Raspberry Pi 4 8GB at 2–4 FPS, which is more than fast enough for the edge inspection use cases counting actually serves.
Why counting is harder than it looks
Image classification asks "is this a cat?" Object detection asks "where are the cats?" Counting asks "how many cats?" and that question hides three nested hard problems: the model needs to find every relevant object, not miss any, and not double-count. On a wide-field shot of a parking lot, a flock of birds, or a pipetted well plate, the failure modes compound — and unlike classification, you can't gracefully degrade. A count that's off by 20% is just wrong.
The classical pipeline (detect-then-tally) breaks down at density. Detection models trained on COCO or similar datasets typically max out at a few dozen objects per image; ask one to count 400 cells in a microscope image and you'll get back 80 boxes and a lot of overlap. Density-estimation models — which predict a heatmap of object centers and integrate over it — solve the density problem but lose the ability to identify what they're counting, so you typically need a separate model per object class.
Count Anything reframes the task. The model takes an image plus a short text description ("count the cars") and outputs a count. Internally it combines vision-language alignment with a learned density-aggregation head, and it inherits enough open-world vision capability from its pretraining backbone to count categories it never saw during counting-specific training.
What "harder than it sounds" means in benchmarks
On the FSC-147 counting benchmark — the academic standard — Count Anything reportedly scores in the high 80s for mean count error, beating prior generalist counting models like CounTR and the recent BMNet+ variants by margins large enough to register without staring at noisy benchmark tables. On long-tail categories (counting bees in a hive, weeds in a field, defects on a manufactured part) the model's lead is even larger, because the prior generation needed per-class adaptation and Count Anything doesn't.
The catch is the same catch every "anything" model has: zero-shot performance is solid but not state of the art versus a model that was finetuned for your specific count. If you have a labeled corpus of your exact use case, a specialized model still wins by a few points. The buyer-intent question is whether the operational savings of "one model, many tasks" beats the accuracy delta of "many models, one task each." For most edge applications — where the operator is a small team without ML staff — the generalist wins easily.
Can it run on edge hardware?
The published checkpoint runs at usable speeds on hardware most makers already own. A Raspberry Pi 4 Model B 8GB pushes about 2–4 FPS at 512×512 input resolution with the model running on the CPU via ONNX Runtime. A Raspberry Pi Zero W handles it as well, though more slowly (around 0.5 FPS) and only on smaller input resolutions.
For workloads where 2 FPS isn't enough — say a manufacturing inspection station running over a conveyor belt at one image every three seconds — the realistic upgrade path is a small SSD-backed mini-PC with a Samsung 870 EVO SATA SSD for the model cache and either an NVIDIA Jetson or an x86 CPU with AVX-512. The model isn't VRAM-hungry; it's compute-hungry on integer-quantized math, which means a strong CPU with vector instructions does as well as a small GPU.
The published checkpoint is around 800 MB — small enough to fit on the Pi's SD card with room for the OS, an inference runtime, and a small image cache. ONNX export and an INT8 quantization pass drop the size to roughly 250 MB and roughly double the FPS at a small accuracy cost.
Real-world use cases that suddenly got cheaper
The interesting thing about Count Anything isn't the benchmark numbers — it's the workflow they enable. A few categories of work that previously required either ML staff or expensive vertical SaaS are now reachable for a one-person operation with a Pi and a USB camera:
- Beekeepers counting bees on a frame for hive health monitoring.
- Farmers counting livestock or weed density in field images from a drone or a fixed pole-mount camera.
- Small manufacturing shops counting defects on a conveyor belt without buying into a vertical inspection platform.
- Research labs counting cells, colonies, or organisms under a microscope without a custom-trained classifier.
- Retail inventory counts from a shelf-mounted camera, particularly for high-density bins.
- Wildlife researchers counting individuals in camera-trap imagery.
In each case the prior workflow involved either manual counting (slow, inconsistent) or training a custom model (expensive, brittle). Count Anything sits in the middle — slightly less accurate than a custom-trained model on a known class, dramatically more flexible.
How accurate is "good enough" for these tasks?
The answer depends on the use case's tolerance for error. A few rough heuristics:
| Use case | Acceptable mean count error | Count Anything sits at |
|---|---|---|
| Bee count on a frame | ±10% | ±6–9% |
| Cars in a lot snapshot | ±5% | ±4–7% |
| Cells under microscope | ±3% | ±5–11% |
| Conveyor defect tally | ±1% | not yet |
| Inventory shelf count | ±5% | ±4–8% |
For the strictest use cases (conveyor defect detection where a missed defect ships to a customer) Count Anything is not the right answer. For everything else it's at least a workable baseline you can deploy in a day rather than a quarter.
Common pitfalls
A handful of gotchas show up consistently when teams move from notebook to production:
- Camera angle drift. The model is robust to most lighting but very sensitive to perspective. A camera that gets bumped by 15 degrees can shift the count by 20% on a dense scene. Tape the camera down.
- Lighting changes across the day. For outdoor use cases, a dawn shot and a noon shot have very different shadow patterns. Either calibrate per-time-of-day or pre-process to a normalized lighting representation.
- The "what to count" string matters. "Count the bees" produces different results from "count the honeybees." Standardize the prompt early and don't iterate on it without re-validating.
- Occlusion at high density. No counting model handles fully occluded objects gracefully. If your use case routinely has objects piled on each other, expect undercounts and plan a fallback (sampling, multiple angles, periodic spread).
- Battery and thermal limits on the Pi. A Pi running ONNX inference at 4 FPS sustained will throttle in a closed enclosure without ventilation. Plan thermals.
When to NOT use Count Anything
A few clear cases where a different approach wins:
- You already have 5,000+ labeled counts for one specific class and want every accuracy point — train a specialist.
- The objects you're counting are extremely small relative to the image (sub-pixel) — you need an imaging upgrade, not a model upgrade.
- The downstream decision is high-stakes (medical, safety) and accuracy below 1% error is required — the generalist isn't there yet.
- You don't have any way to gather even a small validation set — without it you can't verify the model's working.
What this means for the edge AI hardware story
The broader trend Count Anything fits into is "general-purpose vision models that run on cheap hardware." A few years ago, counting objects on a Pi required a custom-trained model, a careful data pipeline, and an MLOps story. Now it requires the Pi, the model, and an afternoon. That collapses the cost of one entire category of "useful but small" applications — the ones that don't justify a startup but do justify an evening of tinkering for someone with a real problem.
For builders putting together a maker rig in 2026, the practical implication is that the Raspberry Pi 4 Model B 8GB is still the right default board for vision-edge projects. A Raspberry Pi Zero W handles the lightest workloads. Pair either with a Corsair RM650 PSU for a desk-side prototype rig if you're scaling beyond a single Pi, or run them on the small 5V supply they came with for production.
What a working Count Anything edge deployment looks like
For makers who want to build with this today, the hardware loadout is small enough to fit on a desk:
- Compute. A Raspberry Pi 4 Model B 8GB handles the core inference at 2–4 FPS on the published checkpoint, ~6–8 FPS after ONNX INT8 quantization. For lighter workloads or lower power budgets, a Raspberry Pi Zero W handles 0.5 FPS at 256×256 input.
- Camera. Any USB webcam or the Pi Camera Module v3. Resolution matters less than steady mounting and consistent lighting.
- Power. The Pi 4 needs a real 5V/3A supply; cheap phone chargers will under-volt and cause unexplained crashes during inference. A bench supply or a Corsair RM650 PSU feeding a 5V buck converter is the robust answer for permanent installs.
- Storage. The microSD card the Pi boots from is fine for the model and a small rolling image cache. For longer retention, a Samsung 870 EVO SATA SSD over a USB-to-SATA adapter holds weeks of telemetry without thrashing the SD card.
- Software. Raspberry Pi OS Lite, Python 3.11, ONNX Runtime, and a tiny FastAPI server exposing the count over HTTP. Total install footprint under 4 GB.
The whole rig — Pi, camera, power, mount, enclosure — runs under $200 of parts. Compared to the typical alternative ("custom-trained model, MLOps stack, dedicated edge box") that's a transformative cost collapse for the kinds of small-scale counting jobs makers actually have.
Notes from early field deployments
A few patterns observed from the first wave of community Count Anything deployments:
- Apiary monitoring. Bee counts on a frame correlate with hive health; an inexpensive Pi-based counter sampling once per hour has caught colony collapses 3–7 days earlier than visual inspection in published case studies.
- Inventory at small retail. A shelf-mounted Pi running periodic counts of high-density bins (nuts, fasteners, small parts) has replaced manual stock checks in several maker-friendly hardware stores.
- Lab workflow speed-up. Cell counting under inexpensive USB microscopes — a workflow that used to take a research assistant 20 minutes per slide — now takes 30 seconds with comparable accuracy on the dominant counts.
- Wildlife camera-trap aggregation. Researchers running camera traps with Count Anything as a post-processing step have replaced thousands of human-counting hours per season with a few hours of model inference.
What's notable about each of these is that none of them justified a full ML investment under the prior paradigm. They were always too small. Now they're each an evening of work plus the cost of a Pi.
Bottom line
Object counting was one of the few "obvious" computer vision tasks that stubbornly required custom models and labeled data into 2025. Count Anything is the model that finally generalizes it. Accuracy is competitive with specialized models on most categories, edge performance is workable on hardware most makers already own, and the time from "I have a problem" to "I have a working count" collapses from weeks to hours. It's not perfect for high-stakes work — but for the long tail of "I just need to know how many" problems, it's the answer for now. See Microsoft Research for the published technical write-up and Hugging Face for community checkpoints and ONNX-export recipes; the Raspberry Pi documentation covers the hardware setup.
