Census is foundational counting infrastructure for teams that refuse to guess. We quantify lines, words, bytes, and characters with research-grade precision — because the first step to understanding any dataset is knowing exactly how much of it exists.
Research Position Paper
"Before you can model, predict, classify, or optimise, you must count. Counting is the bedrock of all quantitative reasoning — the irreducible operation from which every analytical insight ultimately derives. And yet, for decades, the act of counting has been an afterthought: an unnamed utility invoked in passing, its output piped elsewhere, its precision taken for granted. We founded Census because we believe counting deserves its own institution. Not a feature inside something else. Not a flag on someone else's tool. A dedicated, research-grade platform built from first principles, with the singular mission of answering the most fundamental question in data science: how much?"
— The Census Research Team
Capabilities
Each flag represents years of research into a distinct quantification domain. Together, they form a complete analytical framework for textual data.
Quantify structural density by enumerating newline-delimited records. The fundamental unit of log analysis, configuration auditing, and dataset cardinality estimation. Every line is a data point. Every count is a signal.
A non-empty sequence of characters delimited by whitespace boundaries — that's the atomic unit Census calls a "word." Our word-counting pipeline captures semantic density with zero ambiguity, giving you a precise measure of informational throughput.
Exact storage footprint measurement at the byte level. No estimation, no rounding, no approximation. When you need to know exactly how much disk a file consumes, Census delivers the ground truth — every single byte accounted for.
In a multibyte world, bytes and characters diverge. Census honours your locale settings, counting actual characters rather than raw bytes. Essential for internationalised datasets, UTF-8 pipelines, and any system where a character is more than one byte.
Determine the display width of the longest line in any input. Critical for terminal rendering, column alignment verification, and format compliance auditing. When your data has a shape, Census measures its widest point.
Process file lists from NUL-delimited input streams. Built for pipelines where filenames contain
spaces, newlines, or special characters. Pair with find -print0 for
industrial-grade batch counting across entire directory trees.
Methodology
Census follows a rigorous four-phase measurement protocol, designed to eliminate counting errors at every stage of the pipeline.
Census reads your input stream — file, stdin, or batch manifest — in a single linear pass. No seeking, no buffering beyond what's necessary. O(n) complexity, guaranteed.
Each byte is classified against your selected measurement dimensions: newline boundaries for lines, whitespace transitions for words, encoding rules for characters, raw offsets for bytes.
Per-file counts are computed and, when multiple files are specified, a total row is appended. The output follows a strict column order: lines, words, characters, bytes, max line length.
Results are emitted to stdout in a column-aligned, machine-parseable format. Pipe it, store it, visualise it. Census outputs truth — what you do with it is your research.
Case Study
Lattice Analytics processes 2.3 million log files per day across their observability pipeline. Before Census, their data validation layer relied on file size heuristics and sampling-based line estimates — leading to silent data loss that went undetected for weeks.
After deploying Census with --files0-from and -l, Lattice built a
real-time cardinality verification system that counts every line of every file as it enters the
pipeline. Discrepancies trigger alerts within seconds.
"We thought we had observability. Then we started actually counting. The gap between what we assumed and what Census measured was terrifying. Now we don't assume anything — we count."
— Priya Chakraborty, Head of Data Integrity, Lattice Analytics
Live Demo
Real commands. Real output. Every number is exact.
Lines counted across all deployments
Counting accuracy (by definition)
Measurement primitives
Miscounts in production
Pricing
Every tier unlocks new dimensions of quantification.
Observer
$0 / mo
For individuals exploring the fundamentals of quantification.
Analyst
$19 / mo
For teams that need to count more than one thing at a time.
Researcher
$49 / seat / mo
For data teams that demand the complete quantification stack.
Institute
Custom
For organisations where counting is a core competency.
Testimonials
"We replaced our entire homegrown log validation layer with a single Census pipeline using -l and --files0-from. Our auditors were sceptical until they saw the numbers. Census doesn't estimate. It counts. That distinction matters when you're dealing with regulatory compliance."
"The -m flag changed our entire localisation QA process. We were counting bytes and wondering why our Japanese translations looked 'bigger.' Census showed us the difference between bytes and characters, and our bug backlog dropped by a third overnight."
"I use Census -L in every CI pipeline to enforce line width limits. If any file exceeds 120 columns, the build fails. It's the simplest, most reliable format gate I've ever deployed. Fifteen seconds of setup. Zero false positives."
"Our ML team counts everything. Tokens, lines, bytes — before and after every preprocessing step. Census is the source of truth. When the model output doesn't match what Census says went in, we know exactly where the pipeline diverged."
FAQ
-c flag counts raw bytes (storage footprint), while -m counts
actual characters according to your locale. Use -c when you care about disk
space; use -m when you care about content length.--total flag gives you fine-grained control: auto (default, only
with multiple files), always (even for single files), only
(suppress per-file rows), or never.-, Census reads from standard input. This makes it composable with any Unix
pipeline — pipe the output of any command into Census and get an instant quantitative
snapshot of the data flowing through your system.