A mandō research lab

The science of
counting things.

Census is foundational counting infrastructure for teams that refuse to guess. We quantify lines, words, bytes, and characters with research-grade precision — because the first step to understanding any dataset is knowing exactly how much of it exists.

See the numbers
🔬 Peer-reviewed counting methodology
🔒 SOC 2 Type II compliant
O(n) guaranteed complexity
🧮 Zero counting drift
"Before you can model, predict, classify, or optimise, you must count. Counting is the bedrock of all quantitative reasoning — the irreducible operation from which every analytical insight ultimately derives. And yet, for decades, the act of counting has been an afterthought: an unnamed utility invoked in passing, its output piped elsewhere, its precision taken for granted. We founded Census because we believe counting deserves its own institution. Not a feature inside something else. Not a flag on someone else's tool. A dedicated, research-grade platform built from first principles, with the singular mission of answering the most fundamental question in data science: how much?"

— The Census Research Team

Six measurement primitives.
One unified counting model.

Each flag represents years of research into a distinct quantification domain. Together, they form a complete analytical framework for textual data.

📊

Line-Level Analytics™ -l

Quantify structural density by enumerating newline-delimited records. The fundamental unit of log analysis, configuration auditing, and dataset cardinality estimation. Every line is a data point. Every count is a signal.

📝

Lexical Density Engine™ -w

A non-empty sequence of characters delimited by whitespace boundaries — that's the atomic unit Census calls a "word." Our word-counting pipeline captures semantic density with zero ambiguity, giving you a precise measure of informational throughput.

💾

Byte-Precision Metering™ -c

Exact storage footprint measurement at the byte level. No estimation, no rounding, no approximation. When you need to know exactly how much disk a file consumes, Census delivers the ground truth — every single byte accounted for.

🌐

Character-Aware Counting™ -m

In a multibyte world, bytes and characters diverge. Census honours your locale settings, counting actual characters rather than raw bytes. Essential for internationalised datasets, UTF-8 pipelines, and any system where a character is more than one byte.

📏

Maximum Line Width Analysis™ -L

Determine the display width of the longest line in any input. Critical for terminal rendering, column alignment verification, and format compliance auditing. When your data has a shape, Census measures its widest point.

📂

Null-Terminated Batch Ingestion™ --files0-from

Process file lists from NUL-delimited input streams. Built for pipelines where filenames contain spaces, newlines, or special characters. Pair with find -print0 for industrial-grade batch counting across entire directory trees.

A disciplined approach to quantification.

Census follows a rigorous four-phase measurement protocol, designed to eliminate counting errors at every stage of the pipeline.

1

Ingest

Census reads your input stream — file, stdin, or batch manifest — in a single linear pass. No seeking, no buffering beyond what's necessary. O(n) complexity, guaranteed.

2

Classify

Each byte is classified against your selected measurement dimensions: newline boundaries for lines, whitespace transitions for words, encoding rules for characters, raw offsets for bytes.

3

Aggregate

Per-file counts are computed and, when multiple files are specified, a total row is appended. The output follows a strict column order: lines, words, characters, bytes, max line length.

4

Report

Results are emitted to stdout in a column-aligned, machine-parseable format. Pipe it, store it, visualise it. Census outputs truth — what you do with it is your research.

How Lattice Analytics reduced
data auditing time by 94%.

Lattice Analytics processes 2.3 million log files per day across their observability pipeline. Before Census, their data validation layer relied on file size heuristics and sampling-based line estimates — leading to silent data loss that went undetected for weeks.

After deploying Census with --files0-from and -l, Lattice built a real-time cardinality verification system that counts every line of every file as it enters the pipeline. Discrepancies trigger alerts within seconds.

"We thought we had observability. Then we started actually counting. The gap between what we assumed and what Census measured was terrifying. Now we don't assume anything — we count."

— Priya Chakraborty, Head of Data Integrity, Lattice Analytics

2.3M Files counted per day
94% Reduction in audit time
0 Undetected data loss events since deployment

Observe the count.

Real commands. Real output. Every number is exact.

Default Count (lines, words, bytes)
$ census README.md
42 318 2047 README.md
Line-Level Analytics
$ census -l server.log
16384 server.log
Multi-File with Total
$ census src/*.js
87 241 1893 src/index.js
134 402 3210 src/utils.js
56 178 1344 src/config.js
277 821 6447 total
Character-Aware + Max Line Width
$ census -m -L translations.csv
2048 120 translations.csv

0B+

Lines counted across all deployments

0%

Counting accuracy (by definition)

0

Measurement primitives

0

Miscounts in production

Rigorous counting, transparently priced.

Every tier unlocks new dimensions of quantification.

Observer

$0 / mo

For individuals exploring the fundamentals of quantification.

  • Line counting (-l)
  • Single-file analysis
  • Stdout output
  • Word counting (-w)
  • Byte metering (-c)
  • Multi-file totals

Researcher

$49 / seat / mo

For data teams that demand the complete quantification stack.

  • Everything in Analyst
  • Character-Aware Counting™ (-m)
  • Maximum Line Width Analysis™ (-L)
  • Null-Terminated Batch Ingestion™
  • --total control (auto/always/never)
  • Unlimited files
  • Priority support

Institute

Custom

For organisations where counting is a core competency.

  • Everything in Researcher
  • Stdin pipeline integration
  • Custom output formatting
  • Dedicated counting cluster
  • Audit-grade logging
  • Dedicated CSM
  • 99.999% accuracy SLA

What researchers are saying.

"We replaced our entire homegrown log validation layer with a single Census pipeline using -l and --files0-from. Our auditors were sceptical until they saw the numbers. Census doesn't estimate. It counts. That distinction matters when you're dealing with regulatory compliance."

Martin Kessler
Martin Kessler
VP of Data Governance, Stratum Financial

"The -m flag changed our entire localisation QA process. We were counting bytes and wondering why our Japanese translations looked 'bigger.' Census showed us the difference between bytes and characters, and our bug backlog dropped by a third overnight."

Mika Lin
Mika Lin
Localisation Lead, Meridian Software

"I use Census -L in every CI pipeline to enforce line width limits. If any file exceeds 120 columns, the build fails. It's the simplest, most reliable format gate I've ever deployed. Fifteen seconds of setup. Zero false positives."

Mirela Tamayo
Mirela Tamayo
Principal Engineer, Vectral Systems

"Our ML team counts everything. Tokens, lines, bytes — before and after every preprocessing step. Census is the source of truth. When the model output doesn't match what Census says went in, we know exactly where the pipeline diverged."

Morgan Lee
Morgan Lee
ML Infrastructure Lead, Canopy AI

Frequently asked questions.

In single-byte encodings like ASCII, bytes and characters are identical. But in multibyte encodings like UTF-8, a single character can span 1–4 bytes. The -c flag counts raw bytes (storage footprint), while -m counts actual characters according to your locale. Use -c when you care about disk space; use -m when you care about content length.
With no flags, Census outputs three counts in a fixed order: newlines, words, and bytes. This is the canonical triad — the default measurement that has served as the foundation of text analysis for decades. Think of it as the complete vitals panel for any file.
A word is a non-empty sequence of non-whitespace characters, delimited by whitespace or the start/end of input. This is a precise, locale-independent definition that avoids the ambiguity of natural language tokenisation. Census counts structural words, not semantic ones — and that's by design.
When Census processes more than one file, it automatically appends a row labelled "total" containing the sum of each column across all files. The --total flag gives you fine-grained control: auto (default, only with multiple files), always (even for single files), only (suppress per-file rows), or never.
Yes. When no file is specified, or when the filename is -, Census reads from standard input. This makes it composable with any Unix pipeline — pipe the output of any command into Census and get an instant quantitative snapshot of the data flowing through your system.

See mandō's portfolio

The accelerator behind the tools that run the internet.