Census — Foundational Counting Infrastructure for Serious Data Work

Research Position Paper

"Before you can model, predict, classify, or optimise, you must count. Counting is the bedrock of all quantitative reasoning — the irreducible operation from which every analytical insight ultimately derives. And yet, for decades, the act of counting has been an afterthought: an unnamed utility invoked in passing, its output piped elsewhere, its precision taken for granted. We founded Census because we believe counting deserves its own institution. Not a feature inside something else. Not a flag on someone else's tool. A dedicated, research-grade platform built from first principles, with the singular mission of answering the most fundamental question in data science: how much?"

— The Census Research Team

Capabilities

Six measurement primitives.
One unified counting model.

Each flag represents years of research into a distinct quantification domain. Together, they form a complete analytical framework for textual data.

📊

Line-Level Analytics™ -l

Quantify structural density by enumerating newline-delimited records. The fundamental unit of log analysis, configuration auditing, and dataset cardinality estimation. Every line is a data point. Every count is a signal.

📝

Lexical Density Engine™ -w

A non-empty sequence of characters delimited by whitespace boundaries — that's the atomic unit Census calls a "word." Our word-counting pipeline captures semantic density with zero ambiguity, giving you a precise measure of informational throughput.

💾

Byte-Precision Metering™ -c

Exact storage footprint measurement at the byte level. No estimation, no rounding, no approximation. When you need to know exactly how much disk a file consumes, Census delivers the ground truth — every single byte accounted for.

🌐

Character-Aware Counting™ -m

In a multibyte world, bytes and characters diverge. Census honours your locale settings, counting actual characters rather than raw bytes. Essential for internationalised datasets, UTF-8 pipelines, and any system where a character is more than one byte.

📏

Maximum Line Width Analysis™ -L

Determine the display width of the longest line in any input. Critical for terminal rendering, column alignment verification, and format compliance auditing. When your data has a shape, Census measures its widest point.

📂

Null-Terminated Batch Ingestion™ --files0-from

Process file lists from NUL-delimited input streams. Built for pipelines where filenames contain spaces, newlines, or special characters. Pair with find -print0 for industrial-grade batch counting across entire directory trees.

Methodology

A disciplined approach to quantification.

Census follows a rigorous four-phase measurement protocol, designed to eliminate counting errors at every stage of the pipeline.

Ingest

Census reads your input stream — file, stdin, or batch manifest — in a single linear pass. No seeking, no buffering beyond what's necessary. O(n) complexity, guaranteed.

Classify

Each byte is classified against your selected measurement dimensions: newline boundaries for lines, whitespace transitions for words, encoding rules for characters, raw offsets for bytes.

Aggregate

Per-file counts are computed and, when multiple files are specified, a total row is appended. The output follows a strict column order: lines, words, characters, bytes, max line length.

Report

Results are emitted to stdout in a column-aligned, machine-parseable format. Pipe it, store it, visualise it. Census outputs truth — what you do with it is your research.

Case Study

How Lattice Analytics reduced
data auditing time by 94%.

Lattice Analytics processes 2.3 million log files per day across their observability pipeline. Before Census, their data validation layer relied on file size heuristics and sampling-based line estimates — leading to silent data loss that went undetected for weeks.

After deploying Census with --files0-from and -l, Lattice built a real-time cardinality verification system that counts every line of every file as it enters the pipeline. Discrepancies trigger alerts within seconds.

"We thought we had observability. Then we started actually counting. The gap between what we assumed and what Census measured was terrifying. Now we don't assume anything — we count."

— Priya Chakraborty, Head of Data Integrity, Lattice Analytics

2.3M Files counted per day

94% Reduction in audit time

0 Undetected data loss events since deployment

Live Demo

Observe the count.

Real commands. Real output. Every number is exact.

Default Count (lines, words, bytes)

$ census README.md

42 318 2047 README.md

Line-Level Analytics

$ census -l server.log

16384 server.log

Multi-File with Total

$ census src/*.js

87 241 1893 src/index.js

134 402 3210 src/utils.js

56 178 1344 src/config.js

277 821 6447 total

Character-Aware + Max Line Width

$ census -m -L translations.csv

2048 120 translations.csv

0B+

Lines counted across all deployments

0%

Counting accuracy (by definition)

0

Measurement primitives

0

Miscounts in production

Pricing

Rigorous counting, transparently priced.

Every tier unlocks new dimensions of quantification.

Observer

$0 / mo

For individuals exploring the fundamentals of quantification.

Line counting (-l)
Single-file analysis
Stdout output
Word counting (-w)
Byte metering (-c)
Multi-file totals

Analyst

$19 / mo

For teams that need to count more than one thing at a time.

Everything in Observer
Word counting (-w)
Byte-Precision Metering™ (-c)
Multi-file support with totals
Default mode (lines + words + bytes)
Up to 10,000 files/day

Researcher

$49 / seat / mo

For data teams that demand the complete quantification stack.

Everything in Analyst
Character-Aware Counting™ (-m)
Maximum Line Width Analysis™ (-L)
Null-Terminated Batch Ingestion™
--total control (auto/always/never)
Unlimited files
Priority support

Institute

Custom

For organisations where counting is a core competency.

Everything in Researcher
Stdin pipeline integration
Custom output formatting
Dedicated counting cluster
Audit-grade logging
Dedicated CSM
99.999% accuracy SLA

Testimonials

What researchers are saying.

"We replaced our entire homegrown log validation layer with a single Census pipeline using -l and --files0-from. Our auditors were sceptical until they saw the numbers. Census doesn't estimate. It counts. That distinction matters when you're dealing with regulatory compliance."

Martin Kessler

VP of Data Governance, Stratum Financial

"The -m flag changed our entire localisation QA process. We were counting bytes and wondering why our Japanese translations looked 'bigger.' Census showed us the difference between bytes and characters, and our bug backlog dropped by a third overnight."

Mika Lin

Localisation Lead, Meridian Software

"I use Census -L in every CI pipeline to enforce line width limits. If any file exceeds 120 columns, the build fails. It's the simplest, most reliable format gate I've ever deployed. Fifteen seconds of setup. Zero false positives."

Mirela Tamayo

Principal Engineer, Vectral Systems

"Our ML team counts everything. Tokens, lines, bytes — before and after every preprocessing step. Census is the source of truth. When the model output doesn't match what Census says went in, we know exactly where the pipeline diverged."

Morgan Lee

ML Infrastructure Lead, Canopy AI

FAQ

Frequently asked questions.

In single-byte encodings like ASCII, bytes and characters are identical. But in multibyte encodings like UTF-8, a single character can span 1–4 bytes. The -c flag counts raw bytes (storage footprint), while -m counts actual characters according to your locale. Use -c when you care about disk space; use -m when you care about content length.

With no flags, Census outputs three counts in a fixed order: newlines, words, and bytes. This is the canonical triad — the default measurement that has served as the foundation of text analysis for decades. Think of it as the complete vitals panel for any file.

A word is a non-empty sequence of non-whitespace characters, delimited by whitespace or the start/end of input. This is a precise, locale-independent definition that avoids the ambiguity of natural language tokenisation. Census counts structural words, not semantic ones — and that's by design.

When Census processes more than one file, it automatically appends a row labelled "total" containing the sum of each column across all files. The --total flag gives you fine-grained control: auto (default, only with multiple files), always (even for single files), only (suppress per-file rows), or never.

Yes. When no file is specified, or when the filename is -, Census reads from standard input. This makes it composable with any Unix pipeline — pipe the output of any command into Census and get an instant quantitative snapshot of the data flowing through your system.

The science of
counting things.

Six measurement primitives.
One unified counting model.

Line-Level Analytics™ -l

Lexical Density Engine™ -w

Byte-Precision Metering™ -c

Character-Aware Counting™ -m

Maximum Line Width Analysis™ -L

Null-Terminated Batch Ingestion™ --files0-from

A disciplined approach to quantification.

Ingest

Classify

Aggregate

Report

How Lattice Analytics reduced
data auditing time by 94%.

Observe the count.

0B+

0%

0

0

Rigorous counting, transparently priced.

What researchers are saying.

Frequently asked questions.

See mandō's portfolio

The science ofcounting things.

Six measurement primitives.One unified counting model.

Line-Level Analytics™ -l

Lexical Density Engine™ -w

Byte-Precision Metering™ -c

Character-Aware Counting™ -m

Maximum Line Width Analysis™ -L

Null-Terminated Batch Ingestion™ --files0-from

A disciplined approach to quantification.

Ingest

Classify

Aggregate

Report

How Lattice Analytics reduceddata auditing time by 94%.

Observe the count.

0B+

0%

0

0

Rigorous counting, transparently priced.

What researchers are saying.

Frequently asked questions.

See mandō's portfolio

The science of
counting things.

Six measurement primitives.
One unified counting model.

How Lattice Analytics reduced
data auditing time by 94%.