Methodology

How DataLens BI detects column types, measures quality, applies cleaning, and generates insights — all using classical statistics computed client-side.

No machine learning is used. Every analysis uses deterministic statistical methods (Pearson correlation, IQR, linear regression, frequency counts) that are reproducible, explainable, and fast enough to run interactively in a browser on datasets up to ~100k rows.

1 · Automatic column type inference

Each column is sampled (up to 1,000 rows) and a parse ratio is computed for each candidate type. The type with the highest ratio above a threshold wins.

Number detection
Strips currency symbols, commas, and % signs, then parses. Rejects any value containing letters (excluding scientific e) — so codes like OP-119443 are never misclassified as negative numbers. Threshold: ≥ 75% of sampled values parse successfully.
Date detection
Attempts native Date.parse() and a set of explicit ISO, EU, and US patterns. Threshold: ≥ 75%. Dates are normalised to midnight UTC for consistent filtering.
Identifier detection
Numeric columns: flagged as identifier only if the column name contains ID-like tokens (id, uuid, code, ref…). Continuous metrics like Revenue are never flagged by uniqueness alone. Text columns: additionally flagged if uniqueness ratio ≥ 0.97 and cardinality > 50.
Boolean detection
Matches yes/no, true/false, 1/0, on/off patterns (case-insensitive). Threshold: ≥ 85% and ≤ 3 distinct values.

2 · Data quality scoring

Completeness is scored as the fraction of non-missing cells across all selected columns.

completeness = (present cells) / (rows × columns)
outlier_flag = |value − median| > 1.5 × IQR (Tukey fence)
duplicate_count = rows − distinct rows (by full row hash)

3 · Cleaning operations

Operation Target Method
Drop duplicatesAll columnsFull-row hash comparison; keeps first occurrence
Fill missing — medianNumeric columnsSorted array, index at floor(n/2)
Fill missing — modeCategorical columnsMost frequent non-null value
Fill missing — UnknownCategorical columnsLiteral string replacement
Drop rowsAny columnRemoves rows where the column is null/empty
Cap outliersNumeric columnsTukey IQR: clips to [Q1−1.5×IQR, Q3+1.5×IQR]
Trim whitespaceText columnsstr.trim() on every cell
Title caseCategorical textCapitalises first letter of each word; lowercases rest

Auto-clean applies all relevant operations in sequence. Each operation is logged with the column name, method, and count of affected cells. The cleaned copy is kept separately — the original raw data is never overwritten.

4 · Statistical analysis methods

Pearson correlation
Computed pairwise across all numeric columns. Results rendered as a heatmap (RdBu scale, −1 to +1). The top 15 pairs are ranked by absolute strength.
Linear regression (OLS)
Ordinary least squares on scatter pairs. Renders trend line and reports slope and r value in the dynamic insight beneath the chart.
Time-series aggregation
Groups rows by ISO period (day / week / month / quarter / year) and applies the selected aggregate function (sum, avg, median, min, max, count).
Dynamic insights engine
After every chart is rendered, a plain-English finding is computed from the data and injected below: skew direction, % change over time, strongest correlation pair, top/bottom segment gaps.

5 · Display fidelity — original values preserved

Type inference converts values internally for computation, but every row retains a _rawIdx pointer back to the original CSV string. The data preview table uses raw strings for display, ensuring codes like OP-119443 or SO-00001 always appear exactly as they were in the source file, even though the analytics engine correctly identifies and excludes them from numeric aggregations.