Overview Documentation Methodology Use Cases 🧪 Live Demo

Methodology

How DataLens BI detects column types, measures quality, applies cleaning, and generates insights — all using classical statistics computed client-side.

No machine learning is used. Every analysis uses deterministic statistical methods (Pearson correlation, IQR, linear regression, frequency counts) that are reproducible, explainable, and fast enough to run interactively in a browser on datasets up to ~100k rows.

1 · Automatic column type inference

Each column is sampled (up to 1,000 rows) and a parse ratio is computed for each candidate type. The type with the highest ratio above a threshold wins.

Number detection

Strips currency symbols, commas, and % signs, then parses. Rejects any value containing letters (excluding scientific e) — so codes like OP-119443 are never misclassified as negative numbers. Threshold: ≥ 75% of sampled values parse successfully.

Date detection

Attempts native Date.parse() and a set of explicit ISO, EU, and US patterns. Threshold: ≥ 75%. Dates are normalised to midnight UTC for consistent filtering.

Identifier detection

Numeric columns: flagged as identifier only if the column name contains ID-like tokens (id, uuid, code, ref…). Continuous metrics like Revenue are never flagged by uniqueness alone. Text columns: additionally flagged if uniqueness ratio ≥ 0.97 and cardinality > 50.

Boolean detection

Matches yes/no, true/false, 1/0, on/off patterns (case-insensitive). Threshold: ≥ 85% and ≤ 3 distinct values.

2 · Data quality scoring

Completeness is scored as the fraction of non-missing cells across all selected columns.

      completeness = (present cells) / (rows × columns)

      outlier_flag = |value − median| > 1.5 × IQR  (Tukey fence)

      duplicate_count = rows − distinct rows (by full row hash)

3 · Cleaning operations

Operation	Target	Method
Drop duplicates	All columns	Full-row hash comparison; keeps first occurrence
Fill missing — median	Numeric columns	Sorted array, index at floor(n/2)
Fill missing — mode	Categorical columns	Most frequent non-null value
Fill missing — Unknown	Categorical columns	Literal string replacement
Drop rows	Any column	Removes rows where the column is null/empty
Cap outliers	Numeric columns	Tukey IQR: clips to [Q1−1.5×IQR, Q3+1.5×IQR]
Trim whitespace	Text columns	str.trim() on every cell
Title case	Categorical text	Capitalises first letter of each word; lowercases rest

Auto-clean applies all relevant operations in sequence. Each operation is logged with the column name, method, and count of affected cells. The cleaned copy is kept separately — the original raw data is never overwritten.

4 · Statistical analysis methods

Pearson correlation

Computed pairwise across all numeric columns. Results rendered as a heatmap (RdBu scale, −1 to +1). The top 15 pairs are ranked by absolute strength.

Linear regression (OLS)

Ordinary least squares on scatter pairs. Renders trend line and reports slope and r value in the dynamic insight beneath the chart.

Time-series aggregation

Groups rows by ISO period (day / week / month / quarter / year) and applies the selected aggregate function (sum, avg, median, min, max, count).

Dynamic insights engine

After every chart is rendered, a plain-English finding is computed from the data and injected below: skew direction, % change over time, strongest correlation pair, top/bottom segment gaps.

5 · Display fidelity — original values preserved

Type inference converts values internally for computation, but every row retains a _rawIdx pointer back to the original CSV string. The data preview table uses raw strings for display, ensuring codes like OP-119443 or SO-00001 always appear exactly as they were in the source file, even though the analytics engine correctly identifies and excludes them from numeric aggregations.