How DataLens BI detects column types, measures quality, applies cleaning, and generates insights — all using classical statistics computed client-side.
Each column is sampled (up to 1,000 rows) and a parse ratio is computed for each candidate type. The type with the highest ratio above a threshold wins.
e) — so codes like OP-119443 are never misclassified as negative numbers.
Threshold: ≥ 75% of sampled values parse successfully.
Date.parse() and a set of explicit ISO, EU, and US patterns. Threshold: ≥ 75%. Dates are normalised to midnight UTC for consistent filtering.
Completeness is scored as the fraction of non-missing cells across all selected columns.
| Operation | Target | Method |
|---|---|---|
| Drop duplicates | All columns | Full-row hash comparison; keeps first occurrence |
| Fill missing — median | Numeric columns | Sorted array, index at floor(n/2) |
| Fill missing — mode | Categorical columns | Most frequent non-null value |
| Fill missing — Unknown | Categorical columns | Literal string replacement |
| Drop rows | Any column | Removes rows where the column is null/empty |
| Cap outliers | Numeric columns | Tukey IQR: clips to [Q1−1.5×IQR, Q3+1.5×IQR] |
| Trim whitespace | Text columns | str.trim() on every cell |
| Title case | Categorical text | Capitalises first letter of each word; lowercases rest |
Auto-clean applies all relevant operations in sequence. Each operation is logged with the column name, method, and count of affected cells. The cleaned copy is kept separately — the original raw data is never overwritten.
Type inference converts values internally for computation, but every row retains a
_rawIdx pointer back to the original CSV string. The data preview table uses raw
strings for display, ensuring codes like OP-119443 or SO-00001
always appear exactly as they were in the source file, even though the analytics engine
correctly identifies and excludes them from numeric aggregations.