CARVIEW |
Finding CSV files that start with a BOM using ripgrep
For sqlite-utils issue 250 I needed to locate some test CSV files that start with a UTF-8 BOM.
Here's how I did that using ripgrep:
$ rg --multiline --encoding none '^(?-u:\xEF\xBB\xBF)' --glob '*.csv' .
The --multiline
option means the search spans multiple lines - I only want to match entire files that begin with my search term, so this means that ^
will match the start of the file, not the start of individual lines.
--encoding none
runs the search against the raw bytes of the file, disabling ripgrep's default BOM detection.
--glob '*.csv'
causes ripgrep to search only CSV files.
The regular expression itself looks like this:
^(?-u:\xEF\xBB\xBF)
This is rust regex syntax.
(?-u:
means "turn OFF the u
flag for the duration of this block" - the u
flag, which is on by default, causes the Rust regex engine to interpret input as unicode. So within the rest of that (...)
block we can use escaped byte sequences.
Finally, \xEF\xBB\xBF
is the byte sequence for the UTF-8 BOM itself.
Related
- bash Skipping CSV rows with odd numbers of quotes using ripgrep - 2020-12-11
- linux Using iconv to convert the text encoding of a file - 2022-06-14
- llms Piping from rg to llm to answer questions about code - 2024-02-11
- vscode Search and replace with regular expressions in VS Code - 2021-08-02
- zsh Passing command arguments using heredoc syntax - 2022-07-07
- sqlite Fixing broken text encodings with sqlite-transform and ftfy - 2021-01-18
- python CLI tools hidden in the Python standard library - 2023-06-28
- bash Escaping a SQL query to use with curl and Datasette - 2020-12-08
- sqlite One-liner for running queries against CSV files with SQLite - 2022-06-20
- bash nullglob in bash - 2022-02-14
Created 2021-05-28T22:23:45-07:00 · Edit