2019-04-05

How to handle unicode chars in Hadoop FS names on Ubuntu


Once i've caught myself staring on the weird unicode characters in names of files and folders within my hdfs directory browser on web. I haven't even seen such chars before, like this: ΞΆ, ₫, etc. Of course, I wasn't successful in their deletion via Ubuntu shell because they are not ASCII.

But Hadoop wildcards allowed me to do so.

It was a luck these chars were only at the beginning of the names, so, this might comes in handy for such occasions. Simply run the following shell command under hdfs user to get all names which don't start from a regular ASCII chars

$ hadoop fs -ls /{whatsoever_folder_name}/'[^a-z_0-9]*'