PHP upload text file encoding check and manipulation -


i have standard file upload user supposed upload text file. "text file" not egual "text file". same file can have different encodings: utf8, utf7, utf16, utf32, ascii, , ansi

to more clear noticed encodings not able show characters, encoding can show.

tree questions:

  • witch encoding 1 "the compete", can convert encoding without loosing content

  • check if file text file , not binary

  • check if content of text file base64 encoded or not?

  • if uploaded encoding not "the compete" , change encoding "on fly" "the compete" encoding (see question 1)

i not want troll here sending whole code, lets admit have form , action="upload.php", comes part need check above.

$target_dir = "uploads/"; $target_file = $target_dir . basename($_files["filetoupload"]["name"]); [...] // ist check after upload if(isset($_post["submit"])) {       // check 1 : encoding has been uploaded ?      // check 2 : file text file , not binary?      // check 3 : in content of file base64 encoded text?  } // if encoding different "most preferred" change encoding "most preferred" [...] 

can please quick ?

witch encoding 1 "the compete", can convert encoding without loosing content

unicode. choose of common encodings of unicode standard, utf-8 or utf-16. de facto standard on internet utf-8.

check if file text file , not binary

there's no such difference such. text files contain binary data, happens binary data interpreted in right encoding results in human readable text.

you can try check whether file contains lot of "control characters" or nul bytes or such, may not text.

you can try confirming whether file is valid in of expected encodings. have list of supported/expected encodings @ hand , check against list. note though any random binary garbage "valid" in single byte encoding iso-8859-1...

check if content of text file base64 encoded or not?

try decode base64. if decodes properly, probably base64 encoded. if can't decoded due bad/malformed characters, wasn't. however, can yield false positives, simple short text sequences may base64 encoded text.

if uploaded encoding not "the compete" , change encoding "on fly" "the compete" encoding (see question 1)

if it's not utf-8 encoded, convert utf-8... from original encoding...

how know original encoding? don't. can guess. again, have list of encodings @ hand , check them off 1 one, using 1 seems likely.

this doesn't sound sane you? well, that's because isn't.

trying handle unknown encodings nightmare best try avoid outright.

there no right answer. there false positives. cannot sure without having human confirm result. if have text file in unknown encoding, try interpret in known encodings, rule out ones in cannot decoded correctly, , let human pick best result. there libraries implement such guessing/detection logic, paired statistical text analysis guesstimate likelihood of decoded text being actual text, aware such libraries fundamentally can provide best guess.

or know encoding begin with. meta data, or having human tell you.

also see what every programmer absolutely, positively needs know encodings , character sets work text.


Comments

Popular posts from this blog

ruby - Trying to change last to "x"s to 23 -

jquery - Clone last and append item to closest class -

c - Unrecognised emulation mode: elf_i386 on MinGW32 -