python - Restrict separator to only some tabs when using pandas read_csv -
i'm reading tab-delimited data pandas dataframe using read_csv, have tabs occurring within column data means can't use "\t" separator. specifically, last entries in each line set of tab delimited optional tags match [a-za-z][a-za-z0-9]:[a-za-z]:.+ there no guarantees how many tags there or ones present, , different sets of tags may occur on different lines. example data looks (all white spaces tabs in data):
c42tmacxx:5:2316:15161:76101 163 1 @<@dffadddf:dd nh:i:1 hi:i:1 as:i:200 nm:i:0 c42tmacxx:5:2316:15161:76101 83 1 cccccacdddcb@b nh:i:1 hi:i:1 nm:i:1 c42tmacxx:5:1305:26011:74469 163 1 cccfffffhhhhgj nh:i:1 hi:i:1 as:i:200 nm:i:0
i proposing try read tags in single column, , thought passing in regular expression separator excludes tabs occur in context of tags.
following http://www.rexegg.com/regex-best-trick.html wrote following regex this: [a-za-z][a-za-z0-9]:[a-za-z]:[^\t]+\t..:|(\t). tested on online regular expression tester , seems match tabs want separators.
but when run
df = pd.read_csv(myfile.txt, sep=r"[a-za-z][a-za-z0-9]:[a-za-z]:[^\t]+\t..:|(\t)", header=none, engine="python") print(df)
i following output data:
0 1 2 3 4 5 6 7 8 \ 0 c42tmacxx:5:2316:15161:76101 \t 163 \t 1 \t @<@dffadddf:dd \t nan 1 c42tmacxx:5:2316:15161:76101 \t 83 \t 1 \t cccccacdddcb@b \t nan 2 c42tmacxx:5:1305:26011:74469 \t 163 \t 1 \t cccfffffhhhhgj \t nan 9 10 11 12 13 14 0 nan i:1 \t nan nan i:0 1 nan i:1 \t nm:i:1 nan none 2 nan i:1 \t nan nan i:0
what expecting / want is:
0 1 2 3 4 0 c42tmacxx:5:2316:15161:76101 163 1 @<@dffadddf:dd nh:i:1 hi:i:1 as:i:200 nm:i:0 1 c42tmacxx:5:2316:15161:76101 83 1 cccccacdddcb@b nh:i:1 hi:i:1 nm:i:1 2 c42tmacxx:5:1305:26011:74469 163 1 cccfffffhhhhgj nh:i:1 hi:i:1 as:i:200 nm:i:0
how achieve that?
in case it's relevant, i'm using pandas 0.17.1 , real data files of order of 100 million+ lines.
i took quick @ pandas docs , seems regex used separator cannot use groups.
c42tmacxx:5:2316:15161:76101 163 1 @<@dffadddf:dd nh:i:1 hi:i:1 as:i:200 nm:i:0 c42tmacxx:5:2316:15161:76101 83 1 cccccacdddcb@b nh:i:1 hi:i:1 nm:i:1 c42tmacxx:5:1305:26011:74469 163 1 cccfffffhhhhgj nh:i:1 hi:i:1 as:i:200 nm:i:0 ^ ^ ^ ^
you need match 4 first tabs can't without using groups.
a solution isolate wanted \t
using lookaheads , lookbehinds.
here regex should work:
(?<=\d)\t(?=\d)|\t(?=[a-z@<:]{14})|(?<=[a-z@<:]{14})\t
explanation
(?<=\d)\t(?=\d)
: tab precedeed (?<=...)
digit , followed (?=...)
digit
=> match 1st , 2nd tabs
|
or
\t(?=[a-z@<:]{14})
: tab followed 14 consecutive characters present in set letter,@,< or :
=> match 3rd tab
|
or
(?<=[a-z@<:]{14})\t
: tab precedeed same 14 characters set
=> match 4th tab
note
if need allow more characters in 14 consecutive characters pattern, add them set.
Comments
Post a Comment