python - Restrict separator to only some tabs when using pandas read_csv -
i'm reading tab-delimited data pandas dataframe using read_csv, have tabs occurring within column data means can't use "\t" separator. specifically, last entries in each line set of tab delimited optional tags match [a-za-z][a-za-z0-9]:[a-za-z]:.+ there no guarantees how many tags there or ones present, , different sets of tags may occur on different lines. example data looks (all white spaces tabs in data):
c42tmacxx:5:2316:15161:76101    163 1   @<@dffadddf:dd  nh:i:1  hi:i:1  as:i:200    nm:i:0 c42tmacxx:5:2316:15161:76101    83  1   cccccacdddcb@b  nh:i:1  hi:i:1  nm:i:1 c42tmacxx:5:1305:26011:74469    163 1   cccfffffhhhhgj  nh:i:1  hi:i:1  as:i:200    nm:i:0 i proposing try read tags in single column, , thought passing in regular expression separator excludes tabs occur in context of tags.
following http://www.rexegg.com/regex-best-trick.html wrote following regex this: [a-za-z][a-za-z0-9]:[a-za-z]:[^\t]+\t..:|(\t). tested on online regular expression tester , seems match tabs want separators.
but when run
df = pd.read_csv(myfile.txt, sep=r"[a-za-z][a-za-z0-9]:[a-za-z]:[^\t]+\t..:|(\t)",                   header=none, engine="python") print(df) i following output data:
                          0       1    2   3   4   5               6   7   8 \ 0  c42tmacxx:5:2316:15161:76101  \t  163  \t   1  \t  @<@dffadddf:dd  \t nan    1  c42tmacxx:5:2316:15161:76101  \t   83  \t   1  \t  cccccacdddcb@b  \t nan    2  c42tmacxx:5:1305:26011:74469  \t  163  \t   1  \t  cccfffffhhhhgj  \t nan        9    10  11      12  13    14   0 nan  i:1  \t     nan nan   i:0   1 nan  i:1  \t  nm:i:1 nan  none   2 nan  i:1  \t     nan nan   i:0   what expecting / want is:
                          0        1  2               3                      4 0  c42tmacxx:5:2316:15161:76101  163  1  @<@dffadddf:dd  nh:i:1 hi:i:1 as:i:200 nm:i:0    1  c42tmacxx:5:2316:15161:76101  83   1  cccccacdddcb@b  nh:i:1 hi:i:1 nm:i:1    2  c42tmacxx:5:1305:26011:74469  163  1  cccfffffhhhhgj  nh:i:1 hi:i:1 as:i:200 nm:i:0 how achieve that?
in case it's relevant, i'm using pandas 0.17.1 , real data files of order of 100 million+ lines.
i took quick @ pandas docs , seems regex used separator cannot use groups.
c42tmacxx:5:2316:15161:76101    163 1   @<@dffadddf:dd  nh:i:1  hi:i:1  as:i:200    nm:i:0 c42tmacxx:5:2316:15161:76101    83  1   cccccacdddcb@b  nh:i:1  hi:i:1  nm:i:1 c42tmacxx:5:1305:26011:74469    163 1   cccfffffhhhhgj  nh:i:1  hi:i:1  as:i:200    nm:i:0                               ^    ^  ^                ^            you need match 4 first tabs can't without using groups.
a solution isolate wanted \t using lookaheads , lookbehinds.
here regex should work:
(?<=\d)\t(?=\d)|\t(?=[a-z@<:]{14})|(?<=[a-z@<:]{14})\t
explanation
(?<=\d)\t(?=\d) : tab precedeed (?<=...) digit , followed (?=...) digit 
=> match 1st , 2nd tabs
| or
\t(?=[a-z@<:]{14}) : tab followed 14 consecutive characters present in set letter,@,< or : 
=> match 3rd tab
| or
(?<=[a-z@<:]{14})\t : tab precedeed same 14 characters set
=> match 4th tab
note
if need allow more characters in 14 consecutive characters pattern, add them set.
Comments
Post a Comment