python - Restrict separator to only some tabs when using pandas read_csv -


i'm reading tab-delimited data pandas dataframe using read_csv, have tabs occurring within column data means can't use "\t" separator. specifically, last entries in each line set of tab delimited optional tags match [a-za-z][a-za-z0-9]:[a-za-z]:.+ there no guarantees how many tags there or ones present, , different sets of tags may occur on different lines. example data looks (all white spaces tabs in data):

c42tmacxx:5:2316:15161:76101    163 1   @<@dffadddf:dd  nh:i:1  hi:i:1  as:i:200    nm:i:0 c42tmacxx:5:2316:15161:76101    83  1   cccccacdddcb@b  nh:i:1  hi:i:1  nm:i:1 c42tmacxx:5:1305:26011:74469    163 1   cccfffffhhhhgj  nh:i:1  hi:i:1  as:i:200    nm:i:0 

i proposing try read tags in single column, , thought passing in regular expression separator excludes tabs occur in context of tags.

following http://www.rexegg.com/regex-best-trick.html wrote following regex this: [a-za-z][a-za-z0-9]:[a-za-z]:[^\t]+\t..:|(\t). tested on online regular expression tester , seems match tabs want separators.

but when run

df = pd.read_csv(myfile.txt, sep=r"[a-za-z][a-za-z0-9]:[a-za-z]:[^\t]+\t..:|(\t)",                   header=none, engine="python") print(df) 

i following output data:

                          0       1    2   3   4   5               6   7   8 \ 0  c42tmacxx:5:2316:15161:76101  \t  163  \t   1  \t  @<@dffadddf:dd  \t nan    1  c42tmacxx:5:2316:15161:76101  \t   83  \t   1  \t  cccccacdddcb@b  \t nan    2  c42tmacxx:5:1305:26011:74469  \t  163  \t   1  \t  cccfffffhhhhgj  \t nan        9    10  11      12  13    14   0 nan  i:1  \t     nan nan   i:0   1 nan  i:1  \t  nm:i:1 nan  none   2 nan  i:1  \t     nan nan   i:0   

what expecting / want is:

                          0        1  2               3                      4 0  c42tmacxx:5:2316:15161:76101  163  1  @<@dffadddf:dd  nh:i:1 hi:i:1 as:i:200 nm:i:0    1  c42tmacxx:5:2316:15161:76101  83   1  cccccacdddcb@b  nh:i:1 hi:i:1 nm:i:1    2  c42tmacxx:5:1305:26011:74469  163  1  cccfffffhhhhgj  nh:i:1 hi:i:1 as:i:200 nm:i:0 

how achieve that?

in case it's relevant, i'm using pandas 0.17.1 , real data files of order of 100 million+ lines.

i took quick @ pandas docs , seems regex used separator cannot use groups.

c42tmacxx:5:2316:15161:76101    163 1   @<@dffadddf:dd  nh:i:1  hi:i:1  as:i:200    nm:i:0 c42tmacxx:5:2316:15161:76101    83  1   cccccacdddcb@b  nh:i:1  hi:i:1  nm:i:1 c42tmacxx:5:1305:26011:74469    163 1   cccfffffhhhhgj  nh:i:1  hi:i:1  as:i:200    nm:i:0                               ^    ^  ^                ^            

you need match 4 first tabs can't without using groups.

a solution isolate wanted \t using lookaheads , lookbehinds.

here regex should work:

(?<=\d)\t(?=\d)|\t(?=[a-z@<:]{14})|(?<=[a-z@<:]{14})\t

explanation

(?<=\d)\t(?=\d) : tab precedeed (?<=...) digit , followed (?=...) digit

=> match 1st , 2nd tabs

| or

\t(?=[a-z@<:]{14}) : tab followed 14 consecutive characters present in set letter,@,< or :

=> match 3rd tab

| or

(?<=[a-z@<:]{14})\t : tab precedeed same 14 characters set

=> match 4th tab

demo

note

if need allow more characters in 14 consecutive characters pattern, add them set.


Comments

Popular posts from this blog

ruby - Trying to change last to "x"s to 23 -

jquery - Clone last and append item to closest class -

c - Unrecognised emulation mode: elf_i386 on MinGW32 -