python - Restrict separator to only some tabs when using pandas read_csv -


i'm reading tab-delimited data pandas dataframe using read_csv, have tabs occurring within column data means can't use "\t" separator. specifically, last entries in each line set of tab delimited optional tags match [a-za-z][a-za-z0-9]:[a-za-z]:.+ there no guarantees how many tags there or ones present, , different sets of tags may occur on different lines. example data looks (all white spaces tabs in data):

c42tmacxx:5:2316:15161:76101    163 1   @<@dffadddf:dd  nh:i:1  hi:i:1  as:i:200    nm:i:0 c42tmacxx:5:2316:15161:76101    83  1   cccccacdddcb@b  nh:i:1  hi:i:1  nm:i:1 c42tmacxx:5:1305:26011:74469    163 1   cccfffffhhhhgj  nh:i:1  hi:i:1  as:i:200    nm:i:0 

i proposing try read tags in single column, , thought passing in regular expression separator excludes tabs occur in context of tags.

following http://www.rexegg.com/regex-best-trick.html wrote following regex this: [a-za-z][a-za-z0-9]:[a-za-z]:[^\t]+\t..:|(\t). tested on online regular expression tester , seems match tabs want separators.

but when run

df = pd.read_csv(myfile.txt, sep=r"[a-za-z][a-za-z0-9]:[a-za-z]:[^\t]+\t..:|(\t)",                   header=none, engine="python") print(df) 

i following output data:

                          0       1    2   3   4   5               6   7   8 \ 0  c42tmacxx:5:2316:15161:76101  \t  163  \t   1  \t  @<@dffadddf:dd  \t nan    1  c42tmacxx:5:2316:15161:76101  \t   83  \t   1  \t  cccccacdddcb@b  \t nan    2  c42tmacxx:5:1305:26011:74469  \t  163  \t   1  \t  cccfffffhhhhgj  \t nan        9    10  11      12  13    14   0 nan  i:1  \t     nan nan   i:0   1 nan  i:1  \t  nm:i:1 nan  none   2 nan  i:1  \t     nan nan   i:0   

what expecting / want is:

                          0        1  2               3                      4 0  c42tmacxx:5:2316:15161:76101  163  1  @<@dffadddf:dd  nh:i:1 hi:i:1 as:i:200 nm:i:0    1  c42tmacxx:5:2316:15161:76101  83   1  cccccacdddcb@b  nh:i:1 hi:i:1 nm:i:1    2  c42tmacxx:5:1305:26011:74469  163  1  cccfffffhhhhgj  nh:i:1 hi:i:1 as:i:200 nm:i:0 

how achieve that?

in case it's relevant, i'm using pandas 0.17.1 , real data files of order of 100 million+ lines.

i took quick @ pandas docs , seems regex used separator cannot use groups.

c42tmacxx:5:2316:15161:76101    163 1   @<@dffadddf:dd  nh:i:1  hi:i:1  as:i:200    nm:i:0 c42tmacxx:5:2316:15161:76101    83  1   cccccacdddcb@b  nh:i:1  hi:i:1  nm:i:1 c42tmacxx:5:1305:26011:74469    163 1   cccfffffhhhhgj  nh:i:1  hi:i:1  as:i:200    nm:i:0                               ^    ^  ^                ^            

you need match 4 first tabs can't without using groups.

a solution isolate wanted \t using lookaheads , lookbehinds.

here regex should work:

(?<=\d)\t(?=\d)|\t(?=[a-z@<:]{14})|(?<=[a-z@<:]{14})\t

explanation

(?<=\d)\t(?=\d) : tab precedeed (?<=...) digit , followed (?=...) digit

=> match 1st , 2nd tabs

| or

\t(?=[a-z@<:]{14}) : tab followed 14 consecutive characters present in set letter,@,< or :

=> match 3rd tab

| or

(?<=[a-z@<:]{14})\t : tab precedeed same 14 characters set

=> match 4th tab

demo

note

if need allow more characters in 14 consecutive characters pattern, add them set.


Comments

Popular posts from this blog

Capture and play voice with Asterisk ARI -

visual studio - Installing Packages through Nuget - "Central Directory corrupt" -

java - Why database contraints in HSQLDB are only checked during a commit when using transactions in Hibernate? -