好得很程序员自学网

<tfoot draggable='sEl'></tfoot>

concurrency processing mutiple file python solutio

concurrency processing mutiple file python solution

The Problem  #

Tim Bray recently posted about his experiences from using Erlang to do some straightforward parsing of a large log file , inspired by a chapter he wrote for the book Beautiful Code . As it turned out, Erlang isn’t exactly optimized for tasks like this. After trying to parse a 1,000,000-line log file, Tim notes:

“ My first cut in Erlang, based on the per-process dictionary, took around eight minutes of CPU, and kept one of my MacBook’s Core Duo processors pegged at 97% while it was running. Ouch! ”

That’s less than a half megabyte per second. Not very impressive. Let’s see if we can come up with something better in Python.

A Single-Threaded Python Solution  #

Santiago Gala followed up on Tim’s original post with a nice map/reduce-based implementation in Python:

http://memojo.com/~sgala/blog/2007/09/29/Python-Erlang-Map-Reduce

Santiago’s script uses a series of nested generators to do filtering and mapping, and then uses a for-in -loop to reduce the mapped stream into a dictionary.

To benchmark the script, I created a sample by concatenating 100 copies of Tim’s original 10,000-line sample file. With that file, Santiago’s script needs about 6.7 seconds wall-time to parse 200 megabytes of log data on my Core Duo laptop (using Windows XP, warmed-up disk caches, and the final print statement replaced with a pass ).

 

Tim’s 1.67 GHz Core Duo L2400 MacBook should match my 1.66 GHz Core Duo T2300 HP notebook pretty well, so that’s about 70 times faster than his Erlang program, and about twice as fast as his Ruby version. Not too shabby.

But we can speed things up a bit more, of course.

Compiling the RE  #

Python’s RE engine caches compiled expressions, but it’s usually a good idea to move the cache lookup out of the inner loop anyway. And while we’re at it, we can move the method lookup out of the loop as well:

pat =  r"GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) " 
search = re.compile(pat).search

matches = (search(line)  for  line  in  file( "o10k.ap" ))

With these changes, the script finishes in 4.1 seconds.

Skipping lines that cannot match  #

Somewhat less obvious is the fact that we can use Python’s in operator to filter out lines that cannot match:

matches = (search(line)  for  line  in  file( "o10k.ap" )
     if   "GET /ongoing/When"   in  line)

The RE engine does indeed use special code for literal prefixes, but the sublinear substring search algorithm that was introduced in 2.5 is a lot faster in cases like this, so this simple change gives a noticable speedup; the script now runs in 2.9 seconds.

Reading files in binary mode (Windows)  #

On Windows (and in theory, on other platforms that distinguish between text files and binary files), data read via the standard file object are scanned for Windows-style line endings (“\r\n”). Any such character combination is then translated to a single newline, for consistency.

This is of course very convenient, since it allows you to treat text files in the same way no matter what platform you’re on, but on files this large, the performance penality is starting to get noticable.

We can turn this off simply by passing in the “rb” flag (read binary) to the open function.

matches = (search(line)  for  line  in  file( "o10k.ap" ,  "rb" )
     if   "GET /ongoing/When"   in  line)

The file object will still break things up in lines, and our code doesn’t look at the line endings, so we still get the same result. Just a bit quicker.

The Code  #

Here’s the final version of Santiago’s script:

 import  re
 from  collections  import  defaultdict

FILE =  "o1000k.ap" 

pat = re.compile( r"GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) " )

search = pat.search

 # map 
matches = (search(line)  for  line  in  file(FILE,  "rb" )  if   "GET /ongoing/When"   in  line)
mapp    = (match.group(1)  for  match  in  matches  if  match)

 # reduce 
count = defaultdict(int)
 for  page  in  mapp:
    count[page] +=1

 for  key  in  sorted(count, key=count.get)[:10]:
     print   "%40s = %s"  % (key, count[key])

To get a version that’s set up for benchmarking, get the wf-2.py file from this directory:

http://svn.effbot.org/public/stuff/sandbox/wide-finder/

This version of the script finishes in 1.9 seconds. This is a 3.5x speedup over Santiago’s version, and over 250x faster than Tim’s Erlang version. Pretty good for a short single-threaded script, don’t you think?

But I’m running this on a Core Duo machine. Two CPU cores, that is. What about using them both for this task?

A Multi-Threaded Python Solution  #

To run multiple subtasks in parallel, we need to split the task up in some way. Since the program reads a single text file, the easiest way to do that is to split the file into multiple pieces on the way in. Here’s a simple function that rushes through the file, splitting it up in 1 megabyte chunks, and returns chunk offsets and sizes:

 def   getchunks (file, size=1024*1024):
    f = open(file)
     while  1:
        start = f.tell()
        f.seek(size, 1)
        s = f.readline()
         yield  start, f.tell() - start
         if   not  s:
             break 

By default, this splits the file in megabyte-sized chunks:

 >>> for chunk in getchunks("o1000k.ap"):
...     print chunk
(0L, 1048637L)
(1048637L, 1048810L)
(2097447L, 1048793L)
(3146240L, 1048603L)
 

Note the use of readline to make sure that each chunk ends at a newline character. (Without this, there’s a small chance that we’ll miss some entries here and there. This is probably not much of a problem in practice, but let’s stick to the exact solution for now.)

So, given a list of chunks, we need something that takes a chunk, and produces a partial result. Here’s a first attempt, where the map and reduce steps are combined into a single loop:

 pat = re.compile(...)

def process(file, chunk):
    f = open(file)
    f.seek(chunk[0])
    d = defaultdict(int)
    search = pat.search
    for line in f.read(chunk[1]).splitlines():
        if "GET /ongoing/When" in line:
            m = search(line)
            if m:
                d[m.group(1)] += 1
    return d
 

Note that we cannot loop over the file itself, since we need to stop when we reach the end of it. The above version solves this by reading the entire chunk, and then splitting it into lines.

To test this code, we can loop over the chunks and feed them to the process function, one by one, and combine the result:

count = defaultdict(int)
 for  chunk  in  getchunks(file):
     for  key, value  in  process(file, chunk).items():
        count[key] += value

This version is a bit slower than the non-chunked version on my machine; one pass over the 200 megabyte file takes about 2.6 seconds.

However, since a chunk is guaranteed to contain a full set of lines, we can speed things up a bit more by looking for matches in the chunk itself instead of splitting it into lines:

 def   process (file, chunk):
    f = open(file)
    f.seek(chunk[0])
    d = defaultdict(int)
     for  page  in  pat.findall(f.read(chunk[1])):
        d[page] += 1
     return  d

With this change, the time drops to 1.8 seconds (3.7x faster than the original version).

The next step is to set things up so we can do the processing in parallel. First, we’ll call the process function from a standard “worker thread” wrapper:

 import  threading, Queue

 # job queue 
queue = Queue.Queue()

 # result queue 
result = []

 class   Worker (threading.Thread):
     def   run (self):
         while  1:
            args = queue.get()
             if  args  is  None:
                 break 
            result.append(process(*args))
            queue.task_done()

This uses the standard “worker thread” pattern, with a thread-safe Queue for pending jobs, and a plain list object to collect the results ( list.append is an atomic operation in CPython).

To finish the script, just create a bunch of workers, give them something to do (via the queue), and collect the results into a single dictionary:

 for  i  in  range(4):
    w = Worker()
    w.setDaemon(1)
    w.start()

 for  chunk  in  getchunks(file):
    queue.put((file, chunk))

queue.join()

count = defaultdict(int)
 for  item  in  result:
     for  key, value  in  item.items():
        count[key] += value

With a single thread, this runs in about 1.8 seconds (same as the non-threaded version). When we increase the number of threads, things are improved:

Two threads: 1.9 seconds Three: 1.7 seconds Four to eight: 1.6 seconds

For this specific test, the ideal number appears to be three threads per CPU. With fewer threads, the CPU:s will occasionally get stuck waiting for I/O.

Or perhaps they’re waiting for the interpreter itself; Python uses a global interpreter lock to protect the interpreter internals from simultaneous access, so there’s probably some fighting over the interpreter going on as well. To get even more performance out of this, we need to get around the lock in some way.

Luckily, for this kind of problem, the solution is straightforward.

A Multi-Processor Python Solution  #

To fully get around the interpreter lock, we need to run each subtask in a separate process. An easy way to do that is to let each worker thread start an associated process, send it a chunk, and read back the result. To make things really simple, and also portable, we’ll use the script itself as the subprocess, and use a special option to enter “subprocess” mode.

Here’s the updated worker thread:

 import  subprocess, sys

executable = [sys.executable]
 if  sys.platform ==  "win32" :
    executable.append( "-u" )  # use raw mode on windows 

 class   Worker (threading.Thread):
     def   run (self):
        process = subprocess.Popen(
            executable + [sys.argv[0],  "--process" ],
            stdin=subprocess.PIPE,
            stdout=subprocess.PIPE
            )
        stdin = process.stdin
        stdout = process.stdout
         while  1:
            cmd = queue.get()
             if  cmd  is  None:
                putobject(stdin, None)
                 break 
            putobject(stdin, cmd)
            result.append(getobject(stdout))
            queue.task_done()

where the getobject and putobject helpers are defined as:

 import  marshal, struct

 def   putobject (file, object):
    data = marshal.dumps(object)
    file.write(struct.pack( "I" , len(data)))
    file.write(data)
    file.flush()

 def   getobject (file):
     try :
        n = struct.unpack( "I" , file.read(4))[0]
     except  struct.error:
         return  None
     return  marshal.loads(file.read(n))

The worker thread runs a copy of the script itself, and passes in the “—process” option. To enter subprocess mode, we need to look for that before we do anything else:

 if   "--process"   in  sys.argv:
    stdin = sys.stdin
    stdout = sys.stdout
     while  1:
        args = getobject(stdin)
         if  args  is  None:
            sys.exit(0)  # done 
        result = process(*args)
        putobject(stdout, result)
 else :
    ... create worker threads ...

With this approach, the processing time drops to 1.2 seconds, when using two threads/processes (one per CPU). But that’s about as good as it gets; adding more processes doesn’t really improve things on this machine.

Memory Mapping  #

So, is this the best we can get? Not quite. We can speed up the file access as well, by switching to memory mapping:

 

 import  mmap

filemap = None

 def   process (file, chunk):
     global  filemap, fileobj
     if  filemap  is  None  or  fileobj.name != file:
        fileobj = open(file)
        filemap = mmap.mmap(
            fileobj.fileno(),
            os.path.getsize(file),
            access=mmap.ACCESS_READ
        )
    d = defaultdict(int)
     for  file  in  pat.findall(filemap, chunk[0], chunk[0]+chunk[1]):
        d[file] += 1
     return  d

Note that findall can be applied directly to the mapped region, thanks to Python’s internal memory buffer interface . Also note that the mmap module doesn’t support windowing, so the code needs to map the entire file in each subprocess. This can result in overly excessive use of virtual memory on some platforms (running this on your own log files if you’re on a shared web server is not necessarily a good idea. Yes, I’ve tried ;-).

Anyway, this gets the job done in 0.9 seconds, with the original chunk size. But since we’re mapping the entire file anyway in each subprocess, we can increase the chunk size to reduce the process communication overhead. With 50 megabyte chunks, the script runs in just under 0.8 seconds.

Summary  #

In this article, we took a relatively fast Python implementation and optimized it, using a number of tricks:

Pre-compiled RE patterns Fast filtering of candidate lines Chunked reading Multiple processes Memory mapping, combined with support for RE operations on mapped buffers

This reduced the time needed to parse 200 megabytes of log data from 6.7 seconds to 0.8 seconds on the test machine. Or in other words, the final version is over 8 times faster than the original Python version, and (potentially) 600 times faster than Tim’s original Erlang version.

However, it should be noticed that the benchmark I’ve been using focuses on processing speed, not disk speed. The code will most likely behave differently on cold caches (and will definitely take longer to run), on machines with different disk systems, and of course also on machines with additional cores.

If you have some time to spare and some interesting hardware to run it on, feel free to grab the code and take it on a ride:

http://svn.effbot.org/public/stuff/sandbox/wide-finder/

(see the README.txt file for details.)

Addenda  #

2007-10-07: Stanley Seibert has adapted the code to use the processing library , which provides multiprocess functionality with a lot less (user) code; see Parallel Processing in Python with processing for details.

2007-10-07: Bioinformatics veteran and fellow Python string-type hacker Andrew Dalke points out, via mail, that it’s possible to shave off a few more cycles by extracting all URL:s that start with “/ongoing/When/” (which we’re looking for anyway), and then removing bogus URL:s during post-processing. Andrew has also written a custom parser based on mxTextTools , which is a quite a bit faster than the RE solution. Hopefully, he’ll turn his findings into a blog post, so I can link to his work ;-) See More notes on Wide Finder for the full story (which is more about fast “narrow finding” than “wide finding”, though).

2007-10-07: Bill de hóra has some code too .

2007-10-07: And Steve Vinoski has tried the code from this article on some big iron: “ I ran his wf-6.py on an 8-core 2.33 GHz Intel Xeon Linux box with 8GB of RAM, and it ran best at 5 processes, clocking in at 0.336 sec. Another process-based approach, wf-5.py, executed best with 8 processes, presumably one per core, in 0.358 sec. The multithreaded approach, wf-4.py, ran best with 5 threads, at 1.402 sec (but also got the same result with 19 threads, go figure). Using the same dataset, I get 11.8 sec from my best Erlang effort, which is obviously considerably slower. ”

2007-10-08: Paul Boddie provides code and results using a different parallelization library, pprocess .

2007-10-08: Tim Bray summarizes recent developments .

2007-10-12: Updated the article to use binary mode on Windows. This makes the chunk calculations a bit more reliable ( tell can misbehave on text files), and speeds things up quite a bit, since the I/O layer no longer needs to convert line endings.

2007-10-31: Tim Bray has tested a bunch of implementations on a multicore Solaris box . When I write this, Python’s in the lead ;-)

 

[ comment on/vote for this article ]

Technology Answers

Python multiprocessing: sharing a large read-only object between processes?

Do child processes spawned via multiprocessing share objects created earlier in the program?

I have the following setup:

  do_some_processing  (  filename  ):  
    for line in file ( filename ):
        if line . split ( ',' )[ 0 ] in big_lookup_object :
            # something here

if __name__ == '__main__' :
    big_lookup_object = marshal . load ( 'file.bin' )
    pool = Pool ( processes = 4 )
    print pool . map ( do_some_processing , glob . glob ( '*.data' ))

I'm loading some big object into memory, then creating a pool of workers that need to make use of that big object. The big object is accessed read-only, I don't need to pass modifications of it between processes.

My question is: is the big object loaded into shared memory, as it would be if I spawned a process in unix/c, or does each process load its own copy of the big object?

Update: to clarify further - big_lookup_object is a shared lookup object. I don't need to split that up and process it separately. I need to keep a single copy of it. The work that I need to split it is reading lots of other large files and looking up the items in those large files against the lookup object.

Further update: database is a fine solution, memcached might be a better solution, and file on disk (shelve or dbm) might be even better. In this question I was particularly interested in an in memory solution. For the final solution I'll be using hadoop, but I wanted to see if I can have a local in-memory version as well.

Question by: Parand

This question originated from: stackoverflow.com


Answer            

"Do child processes spawned via multiprocessing share objects created earlier in the program?"

No.

Processes have independent memory space.

Solution 1

To make best use of a large structure with lots of workers, do this.

Write each worker as a "filter" -- reads intermediate results from stdin, does work, writes intermediate results on stdout.

Connect all the workers as a pipeline:

  process1   <  source   |   process2   |   process3   |     ...     |   processn   >  result

Each process reads, does work and writes.

This is remarkably efficient since all processes are running concurrently. The writes and reads pass directly through shared buffers between the processes.

Solution 2

In some cases, you have a more complex structure -- often a "fan-out" structure. In this case you have a parent with multiple children.

Parent opens source data. Parent forks a number of children.

Parent reads source, farms parts of the source out to each concurrently running child.

When parent reaches the end, close the pipe. Child gets end of file and finishes normally.

The child parts are pleasant to write because each child simply reads sys.sydin .

The parent has a little bit of fancy footwork in spawning all the children and retaining the pipes properly, but it's not too bad.

Fan-in is the opposite structure. A number of independently running processes need to interleave their inputs into a common process. The collector is not as easy to write, since it has to read from many sources.

Reading from many named pipes is often done using the select module to see which pipes have pending input.

Solution 3

Shared lookup is the definition of a database.

Solution 3A -- load a database. Let the workers process the data in the database.

Solution 3B -- create a very simple server using werkzeug (or similar) to provide WSGI applications that respond to HTTP GET so the workers can query the server.

Solution 4

Shared filesystem object. Unix OS offers shared memory objects. These are just files that are mapped to memory so that swapping I/O is done instead of more convention buffered reads.

You can do this from a Python context in several ways

Write a startup program that (1) breaks your original gigantic object into smaller objects, and (2) starts workers, each with a smaller object. The smaller objects could be pickled Python objects to save a tiny bit of file reading time.

Write a startup program that (1) reads your original gigantic object and writes a page-structured, byte-coded file using seek operations to assure that individual sections are easy to find with simple seeks. This is what a database engine does -- break the data into pages, make each page easy to locate via a seek .

Spawn workers with access this this large page-structured file. Each worker can seek to the relevant parts and do their work there.

http://www.doughellmann.com/PyMOTW/multiprocessing/mapreduce.html

Implementing MapReduce with multiprocessing ?

The Pool class can be used to create a simple single-server MapReduce implementation. Although it does not give the full benefits of distributed processing, it does illustrate how easy it is to break some problems down into distributable units of work.

SimpleMapReduce ?

In MapReduce, input data is broken down into chunks for processing by different worker instances. Each chunk of input data is mapped to an intermediate state using a simple transformation. The intermediate data is then collected together and partitioned based on a key value so that all of the related values are together. Finally, the partitioned data is reduced to a result set.

 import   collections 
 import   itertools 
 import   multiprocessing 

 class   SimpleMapReduce  (  object  ): 
    
     def   __init__  (  self  ,   map_func  ,   reduce_func  ,   num_workers  =  None  ): 
         """ 
         map_func 

           Function to map inputs to intermediate data. Takes as 
           argument one input value and returns a tuple with the key 
           and a value to be reduced. 
          
         reduce_func 

           Function to reduce partitioned version of intermediate data 
           to final output. Takes as argument a key as produced by 
           map_func and a sequence of the values associated with that 
           key. 
           
         num_workers 

           The number of workers to create in the pool. Defaults to the 
           number of CPUs available on the current host. 
         """ 
         self  .  map_func   =   map_func 
         self  .  reduce_func   =   reduce_func 
         self  .  pool   =   multiprocessing  .  Pool  (  num_workers  ) 
    
     def   partition  (  self  ,   mapped_values  ): 
         """Organize the mapped values by their key. 
         Returns an unsorted sequence of tuples with a key and a sequence of values. 
         """ 
         partitioned_data   =   collections  .  defaultdict  (  list  ) 
         for   key  ,   value   in   mapped_values  : 
             partitioned_data  [  key  ]  .  append  (  value  ) 
         return   partitioned_data  .  items  () 
    
     def   __call__  (  self  ,   inputs  ,   chunksize  =  1  ): 
         """Process the inputs through the map and reduce functions given. 
          
         inputs 
           An iterable containing the input data to be processed. 
          
         chunksize=1 
           The portion of the input data to hand to each worker.  This 
           can be used to tune performance during the mapping phase. 
         """ 
         map_responses   =   self  .  pool  .  map  (  self  .  map_func  ,   inputs  ,   chunksize  =  chunksize  ) 
         partitioned_data   =   self  .  partition  (  itertools  .  chain  (  *  map_responses  )) 
         reduced_values   =   self  .  pool  .  map  (  self  .  reduce_func  ,   partitioned_data  ) 
         return   reduced_values 

Counting Words in Files ?

The following example script uses SimpleMapReduce to counts the “words” in the reStructuredText source for this article, ignoring some of the markup.

 import   multiprocessing 
 import   string 

 from   multiprocessing_mapreduce   import   SimpleMapReduce 

 def   file_to_words  (  filename  ): 
     """Read a file and return a sequence of (word, occurances) values. 
     """ 
     STOP_WORDS   =   set  ([ 
             'a'  ,   'an'  ,   'and'  ,   'are'  ,   'as'  ,   'be'  ,   'by'  ,   'for'  ,   'if'  ,   'in'  ,  
             'is'  ,   'it'  ,   'of'  ,   'or'  ,   'py'  ,   'rst'  ,   'that'  ,   'the'  ,   'to'  ,   'with'  , 
             ]) 
     TR   =   string  .  maketrans  (  string  .  punctuation  ,   ' '   *   len  (  string  .  punctuation  )) 

     print   multiprocessing  .  current_process  ()  .  name  ,   'reading'  ,   filename 
     output   =   [] 

     with   open  (  filename  ,   'rt'  )   as   f  : 
         for   line   in   f  : 
             if   line  .  lstrip  ()  .  startswith  (  '..'  ):   # Skip rst comment lines 
                 continue 
             line   =   line  .  translate  (  TR  )   # Strip punctuation 
             for   word   in   line  .  split  (): 
                 word   =   word  .  lower  () 
                 if   word  .  isalpha  ()   and   word   not   in   STOP_WORDS  : 
                     output  .  append  (   (  word  ,   1  )   ) 
     return   output 


 def   count_words  (  item  ): 
     """Convert the partitioned data for a word to a 
     tuple containing the word and the number of occurances. 
     """ 
     word  ,   occurances   =   item 
     return   (  word  ,   sum  (  occurances  )) 


 if   __name__   ==   '__main__'  : 
     import   operator 
     import   glob 

     input_files   =   glob  .  glob  (  '*.rst'  ) 
    
     mapper   =   SimpleMapReduce  (  file_to_words  ,   count_words  ) 
     word_counts   =   mapper  (  input_files  ) 
     word_counts  .  sort  (  key  =  operator  .  itemgetter  (  1  )) 
     word_counts  .  reverse  () 
    
     print   '  \n  TOP 20 WORDS BY FREQUENCY  \n  ' 
     top20   =   word_counts  [:  20  ] 
     longest   =   max  (  len  (  word  )   for   word  ,   count   in   top20  ) 
     for   word  ,   count   in   top20  : 
         print   '  %-*s  :   %5s  '   %   (  longest  +  1  ,   word  ,   count  ) 

Each input filename is converted to a sequence of (word, 1) pairs by file_to_words . The data is partitioned by SimpleMapReduce.partition() using the word as the key, so the partitioned data consists of a key and a sequence of 1 values representing the number of occurrences of the word. The reduction phase converts that to a pair of (word, count) values by calling count_words for each element of the partitioned data set.

$ python multiprocessing_wordcount.py

PoolWorker-2 reading communication.rst
PoolWorker-2 reading index.rst
PoolWorker-1 reading basics.rst
PoolWorker-1 reading mapreduce.rst

TOP 20 WORDS BY FREQUENCY

process         :    75
multiprocessing :    40
worker          :    35
after           :    30
running         :    29
start           :    28
processes       :    26
python          :    26
literal         :    25
header          :    25
pymotw          :    25
end             :    25
daemon          :    23
now             :    21
consumer        :    19
starting        :    18
exiting         :    16
event           :    15
value           :    14
run             :    13

See also

MapReduce - Wikipedia Overview of MapReduce on Wikipedia. MapReduce: Simplified Data Processing on Large Clusters Google Labs presentation and paper on MapReduce. operator Operator tools such as itemgetter() .

Answer by: S.Lott

Related topics:    python    multiprocessing

Warning

Some of this package’s functionality requires a functioning shared semaphore implementation on the host operating system. Without one, the multiprocessing.synchronize module will be disabled, and attempts to import it will result in an ImportError . See issue 3770 for additional information.

Note

Functionality within this package requires that the __main__ method be importable by the children. This is covered in Programming guidelines however it is worth pointing out here. This means that some examples, such as the multiprocessing.Pool examples will not work in the interactive interpreter. For example:

Extended Slices

Ever since Python 1.4, the slicing syntax has supported an optional third ``step'' or ``stride'' argument. For example, these are all legal Python syntax: L[1:10:2] , L[:-1:1] , L[::-1] . This was added to Python at the request of the developers of Numerical Python, which uses the third argument extensively. However, Python's built-in list, tuple, and string sequence types have never supported this feature, raising a TypeError if you tried it. Michael Hudson contributed a patch to fix this shortcoming.

For example, you can now easily extract the elements of a list that have even indexes:

>>> L = range(10)
>>> L[::2]
[0, 2, 4, 6, 8]

Negative values also work to make a copy of the same list in reverse order:

>>> L[::-1]
[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

This also works for tuples, arrays, and strings:

>>> s='abcd'
>>> s[::2]
'ac'
>>> s[::-1]
'dcba'

If you have a mutable sequence such as a list or an array you can assign to or delete an extended slice, but there are some differences between assignment to extended and regular slices. Assignment to a regular slice can be used to change the length of the sequence:

>>> a = range(3)
>>> a
[0, 1, 2]
>>> a[1:3] = [4, 5, 6]
>>> a
[0, 4, 5, 6]

Extended slices aren't this flexible. When assigning to an extended slice, the list on the right hand side of the statement must contain the same number of items as the slice it is replacing:

>>> a = range(4)
>>> a
[0, 1, 2, 3]
>>> a[::2]
[0, 2]
>>> a[::2] = [0, -1]
>>> a
[0, 1, -1, 3]
>>> a[::2] = [0,1,2]
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
ValueError: attempt to assign sequence of size 3 to extended slice of size 2

Deletion is more straightforward:

>>> a = range(4)
>>> a
[0, 1, 2, 3]
>>> a[::2]
[0, 2]
>>> del a[::2]
>>> a
[1, 3]

One can also now pass slice objects to the __getitem__ methods of the built-in sequences:

>>> range(10).__getitem__(slice(0, 5, 2))
[0, 2, 4]

Or use slice objects directly in subscripts:

>>> range(10)[slice(0, 5, 2)]
[0, 2, 4]

To simplify implementing sequences that support extended slicing, slice objects now have a method indices( length ) which, given the length of a sequence, returns a ( start , stop , step ) tuple that can be passed directly to range() . indices() handles omitted and out-of-bounds indices in a manner consistent with regular slices (and this innocuous phrase hides a welter of confusing details!). The method is intended to be used like this:

class FakeSeq:
    ...
    def calc_item(self, i):
        ...
    def __getitem__(self, item):
        if isinstance(item, slice):
            indices = item.indices(len(self))
            return FakeSeq([self.calc_item(i) for i in range(*indices)])
        else:
            return self.calc_item(i)

From this example you can also see that the built-in slice object is now the type object for the slice type, and is no longer a function. This is consistent with Python 2.2, where int , str , etc., underwent the same change.  

查看更多关于concurrency processing mutiple file python solutio的详细内容...

  阅读:43次

上一篇: bulk insert

下一篇:python for else statement test