草人 최광민
kwangmin.egloos.com  |  Egloos | Log-in

카테고리
전체
알림
육아
에피소드
詩 :-)
詩 :-|
詩 :-(
역사 (서양)
역사 (동양)
역사 (중근동)
역사 (종합)
사상
Χριστιανισμός
비교신화/종교
Quotation
기자 이민아의 글
음악 (classical)
음악 (pop)
음악 (기타)
문학
영화
경제/사회/시사
과학일반
Biology/Medicine
BIT review
Bioinformatics
Complex Systems
Data Mining
OS (Linux, XP)
R
Perl, Python, Ruby
Bloomington, IN
Austin, TX
동영상
기타
미분류
링크
AOL-XM BBC NPR Video Lectures IU Complex Systems Group Google TechTalk
최근 등록된 덧글
음, 심심할 때 조금..
by 草人 at 07/29
오랜만^^ 어서 ..
by fulshot at 07/29
기원전 아케메네스..
by 草人 at 07/26
저 "소를 죽이는 미트..
by Daimon at 07/26
오래간만이야, 이..
by 草人 at 07/26
음, 쓰다가 만 글인..
by 草人 at 07/26
가입하는 와중에 ..
by HLee at 07/26
저 책을 읽어본 바도..
by Daimon at 07/25
물론 세가지 구조는..
by 草人 at 06/16
저는 이런 <전문가 ..
by Daimon at 06/14
2008년 08월 17일
[시] 옷에게 바치는 송가 (네루다)
     

옷에게 바치는 송가 (頌歌)

- 파블로 네루다

아침마다 너는 기다린다.
옷이여, 의자 위에서
나의 허영과 나의 사랑과
나의 희망, 나의 육체로
너를 채워주길 기다리지.
꿈에서 나오자마자
나는 물을 떠나
너의 소매 속으로 들어가고
나의 발은 너의
발의 빈 구멍을 찾는다.
그렇게 해서
나는 너의 지칠 줄 모르는 성실한 도움으로
목장의 풀을 밟으러 나오지.
나는 이제 詩 안으로 들어가
창문으로
사람들을 바라본다.
남자들, 여자들
사실들과 싸움들이
나를 이루어 간다.
나와 맞서서
나의 손을 만들고 나의 눈을 뜨게 하고
나의 입이 닳도록 한다.
그렇게 해서 옷이여
나도 너를 이루어 간다.
너의 팔꿈치를 빼고
너의 실을 끊고
그렇게 해서 너의 일생은 나의 일생의 모습으로 성장해 간다.
마치 나의 영혼처럼.
바람에
나부끼고 소리를 내지.
불행한 순간에는 넌 나의 뼈에 붙는다. 밤이면 텅 비는 나의 뼈
어둠과 꿈이 도깨비 모습을 하고
너의 날개와 나의 날개를 가득 채운다.
나는 어느날
어느 적의
총알 하나가
네게 나의 핏자국을 남기지 않을까
걱정도 해보지.
또 어쩌면
일은 그렇게 극적으로 벌어지지 않고
그냥 단순하게
네가 차차 병이 들어 가리라는 생각도 해본다.
옷이여
너는 나와 함께
늙어 가며
나와 나의 몸과 함께
같이 살다가 같이 땅 속으로
들어가리라.
그래서
날마다
나는 네게 인사를 한다.
정중하게. 그러면 또 너는
나를 껴안고 나는 너를 잊어도 좋아.
우리는 결국 하나니까.
밤이면 너와 나는
바람에 맞서는 동지일 것이고
거리에서나 싸움터에서나
어쩌면 어쩌면 언젠가 움직이지 않는
한 몸일테니까.


Ode to Clothes

- Pablo Neruda

Every morning you wait,
clothes, over a chair,
to fill yourself with
my vanity, my love,
my hope, my body.
Barely
risen from sleep,
I relinquish the water,
enter your sleeves,
my legs look for
the hollows of your legs,
and so embraced
by your indefatigable faithfulness
I rise, to tread the grass,
enter poetry,
consider through the windows,
the things,
the men, the women,
the deeds and the fights
go on forming me,
go on making me face things
working my hands,
opening my eyes,
using my mouth,
and so,
clothes,
I too go forming you,
extending your elbows,
snapping your threads,
and so your life expands
in the image of my life.
In the wind
you billow and snap
as if you were my soul,
at bad times
you cling
to my bones,
vacant, for the night,
darkness, sleep
populate with their phantoms
your wings and mine.
I wonder
if one day
a bullet
from the enemy
will leave you stained with my blood
and then
you will die with me
or one day
not quite
so dramatic
but simple,
you will fall ill,
clothes,
with me,
grow old
with me, with my body
and joined
we will enter
the earth.
Because of this
each day
I greet you
with reverence and then
you embrace me and I forget you,
because we are one
and we will go on
facing the wind, in the night,
the streets or the fight,
a single body,
one day, one day, some day, still.


# by 草人 | 2008/08/17 15:50 | 詩 :-| | 트랙백
2008년 08월 17일
A Man from the Earth
     


(중반 이후의 "신성모독적" 발언을 제외하면,) SF(?) 원작을 기반으로 모든 면에서 잘 만든 초 저예산 영화. 한 편의 잘 만든 연극을 보는 듯 하다.

# by 草人 | 2008/08/17 15:08 | 영화 | 트랙백
2008년 08월 16일
Introduction to Information Retrieval (Manning et al.)
     

Introduction to Information Retrieval

Introduction to Information Retrieval

This is the companion website for thefollowing book.

Christopher D. Manning,Prabhakar Raghavanand HinrichSchütze,Introduction to Information Retrieval, CambridgeUniversity Press. 2008.

You can order this bookat CUP, at your local bookstore or on the internet. The best searchterm to use is the ISBN:0521865719.

The book aims to provide a modern approach to information retrieval from a computer science perspective.It is based on a course we have been teaching invarious forms at Stanford University and at theUniversity of Stuttgart.

We'd be pleased to get feedback about how this book works out as a textbook,what is missing, or covered in too much detail, or what is simplywrong. Please send any feedback or comments to:informationretrieval (at) yahoogroups (dot) com

Online resources

Apart from small differences (mainly concerning copy editing and figures), the online editions should have the same content as the print edition. However, we are planning to fix errata in the online editions every few months or so.

The following materials are available online. The date of last update is given in parentheses.

  • HTMLedition (2008.06.01)
  • PDFof the book for online viewing (with nice hyperlink features, 2008.07.12)
  • PDFof the book for printing (2008.05.27)
  • PDFs of individual chapters (2008.05.27)
  • slides (2008.07.12)
  • discussion forums (2008.08.13)
  • a moodle with interactive exercises (2008.08.13)
  • solutionsto the exercises (2007.12.31, you will need to register with CUP)
  • errata (2008.07.12)

Information retrieval resources

Alist of informationretrieval resources is also available.

Introduction to Information Retrieval: Table of Contents

 
chapter      resources

Front matter (incl. table of notations) pdf

01   Boolean retrieval pdfhtml
02 The term vocabulary & postings lists pdf html
03 Dictionaries and tolerant retrieval pdfhtml
04 Index construction pdfhtml
05 Index compression pdfhtml
06 Scoring, term weighting & the vector space model pdfhtml
07 Computing scores in a complete search system pdfhtml
08 Evaluation in information retrieval pdfhtml
09 Relevance feedback & query expansion pdfhtml
10 XML retrieval pdfhtml
11 Probabilistic information retrieval pdfhtml
12 Language models for information retrieval pdfhtml
13 Text classification & Naive Bayes pdfhtml
14 Vector space classification pdfhtml
15 Support vector machines & machine learning on documents pdfhtml
16 Flat clustering pdfhtml html
17 Hierarchical clustering pdfhtml
18 Matrix decompositions & latent semantic indexing pdfhtml
19 Web search basics pdfhtml
20 Web crawling and indexes pdfhtml
21 Link analysis pdfhtml

Bibliography & Index pdf


bibtex file bib

 

# by 草人 | 2008/08/16 15:38 | Data Mining | 트랙백
2008년 08월 16일
A Liberal Decalogue (Bertrand Russell)
     
A LIBERAL DECALOGUE

- Bertrand Russell, {Autobiography}, New York: Routledge, 2000., pp. 553~554.


1.

Do not feel absolutely certain of anything.

2.

Do not think it worth while to proceed by concealing evidence, for the evidence is sure to come light.

3.

Never try to discourage thinking for you are sure to succeed.

4.

When you meet with opposition, even if it should be from your husbandor your children, endeavour to overcome it by argument and not byauthority, for a victory dependent upon authority is unreal andillusory.

5.

Have no respect for the authority of others, for there are always contrary authorities to be found.

6.

Do not use power to suppress opinions you think pernicious, for if you do the opinions will suppress you.

7.

Do not fear to be eccentric in opinion, for every opinion now accepted was once eccentric.

8.

Find more pleasure in intelligent dissent then in passive agreement,for, if you value intelligent as you should, the former implies adeeper agreement than the latter.

9.

Be scrupulously truthful, even if the truth is inconvenient, for it is more inconvenient when you try to conceal it.

10.

Do not feel envious of the happiness of those who live in a fool’sparadise, for only a fool will think that it is happiness.
# by 草人 | 2008/08/16 13:33 | Quotation | 트랙백
2008년 08월 16일
BLAST algorithm (Wikipedia)
     

Algorithm

To run, BLAST requires a query sequence to search for, and asequence to search against (or a sequence database containing multiplesuch sequences)(also called the target sequence). BLAST will findsubsequences in the database which are similar to subsequences in thequery. In typical usage, the query sequence is much smaller than thedatabase, e.g., the query may be one thousand nucleotides while thedatabase is several billion nucleotides.

The main idea of BLAST is that there are often high-scoring segmentpairs (HSP) contained in a statistically significant alignment. BLASTsearches for high scoring sequence alignments between the query sequence and sequences in the database using a heuristic approach that approximates the Smith-Waterman algorithm. The exhaustive Smith-Waterman approach is too slow for searching large genomic databases such as GenBank. Therefore, the BLAST algorithm uses a heuristicapproach that is less accurate than the Smith-Waterman but over 50times faster. The speed and relatively good accuracy of BLAST are amongthe key technical innovation of the BLAST programs.

Here the algorithm of BLASTP (a protein to protein search) is introduced to present the concept of BLAST.[4]

  1. Remove low-complexity region or sequence repeats in the query sequence.
    Low-complexity region means a region of a sequence is composed of fewkinds of elements. These regions might give high scores that confusethe program to find the actual significant sequences in the database,so they should be filtered out. The regions will be marked with an X(protein sequences) or N (nucleic acid sequences) and then be ignoredby the BLAST program. To filter out the low-complexity regions, the SEG program is used for protein sequences and the program DUST is used for DNA sequences. On the other hand, the program XNU is used to mask off the tandem repeats in protein sequences.
  2. Make a k-letter word list of the query sequence.
    Take k=3 for example, we list the words of length 3 in the queryprotein sequence (k is usually 11 for a DNA sequence) “sequentially”,until the last letter of the query sequence is included. The method canbe illustrated in figure 1.
    Fig. 1 The method to establish the k-letter query word list.
    Fig. 1 The method to establish the k-letter query word list.
  3. List the possible matching words.
    This step is one of themain differences between BLAST and FASTA. FASTA cares about all of thecommon words in the database and query sequences that are listed instep 2; however, BLAST cares about only the high-scoring words. Thescores are created by comparing the word in the list in step 2 with allthe 3-letter words. By using the scoring matrix (substitution matrix)to score the comparison of each residue pair, there are 20^3 possiblematch scores for a 3-letter word. For example, the score obtained bycomparing PQG with PEG and PQA is 15 and 12, respectively. For DNAwords, a match is scored as +5 and a mismatch as -4. After that, aneighborhood word score threshold T is used to reduce the number ofpossible matching words. The words whose scores are greater than thethreshold T will remain in the possible matching words list, whilethose with lower scores will be discarded. For example, PEG is kept,but PQA is abandoned when T is 13.
  4. Organize the remaining high-scoring words into an efficient search tree.
    This is for the purpose that the program can rapidly compare the high-scoring words to the database sequences.
  5. Repeat step 1 to 4 for each 3-letter word in the query sequence.
  6. Scan the database sequences for exact match with the remaining high-scoring words.
    The BLAST program scans the database sequences for the remaininghigh-scoring word, such as PEG, of each position. If an exact match isfound, this match is used to seed a possible ungapped alignment betweenthe query and database sequences.
  7. Extend the exact matches to high-scoring segment pair (HSP).
    • The original version of BLAST stretches a longer alignment betweenthe query and the database sequence in left and right direction, fromthe position where exact match is scanned. The extension doesn’t stopuntil the accumulated total score of the HSP begins to decrease. Asimplified example is presented in figure 2.
      Fig. 2 The process to extension the exact match.
      Fig. 2 The process to extension the exact match.
    • To save more time, a newer version of BLAST, called BLAST2 orgapped BLAST, has been developed. BLAST2 adopts a lower neighborhoodword score threshold to maintain the same level of sensitivity fordetecting sequence similarity. Therefore, the possible matching wordslist in step 3 becomes longer. Next, the exact matched regions, withindistance A from each other on the same diagonal in figure 3, will bejoined as a longer new region. Finally, the new regions are thenextended as the same method in the original version of BLAST, and theHSPs’ (High-scoring segment pair) scores of the extended regions arethen created by using a substitution matrix as before.
      Fig. 3 The positions of the exact matches.
      Fig. 3 The positions of the exact matches.
  8. List all of the HSPs in the database whose score is high enough to be considered.
    Welist the HSPs whose scores are greater than the empirically determinedcutoff score S. By examining the distribution of the alignment scoresmodeled by comparing random sequences, a cutoff score S can bedetermined such that its value is large enough to guarantee thesignificance of the remained HSPs.
  9. Evaluate the significance of the HSP score.
    BLAST next assesses the statistical significance of each HSP score by exploiting the Gumbel extreme value distribution (EVD).(It is proved that the distribution of Smith-Waterman local alignmentscores between two random sequences follows the Gumbel EVD, regardlessof whether gaps are allowed in the alignment). In accordance with theGumbel EVD, the probability p of observing a score S equal to orgreater than x is given by the equation
    p\left( S\ge x \right)=1-\exp \left( -e^{-\lambda \left( x-\mu  \right)} \right)
    ,where
    \mu ={}^{\left[ \log \left( Km'n' \right) \right]}\!\!\diagup\!\!{}_{\lambda }\;
    The statistical parameters λ and Kare estimated by fitting the distribution of the ungapped localalignment scores, of the query sequence and a lot of shuffled versions(Global or local shuffling) of a database sequence, to the Gumbelextreme value distribution. Note that λ and Kdepend upon the substitution matrix, gap penalties, and sequencecomposition (the letter frequencies).The m’ and n’ is the effectivelength of the query and database sequence, respectively. The originalsequence length is shortened to the effective length to compensate forthe edge effect (an alignment start near the end of one of the query ordatabase sequence is likely not to have enough sequence to build anoptimal alignment). They can be calculated as
    m'\approx m-{}^{\left( \ln Kmn \right)}\!\!\diagup\!\!{}_{H}\;
    n'\approx n-{}^{\left( \ln Kmn \right)}\!\!\diagup\!\!{}_{H}\;,
    where His the average expected score per aligned pair of residues in analignment of two random sequences. Altschul and Gish gave the typicalvalues, λ = 0.318, K = 0.13, and H = 0.40, for ungapped local alignment using BLOSUM62as the substitution matrix. Using the typical values for assessing thesignificance is called the lookup table methods, it is not accurate.Theexpect score E of a database match is the number of times that anunrelated database sequence would obtain a score S higher than x bychance. The expectation E obtained in a search for a database of Dsequences is given by
    E\approx 1-e^{-p\left( s>x \right)D}
    Furthermore, when p < 0.1, E could be approximated by the Poisson distribution as
    E\approx pD
    Note that the E value accessing the significance of the HSP score here(for ungapped local alignment) is not identical to the one in the laterstep to evaluate the final gapped local alignment score, due to thevariation of the statistical parameters.
  10. Make two or more HSP regions into a longer alignment.
    Sometimes, we find two or more HSP regions in one database sequencethat can be made into a longer alignment. This provides additionalevidence of the relation between the query and database sequence. Thereare two methods, the Poisson method and the sum-of scores method, tocompare the significance of the newly combined HSP regions. Supposethat here are two combined HSP regions with the sets of score (65, 40)and (52, 45), respectively. The Poisson method gives more significanceto the set with the lower score of each set is higher (45>40).However, the sum-of-scores method prefers the first set, because 65+40(105) is greater than 52+45(97). The original BLAST uses the Poissonmethod; gapped BLAST and the WU-BLAST use the sum-of scores method.
  11. Show the gapped Smith-Waterman local alignments of the query and each of the matched database sequences.
    • The original BLAST only generates ungapped alignments including theinitially found HSPs individually, even when there is more than one HSPfound in one database sequence.
    • BLAST2 versions produce a single alignment with gaps that caninclude all of the initially found HSP regions. Note that thecomputation of the score and its corresponding E score is involved withthe adequate gap penalties.
  12. Report the matches whose expect score is lower than a threshold parameter E.
# by 草人 | 2008/08/16 02:50 | Bioinformatics | 트랙백