- What do you mean by line ? Konversation backend for example does not index the whole file but in blocks; I dont remember the exact details now but it is something like 3 hours or 50 lines whichever is smaller (or something like this). So, when new lines are entered in the chat log, the whole file does not need to be indexed. Also this means, clients can report the time/date (session) of the chat which matched the query (though no one takes advantage of that right now). Is that what you meant ?. Debajyoti Bera
- I meant considering each utterance as a separate IM hit. But on second thought it would probably cause more trouble than use. (Lukas)
- dBera: It is not a fair assumption that clients will always open a hit by using the backend-recommended open-command. The property returned by the backend is a suggestion; but special purpose clients can use the hits in different ways. There should be enough information in the hits to distinguish different types of hits. Mimetype, HitType, FileType are provided for easy distinguishing. Both MonodocEntry and DocbookEntry have filetype:documentation, so that property cannot be used. MonodocEntry does not have a distinct mimetype, so at least for MonodocEntry, a separate HitType is needed. I am trying to convince myself that if we replace the monodocentry mimetype from text/html to something different, then the mimetype could be used to distinguish. But even then semantically, docbookentry and monodocentry properties are widely different, so I dont see any reason not to have two separate HitTypes for them.
- I think the name of the application, wherever applicable, should be a global property (say beagle:app or beagle:application). The opening uri could be then beagle:app_param or something like this. Here I am thinking of separating the name of the application from the param so that users dont have a hard time changing the name of the application, e.g. icedove in debian, non-standard path or wrapper scripts. - Debajyoti Bera 20:12, 13 January 2008 (EST)
- I've punted the Docbook and Monodoc types - they are just files like any other. We can use file:type to specify them. (Lukas)
- I dont think they are files. They are the child indexables for the monodoc and docbook files. Since they are documentation files, currently they have the filetype documentation (like manpages, allowing for ease of querying "filetype:documentation beagleclient"). We cant use source to distinguish between them either since there could be docbook files in the filesystem too. -Debajyoti Bera
- I would like to think of them as child indexables of archives which are also considered files if I'm not mistaken. (Lukas)
- That is good but there are a few differences.
- Archive child indexables are really files, so giving them file:type=archive makes sense, but monodoc and docbook entries are not really files, so using file:type for them seems semantically incorrect. I am most uncomfortable for this reason.
- Monodoc and docbook are documentation index, so yelp queries for them (and in future there could be kde or mono documentation tools making use of these). They would need a specific type to restrict the query to only docbook or only monodoc. - Debajyoti Bera
- I'm really uneasy about adding two new different categories, but am willing to do a trade-off of adding a beagle:documentation category where documentation:type will specify the type (analogically like the beagle:file type does). However what I'm worried about is the behavior when a user drops a docbook file into his home directory and searches for documentation within it - what will the results be? (Lukas)
- beagle:documentation will work, so yelp would add to the query string "beagle:documentation=docbook". If yelp only wants to search within the system wide index, then it has to specify it by the beagle:source property. But saying HitType=file non-file data, does that sound right to you ? Otherwise, beagle:documentation=monodoc and beagle:hittype=monodoc are same to me. I see you are attaching a religious importance to the hittype categories ;-). - Debajyoti Bera
- Oh, I do not remember what is the current scenario, but there should be separate hittypes for browsing history and bookmarks; or we could have your style of hittype=webpage and webpage:type=bookmark :) (that was a bad joke, but I shalt abide by your orders, Sir). - Debajyoti Bera 15:13, 14 January 2008 (EST)
- Haha, well I'm trying to make this as easy to use as possible. :-) So basically the deal is that we use beagle:documentation (where documentation:type defines the type). Is that fine with you? (Lukas)
- beagle:documentation gives a special status to "documentation"; non beagle namespace would make me more comfortable e.g. documentatio:type. Actually read below... Debajyoti Bera
- Given the direction of the requirement, I think we should have a HitType and a SubType. For hittype=file, the subtype will be image, video, etc. For hittype=documentation, the subtype will be monodoc, manpage etc. For hittype=webpage, the subtype will be bookmark, history etc. Emails in the disk will have hittype=email (as currently) and attachments will have the right subtype. Given that some types of data can have more than one categorization, this is one way best effort policy which is less complicated that defining a very complicated hierarchy consisting of multiple values. Just a revolutionary thought.- Debajyoti Bera
- This makes it once again more complicated. :-) This should be possible to add later on if we decide to because all the property names will be prefixed with "beagle:" in the index so you could just do hit_type + ":type" to add the subtype property later. (Lukas)
- Not sure I understand you fully, but currently the properties with namespace beagle have some different technical meaning as well. I dont remember them offhand. The non beagle namespace properties are actually prefixed by "prop:t" or "prop:k". Besides having different semnatic meaning, they are handled differently in the code too. Its good to keep it simple, but we should also do it right so that this need not be changed in the future. Btw, what do hittype you propose for bookmarks and web history ? - Debajyoti Bera
- I would like them to be of the same hit type because they point at the same type of data - webpages. Programatically it is very easy to get to only either of those. But in my opinion most of the time the end-user will be using a graphical frontend for searching for specific types (we cannot expect him to know all the property names and query syntax to use). So having more or less hit types wont affect the frontend applications, but will have an affect on the usability of the underlying API. (Lukas)
- I think browsing history should be separated from bookmarks. I query bookmarks more often than I query browsing history. I dont see how I can only query bookmarks easily using the query syntax (yes, I use the query syntax heavily). Also, bookmarks generally come with its own description (from the bookmark editor). Also, if you look at the KDE bookmark and webhist backends, you will see they have different properties. If you care about GUIs, then adding bookmark/webhistory/monodoc/docbook does not make it any worse.- Debajyoti Bera
- I should have checked the list above before replying. Possibly I can live with webpage and bookmark having same hittype. I can then query for only bookmarks by doing "webpage:bookmark=true" and only web history "-webpage:bookmark hittype:webpage". - Debajyoti Bera
- When you write hittype=beagle:file do you really mean to append "beagle:" in front the the type ? Thats clumsy. Hittype is already namespaced itself (the property name is beagle:hittype), why namespace the result. The query syntax becomes very cumbersome this way and this will involve a lot of other changes as well.- Debajyoti Bera
- I would really like to keep the count of hit types to minimum so it doesnt get cluttered again. But if we really need different categories because of sane reasons I'm fine with that. That is the main reason I want to discuss it so that we can find an optimal solution. Yeah, I wanted to do Hit.Type = "beagle:file" in the API. If all you are worried about is the query syntax, we could add some query-fu magic that just adds the specific QueryParts (similiarly to what your PropertyKeywordFu does). (Lukas)
- Any reason for adding namespace to the value of the property ? Since we dont add namespace to any other property value, I am wondering why now.- Debajyoti Bera
- No specific reason, but in case in the future we decide to have other sources that Beagle will proxy (web sources - bugzilla, google, etc.) having those in a separate namespace would be nice. (Lukas)
- Those will have different beagle:source value - right ? The HitType is independent from the source and should only depend on the type of the data. If there are querydrivers in the future, and they serve e.g. files, they they will have hittype=file, if the serve webpages (e.g. google), then they will have hittype=webpage. - Debajyoti Bera
- What I had in mind was bugzilla:bug, facebook:friend, etc. I thought it wouldnt mind having a namespace for the type and it makes it look more elegant. :-) (Lukas)
- Hmmm... I dont find it amusing, but maybe others will. For me, there is too much information getting packed into words. Sigh ... life was so easy before, source was the name of the backend and hittype was a meaningful term describing the kind of object. For further classification, there was filetype. - Debajyoti Bera
- Not much has changed hit type just has the namespace prepended and some of the types renamed, beagle:source stays and so does file:type. I wish we could have input from some other people as well. I really want to get this right.(Lukas)
- One reason I mentioned the subtype thing above is, for hittype=file, we are adding file:type; for hittype=documentation, we are adding documentation:type, I was suggesting to make it formal. Not every hittype will have a subtype but some might. That would solve the web history/bookmark dilema. - Debajyoti Bera
- Once you think this spec is done, lets email the list and let people look at it. I am not expecting a huge gathering of crowd around this but maybe a few more eyes will have a look. The more the better. - Debajyoti Bera
- I looked at the property name mapping and query related code and am having second thoughts. - Debajyoti Bera
- Namespacing the hittype values. This will require the user to know the namespace too while specifying the source; the resulting querystring looks elegant as a part of API (part.Key="beagle:hittype", part.value="facebook:foo") but not so much as part of a query string "hittype:facebook:foo"). Instead the facebook backend author should use a short descriptive term which can be used in the query string. Yes, for now the PropertyKeywordFu could take all hittype:abc and make it a querypart part.Key="beagle:hittype", part.Value="beagle:abc", but that would be assuming that all hittypes ever will have namespace beagle. Since that is contrary to the assumption that there will different namespaces in the future, the query syntax has to contain the namespace too. The other option, to instruct future authors to use concise descriptive hittype values, is better IMO. Also, a design level argument is there should not different hittypes "hittype:google:query", "hittype:a9:query" etc. but only "hittype:webquery" (or something similar). The namespace information as proposed is describing the source of the data, which is provided by beagle:source. - Debajyoti Bera
- Having hittype:webpage consisting of both hittype:bookmark and hittype:webhistory will be difficult for OR queries. Its not difficult to do it using queryparts and the GUI can do all of these, but our query syntax cannot handle arbitrarily nested ANDs of ORs of ANDs of ORs ... query, so handwriting a query to search only in bookmarks or only in emails would be difficult. - Debajyoti Bera
- Since the whole drive is towards making the property names consistent, I would suggest keeping in mind about the query syntax for the users who would hand write the query. As always, none of the above matters anything to a GUI. If the user choose Webpage in the drop down list, a querypart_or with "hittype=webhistory OR hittype=bookmark" could be added. - Debajyoti Bera
- How about webpage:type=bookmark (where the actual webpage is not indexed, instead the bookmark folder, description and keywords associated with the bookmark are indexed) and webpage:type=history (where the actual webpage is indexed). - Debajyoti Bera
- The main question we have to ask our selves here is - are we building a query language or an ontology. In case of the former we will end up with a messed up API again. But if we decide to go for the latter, we can build our own query language following what the keyword-fu does instead of obfuscating the Lucene one. So you can add a type:bookmark query which will be parsed by our own query parser. This will basically involve appending one keyword query part which will translate to (Hit.Type = "beagle:webpage" and Hit ["webpage:type"] = "bookmark"). Other query goodies could be implemented this way too. (Lukas)
- That is what I thought initially, but then looking at the code, there is a technical limitation in expanding type:bookmark (unable to handle nested OR and AND queries). Thus I think the ontology should be designed keeping in mind the query language. Just to give an example, the filetype keyword was not there originally. It was devised much later to act as an alias to (mimetype:image/jpeg OR mimetype:image/png OR ...) because of the inability to do nested ORs/ANDs. The filetype keyword gives a direct access to a common enough group of data - technically it is not needed since the filetype information can be derived from the mimetype anyway. The same reason applies to why a separate keyword should exist for bookmark and webhistory. More so, because bookmark and webhistory are treated entirely differently. Technically speaking, even monodoc and docbook have completely different metadata and its not good to group them in the same category either, but documentation:type takes care of that. Debajyoti Bera
- What you are saying is the same as we would have a separate hit type for each file type - does that sound sane to you? (Lukas)
- I am saying there should be one property-key,value pair to select the different logical types of data. For different types of files, we could have added many different hittypes, but instead came up with a filetype property. Just another way of handling it. Grouping monodoc and docbook was fine because documentation:type can select the group. Similarly I want a keyword to group only bookmarks and only webhistory because they are fundamentally completely different. Using hittype or webpage:type or something else. If you look into the current properties, there are many such examples where a keyword was added to broadly classify one group of data. This is bad from a ontology point of view, but required from beagle point of view. Either someone fix beagle (which looks pretty hard to me at this point) or the ontology needs to be polluted with such redundant classification.- Debajyoti Bera
- OK, so I'm completely lost now with what you are saying. :-) So you do or don't agree with the "*:type" property? (Lukas)
- Phew! There are too many 'types' involved :(. Could you elaborate your previous question "have a separate hit type for each file type" with an example ? - Debajyoti Bera
- I want to make sure we do this consistently. So if we have file:type we may just as have webpage:type. What I ment was having a Hit.Type {File, Image, Video, Audio, Webpage, IMLog} - but that is just plain stupid, you wouldn't be able to search for all files just with using the File type. But this just shows how bad it is to have *too* many hit types. (Lukas)
- What is the problem with having a webpage:type property ? As you sat, we have file:type and documentation:type. Now I understand your "*.type" question :) and I like it. Its probably bad from an ontology point of view but I will leave the ontology experts to dwell on the perfect ontology for desktop data. For me, consistent names of properties and a decent classification is enough for practical usage in beagle. - Debajyoti Bera
- There is no problem with it, I'm all for it, actually I have been from the beginning. It's already in the spec so I thought you were against it. :-) (Lukas)
- Not the first time this has happened with me :-X. I was saying the same thing (of course without reading the spec, since its so long ;-) and mentioned it earlier in the comment marked "revolutionary thought" :-). I faintly remember seeing a webpage:bookmarked boolean property saying if a link is bookmarked - I was looking for a string property webpage:type = {history,bookmark} and got confused. A string webpage:type like file:type is fine for me.
- Perfect! (Lukas)
- Quick glance at the spec: remove beagle:application, user:tag from the properties in each category since they are already declared as global ? - Debajyoti Bera
- What should a mandatory field (e.g., dc:date) contain if the program doesn't know, or if it doesn't make sense? Can dc:title (which is mandatory) be the empty string, if it doesn't have a title? --Ken 18:15, 16 January 2008 (EST)
- dc:title should always be available - I'm not aware of a type that couldn't have a title. For files it can be the filename if there is no specific title, for notes the note name, for emails the subject line, in worst case scenarios it could just be the URI pointing to the resource, but as I said I'm not aware of a type which couldn't have a title. The same goes for date (worst case scenario is to use the DateTime when the item was indexed but again each type should have a dc:date). (Lukas)
- Even when the keys are standardized, it doesn't really help until the values are standardized, too. For example, several of the numeric values don't specify units -- is "bitrate" in bit/sec, kbit/sec, kbyte/sec? (One key is named "fps", which is itself a unit, but then there's "bitrate", which is not.) Is (color) 'depth' in bits or bytes? --Ken 18:15, 16 January 2008 (EST)
- Yes, I haven't forgotten about this and am aware of it. In the meantime someone can suggest the best formats to use for each field. (Lukas)
- I don't see anything about an extension mechanism for authors of other programs. Can we use, say, "#{my_domain_name}:#{field_value}", or do you have something in mind? --Ken 18:15, 16 January 2008 (EST)
- You're correct, there is a small note added about this at the beginning of the spec but we still need to create its own section in the spec about it. (Lukas)
This page was last modified 10:55, 17 January 2008. This page has been accessed 2,757 times.